[OSM-dev] Lite OSM backend

Thu Sep 7 23:00:59 BST 2006

OJW wrote:
> The idea is to not have any processing stages which require memory 
> proportional to the size of the OSM database.  As planet.osm has grown from 
> 200MB to 3GB in a few months, I suggest thinking quite carefully before 
> "storing all nodes in a lookup table".

Hello Oliver. Eventually, the look-up table could get large, but this 
needn't be a problem. The code to generate the compacted files cold be 
linked to bdb which would provide an index. I expect the process would 
be fast  on a modern desktop PC if the table is indexed.

> 
> Node IDs will get dropped anyway when the node file is filtered for 
> "compacted" format (i.e. after it's been used to update the segments file, 
> and before it gets deployed to a rendering device), at which point each node 
> will be just "lat,long,flags"
> 
>> Drop the int32 id (implied by file offset) double lat1 double lat2
>> double lon1 double lon2 in the segment file. Again, a lookup table will
>> be required to match OSM id to offset when generating the way file.
> 
> That's an interesting idea, and it comes down to how fast you can find the 
> node.  

If the page which contains the node is in memory, it will take basically 
the latency of the memory on the computer. On modern computers, this is 
in the order of tens of nanoseconds.

If the page isn't in memory, it will take around 7 milliseconds to 
fetch. If the node table is pre-sorted on the first 16 bits of integer 
lat/lon prior to processing, the nodes for a 655m x 655m area (at the 
equator, smaller near the poles) will be in the same page. You'll have a 
7ms page fault on the node table every 655 meters.

> 
> From a "locality of reference" point of view, it's quite attractive for the 
> renderer to be able to read the segments file a few bytes at a time and have 
> everything it needs to render the map, without having to retrieve node 
> 12900155 and node 32102221 from storage.   From my limited understanding of
> computers, I think that working sequentially through the segments file, and 
> keeping a small set of working data that can fit into CPU registers, might 
> make things faster.

If the node file is pre-sorted so proximate modes are stored in the same 
page cache, there will be almost no speed advantage. If you have a 
simple algorithm working through the segment file (one which is small 
enough not to flush the CPU L1/L2 cache), the node data would often be 
in a CPU L1 or L2 cache whose latency is just a few nanoseconds.

> The other consideration is that storing coordinates in the segment file means 
> you can filter it by area (again, using 12 bytes of storage in your filtering 
> program instead of 3GB) without having to recalculate the offsets for each 
> node.  Again, some programs might find a speed advantage in being able to 
> quickly and easily filter the dataset (e.g. a device which occasionally 
> copies segments within 20 miles of your position from flash memory into RAM 
> and uses that as a working copy)

If the device is using an operating system which supports virtual 
memory, and the data is sorted (to be page-friendly), there would be no 
need for the programmer to worry about copying the data. The OS will 
take care of that.

If you access a given point in the file, and it is not in memory, the OS 
will pull in the 4k or 8K page from the hard drive or flash memory which 
contains that data. That page will stay in memory for as long as it is 
needed. Any access to that area of the file will be more-or-less instant 
for as long as it stays in memory. It will stay in memory normally until 
the OS can find better use for the memory.

Embedded linux systems, windows CE PalmOS etc support the necessary 
features.

> 
> My idea for the segments file was to drop the segment ID and node IDs when 
> preparing the file for use by a renderer, leaving just 2 sets of coordinates, 
> and the attribute flags

I think the biggest issue is in the decision making process which 
segments will be fed to the renderer. Unless you use a brute force 
search, you will need an index of some sort.

I suggest the following:
Using the same prototypes as I listed,  before creating the binary file, 
pre-sort the segments on node a lat then node a lon on the first 12 bits 
of lat/12 bits of lon (don't store the lat/lon on the segments).

Whilst writing the segments out to the segments file, whenever the first 
12 bits of lat or 12 bits of lon change, write the lat/lon/offset to an 
index file.

The two consecutive entries in the index file for offset will tell you 
which segments make up the 10Kmx10km tile you are interested in. It is 
really that easy! If you are near the edge of a tile, you will want to 
include adjacent tiles so that you don't have cut segments.

There is a balance between size of tile and size of index. If you index 
on the first 16 bits of both lat and lon, you will have 655x655m tiles. 
But an index to point to all these tiles could be very large.