[OSM-dev] Lite OSM backend
Nick Hill
nick at nickhill.co.uk
Thu Sep 7 23:00:59 BST 2006
OJW wrote:
> The idea is to not have any processing stages which require memory
> proportional to the size of the OSM database. As planet.osm has grown from
> 200MB to 3GB in a few months, I suggest thinking quite carefully before
> "storing all nodes in a lookup table".
Hello Oliver. Eventually, the look-up table could get large, but this
needn't be a problem. The code to generate the compacted files cold be
linked to bdb which would provide an index. I expect the process would
be fast on a modern desktop PC if the table is indexed.
>
> Node IDs will get dropped anyway when the node file is filtered for
> "compacted" format (i.e. after it's been used to update the segments file,
> and before it gets deployed to a rendering device), at which point each node
> will be just "lat,long,flags"
>
>> Drop the int32 id (implied by file offset) double lat1 double lat2
>> double lon1 double lon2 in the segment file. Again, a lookup table will
>> be required to match OSM id to offset when generating the way file.
>
> That's an interesting idea, and it comes down to how fast you can find the
> node.
If the page which contains the node is in memory, it will take basically
the latency of the memory on the computer. On modern computers, this is
in the order of tens of nanoseconds.
If the page isn't in memory, it will take around 7 milliseconds to
fetch. If the node table is pre-sorted on the first 16 bits of integer
lat/lon prior to processing, the nodes for a 655m x 655m area (at the
equator, smaller near the poles) will be in the same page. You'll have a
7ms page fault on the node table every 655 meters.
>
> From a "locality of reference" point of view, it's quite attractive for the
> renderer to be able to read the segments file a few bytes at a time and have
> everything it needs to render the map, without having to retrieve node
> 12900155 and node 32102221 from storage. From my limited understanding of
> computers, I think that working sequentially through the segments file, and
> keeping a small set of working data that can fit into CPU registers, might
> make things faster.
If the node file is pre-sorted so proximate modes are stored in the same
page cache, there will be almost no speed advantage. If you have a
simple algorithm working through the segment file (one which is small
enough not to flush the CPU L1/L2 cache), the node data would often be
in a CPU L1 or L2 cache whose latency is just a few nanoseconds.
> The other consideration is that storing coordinates in the segment file means
> you can filter it by area (again, using 12 bytes of storage in your filtering
> program instead of 3GB) without having to recalculate the offsets for each
> node. Again, some programs might find a speed advantage in being able to
> quickly and easily filter the dataset (e.g. a device which occasionally
> copies segments within 20 miles of your position from flash memory into RAM
> and uses that as a working copy)
If the device is using an operating system which supports virtual
memory, and the data is sorted (to be page-friendly), there would be no
need for the programmer to worry about copying the data. The OS will
take care of that.
If you access a given point in the file, and it is not in memory, the OS
will pull in the 4k or 8K page from the hard drive or flash memory which
contains that data. That page will stay in memory for as long as it is
needed. Any access to that area of the file will be more-or-less instant
for as long as it stays in memory. It will stay in memory normally until
the OS can find better use for the memory.
Embedded linux systems, windows CE PalmOS etc support the necessary
features.
>
> My idea for the segments file was to drop the segment ID and node IDs when
> preparing the file for use by a renderer, leaving just 2 sets of coordinates,
> and the attribute flags
I think the biggest issue is in the decision making process which
segments will be fed to the renderer. Unless you use a brute force
search, you will need an index of some sort.
I suggest the following:
Using the same prototypes as I listed, before creating the binary file,
pre-sort the segments on node a lat then node a lon on the first 12 bits
of lat/12 bits of lon (don't store the lat/lon on the segments).
Whilst writing the segments out to the segments file, whenever the first
12 bits of lat or 12 bits of lon change, write the lat/lon/offset to an
index file.
The two consecutive entries in the index file for offset will tell you
which segments make up the 10Kmx10km tile you are interested in. It is
really that easy! If you are near the edge of a tile, you will want to
include adjacent tiles so that you don't have cut segments.
There is a balance between size of tile and size of index. If you index
on the first 16 bits of both lat and lon, you will have 655x655m tiles.
But an index to point to all these tiles could be very large.
More information about the dev
mailing list