[OSM-dev] Lite OSM backend

Thu Sep 7 19:46:34 BST 2006

On Thursday 07 September 2006 01:11, Nick Hill wrote:
> Drop the node ID, which is implied by the offset in the node file. (This
> will require generating a lookup table when processing the osm file for
> when making the way file).

The idea is to not have any processing stages which require memory 
proportional to the size of the OSM database.  As planet.osm has grown from 
200MB to 3GB in a few months, I suggest thinking quite carefully before 
"storing all nodes in a lookup table".

Node IDs will get dropped anyway when the node file is filtered for 
"compacted" format (i.e. after it's been used to update the segments file, 
and before it gets deployed to a rendering device), at which point each node 
will be just "lat,long,flags"

> Drop the int32 id (implied by file offset) double lat1 double lat2
> double lon1 double lon2 in the segment file. Again, a lookup table will
> be required to match OSM id to offset when generating the way file.

That's an interesting idea, and it comes down to how fast you can find the 
node.  

From a "locality of reference" point of view, it's quite attractive for the 
renderer to be able to read the segments file a few bytes at a time and have 
everything it needs to render the map, without having to retrieve node 
12900155 and node 32102221 from storage.  From my limited understanding of 
computers, I think that working sequentially through the segments file, and 
keeping a small set of working data that can fit into CPU registers, might 
make things faster.

The other consideration is that storing coordinates in the segment file means 
you can filter it by area (again, using 12 bytes of storage in your filtering 
program instead of 3GB) without having to recalculate the offsets for each 
node.  Again, some programs might find a speed advantage in being able to 
quickly and easily filter the dataset (e.g. a device which occasionally 
copies segments within 20 miles of your position from flash memory into RAM 
and uses that as a working copy)

My idea for the segments file was to drop the segment ID and node IDs when 
preparing the file for use by a renderer, leaving just 2 sets of coordinates, 
and the attribute flags

On Thursday 07 September 2006 00:20, Jon Burgess wrote:
> I reckon a good option for the lat/lon would be a 32bit fixed point
> number. I think 2^32 data points over +-180 degrees gives a resolution
> of about 1cm across the globe (40,000 km circumference / 4 billion).

Very good idea, and something I want to implement.

On Thursday 07 September 2006 09:01, Nicola Ranaldo wrote:
> About flags, i think 32 bits are not sufficient. We should think about a
> system to extend tags. For example if the  32th is set to 1 then software
> may lookup a special file to find the other tags.

Again, very good idea, and one I hadn't thought of yet.  We might need to 
think carefully about how this would work (what can you use as an identifier, 
that passes unchanged through the filtering and compacting stages?  e.g. if 
you retain ID fields in the output files for that purpose, then you have to 
decide if it's better to spend that extra space on more binary attributes 
instead.

A variable-length record (flag 32 means read an additional 4 bytes, similar to 
the UTF-8 scheme) is an idea, but worth considering how it affects other 
programs (e.g. is it acceptable to require more complexity in the filtering 
programs, to deal with variable-length records)

On Thursday 07 September 2006 09:01, Nicola Ranaldo wrote:
> generating tools should sort tags
> by usage, assign the 31 most used to bits and generate a configuration file
> to get mapping.

That might be a useful feature for some applications, but since the choice of 
tags isn't specified in the file format, it's up to the people programming 
generators whether they want to spend memory and CPU on that task. 

Personally, I was assuming that people would want to specify their 32 tags (or 
64 or whatever) based on what their renderer was programmed to display.  e.g. 
a device for aircraft pilots might have just two tags for roads ("big" and 
"small"), while a device for cars might allocate most of its 32 tags to 
different types of road.  A device for walkers might choose to allocate 10 
different types of footpath, and lots of field/woodland boundaries but no 
road information. 

Similarly, the node file for use in the car device will have fuel stations and 
parking lots. The walker's device will have pubs and trig points. The pilot's 
device will have crop circles, aerials, and gliding schools.  The choice of 
tags to put in the output file is such a personal choice, so 
application-dependant, that I didn't anticipate much need to auto-generate it

Of course, you can use jorg's statistics page to help you when deciding which 
popular tags to select...

Other bits:

I'd be interested to hear from computer scientists about fancy algorithms for 
spacially-sorting the data, doing indexes, etc. First person to mention 
Hilbert space gets a nerd of the month award.

Is it worth upgrading the segments file to something that can be used in 
rendering - give each segment a set of lat/long points and some extended 
attributes to cover them all.

Do we need a file-header? e.g. to say what type of file (nodes file 
unfiltered, nodes file stripped of its ID tags, nodes file with 12 bytes of 
tag information per node, or with variable record-length...), and flags like 
how much processing has been done on each file (so that the processing tools 
can resume a partially-processed file), that sort of thing?

Is anyone prepared to write the "apply node coordinates to their respective 
segments" program - I'm a bit distracted with the Mapyrus renderer at the 
moment, but it would be very useful.  Basically, it loads x MB of node data 
into memory (x being set by the user depending on how good their computer 
is), then goes through the segments file replacing the blank coordinates with 
real ones for any that are in it's "block of nodes". (the segments file is 
operated on in-situ, and doesn't change size)

The naive C-program I tried reveals that it needs to be much much quicker, 
with better data-structures for holding the working-copy of nodes, and lots 
of experimentation of how many nodes and segments to keep in memory between 
subsequent disk accesses to optimise the speed.

Regards

OJW

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20060907/ead161d0/attachment.pgp>