[osmosis-dev] Improving completeWays/completeRelations performance

Fri Feb 18 08:57:17 GMT 2011

Hi,

    in the long run I'd like to change the Geofabrik extracts so that 
they have the completeWays/completeRelations feature enabled. It's a 
pain because that totally breaks the elegant and well-performing 
streaming mode in Osmosis but it would really make the extracts more 
usable, and more in line with what people get from the API.

My biggest concern is the disk space used for temporary storage. If I 
read things correctly, a temporary storage of the input stream is 
created for each --bb or --bp task. So if you do something like

osmosis --rb planet --tee 5
   --bb ... --wb europe
   --bb ... --wb asia
   --bb ... --wb america
   --bb ... --wb australia
   --bb ... --wb africa

then you will temporarily have 5 copies of the planet file lying around. 
So while, if there was only one copy of it, I could still hope to make 
use of linux file system buffers and a lot of RAM to soften the negative 
impact of file storage, that will kill performance for sure.

I wonder if there is a way to at least reduce this to *one* temporary 
storage. The easiest thing I could imagine would be a new "multi-bb" (or 
"multi-bp") task that basically combines the tee and bb. That would be 
less elegant and would probably also be less efficient because it would 
not use multiple threads, but it could easily use one shared temporary 
storage.

But I've been thinking: With the the high performance of PBF reading, a 
two-pass operation should become possible. Simply read the input file 
twice, determining which objects to copy in pass 1, and actually copying 
them in pass 2. I'm just not sure how that could be made to fit in 
Osmosis. One way could be creating a special type of file, a "selection 
list", from a given entity stream. A new task "--write-seelction-list" 
would dump the IDs of all nodes, ways, and relations that were either 
present or referenced in the entity stream:

osmosis --rb planet --tee 5
   --bb ... --write-selection-list europe.sel
   --bb ... --write-selection-list asia.sel
   --bb ... --write-selection-list america.sel
   --bb ... --write-selection-list australia.sel
   --bb ... --write-selection-list africa.sel

Then, in a second pass, one would use a new task 
"--apply-selection-list" to actually filter the objects:

osmosis --rb planet --tee 5
   --apply-selection-list europe.sel --wb europe
   --apply-selection-list asia.sel --wb asia
   ...

The selection lists would be quite big, and would for efficiency have to 
be fully kept in memory, so the above jobs could probably eat 20 GB of 
RAM easily (1.5 billion objects, IDs have 64 bit, hash table overhead). 
Also, what I have sketched above would be able to give you

* all nodes in the bounding box
* all ways using any of these nodes
* all nodes used by any of these ways even if outside
* all relations using any of these nodes or ways
o all nodes and ways used by any of these relations even if outside
o but NOT all nodes used by a way drawn in through a relation.

(The points marked "*" are what the API does; the API does not do the 
"o" marked points even though users could be interested in them.)

Does anybody have any thoughts about this; maybe a different approach still?

Bye
Frederik