[OSM-dev] Osmosis bug when using 'completeWays' option?

Mon Dec 17 06:26:43 GMT 2007

As already mentioned, the current osmosis pipeline design doesn't lend 
itself very well to random data access.

The current design focuses each task on a small and specific purpose 
which is great for increased utility, but leads to compromises in 
performance.

A couple of possible solutions that come to mind are:
1. Write a new task that combines file read and bounding box 
extraction.  Randomly seeking over a raw xml file is unlikely to provide 
ideal performance but it may be simplest to implement.  This essentially 
mimics the way Frederik's perl program works.
2. Add seekable data support to the pipeline.  This will take more 
effort but may be the best solution usable in a wide variety of scenarios.

Step two would require a new data type to be added to the pipeline for 
dealing with data as a complete (seekable) set.  Currently the pipeline 
can process entities and changes.  It would be possible to add a new 
type called dataset or similar.  A "reader" task could then read an 
entire data set, store it in a seekable (and indexed) form, and pass 
that complete seekable set (exposed through a suitable interface) to 
downstream tasks.  That would eliminate the need for each task to 
perform its own temporary file data buffering which will scale more 
effectively to large numbers of bounding boxes.  It would be possible to 
completely disconnect the seekable store creation task from the data 
processing task which would allow a seekable data store to be re-used 
between osmosis invocations.

If the above description sounds a lot like a database then you're right, 
it is.  The current temporary files in osmosis provide some database 
like features such as random access and indexes but they're very simple 
and limited.  I've been thinking about writing a more complicated 
database or using something like bdb to implement some of this but it is 
not trivial and could take some time to complete.  Note that the 
existing production mysql schema is not suitable for this, it takes 
forever to import and the indexing is unlikely to be ideal.  A simple 
read-only data store can provide huge speed advantages over a full sql 
database but requires a brand new "schema".

But, I may be smoking something.  I'm sure there are many ways of doing 
this.

Note that all of this will take a significant performance hit, most 
osmosis tasks currently operate in a "streamy" fashion which requires 
minimal memory and is very fast for processing a planet file end to 
end.  Any random access processing will slow that down.

Karl Newman wrote:
> On Dec 2, 2007 2:44 PM, Frederik Ramm <frederik at remote.org> wrote:
>   
>> Hi,
>>
>>     
>>>>> If you're working on a
>>>>> particular area area, you might want to start with a simple bounding
>>>>> box first (no completeWays) to limit the data that is buffered.
>>>>>           
>>>> Currently I'm working on a small area, but it will target the entire world.
>>>> Maybe I need to split the world into a few sub areas first (e.g. Europe,
>>>> North America, South America, Afrika and Asia) before cutting them into
>>>> small pieces.
>>>>         
>>> I should talk to Brett about making some sort of shared entity
>>> cache/buffer that could be re-used by multiple downstream tasks, so in
>>> the case of a "tee" with "completeWays" bounding boxes there would be
>>> only one copy of all the entities instead of one set per tee.
>>>       
>> Maybe the pre-cutting of a bounding box could be automated. Based on
>> the assumption that no way is longer than "x", the area filter could
>> hold only references to those objects which are *not* inside the
>> selected area but in the vicinity ("x" around the bounding box), for
>> possible later inclusion.
>>
>>     
> That's a possibility; now solve for "x" :-)
>
>   
>> As far as I understand, you're keeping the full objects. My Perl
>> polygon extractor only kept the object IDs (in a bit set) and
>> afterwards either used seek() to jump back to the beginning of the
>> relevant section in the input file to retrieve the missing objects (or
>> simply re-opened it if it was not seekable). Also, for effective
>> output ordering, it created three temporary output files (for nodes,
>> ways, relations) and concatenated them at the end of the operation. I
>> fear it will be very hard to put such optimisations into Osmosis as
>> they don't fit well with the pipeline concept.
>>
>>     
> Well, it does keep the object IDs in a bit set, but the pipline idea
> is the problem--the source of the entity data could be a database. The
> pipeline paradigm doesn't lend itself to seeking backwards. That's why
> we store the entities (three separate temporary files, here, too).
>
> Karl
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>