[OSM-dev] Osmosis bug when using 'completeWays' option?
Brett Henderson
brett at bretth.com
Mon Dec 17 06:26:43 GMT 2007
As already mentioned, the current osmosis pipeline design doesn't lend
itself very well to random data access.
The current design focuses each task on a small and specific purpose
which is great for increased utility, but leads to compromises in
performance.
A couple of possible solutions that come to mind are:
1. Write a new task that combines file read and bounding box
extraction. Randomly seeking over a raw xml file is unlikely to provide
ideal performance but it may be simplest to implement. This essentially
mimics the way Frederik's perl program works.
2. Add seekable data support to the pipeline. This will take more
effort but may be the best solution usable in a wide variety of scenarios.
Step two would require a new data type to be added to the pipeline for
dealing with data as a complete (seekable) set. Currently the pipeline
can process entities and changes. It would be possible to add a new
type called dataset or similar. A "reader" task could then read an
entire data set, store it in a seekable (and indexed) form, and pass
that complete seekable set (exposed through a suitable interface) to
downstream tasks. That would eliminate the need for each task to
perform its own temporary file data buffering which will scale more
effectively to large numbers of bounding boxes. It would be possible to
completely disconnect the seekable store creation task from the data
processing task which would allow a seekable data store to be re-used
between osmosis invocations.
If the above description sounds a lot like a database then you're right,
it is. The current temporary files in osmosis provide some database
like features such as random access and indexes but they're very simple
and limited. I've been thinking about writing a more complicated
database or using something like bdb to implement some of this but it is
not trivial and could take some time to complete. Note that the
existing production mysql schema is not suitable for this, it takes
forever to import and the indexing is unlikely to be ideal. A simple
read-only data store can provide huge speed advantages over a full sql
database but requires a brand new "schema".
But, I may be smoking something. I'm sure there are many ways of doing
this.
Note that all of this will take a significant performance hit, most
osmosis tasks currently operate in a "streamy" fashion which requires
minimal memory and is very fast for processing a planet file end to
end. Any random access processing will slow that down.
Karl Newman wrote:
> On Dec 2, 2007 2:44 PM, Frederik Ramm <frederik at remote.org> wrote:
>
>> Hi,
>>
>>
>>>>> If you're working on a
>>>>> particular area area, you might want to start with a simple bounding
>>>>> box first (no completeWays) to limit the data that is buffered.
>>>>>
>>>> Currently I'm working on a small area, but it will target the entire world.
>>>> Maybe I need to split the world into a few sub areas first (e.g. Europe,
>>>> North America, South America, Afrika and Asia) before cutting them into
>>>> small pieces.
>>>>
>>> I should talk to Brett about making some sort of shared entity
>>> cache/buffer that could be re-used by multiple downstream tasks, so in
>>> the case of a "tee" with "completeWays" bounding boxes there would be
>>> only one copy of all the entities instead of one set per tee.
>>>
>> Maybe the pre-cutting of a bounding box could be automated. Based on
>> the assumption that no way is longer than "x", the area filter could
>> hold only references to those objects which are *not* inside the
>> selected area but in the vicinity ("x" around the bounding box), for
>> possible later inclusion.
>>
>>
> That's a possibility; now solve for "x" :-)
>
>
>> As far as I understand, you're keeping the full objects. My Perl
>> polygon extractor only kept the object IDs (in a bit set) and
>> afterwards either used seek() to jump back to the beginning of the
>> relevant section in the input file to retrieve the missing objects (or
>> simply re-opened it if it was not seekable). Also, for effective
>> output ordering, it created three temporary output files (for nodes,
>> ways, relations) and concatenated them at the end of the operation. I
>> fear it will be very hard to put such optimisations into Osmosis as
>> they don't fit well with the pipeline concept.
>>
>>
> Well, it does keep the object IDs in a bit set, but the pipline idea
> is the problem--the source of the entity data could be a database. The
> pipeline paradigm doesn't lend itself to seeking backwards. That's why
> we store the entities (three separate temporary files, here, too).
>
> Karl
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>
More information about the dev
mailing list