[OSM-dev] Osmosis bug when using 'completeWays' option?

Mon Dec 17 09:27:37 GMT 2007

Hi!

There is probably no one-size-fits-all. For some datasets and some
operations streaming will be fine, for others some sort of database is
needed.

To explore the "database option" without the overhead of using a real
database like MySQL or PostgreSQL I have started experimenting with
SQLite. SQLite uses only a single file and is therefore easy to setup
and use. Libraries for all major languages are available and because
it uses more or less standard SQL, it is easier to use than a custom
setup with DB files or so.

There is a Ruby library at osmlib.rubyforge.org/osmlib-sqlite which
currently can import an .osm file into a sqlite database and dump the
sqlite database out again as .osm file. The schema for the database is
a very simple direct mapping of the osm data model into tables for
nodes, ways, relations, node_tags, way_tags, relation_tags, way_nodes,
and members. In addition each table has a "marked" column, so you can
mark data you are interested in using SQL commands and only this data is
dumped out. The idea is to have the following workflow:

1. Create database and fill it from .osm file (done)
2. Create indexes on database (sql to create index is there, but not
   integrated yet, have to call it by hand)
3. Decide which parts of the database you are interested in and call
   SQL commands to mark the data, eventually there could be tools to
   help you do that for common filtering tasks. Basically this is things
   like:
   UPDATE way_tags SET marked=1 WHERE key='highway'
4. Call a script that (recursively) marks data needed for data integrity
   (like marking all nodes used in already marked ways).
5. Dump the marked objects in the database out into a .osm file. (done)

If this proves to be a viable approach I could even imagine that somebody
would prepare the sqlite database for the planet file and people can
download it directly sparing them the first step.

Jochen

On Mon, Dec 17, 2007 at 05:26:43PM +1100, Brett Henderson wrote:
> Date: Mon, 17 Dec 2007 17:26:43 +1100
> From: Brett Henderson <brett at bretth.com>
> To: OSM-Dev Openstreetmap <dev at openstreetmap.org>
> Subject: Re: [OSM-dev] Osmosis bug when using 'completeWays' option?
> 
> As already mentioned, the current osmosis pipeline design doesn't lend 
> itself very well to random data access.
> 
> The current design focuses each task on a small and specific purpose 
> which is great for increased utility, but leads to compromises in 
> performance.
> 
> A couple of possible solutions that come to mind are:
> 1. Write a new task that combines file read and bounding box 
> extraction.  Randomly seeking over a raw xml file is unlikely to provide 
> ideal performance but it may be simplest to implement.  This essentially 
> mimics the way Frederik's perl program works.
> 2. Add seekable data support to the pipeline.  This will take more 
> effort but may be the best solution usable in a wide variety of scenarios.
> 
> Step two would require a new data type to be added to the pipeline for 
> dealing with data as a complete (seekable) set.  Currently the pipeline 
> can process entities and changes.  It would be possible to add a new 
> type called dataset or similar.  A "reader" task could then read an 
> entire data set, store it in a seekable (and indexed) form, and pass 
> that complete seekable set (exposed through a suitable interface) to 
> downstream tasks.  That would eliminate the need for each task to 
> perform its own temporary file data buffering which will scale more 
> effectively to large numbers of bounding boxes.  It would be possible to 
> completely disconnect the seekable store creation task from the data 
> processing task which would allow a seekable data store to be re-used 
> between osmosis invocations.
> 
> If the above description sounds a lot like a database then you're right, 
> it is.  The current temporary files in osmosis provide some database 
> like features such as random access and indexes but they're very simple 
> and limited.  I've been thinking about writing a more complicated 
> database or using something like bdb to implement some of this but it is 
> not trivial and could take some time to complete.  Note that the 
> existing production mysql schema is not suitable for this, it takes 
> forever to import and the indexing is unlikely to be ideal.  A simple 
> read-only data store can provide huge speed advantages over a full sql 
> database but requires a brand new "schema".
> 
> But, I may be smoking something.  I'm sure there are many ways of doing 
> this.
> 
> Note that all of this will take a significant performance hit, most 
> osmosis tasks currently operate in a "streamy" fashion which requires 
> minimal memory and is very fast for processing a planet file end to 
> end.  Any random access processing will slow that down.
> 
> Karl Newman wrote:
> > On Dec 2, 2007 2:44 PM, Frederik Ramm <frederik at remote.org> wrote:
> >   
> >> Hi,
> >>
> >>     
> >>>>> If you're working on a
> >>>>> particular area area, you might want to start with a simple bounding
> >>>>> box first (no completeWays) to limit the data that is buffered.
> >>>>>           
> >>>> Currently I'm working on a small area, but it will target the entire world.
> >>>> Maybe I need to split the world into a few sub areas first (e.g. Europe,
> >>>> North America, South America, Afrika and Asia) before cutting them into
> >>>> small pieces.
> >>>>         
> >>> I should talk to Brett about making some sort of shared entity
> >>> cache/buffer that could be re-used by multiple downstream tasks, so in
> >>> the case of a "tee" with "completeWays" bounding boxes there would be
> >>> only one copy of all the entities instead of one set per tee.
> >>>       
> >> Maybe the pre-cutting of a bounding box could be automated. Based on
> >> the assumption that no way is longer than "x", the area filter could
> >> hold only references to those objects which are *not* inside the
> >> selected area but in the vicinity ("x" around the bounding box), for
> >> possible later inclusion.
> >>
> >>     
> > That's a possibility; now solve for "x" :-)
> >
> >   
> >> As far as I understand, you're keeping the full objects. My Perl
> >> polygon extractor only kept the object IDs (in a bit set) and
> >> afterwards either used seek() to jump back to the beginning of the
> >> relevant section in the input file to retrieve the missing objects (or
> >> simply re-opened it if it was not seekable). Also, for effective
> >> output ordering, it created three temporary output files (for nodes,
> >> ways, relations) and concatenated them at the end of the operation. I
> >> fear it will be very hard to put such optimisations into Osmosis as
> >> they don't fit well with the pipeline concept.
> >>
> >>     
> > Well, it does keep the object IDs in a bit set, but the pipline idea
> > is the problem--the source of the entity data could be a database. The
> > pipeline paradigm doesn't lend itself to seeking backwards. That's why
> > we store the entities (three separate temporary files, here, too).
> >
> > Karl
> >
> > _______________________________________________
> > dev mailing list
> > dev at openstreetmap.org
> > http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
> >   
> 
> 
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
> 

-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298