[OSM-dev] Osmosis multi-task questions

Mon Nov 26 23:38:06 GMT 2007

Frederik Ramm wrote:
> Hi,
>
>    I'm thinking about using Osmosis to produce daily "mini planets"
> for a number of areas, e.g. one each for Germany and the neighbouring
> countries, but then also one each for the 16 German "Länder" (states).
>
> There are probably no 100% answers for the following but maybe someone
> has experimented with these things and has some ideas/insights on the
> best procedures.
>
> I am thinking about this:
>
> 1. Each week, get full planet file and do a bounding box extract for
> Europe.
>
> 2. Daily, apply the diff to this file. This will add a little non-Europe data
> each day but I can ignore that.
>   
You can run the result through another bounding box task to remove any 
unwanted data if necessary.
> 3. Daily, split out desired polygons from patched planet file.
>
> Assuming for a moment that I'd not use a database; would it be
> possible (and sensible) to use the "tee" task in Osmosis to branch
> from one --read-xml into 20 --border-polygon/--write-xml tasks, so
> that all areas I am interested in get cut out in one go? I guess I
> would have to create a very long command line naming all the output
> pipes of the tee and assigning them to each of the --border-polygon
> tasks, right?
>   
Yes, that's right.  Lambertus is doing something similar where he's 
producing around 200 bounding boxes in a single osmosis invocation.  
Many times faster than invoking osmosis once for each area.
> Can I make use of the default pipe connection between the
> border-polygon and write-xml if I pair them, e.g.
>
> osmosis 
>    --rx ... (use default pipe) --tee (name 20 output pipes) 
>    --bp (name 1 input pipe but use default pipe on output) --wx 
>    --bp (name another input pipe, again use default pipe on out) --wx
>    ...
>   
This won't work currently.  The outputs of each task get added to a 
queue so the default allocation works on a FIFO basis.  Ages ago kleptog 
suggested using a stack approach (ie. FILO) approach which would make 
default connectivity work in more cases.  A stack would change the 
existing behaviour so I've hung back from making the change, but I think 
I need to do it one of these days.

If you want to use default pipe connection, you'll have to do something 
like this.
osmosis
    --rx
    --tee 20
    --bp file=poly01.txt
    --bp file=poly02.txt
    ...
    --bp file=poly20.txt
    --wx out01.osm
    --wx out02.osm
    ...
    --wx out20.osm

Note that paired tasks are not next to each other on the command line, 
the tee registers 20 outputs, the --bp tasks read each of those and 
produce another 20 outputs, the --wx tasks then pick each of those up 
and write to file.

You have to be careful to make sure they're lined up correctly but it 
should be possible.
> In a case like the one sketched initially - select 5 neighbouring
> regions plus 16 regions inside one of the selected - could I even 
> construct a command line that would do something like this:
>
> read xml -> tee into 5 pipes, each with own bounding polygon, four
> of them written directly to file, fifth tees into 16 pipes, one of
> which is written directly to file, the others are again used as
> bounding polygon inputs and then written
>   
That sounds okay.
> Could I expect better performance by using a Mysql database which is
> reset every week and has the diffs applied to it? When extracting
> polygons from the Mysql instead of from an XML file, would the same
> "tee" strategy make sense or would Mysql reading be fast enough to
> just extract every polygon sequentially? I see that the read-mysql
> task has no option to make use of Mysql indexes for selecting bounding
> boxes, so I assume it would always feed the full data set into the
> select_polygon which might be less than ideal. Maybe I could trick
> Osmosis into operating on a "view" of a node table that contains only
> nodes in a certain lat/lon range?
>   
With a decent machine, I think you'll find file based performance more 
than adequate.  A file will be far quicker than using MySQL as raw data 
storage without using any indexing.  If bounding boxes could be 
extracted directly from the db it may be quicker, it's not something 
I've tried.  A new mysql task for reading bounding boxes directly from a 
db could be very useful, patches welcome of course :-)
> The whole thing is to be run on a quad-core machine with 8 GB and
> reasonably fast disk arrays.
>   
One thing to note is that --bb and --bp both accept a idTrackerType 
argument.  The default value of idList is best for small'ish areas 
(approx less than 1/32 of entire planet) but for huge areas (Europe 
might be an example of this) it may be worth specifying 
idTrackerType=BitSet.  The only way to know is to test it out.

IdList consumes 4 bytes of memory per node id inside the area (ways and 
relations are not very relevant), BitSet consumes 
maximum_node_id_in_input_source/8 bytes of memory.  IdList is far more 
efficient for areas containing small numbers of nodes, BitSet is more 
efficient for areas containing a large percentage of the overall planet.

Brett