[OSM-dev] Osmosis multi-task questions
Brett Henderson
brett at bretth.com
Mon Nov 26 23:38:06 GMT 2007
Frederik Ramm wrote:
> Hi,
>
> I'm thinking about using Osmosis to produce daily "mini planets"
> for a number of areas, e.g. one each for Germany and the neighbouring
> countries, but then also one each for the 16 German "Länder" (states).
>
> There are probably no 100% answers for the following but maybe someone
> has experimented with these things and has some ideas/insights on the
> best procedures.
>
> I am thinking about this:
>
> 1. Each week, get full planet file and do a bounding box extract for
> Europe.
>
> 2. Daily, apply the diff to this file. This will add a little non-Europe data
> each day but I can ignore that.
>
You can run the result through another bounding box task to remove any
unwanted data if necessary.
> 3. Daily, split out desired polygons from patched planet file.
>
> Assuming for a moment that I'd not use a database; would it be
> possible (and sensible) to use the "tee" task in Osmosis to branch
> from one --read-xml into 20 --border-polygon/--write-xml tasks, so
> that all areas I am interested in get cut out in one go? I guess I
> would have to create a very long command line naming all the output
> pipes of the tee and assigning them to each of the --border-polygon
> tasks, right?
>
Yes, that's right. Lambertus is doing something similar where he's
producing around 200 bounding boxes in a single osmosis invocation.
Many times faster than invoking osmosis once for each area.
> Can I make use of the default pipe connection between the
> border-polygon and write-xml if I pair them, e.g.
>
> osmosis
> --rx ... (use default pipe) --tee (name 20 output pipes)
> --bp (name 1 input pipe but use default pipe on output) --wx
> --bp (name another input pipe, again use default pipe on out) --wx
> ...
>
This won't work currently. The outputs of each task get added to a
queue so the default allocation works on a FIFO basis. Ages ago kleptog
suggested using a stack approach (ie. FILO) approach which would make
default connectivity work in more cases. A stack would change the
existing behaviour so I've hung back from making the change, but I think
I need to do it one of these days.
If you want to use default pipe connection, you'll have to do something
like this.
osmosis
--rx
--tee 20
--bp file=poly01.txt
--bp file=poly02.txt
...
--bp file=poly20.txt
--wx out01.osm
--wx out02.osm
...
--wx out20.osm
Note that paired tasks are not next to each other on the command line,
the tee registers 20 outputs, the --bp tasks read each of those and
produce another 20 outputs, the --wx tasks then pick each of those up
and write to file.
You have to be careful to make sure they're lined up correctly but it
should be possible.
> In a case like the one sketched initially - select 5 neighbouring
> regions plus 16 regions inside one of the selected - could I even
> construct a command line that would do something like this:
>
> read xml -> tee into 5 pipes, each with own bounding polygon, four
> of them written directly to file, fifth tees into 16 pipes, one of
> which is written directly to file, the others are again used as
> bounding polygon inputs and then written
>
That sounds okay.
> Could I expect better performance by using a Mysql database which is
> reset every week and has the diffs applied to it? When extracting
> polygons from the Mysql instead of from an XML file, would the same
> "tee" strategy make sense or would Mysql reading be fast enough to
> just extract every polygon sequentially? I see that the read-mysql
> task has no option to make use of Mysql indexes for selecting bounding
> boxes, so I assume it would always feed the full data set into the
> select_polygon which might be less than ideal. Maybe I could trick
> Osmosis into operating on a "view" of a node table that contains only
> nodes in a certain lat/lon range?
>
With a decent machine, I think you'll find file based performance more
than adequate. A file will be far quicker than using MySQL as raw data
storage without using any indexing. If bounding boxes could be
extracted directly from the db it may be quicker, it's not something
I've tried. A new mysql task for reading bounding boxes directly from a
db could be very useful, patches welcome of course :-)
> The whole thing is to be run on a quad-core machine with 8 GB and
> reasonably fast disk arrays.
>
One thing to note is that --bb and --bp both accept a idTrackerType
argument. The default value of idList is best for small'ish areas
(approx less than 1/32 of entire planet) but for huge areas (Europe
might be an example of this) it may be worth specifying
idTrackerType=BitSet. The only way to know is to test it out.
IdList consumes 4 bytes of memory per node id inside the area (ways and
relations are not very relevant), BitSet consumes
maximum_node_id_in_input_source/8 bytes of memory. IdList is far more
efficient for areas containing small numbers of nodes, BitSet is more
efficient for areas containing a large percentage of the overall planet.
Brett
More information about the dev
mailing list