[OSM-dev] [OSM-talk] Osmosis error with multiple bouding boxes

Fri Oct 26 00:59:09 BST 2007

On 10/25/07, Brett Henderson <brett at bretth.com> wrote:
> Karl Newman wrote:
> > Crossposting to dev
> >> Osmosis needs some work with regard to tee and many outputs still, because
> >> it's quitting here with 100+...on out of heap space errors (The server has
> >> 4GB), with 50 pipes it worked. But other than that it works many times
> >> faster than processing the planet all over for each tile. So I'm happy with
> >> it.
> >>
> >
> > Maybe the task could be tweaked with an option to buffer all the data
> > into a file, then send it on to each pipe sequentially. You could also
> > set up a hierarchical/recursive series of Osmosis calls (sort of lame,
> > I know).
> >
> Not sure what you mean by the hierarchical/recursive call bit ...
>

Oh, it was a dumb idea. I meant call Osmosis in a shell script or
something a number of times with progressively smaller tiles. The
first run breaks the planet into a number of tiles, then for each of
those tiles, it's called again to further shrink it. I don't think
that's necessary, though.

> Firstly, as Jon mentioned the task memory limit may need to be
> increased.  If that solves the problem we're done :-)
>
> Otherwise ...
>
> Firstly a bit of background.  Most osmosis tasks maintain no state so
> have very small RAM requirements.  The --bounding-box and
> --bounding-polygon tasks are an exception because they maintain a BitSet
> for storing all the nodes, ways and segments that have been passed to
> the output.  To be honest I haven't checked to see how much RAM this
> uses but so far it hasn't been an issue for me.  Presumably the RAM
> usage is something like (NUM_ENTITIES / 8) bytes.  Each bit will take
> 1/8 of a byte which isn't much but will add up with the ids in the osm
> db.  It will also obviously add up when multiple bounding box tasks are
> being used at the same time.  The usage is probably proportional to the
> highest id for each type regardless of whether the bit is used.  If the
> highest id is 100,000,000 but only 50,000,000 are used, 100,000,000 bits
> will still be required.  I don't know if the java implementation is
> smart enough to store cleared bits sparsely but I doubt it.
>
> A couple of solutions might be:
> 1. As you've already suggested, store all data to a temp file then
> forward to each output sequentially.  My main concern is that this will
> probably be slower than re-reading the planet for each bounding box.
> I've improved performance of temp files recently but they're still not
> blazingly fast.

Hmm... I guess the XML parsers are pretty good/speedy; no real
advantage to my idea.

> 2. Persist the BitSet.  There's already an IndexStore which is basically
> a way of storing and loading long values indexed by a long.  This could
> be used to write out bits to disk and reload randomly.  I think this
> will be much quicker than loading complete objects from disk but it will
> be a significant performance hit on the current in-memory implementation.
>
> I suspect that the fastest approach will be to work around it by simply
> running as many bounding box tasks as will fit in RAM, and invoking
> osmosis several times if many bounding boxes are required re-reading the
> planet for each invocation.  Or perhaps a variant on Option 1 might be
> appropriate where the --tee task stores all input data then forwards it
> to several output tasks at a time but not all at once to avoid reading
> from the store more times than necessary.  I'll call this Option 3.
>
> 3. Store all input data.  Forward to a configurable number of outputs at
> a time.  For example, if 150 bboxes are required, store all input data,
> then forward to 50 bboxes at a time.

Yes, that might be better. I'm thinking about this in terms of a
"tiling" task. The general structure I had in mind was to read in all
the nodes and split them out into a series of certain-size tiles (I
was hoping for 4 degrees by 4 degrees, but maybe it'll have to be 2x2
or 1x1 depending on memory). It would hold back a copy of all the
nodes in a random-access store, then read the ways and send them to
the appropriate tile along with the nodes from the random access store
that are needed to make the way complete. (This might need some
tweaking, else you'd get nodes after ways in the file, maybe could be
fixed by sort?). Relations would be similar, but that needs a bit more
thought.

I don't know. This is generally how I did the "completeWays" option to
the area filter. Any idea how this could be modularized or pipelined?

>
> If anybody wishes to experiment, the classes in question are:
> BigBitSet - A wrapper around java BitSet allowing the use of long
> indexes (currently just casts to an int but will be extended when
> necessary) and negative numbers.
> AreaFilter - The base class for --bounding-box (BoundingBoxFilter) and
> --bounding-polygon (PolygonFilter) tasks.  It is the one that utilises
> BigBitSet.
> EntityTee - The task implementation for "tee-ing" input data to multiple
> destinations.
> ChangeTee - The same as EntityTee but for change streams.  Probably not
> applicable here.
> IndexStore - Can be used for random access long data type persistence.
> SimpleObjectStore - Useful for simple store and forward solutions.
>
> If somebody wishes to improve temp file performance, the main classes to
> look at all classes in the com.bretth.osmosis.core.store package.
> StoreWriter, StoreReader, ObjectWriter, ObjectReader are some of the
> classes used by all object store implementations.  There may be some
> performance gains possible.
>
> Cheers,
> Brett
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>