[osmosis-dev] --used-node performance and a possible way to improve it
Igor Podolskiy
igor.podolskiy at vwi-stuttgart.de
Wed May 11 20:29:53 BST 2011
Hi Osmosis developers,
recently I found myself filtering large extracts of data with fairly
restrictive tag-based filters. Imagine a task like "get all
administrative boundaries in the state of Baden-Württemberg, Germany".
It was taking a long time, lots of CPU, and lots of I/O. And I asked
myself what it was doing all that time, actually.
I had a pipeline like [1] (OSMembrane screenshot). Just the
old-fashioned filter for ways and relations with a merge at the end (in
the screenshot, top row is ways, bottom row is relations).
Turns out that the main culprit is the --used-node task. As you surely
know, it works like this:
1. Store all ways, nodes and relations coming in into a "simple object
store".
2. During this, records all node references.
3. Replay the simple object store to the output, filtering out unneeded
nodes.
Basically, in a workflow like
read from disk -> filter ways -> get used nodes -> write
you basically write _everything_ you got from disk back to disk and then
read it back again in --used-nodes. More than that, you only can control
the filesystem the second write happens by setting the java.io.tmpdir
property which isn't really intuitive. And you spend CPU time
compressing and decompressing the intermediate store.
So, in actual numbers for boundaries in Baden-Württemberg (128 MB PBF as
of today) this workflow boils down to "read 128 MB from disk, write ~180
MB gzipped serialized objects to disk, read ~180 MB from disk, write 2
MB PBF to disk." In the example pipeline shown above, those ~180 MB of
gzipped serialized objects get written and read _twice_ because of two
--used-node tasks.
This seemed, well, a little wasteful to me. You should only pay for what
you getting (the 2 MB), not for everything there is (the 128 MB), and
surely not twice :) So I thought up an approach which avoids
intermediate stores.
It involves a task that takes two input streams and produces a single
one. It works like this:
1. Read everything from the first stream and ignore all nodes, record
the required ids for ways and relations in an id tracker just like
--used-node, and pass the ways and relations downstream immediately.
2. Read everything from the second stream, ignore all ways and relations
and only pass the required nodes (based on the id tracker) downstream.
It's a bit like a merge with an id tracker.
In terms of the complete workflow, it involves reading the source file
from disk twice; the pipeline equivalent to [1] is shown in [2] (another
OSMembrane screenshot).
I implemented this task (named --fast-used-node, better name needed ;))
and made a couple of measurements for the example I mentioned above
(admin boundaries in Baden-Württemberg).
Pipeline [1] with --used-node: ~312+-10 seconds
Pipeline [2] with --fast-used-node: ~140+-10 seconds
A simple pipeline like read->filter ways->used-nodes->write takes about
140 seconds with --used-node, the --fast-used-node one takes ~90 seconds
complete.
All numbers on Pentium Dual-Core E5300 2.6 GHz, Win7 Pro 32-Bit, vanilla
SATA disk, default 64MB heap size (irrelevant for this task). Both
approaches seem to be CPU-bound (so the compression/decompression is
more a problem than the IO in and of itself).
Of course, everything has a price. First, you effectively need to read
the source file twice from disk; just splitting up a stream and
buffering it isn't enough, as all buffers will eventually fill up and
everything will come to a halt. That assumes that the source stream can
be read twice in the first place, so network sources or stdin won't
work, at least for now. Also, the pipelines get more complex, and the
whole principle is a bit harder to understand than the straightforward one.
And finally, it changes the sorting order to "ways/relations, then
nodes" - don't know if this is a big problem.
Anyway, for my use case, it works[TM], and I figured that this use case
- restrictive tag-based filtering of big source file-on-disk datasets -
would be quite common.
So what do you think - do we/you want that patch with --fast-used-node
(and probably a similar one for --fast-used-way)? :) Or is it too
special? Or not worthwhile for some other reason?
I would be really glad to hear your feedback, if you can spare some of
your time for it.
In the hope this will help someone with something,
Best regards,
Igor
[1] http://i.imgur.com/beqT6.png
[2] http://i.imgur.com/nV3kL.png
More information about the osmosis-dev
mailing list