[osmosis-dev] --used-node performance and a possible way to improve it

Wed May 11 20:29:53 BST 2011

Hi Osmosis developers,

recently I found myself filtering large extracts of data with fairly 
restrictive tag-based filters. Imagine a task like "get all 
administrative boundaries in the state of Baden-Württemberg, Germany". 
It was taking a long time, lots of CPU, and lots of I/O. And I asked 
myself what it was doing all that time, actually.

I had a pipeline like [1] (OSMembrane screenshot). Just the 
old-fashioned filter for ways and relations with a merge at the end (in 
the screenshot, top row is ways, bottom row is relations).

Turns out that the main culprit is the --used-node task. As you surely 
know, it works like this:

1. Store all ways, nodes and relations coming in into a "simple object 
store".
2. During this, records all node references.
3. Replay the simple object store to the output, filtering out unneeded 
nodes.

Basically, in a workflow like

    read from disk -> filter ways -> get used nodes -> write

you basically write _everything_ you got from disk back to disk and then 
read it back again in --used-nodes. More than that, you only can control 
the filesystem the second write happens by setting the java.io.tmpdir 
property which isn't really intuitive. And you spend CPU time 
compressing and decompressing the intermediate store.

So, in actual numbers for boundaries in Baden-Württemberg (128 MB PBF as 
of today) this workflow boils down to "read 128 MB from disk, write ~180 
MB gzipped serialized objects to disk, read ~180 MB from disk, write 2 
MB PBF to disk." In the example pipeline shown above, those ~180 MB of 
gzipped serialized objects get written and read _twice_ because of two 
--used-node tasks.

This seemed, well, a little wasteful to me. You should only pay for what 
you getting (the 2 MB), not for everything there is (the 128 MB), and 
surely not twice :) So I thought up an approach which avoids 
intermediate stores.

It involves a task that takes two input streams and produces a single 
one. It works like this:

1. Read everything from the first stream and ignore all nodes, record 
the required ids for ways and relations in an id tracker just like 
--used-node, and pass the ways and relations downstream immediately.

2. Read everything from the second stream, ignore all ways and relations 
and only pass the required nodes (based on the id tracker) downstream.

It's a bit like a merge with an id tracker.

In terms of the complete workflow, it involves reading the source file 
from disk twice; the pipeline equivalent to [1] is shown in [2] (another 
OSMembrane screenshot).

I implemented this task (named --fast-used-node, better name needed ;)) 
and made a couple of measurements for the example I mentioned above 
(admin boundaries in Baden-Württemberg).

Pipeline [1] with --used-node: ~312+-10 seconds
Pipeline [2] with --fast-used-node: ~140+-10 seconds

A simple pipeline like read->filter ways->used-nodes->write takes about 
140 seconds with --used-node, the --fast-used-node one takes ~90 seconds 
complete.

All numbers on Pentium Dual-Core E5300 2.6 GHz, Win7 Pro 32-Bit, vanilla 
SATA disk, default 64MB heap size (irrelevant for this task). Both 
approaches seem to be CPU-bound (so the compression/decompression is 
more a problem than the IO in and of itself).

Of course, everything has a price. First, you effectively need to read 
the source file twice from disk; just splitting up a stream and 
buffering it isn't enough, as all buffers will eventually fill up and 
everything will come to a halt. That assumes that the source stream can 
be read twice in the first place, so network sources or stdin won't 
work, at least for now. Also, the pipelines get more complex, and the 
whole principle is a bit harder to understand than the straightforward one.

And finally, it changes the sorting order to "ways/relations, then 
nodes" - don't know if this is a big problem.

Anyway, for my use case, it works[TM], and I figured that this use case 
- restrictive tag-based filtering of big source file-on-disk datasets - 
would be quite common.

So what do you think - do we/you want that patch with --fast-used-node 
(and probably a similar one for --fast-used-way)? :) Or is it too 
special? Or not worthwhile for some other reason?

I would be really glad to hear your feedback, if you can spare some of 
your time for it.

In the hope this will help someone with something,
Best regards,
Igor

[1] http://i.imgur.com/beqT6.png
[2] http://i.imgur.com/nV3kL.png