Hi Igor,<br><br>You're describing a common problem with the Osmosis pipeline.  Many scenarios would be improved if you could access the data stream twice rather than having to buffer all data within the task itself.<br>

<br>Your solution of requiring two input streams is quite practical.  I've often been tempted to do something similar.  The main issue is that it relies on the user passing an identical input stream twice which is somewhat error prone and ... well just feels messy.  Without a good solution I've just avoided solving the problem at all :-)<br>

<br>I've dabbled with the idea of allowing pipeline restarts.  The only way I can think to do it without requiring a more complicated task API is to allow tasks to throw a special type of exception (eg. RequestStreamRestartException) which could be caught by reader tasks (eg. --read-xml) and indicate to them to begin reading from the start of the stream again.  Reader tasks not supporting the new exception would simply abort as per normal exception handling.  This approach has the benefit that end-users could construct pipelines without consideration for whether data was read once or twice.  The downside is a bunch of added complexity though.  I haven't had time to investigate it properly.<br>

<br>I'm a little hesitant to see this type of workaround become the norm, but I don't have a better option right now.  Given that your tasks are independent of existing tasks I don't have any objection to them being added.<br>

<br>Cheers,<br>Brett<br><br><div class="gmail_quote">On Thu, May 12, 2011 at 5:29 AM, Igor Podolskiy <span dir="ltr"><<a href="mailto:igor.podolskiy@vwi-stuttgart.de">igor.podolskiy@vwi-stuttgart.de</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Osmosis developers,<br>

<br>

recently I found myself filtering large extracts of data with fairly restrictive tag-based filters. Imagine a task like "get all administrative boundaries in the state of Baden-Württemberg, Germany". It was taking a long time, lots of CPU, and lots of I/O. And I asked myself what it was doing all that time, actually.<br>


<br>

I had a pipeline like [1] (OSMembrane screenshot). Just the old-fashioned filter for ways and relations with a merge at the end (in the screenshot, top row is ways, bottom row is relations).<br>

<br>

Turns out that the main culprit is the --used-node task. As you surely know, it works like this:<br>

<br>

1. Store all ways, nodes and relations coming in into a "simple object store".<br>

2. During this, records all node references.<br>

3. Replay the simple object store to the output, filtering out unneeded nodes.<br>

<br>

Basically, in a workflow like<br>

<br>

   read from disk -> filter ways -> get used nodes -> write<br>

<br>

you basically write _everything_ you got from disk back to disk and then read it back again in --used-nodes. More than that, you only can control the filesystem the second write happens by setting the java.io.tmpdir property which isn't really intuitive. And you spend CPU time compressing and decompressing the intermediate store.<br>


<br>

So, in actual numbers for boundaries in Baden-Württemberg (128 MB PBF as of today) this workflow boils down to "read 128 MB from disk, write ~180 MB gzipped serialized objects to disk, read ~180 MB from disk, write 2 MB PBF to disk." In the example pipeline shown above, those ~180 MB of gzipped serialized objects get written and read _twice_ because of two --used-node tasks.<br>


<br>

This seemed, well, a little wasteful to me. You should only pay for what you getting (the 2 MB), not for everything there is (the 128 MB), and surely not twice :) So I thought up an approach which avoids intermediate stores.<br>


<br>

It involves a task that takes two input streams and produces a single one. It works like this:<br>

<br>

1. Read everything from the first stream and ignore all nodes, record the required ids for ways and relations in an id tracker just like --used-node, and pass the ways and relations downstream immediately.<br>

<br>

2. Read everything from the second stream, ignore all ways and relations and only pass the required nodes (based on the id tracker) downstream.<br>

<br>

It's a bit like a merge with an id tracker.<br>

<br>

In terms of the complete workflow, it involves reading the source file from disk twice; the pipeline equivalent to [1] is shown in [2] (another OSMembrane screenshot).<br>

<br>

I implemented this task (named --fast-used-node, better name needed ;)) and made a couple of measurements for the example I mentioned above (admin boundaries in Baden-Württemberg).<br>

<br>

Pipeline [1] with --used-node: ~312+-10 seconds<br>

Pipeline [2] with --fast-used-node: ~140+-10 seconds<br>

<br>

A simple pipeline like read->filter ways->used-nodes->write takes about 140 seconds with --used-node, the --fast-used-node one takes ~90 seconds complete.<br>

<br>

All numbers on Pentium Dual-Core E5300 2.6 GHz, Win7 Pro 32-Bit, vanilla SATA disk, default 64MB heap size (irrelevant for this task). Both approaches seem to be CPU-bound (so the compression/decompression is more a problem than the IO in and of itself).<br>


<br>

Of course, everything has a price. First, you effectively need to read the source file twice from disk; just splitting up a stream and buffering it isn't enough, as all buffers will eventually fill up and everything will come to a halt. That assumes that the source stream can be read twice in the first place, so network sources or stdin won't work, at least for now. Also, the pipelines get more complex, and the whole principle is a bit harder to understand than the straightforward one.<br>


<br>

And finally, it changes the sorting order to "ways/relations, then nodes" - don't know if this is a big problem.<br>

<br>

Anyway, for my use case, it works[TM], and I figured that this use case - restrictive tag-based filtering of big source file-on-disk datasets - would be quite common.<br>

<br>

So what do you think - do we/you want that patch with --fast-used-node (and probably a similar one for --fast-used-way)? :) Or is it too special? Or not worthwhile for some other reason?<br>

<br>

I would be really glad to hear your feedback, if you can spare some of your time for it.<br>

<br>

In the hope this will help someone with something,<br>

Best regards,<br>

Igor<br>

<br>

[1] <a href="http://i.imgur.com/beqT6.png" target="_blank">http://i.imgur.com/beqT6.png</a><br>

[2] <a href="http://i.imgur.com/nV3kL.png" target="_blank">http://i.imgur.com/nV3kL.png</a><br>

<br>

_______________________________________________<br>

osmosis-dev mailing list<br>

<a href="mailto:osmosis-dev@openstreetmap.org" target="_blank">osmosis-dev@openstreetmap.org</a><br>

<a href="http://lists.openstreetmap.org/listinfo/osmosis-dev" target="_blank">http://lists.openstreetmap.org/listinfo/osmosis-dev</a><br>

</blockquote></div><br>