[osmosis-dev] --used-node performance and a possible way to improve it

Fri Jun 3 21:04:40 BST 2011

Hi Brett, hi @osmosis-dev,

> You're describing a common problem with the Osmosis pipeline.  Many
> scenarios would be improved if you could access the data stream twice
> rather than having to buffer all data within the task itself.
good to know I'm not alone ;)

In the meantime I've been working quite a lot with Osmosis, with the 
standard tasks, and with my custom ones, and I've been following the 
discussions on this list, especially your posts about metadata... so now 
I think I have some clue about what actually lies behind these problems 
and solutions, mine included. In a way, this is also a reply to your 
recent post about metadata. Feel free to skip my ramblings :)

Here's the TL;DR version: my --fast-used-* is really just a workaround 
and should really stay a plugin and out of the main tree. What I think 
we need is a much more generic solution, like a second type of 
communication channel between the tasks for flow control information.

Full version:

There seem to be many use cases for Osmosis which more or less blow up 
the current streaming principle, --used-* being only one of them. That's 
why we're talking on this list about additional metadata, and 
completeWays performance of --bbox, and performance of --used-node, and 
RestartExceptions.

Even the <bound> handling is actually wrong at many places, and it can't 
be made right because of the stream ordering requirements. I somewhat 
fixed it in --merge, but it's more or less unfixable in, say, 
--apply-change without caching the whole stream which leads to the 
--used-node problem (lots of wasted CPU, I/O and disk space).

As I used Osmosis over the last couple of weeks, and I often found 
myself  thinking: can I do X with Osmosis? Can I optimize step Y of the 
pipeline? And the answer is almost always: no, it requires replaying the 
stream. No, it requires a bit of information from another task up or 
down the pipeline. No, it requires some coordination or synchronization 
between multiple tasks.

In short, and I think that's the core problem: the answer is no because 
_the tasks do not know enough about each other_ both before and during 
processing.

And as tempting as some metadata embedded in the data stream like you 
implemented in your last commits or a workaround implemented with my 
--fast-used-node might be - in my very humble and uninformed opinion it 
handles symptoms, not this core problem.

What we have now, is very similar to the plain old analog telephone 
network: we transmit both data and control information over a single 
channel. And there's a reason the telecom networks switched to 
out-of-band signaling with separate channels for data and metadata like 
call setup: it is more flexible. Before that, you had crazy workaround 
stuff like "hook flash" where you submitted information by actually 
interrupting the channel with a particular timing. Maybe that's just me, 
but throwing an exception up the pipeline is in a way very much like a 
hook flash :)

As to the RestartStreamException in particular, I have a feeling that it 
isn't really going to work well with --tee, --buffer, --merge, 
--apply-change and similar tasks. And even if, an exception really 
seriously messes up your control flow. Once you throw it, you cannot 
really make any assumptions as to what state you're currently in. And 
even if you work around it, that code going to be _really_ messy.

So I think what we need is a full-blown "control plane" for 
communication between tasks. Like in telecommunications, it should be 
orthogonal to the data streams, otherwise you will always be having the 
problem that the bit of information that you need is in the wrong place 
of the stream. We see this with <bound>: you need to write it out at the 
start of the stream but can only you know it at the end in the current 
model. That's why you would need to cache a GB-sized stream which seems 
pretty wasteful for 4 doubles.

I'd like a way for a task to know what is downstream and upstream of the 
pipeline. I'd like proper synchronization like locks and latches adn 
waits between tasks - yes, synchronization is complex and can lead to 
deadlocks, but we also can have deadlocks _now_ without the benefits of 
synchronization. I'd like a way for one task to be able to say what it 
is going to do to stream with respect to the sort ordering or the 
bounding box. And I like for other tasks to be able to reason about that 
in the context of the current pipeline.

Enough I'd likes. :) If you have a good "control plane", good things 
happen, that's it, actually ;) I very well understand that this is going 
to make the task API more complex. But I think it would pay off as you 
could use Osmosis for more tasks. Passing metadata in the data stream 
and exceptions and --fast-used-node make things more complex as well, 
and the payoff is less IMHO. You're right when you say that implementing 
a not good enough solution is worse that not implementing one at all - 
so the question for me is, really: what is "good enough"?

I have some - very basic - thoughts about how that "control plane" could 
work. I think a Wiki page with a more thorough description of that 
approach would be a better way to communicate it. I'll try to do that. 
Or would you better like it here on the list?

As to your hesitation to accept the --fast-used-* workaround: you're 
absolutely right, I understand that now. There are issues with those 
tasks, and I could and maybe will address some of those issues - but it 
will still stay a workaround. Now even I don't think it should go in the 
main distribution.

I've packaged --fast-used-* as a plugin and I'm going to make it public 
somewhere for those who need it now, like myself. What would be a good 
place to do that, BTW?

Thank you for taking your time to read all of this,
Greeting from Stuttgart, Germany,
Igor