[osmosis-dev] --used-node performance and a possible way to improve it
Igor Podolskiy
igor.podolskiy at vwi-stuttgart.de
Fri Jun 3 21:04:40 BST 2011
Hi Brett, hi @osmosis-dev,
> You're describing a common problem with the Osmosis pipeline. Many
> scenarios would be improved if you could access the data stream twice
> rather than having to buffer all data within the task itself.
good to know I'm not alone ;)
In the meantime I've been working quite a lot with Osmosis, with the
standard tasks, and with my custom ones, and I've been following the
discussions on this list, especially your posts about metadata... so now
I think I have some clue about what actually lies behind these problems
and solutions, mine included. In a way, this is also a reply to your
recent post about metadata. Feel free to skip my ramblings :)
Here's the TL;DR version: my --fast-used-* is really just a workaround
and should really stay a plugin and out of the main tree. What I think
we need is a much more generic solution, like a second type of
communication channel between the tasks for flow control information.
Full version:
There seem to be many use cases for Osmosis which more or less blow up
the current streaming principle, --used-* being only one of them. That's
why we're talking on this list about additional metadata, and
completeWays performance of --bbox, and performance of --used-node, and
RestartExceptions.
Even the <bound> handling is actually wrong at many places, and it can't
be made right because of the stream ordering requirements. I somewhat
fixed it in --merge, but it's more or less unfixable in, say,
--apply-change without caching the whole stream which leads to the
--used-node problem (lots of wasted CPU, I/O and disk space).
As I used Osmosis over the last couple of weeks, and I often found
myself thinking: can I do X with Osmosis? Can I optimize step Y of the
pipeline? And the answer is almost always: no, it requires replaying the
stream. No, it requires a bit of information from another task up or
down the pipeline. No, it requires some coordination or synchronization
between multiple tasks.
In short, and I think that's the core problem: the answer is no because
_the tasks do not know enough about each other_ both before and during
processing.
And as tempting as some metadata embedded in the data stream like you
implemented in your last commits or a workaround implemented with my
--fast-used-node might be - in my very humble and uninformed opinion it
handles symptoms, not this core problem.
What we have now, is very similar to the plain old analog telephone
network: we transmit both data and control information over a single
channel. And there's a reason the telecom networks switched to
out-of-band signaling with separate channels for data and metadata like
call setup: it is more flexible. Before that, you had crazy workaround
stuff like "hook flash" where you submitted information by actually
interrupting the channel with a particular timing. Maybe that's just me,
but throwing an exception up the pipeline is in a way very much like a
hook flash :)
As to the RestartStreamException in particular, I have a feeling that it
isn't really going to work well with --tee, --buffer, --merge,
--apply-change and similar tasks. And even if, an exception really
seriously messes up your control flow. Once you throw it, you cannot
really make any assumptions as to what state you're currently in. And
even if you work around it, that code going to be _really_ messy.
So I think what we need is a full-blown "control plane" for
communication between tasks. Like in telecommunications, it should be
orthogonal to the data streams, otherwise you will always be having the
problem that the bit of information that you need is in the wrong place
of the stream. We see this with <bound>: you need to write it out at the
start of the stream but can only you know it at the end in the current
model. That's why you would need to cache a GB-sized stream which seems
pretty wasteful for 4 doubles.
I'd like a way for a task to know what is downstream and upstream of the
pipeline. I'd like proper synchronization like locks and latches adn
waits between tasks - yes, synchronization is complex and can lead to
deadlocks, but we also can have deadlocks _now_ without the benefits of
synchronization. I'd like a way for one task to be able to say what it
is going to do to stream with respect to the sort ordering or the
bounding box. And I like for other tasks to be able to reason about that
in the context of the current pipeline.
Enough I'd likes. :) If you have a good "control plane", good things
happen, that's it, actually ;) I very well understand that this is going
to make the task API more complex. But I think it would pay off as you
could use Osmosis for more tasks. Passing metadata in the data stream
and exceptions and --fast-used-node make things more complex as well,
and the payoff is less IMHO. You're right when you say that implementing
a not good enough solution is worse that not implementing one at all -
so the question for me is, really: what is "good enough"?
I have some - very basic - thoughts about how that "control plane" could
work. I think a Wiki page with a more thorough description of that
approach would be a better way to communicate it. I'll try to do that.
Or would you better like it here on the list?
As to your hesitation to accept the --fast-used-* workaround: you're
absolutely right, I understand that now. There are issues with those
tasks, and I could and maybe will address some of those issues - but it
will still stay a workaround. Now even I don't think it should go in the
main distribution.
I've packaged --fast-used-* as a plugin and I'm going to make it public
somewhere for those who need it now, like myself. What would be a good
place to do that, BTW?
Thank you for taking your time to read all of this,
Greeting from Stuttgart, Germany,
Igor
More information about the osmosis-dev
mailing list