<div class="gmail_quote">On Fri, Mar 12, 2010 at 11:24 PM, Lars Francke <span dir="ltr"><<a href="mailto:lars.francke@gmail.com">lars.francke@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<br><div class="im">

> Most OSM systems tend to have a large number of disorganised and<br>

> uncontrolled clients.  Does this work well with the AMQP paradigm?  In other<br>

> words, does it take administrative overhead to register new subscriptions to<br>

> a queue?  What happens if large numbers of subscriptions are created then<br>

> the clients disappear?  Is AMQP targetted at a world where the clients are<br>

> relatively controlled and small in number?  It's important to minimise<br>

> administration overhead where possible.<br>

<br>

</div>I'll try to answer those in order.<br>

Subscriptions in AMQP are defined client side[1] so there is _no_<br>

administrative overhead at all. Clients only need to know which<br>

exchange/queue they should bind to and that's just a string they need<br>

to know.<br>

<br>

Queues can be declared in different ways, one way is to declare it as<br>

autoDelete[2]: "true if we are declaring an autodelete queue (server<br>

will delete it when no longer in use)". Again no administrative<br>

overhead. It just disappears when no longer used so no messages are<br>

routed there. So depending on the use case there are multiple options.<br>

<br>

I have no huge real-world experience with AMQP but I believe it is<br>

targeted at both worlds: Small and controlled and large and<br>

uncontrolled. The structure they have found seems to work really well<br>

in any case.<br>

<br>

Since the installation of the RabbitMQ server I did not have to touch<br>

it again. The only thing one _could_ do is to add some kind of user<br>

management (currently everyone is allowed to do everything for example<br>

send fake messages[4]) to allow only certain users to publish to<br>

certain exchanges but that are two lines in the admin console.<br></blockquote><div><br>Okay, that all sounds promising.  My main queueing experience is with MQ Series, or WebSphere MQ, or whatever it's called these days.  It's a little more top heavy :-)<br>

 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im"><br>

> Clients will experience outages whether that be due to a network problem,<br>

> server reboot, or just not running a system 24x7.  Presumably they need a<br>

> way to catch up on missed events.  There are a few options: 1. The server<br>

> holds events for clients until they become available again, 2. The client<br>

> catches up using an out of band mechanism (eg. downloads diffs directly), or<br>

> 3. The client can request that the server begin sending data from a specific<br>

> point.  I think that only options 1 and 2 are possible using AMQP.  1 is not<br>

> scalable, and 2 adds additional client complexity.  3 is what I'd like to<br>

> see, but I don't think it can be done using a typical reliable messaging<br>

> system such as AMQP.  I hope I'm wrong though.<br>

<br>

</div>Options 1) and 3) are kind of interchangeable and it depends on the<br>

circumstances which one to use. Both are possible.<br>

<br>

I'm glad to inform you that you are indeed kind of wrong. AMQP is used<br>

for this kind of things through private "unnamed" temporary reply<br>

queues. An example: Program A needs all diffs since time T so it<br>

creates a private reply queue the server generates a temporary name<br>

and then it just sends a message to the "Query queue" with the routing<br>

key "osm.query.osc" and the payload is just the timestamp (or any<br>

extra options one might think of) with the "replyTo"[3] field set to<br>

the previously created private queue. Some (might be a different one<br>

depending on the query kind, could for example decide dynamically to<br>

use an API or a XAPI reply handler depending on the request) process<br>

reads the message and begins sending the reply on the private queue<br>

which is automatically destroyed at the end.<br></blockquote><div><br>Okay, that sounds workable.  Perhaps a little more complicated than a HTTP or raw socket request, but that's not necessarily bad.<br><br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


<div class="im"><br>

> Something to note about the current replication mechanism is that it doesn't<br>

> use any transactional capabilities other than creating files in a particular<br>

> order.  All replication state tracking is client side where transactions are<br>

> actually occurring (eg. writing to a database, updating a planet file, etc)<br>

> which keeps the server highly scalable and agnostic of client reliability.<br>

<br>

</div>I didn't mean to say that what you're doing is wrong or bad in any<br>

way! I'm sorry if it came across as if I want to disregard everything<br>

you have done with Osmosis. The replication diffs are a huge step<br>

forward.<br></blockquote><div><br>It's all fine :-)  I'm more than happy to see alternative ideas out there.  And to be perfectly honest, if a better system comes out of it and Osmosis replication can be retired then I'm happy to reclaim a bit of my life back ;-)<br>

 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im"><br>

> I don't know how you'd hook into the Ruby API effectively and reliably.  You<br>

> can't just wait for changeset closure events because changesets can remain<br>

> open for large periods of time.  You really want to be replicating data as<br>

> soon as possible after it becomes available to API queries.  This may mean<br>

> receiving notification about every single entity as it is created, modified<br>

> or deleted from the db, but this will result in huge numbers of events which<br>

> will be difficult to process in an efficient manner.<br>

<br>

</div>That's exactly what I wanted to do: Send one message for each<br>

successful API call (create, update, delete). Yes it'll be a lot of<br>

messages but nothing even remotely (judging by the minutely diffs)<br>

hard to handle. Those messages should of course not be processed on<br>

the API server but if the messages are routed to the dev server it<br>

shouldn't have a huge impact. It would obviously put _some_ burden on<br>

the API server(s) but we'd gain a lot of flexibility and<br>

opportunities. We'd certainly be the first big open source project<br>

that has this kind of (for lack of a better term) fire-hose stream of<br>

its data. All it takes to make it even more accessible is for someone<br>

to write the PubSubHubBub/XMPP implementation.<br></blockquote><div><br>Yep, okay.  I'm not 100% convinced this is the right approach, but again I'm more than happy to see what comes out of it.<br><br>In the meantime, if you do need something more regular than minute updates, the existing Osmosis mechanism should be able to get down to around 5 second intervals.  I'm not keen to publish files publicly with that type of interval, but a second process consuming them on the same server (or one in close proximity) should be fine.<br>

 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<br>

As to the closing of changesets I thought of another small nicety<br>

(which I've not yet implemented so this might still fall apart): I<br>

planed a tool that subscribes to all the API messages and just holds a<br>

list of timers (even with a million open changesets at a time it<br>

shouldn't be a problem) which are adjusted accordingly. So when it<br>

gets a "changeset open" message it creates a timer that expires on the<br>

"closed at" time of the changeset and that is reset on every change to<br>

the changeset (of course factoring in the limitations like 24h, 50000<br>

elements). At some time the timer will fire and it can check the API<br>

to fetch all the associated metadata and send a message "changeset<br>

close". This would have the benefit that there is only one call to the<br>

API for every opened changeset (which I hope would be okay, load wise)<br>

and all consumers can profit from this information.<br></blockquote><div><br>I didn't understand all of this to be honest.  I think I understand how existing changesets work, and how they're auto-closed after intervals.  But I'm not sure where your idea fits into that.<br>

 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im"><br>

> I also think you'll run into a fair bit of<br>

> resistance trying to incorporate changes into the Ruby API, it's simpler at<br>

> least to remain independent where possible.  Unless you want to achieve<br>

> sub-second replication, the current approach could be run with a very short<br>

> replication interval.  The main restriction on replication interval now is<br>

> downloading large numbers of files from the planet server, not the<br>

> extraction of data from the database.<br>

<br>

</div>Yes I think that's going to be the problem (incorporate it into the<br>

API). That's why I currently implement it by running of the diff files<br>

you produce. That generates the same messages and allows for a very<br>

realistic testing scenario while at the same time not having to change<br>

anything in the API for now.<br></blockquote><div><br>Yep, it will be easier to include something if you have a working prototype first.  Having a client base ready to go will also help.<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


<br>

As I've said before: We thought about this solely for the dev servers<br>

(and the wikimedia toolserver(s)) but I've since come to believe that<br>

something(!) like this would be very nice and an innovative step for<br>

OpenStreetMap's future.<br>

<div class="im"><br>

> I guess something to consider is who are the clients of the mechanism.<br>

> Somebody wanting to see activity in a geographical area may not care about<br>

> reliability and perhaps something like XMPP is appropriate here.  But<br>

> anybody wanting reliable replication (ie. TRAPI) will need something robust<br>

> that guarantees delivery and data ordering.<br>

<br>

</div>I agree. AMQP is the solution I chose but there are of course other<br>

ways. AMQP allows for robust, reliable and ordered delivery of<br>

messages (it even has transactions if needed) and XMPP and others can<br>

be (easily from what I've read, not done it myself) integrated into<br>

this.<br>

<div class="im"><br>

> Anyway, it's good to hear that some fresh minds are interested in the<br>

> problem of changeset distribution.  I'm very interested to hear what comes<br>

> out of it.<br>

<br>

</div>Me too :)<br>

Thanks for your comments. It's always good to get a set of fresh eyes<br>

on an idea and I'd be glad to listen to any further input you (and<br>

others) might have.<br></blockquote><div> <br>Good luck :-)<br><br>Cheers,<br>Brett<br></div></div><br>