<div class="gmail_quote">On Fri, Nov 20, 2009 at 5:15 PM, Lars Francke <span dir="ltr"><<a href="mailto:lars.francke@gmail.com">lars.francke@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Once again: Thanks for all your work on this!<br>

After taking a stab at it myself I certainly have a new appreciation<br>

for what you've done.<br></blockquote><div><br>Hehe, I started doing this in 2006 I think and thought it'd be done and dusted in a few months.  3 years later and I'm still doing it ...<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div class="im"><br>

> The "history" diffs are in the process of being generated and are well<br>

> through 2008 as we speak.  These are effectively daily diffs but aren't<br>

> getting deleted on a rolling window basis.  This is effectively creating a<br>

> full history dump of the database.  This has been in the wings for a while,<br>

> but only possible now that there is some more disk space available.  These<br>

> are still timestamp based extracts due to transaction id queries being<br>

> useless for historical queries.  As a result of the use of timestamps, these<br>

> will be run with a large delay to avoid missing data.  I'll probably set<br>

> this delay to 1 day to be safe, but perhaps a couple of hours would be<br>

> enough.<br>

<br>

</div>The first few years worth of history diffs have been created using the<br>

"old" Osmosis version. So is it possible that they are missing a few<br>

transactions too? (As a result of the "one off" bug).<br></blockquote><div><br>The one off was in the transaction id calculation code which is only used for the minute-replicate diffs.  The history diffs are generated using the older style timestamp range queries which shouldn't have the same problem.<br>

<br>But there could certainly be bugs in the history diffs, let me know if you see anything.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div class="im"><br>

> Moving away from a file-based<br>

> distribution approach has serious implications for reliability in the face<br>

> of server and network outages, cacheability, bandwidth consumption, and<br>

> server resource usage.  As a result, the existing approach is likely to<br>

> represent the state of the art in the near to medium future.  We need to<br>

> stabilise the existing features before attempting new ones :-)<br>

<br>

</div>I thought about a replication over pubsubhubbub which should take care<br>

of bandwith, cacheability, server resource usage (with fat pings) and<br>

a few other problems. But I've done no work on it yet or even thought<br>

it through. It just seemed like a fitting concept for the (or _one_ of<br>

the types) type of replication we need.<br>

The MusicBrainz project is facing much the same problem as we are and<br>

they're using a very similar solution<br>

(<a href="http://musicbrainz.org/doc/Replication_Mechanics" target="_blank">http://musicbrainz.org/doc/Replication_Mechanics</a>).<br></blockquote><div><br>The musicbrainz replication scheme (on first read) sounds pretty similar to what we're doing now.  In other words, the client/slave tracks which sequence number it has reached and downloads replication files until it reaches the current point.  One difference is that they're replicating between identical schemas whereas Osmosis is more general but the idea seems to be the same.<br>

<br>As for publish/subscribe mechanisms I'm less sure.  There are a few things I wish to achieve in order to maximise fault tolerance and promote loose coupling between systems:<br>1. Zero administration per client on the server side.  In other words I don't wish to have to perform setup per client on the server.  A possible exception is authentication.<br>

2. Zero state managed per client on the server side.  To maximise scalability and minimise administration, I'd rather all per client state be managed at the client side.<br>3. Clients must be able to re-sync after network or client server outage without having gaps in data.  To do this they need to be able to request the server to start sending data from a specific point with the server limiting how far back it will allow.<br>

<br>Point 3 is the most problematic from a pub/sub perspective because most pub/sub mechanisms have a single server publishing updates, and already subscribed clients receiving them.  It is hard for a client to re-sync from a known point if it has missed updates.  I'd rather the server not have to know which updates each client has received and track that client side instead.<br>

<br>It may be necessary to write a server app from scratch.  It could run regular extracts from the db (eg. every 10 seconds or so) but not publish them publicly.  The existing minute-replicate process could switch to consuming these extracts and roll them into minute chunks.  The server app would be multi-threaded with a master thread retrieving updates from the db, then notifying client specific threads when new data is available.  Each client specific thread would begin sending data from the point at which the client requests when it first connects, then push subsequent updates to the client when it is notified of each new extract.  The client would be responsible for tracking which sequence it had successfully received and committed to its output data store.  The server would wrap all diffs in a replication xml structure that would provide information about the timestamp this change represents, and the current replication sequence number.<br>

<br>With the above approach, the master thread would be the only process accessing the production db.  The master thread would never block based on client activity.  Clients could begin processing from any point (within limits) to allow them to load their database to a point with planet+day+hour diffs then continue from a specified point.  If clients lost connectivity for some time, they could resume where they left off unless they'd fallen outside the maximum re-sync window in which case they'd have to catch up via normal diffs.  The client threads could be placed in a pool that limits the number of clients to a sensible number.  Connections could require a user id if necessary to limit the number of consumers using this mechanism.  If large numbers of consumers started using this, it would be possible to cascade these replication systems in a hierarchy for scalability.<br>

<br>I don't think code complexity would be terribly high.  The most difficult part is the threading aspect of the server, but it would be simpler than Osmosis itself in many respects because the threads are mostly independent.  Having said all that, I do have a tendency to reinvent that which is already available so perhaps this has already been solved elsewhere :-)<br>

<br>One fairly major consideration is what impact this type of system has on OSM infrastructure.  It is likely to be more error prone and require more maintenance than the existing approach.  The push mechanism and eliminated network "chattiness" also makes it uncacheable which has implications on bandwidth consumption.<br>

<br>Anyway, that's a dump of my thoughts.  I certainly won't implement anything in the near future so feel free to have a play and see what you can come up with.<br><br></div></div>Brett<br><br>