[OSM-talk] Rebuild plan (followups to rebuild list, please)
dermotm at gmail.com
Thu Mar 22 14:07:40 GMT 2012
I have waited a day to reply to the sudden wave of feedback regarding
the rebuild task list and plan. In this way I hope to ensure that my
reply is constructive and useful. I urge others to adopt a similar
It would be nice to be able to claim that it was gratifying to see
such a sudden surge of interest in a topic for which it has, until
now, been difficult to drum up much enthusiasm. Those who have
participated in the process of getting us to the point where we have a
plan and an emerging toolset - they deserve our thanks and they have
mine. Those who have chosen to snipe, often in non-specific terms, at
this plan, imperfect though it may be, well, I think they should
consider how things get done around here. Clearly they would have done
a better job and it's unfortunate that they did not step forward in a
timely fashion and do so.
All this being said, allow me to address those criticisms that have
been made in specific enough terms to allow it. There is a risk I will
leave out something important, but something tells me I'll hear about
that soon enough. I will politely request that followups be made to
rebuild at openstreetmap.org, a list that is open to all interested
parties and that exists for the purpose of such discussions as this. I
personally will assume that any followup not to rebuild@ is
unproductive punditry that need not be addressed in actual planning.
"The plan should be postponed until after April 1st"
To this I will simply state that deadlines are a Good Thing when you
are trying to get something done. Until we have completed this task it
is good that we should work to some deadlines even if they have to
evolve in the light of circumstances. If a safe rebuild or a portion
of it really has to slip beyond 1st April then that will have to
happen. There is, however, no virtue in ensuring that we slip by a
token few days just to prove that the world will not end. But be
assured that the plan is a living document that will not ignore
"There should be _much_ more test runs and validation of the edits made"
The more testing the better, this is clear. I hope that those calling
for improvements here have read, understood and fed back weaknesses
found in the test suite:
(all files test*)
Unlesss you prefer to systematically verify every object in the planet
file, this will provide the single greatest chance of successful data
migration. We do also need spot checking of data changes made to a
real API database and this is planned. It will need manpower, of
course, something that is still lacking in this process.
Let me recap on the planned nature of these tests - as can be seen
from the plan, this weekend is to see a test run on a subset or
subsets of the data set on the dev server, these subsets being chosen
for being representative of many of the important test cases (and
probably having regard to the locations where volunteer data checkers
have the local knowledge to most easily spot unexpected behaviour).
As this is a fast moving process, the plan does not yet reflect the
fact that we also hope to commisison the new database server and
install a full API database. The redaction process will then also be
commenced on this box (we have a choice whether to test the offline or
online redaction), something that will give us the fairest benchmark
(and the most random distribution of test cases) possible. Even during
the running of this full planet test it will be possible to view and
validate the decisions being made.
Until we run these tests we don't know how we will have to react to
what we find. If we discover that data is vanishing all over the place
and wrong redactions are happening, this will oblige much greater
caution than if everything behaves well. The benchmarking will also be
revealing. If we discover that live redaction on a non-loaded API
seems to suggest (random figure with no basis used for effect) a whole
month of database churning, that might indicate that an offline
redaction is much smarter (consider the scope for conflicts or just
plain degradation of API performance).
But we have to perform the tests first - after that, if we can see
that our projections are flawed, we will need to address this.
"This can be done without downtime and should be"
Two points need to be made about this, and both are hinted at above.
Firstly, _if_ we wish to use the opportunity of the licence change to
migrate to the new server (and database version), something Matt is
keen to do, this will require at least some downtime. A separate
discussion must be had about the principle of live redaction V offline
redaction (which is assumed to be quicker and avoids certain
theoretical issues such as permormance hit and redactions conflicting
with real edits).
We still lack the benchmarks to make a truly informed decision between
the live and offline options. The plan, as many of you have mentioned,
assumes that the offline approach is the safest path to a swift
completion. Maybe we will learn more this weekend.
"Downtime should have more notice"
Yes, it should. Maybe we will manage to shorten the length of it
and/or move it to a more acceptable time. There are not many of us and
we are under pressure.
"The pace of the plan fails to heed the scope for error"
The less one knows about the rebuild process the more scary this
aspect will seem. Put briefly, the entire process is reversible. No
version history is being deleted from the database, only the current
version records will be altered or (marked) deleted. In the event that
we make an abject mess of the rebuild we can simply roll forward the
historical versions of each object to recover the state we were in at
the start of the process. We can do this selectively per object or
across the entire data set if an unrecoverable snag is exposed.
Clearly, it would be distressing, annoying and personally very
embarrassing to have to do so, and if we take 3 days of downtime all
in the name of getting back to where we started then nobody will claim
that it's a good thing. But in assessing the scope for error it is
important to acknowledge that what we are risking is disruption, not
"1st April has been held out to mappers as an agree-by date. But now
we are starting early"
It's a somewhat fair point. Many of us will have little sympathy with
drama queens who have left a lot of their peers guessing and allowed
them to take the trouble to remap their stuff only to theatrically
agree at the last minute. Guys, if you're reading, just effing agree
and be done with it, or refuse if you like, whatever point you were
making has been made. But this is also an issue for the lost mappers
campaign and all the excellent email chasing that a lot of you have
heroically been doing. We have, as a community, tended to see April
1st as a step change - throw a switch at midnight and it's done. This
has allowed some people (and I've even done this myself) to suggest to
mappers that they have until 1st April to agree, whereas others may
have looked forward to switching the attribution on their tile server
on the morning of the 1st.
Can we get a win-win scenario here? Again, I really want to see the
benchmarks from this weekend. I also really want to avoid forced
procrastination for no reward. As per the plan document, there is some
scope for reprocessing of objects based on "new information". See the
stern warnings, though, it's very much a measure of last resort. If we
go for an offline rebuild, there may be slightly higher scope to
handle some level of (deserving) late agreement before we switch back
to read-write, though as a very secondary coder in this effort I am
not in a position to promise this.
This mail is long already, so apologies if I have missed something
important. I look forward to seeing a lot of you on rebuild@ and we
can identify any gaps together.
Igaühel on siin oma laul
ja ma oma ei leiagi üles
More information about the talk