[Rebuild] Communication to data consumers wrt the licence change (draft)

Sat Mar 24 21:29:01 GMT 2012

Frederik wrote:
> On 03/23/2012 04:17 PM, Dermot McNally wrote:
> > So the "we" in Simon's words is either me and Matt or just Matt, in
> > as much as defer to his judgement on matters relating to his code and
> > how best to deploy it.
>
> Right. And as far as I understood, Matt has never ever said that he
> prefers the offline process. Instead, he has said that in the absence of
> anything "having to be completed" on 1st April, he would actually prefer
> the online process, but that this is surely not going to be complete on
> 1st April.
>
> So either I have completely misread what Matt has said, or else other
> people are, in their heads, combining the "oh god we must have something
> to show on 1st April" panic that emanates from OSMF board with Matt's
> factual "offline process is better but not going to be completed on 1st
> April", and arriving at: "Matt (a technician to whose judgement we can
> defer) says that offline is better."
>
> Which would be a really grand mis-interpretation of what he says.

i would characterise my position as "there are a number of advantages
to the online process, but speed is not one of them". preferring one
method or another implies that the situation is a level playing field,
when it is not. there is a time constraint. as for the trade-offs
between the advantages and disadvantages of each method, i think
that's the open discussion that Dermot was in favour of.

> And the reason I am so upset about this is the untruthfulness.

whoah, i think that's going a bit far.

> I want the board to stand up and say: "Yes, we have heard Matt who has
> said that the online process is better but cannot be completed by April
> 1st. We have also heard that even the offline process is not guaranteed
> to be completed by April 1st. We chose to *ignore* the sentiment of the
> technician because we think that the offline process is actually better
> for the project."
>
> What is happening here is that Matt, a technician with no history of
> political grand-standing, is being mis-quoted because of those who want
> results by April 1st no matter what the cost, nobody actually has the
> spine to say it.

ok. i, being a board member but in this instance speaking for myself,
want to meet the goal that we set ourselves for completing this
process. not because it's political, but because when i set myself a
goal i feel duty-bound to do as much as possible to reach it. in this
case, the two processes will reach much the same result, but one has a
much greater chance of reaching it by the date we identified as being
reasonable.

> > Rather Matt reached the conclusion that to have a sufficiently
> > performant rebuild process, the offline approach, which is by
> > definition faster than the live one, would be a better bet.
>
> Again, this is only true if you insert *someone else*'s definition of
> "sufficiently performant". Matt never said (and of course he can correct
> me if I misrepresent him) that, say, completing in Mid-April is not
> "sufficiently performant".

"sufficiently performant" being, in this case, the performance
sufficient to reach the self-imposed goal. it's like running a
marathon; i've got a time i'd promised myself i'd finish by and my
pacing has been off, i'm behind the time, so it seems that a sprint
finish is in order.

> > That we are now discussing it is good. I'd prefer that we not do so
> > under a cloud of conspiracy theories and threats of retribution.
>
> Remember when only recently Steve and Mikel signed an anti-Google blog
> post with their osmfoundation.org email addresses and there were
> different opinions from different board members as to whether this had
> been a board decision or not. *This* is the environment in which we are
> operating. I want full accountability, not "Oh, I thought that Matt had
> recommended this option. Now you say he didn't? I'm confused. Surely I
> would never have given the orders to shut down the database for a week...".
>
> But let me get back to the most prominent reason why I am skeptical of
> the "hard" cutover.
>
> The "soft" cutover is one where the API functions normally for how ever
> long it takes to make the changes. My estimate is that this might be two
> weeks. If it is done after one week, or if it should take three weeks or
> god forbid even four, the project will not be hurt apart from the fact
> that the license change is delayed (which, considering that we're so
> late in giving proper notice, would perhaps not even be a bad thing).

late in giving proper notice? the April 1st date had been announced months ago.

> This means that even if something should go wrong, we'll be in a
> situation where technicians can still get the sleep that is required to
> keep their health up, where they can still do their day job, and all.
>
> On the other hand, the "hard" cutover means that most - or even all -
> activity in the project comes to a halt while the change is going on. I
> agree with you that this is not a problem if it can be done in a day or
> two. However, it is entirely possible, and indeed likely, that it will
> take longer.

absolutely. we need to test and benchmark this stuff as much as
possible ahead of time to give ourselves confidence in this process.
this is why we've been writing a test suite for the license change
code and why we're attempting to benchmark the license change process.
any and all help would be gratefully appreciated, and we can work
towards having confidence that this will go right first-time with a
minimum of fuss.

the benchmarking process might show that downtime will be extreme. as
Keynes said, "when the facts change, I change my mind", so that would
be cause for a re-think. currently, i am not anticipating any more
extreme downtime than we had for the 0.5->0.6 API migration, and in
the process we'll have also upgraded to the latest replication-capable
PostgreSQL. until we have hard facts, all else is speculation.

> That after the end of one day, a mistake is spotted that
> requires that we have to re-start. That something breaks, a transaction
> aborts, whatever. Remember this is the very first time this code is
> used. Shit *will* happen. And when it happens, the eyes of all of OSM
> are upon the - as you correctly say - too few people handling this.
> People who have a day job, a real life, maybe even a family. The whole
> project is halted, and a handful of people (likely less) will have to
> work day and night to get things going again. Or, to decide to continue
> with a half-working solution because there's no time to fix the bug
> properly. And so on.

yeah, it was pretty much the same when we did the 0.5->0.6 API
migration. it's very stressful, and definitely some short,
unsatisfactory nights. but that's not a good reason not to do this,
imho. this license change has been dragging on long enough, and it's
well past the time we would have all wanted it to be completed when it
began. so let's pull together and JFDI, once and for all.

> I can see that somebody who wants to get the license change done by 1st
> April at all costs would prefer the second option. Not because it
> promises to be faster, but because it makes sure that the community
> squeezes every last drop of energy out of the few who have to do the job.
>
> But I don't think it is good. It's not good style to treat your own
> people like that, to wantonly create such a high-stress situation. It is
> not good leadership, it is not good management, it is just reckless, and
> all for this stupid "April 1st" fetish. I don't like the spirit behind
> this, I don't like the attitude, I don't like an OSM Foundation that
> sets its priorities in this way.

believe me, back in November, April sounded like it was a long way
away. i was enthusiastic about the prospect of all of us drawing a
line in the sand and saying that it will be done. the line is much,
much closer now - but the desire to say that it will all be done
remains. were it the case that the April 1st goal was unreachable for
good reason from the start, i think we'd all have accepted that and
aimed for a more appropriate goal. however, the goal was reachable and
the only reason for delaying is that we have failed to be adequately
ready for it. *i* have failed to be adequately ready for it. i don't
think it's good enough to say "i was lazy - let's just move the
goalposts".

> You may be wishing Matt a happy birthday but at the same time you're
> pushing forward a plan that makes him and possibly a very small number
> of others the single point of failure of the whole project for days,
> when this could easily be avoided.

volunteers welcome to help out!

cheers,

matt