[Osmf-talk] Community consultation: plans to hire a Senior Site Reliability Engineer
Jorden Verwer
jorden at verwer.express
Sun Jul 26 14:51:56 UTC 2020
Hello,
It's been a while since I last participated on this mailing list, but
this subject drew my attention. See, all the people I know that call
themselves "Senior Site Reliability Engineer" are just glorified system
administrators - and not particularly good ones at that. Now that I know
that this term is just a Googleism, it makes more sense - there is a
broad desire in the industry to emulate Google, because Google is
successful (for certain definitions of the term "success", anyway), so
emulating them must necessarily lead to success for other organizations
as well, right? And obviously it's enough to just copy the labels
without actively implementing any meaningful changes, because then
everything else will automatically fall into place...
Cynicism aside, I recognize that there may be valid reasons for adopting
a new approach to system administration, I just don't think one should
blindly follow Google in any matter. Google has a tendency to behave
very idiosyncratically, sometimes even going against basic design
principles of systems they use. While this may or may not work for them,
it often doesn't work for others, especially those that don't understand
the "why" of Google's approach and concentrate only on the "how" (which
they often don't really understand either). Please note that I'm not
accusing OSMF of any of this, I'm just explaining my reasons for
replying.
Reading through the introductory chapter of the book that was mentioned
earlier, one piece in particular had me worried:
"And once an SRE team is in place, their potentially unorthodox
approaches to service management require strong management support. For
example, the decision to stop releases for the remainder of the quarter
once an error budget is depleted might not be embraced by a product
development team unless mandated by their management."
The "we're out of money, drop everything" line of thinking is extremely
orthodox. I hope I'm misinterpreting this text fragment, because if its
authors truly think this is an unorthodox approach, I see no reason to
take anything else they say seriously either.
Something else caught my eye as well:
"integrating developers into on-call pager rotations"
This is a bad idea. Most competent developers don't want to waste their
time on such menial jobs. I recognize that it's sometimes a necessity in
smaller organizations, but bigger organizations waste their talents by
making developers perform operational chores. To be fair, they do
present a credible rationale for this approach, but I think developers
should be held accountable for excess operational workloads caused by
their efforts regardless of arbitrary limits like the 50% criterion that
is mentioned here.
Then later on it turns out that they committed the cardinal sin of using
a term they made up ("error budget") before defining it, causing me to
erroneously interpret it as a financial concept through no fault of my
own. I'm intentionally not going back to edit the earlier part of my
email, to show you just how annoying this is. I'd recommend attending a
high-quality course in technical writing to prevent this from happening
again. And yes, I realize I'm barking up the wrong tree here, but I just
had to vent.
Having said all that, I do think some aspects of "the Google approach"
are certainly valuable and insightful, especially the part about
monitoring. I'm certainly not dismissing it outright. It actually seems
to be one of the better "DevOps" (a term which is far too broad to
really be meaningful) approaches out there. And I certainly don't want
to tell others how they should be doing their jobs, so I won't. If this
is how the volunteers want to strengthen their team, I won't tell them
otherwise. If, on the other hand, this is something the board wants and
the people doing system administration were involved merely to
rubber-stamp the board's plan, so to speak, then I'd like to encourage
them to openly voice any objections they may have. I don't know which
scenario is correct and which isn't, but the people who need to know
certainly do.
All in all, I'd be willing to give this a try, but the permanent (or
open-ended) nature of the proposed employment contract makes this
impossible. Furthermore, my experience in leading and participating in
volunteer organizations that employ paid staff as well has taught me
that such a setting requires even more time than usual to evaluate an
employee's fitness for a particular function. Therefore I would like to
suggest starting out with (for instance) a one-year contract and then
starting a thorough evaluation at about three quarters of that time. Of
course, an employee that's clearly dysfunctional should be fired
earlier, but I'm talking about the "We're okay with what you've been
doing so far" scenario. You'll really want to ask yourself the question
if this person is a good long-term addition to the team, leaving open
the option that despite all their good work, the answer is no.
Regards,
Jorden
More information about the osmf-talk
mailing list