[Osmf-talk] Community consultation: plans to hire a Senior Site Reliability Engineer

Sun Jul 26 14:51:56 UTC 2020

Hello,

It's been a while since I last participated on this mailing list, but 
this subject drew my attention. See, all the people I know that call 
themselves "Senior Site Reliability Engineer" are just glorified system 
administrators - and not particularly good ones at that. Now that I know 
that this term is just a Googleism, it makes more sense - there is a 
broad desire in the industry to emulate Google, because Google is 
successful (for certain definitions of the term "success", anyway), so 
emulating them must necessarily lead to success for other organizations 
as well, right? And obviously it's enough to just copy the labels 
without actively implementing any meaningful changes, because then 
everything else will automatically fall into place...

Cynicism aside, I recognize that there may be valid reasons for adopting 
a new approach to system administration, I just don't think one should 
blindly follow Google in any matter. Google has a tendency to behave 
very idiosyncratically, sometimes even going against basic design 
principles of systems they use. While this may or may not work for them, 
it often doesn't work for others, especially those that don't understand 
the "why" of Google's approach and concentrate only on the "how" (which 
they often don't really understand either). Please note that I'm not 
accusing OSMF of any of this, I'm just explaining my reasons for 
replying.

Reading through the introductory chapter of the book that was mentioned 
earlier, one piece in particular had me worried:

"And once an SRE team is in place, their potentially unorthodox 
approaches to service management require strong management support. For 
example, the decision to stop releases for the remainder of the quarter 
once an error budget is depleted might not be embraced by a product 
development team unless mandated by their management."

The "we're out of money, drop everything" line of thinking is extremely 
orthodox. I hope I'm misinterpreting this text fragment, because if its 
authors truly think this is an unorthodox approach, I see no reason to 
take anything else they say seriously either.

Something else caught my eye as well:

"integrating developers into on-call pager rotations"

This is a bad idea. Most competent developers don't want to waste their 
time on such menial jobs. I recognize that it's sometimes a necessity in 
smaller organizations, but bigger organizations waste their talents by 
making developers perform operational chores. To be fair, they do 
present a credible rationale for this approach, but I think developers 
should be held accountable for excess operational workloads caused by 
their efforts regardless of arbitrary limits like the 50% criterion that 
is mentioned here.

Then later on it turns out that they committed the cardinal sin of using 
a term they made up ("error budget") before defining it, causing me to 
erroneously interpret it as a financial concept through no fault of my 
own. I'm intentionally not going back to edit the earlier part of my 
email, to show you just how annoying this is. I'd recommend attending a 
high-quality course in technical writing to prevent this from happening 
again. And yes, I realize I'm barking up the wrong tree here, but I just 
had to vent.

Having said all that, I do think some aspects of "the Google approach" 
are certainly valuable and insightful, especially the part about 
monitoring. I'm certainly not dismissing it outright. It actually seems 
to be one of the better "DevOps" (a term which is far too broad to 
really be meaningful) approaches out there. And I certainly don't want 
to tell others how they should be doing their jobs, so I won't. If this 
is how the volunteers want to strengthen their team, I won't tell them 
otherwise. If, on the other hand, this is something the board wants and 
the people doing system administration were involved merely to 
rubber-stamp the board's plan, so to speak, then I'd like to encourage 
them to openly voice any objections they may have. I don't know which 
scenario is correct and which isn't, but the people who need to know 
certainly do.

All in all, I'd be willing to give this a try, but the permanent (or 
open-ended) nature of the proposed employment contract makes this 
impossible. Furthermore, my experience in leading and participating in 
volunteer organizations that employ paid staff as well has taught me 
that such a setting requires even more time than usual to evaluate an 
employee's fitness for a particular function. Therefore I would like to 
suggest starting out with (for instance) a one-year contract and then 
starting a thorough evaluation at about three quarters of that time. Of 
course, an employee that's clearly dysfunctional should be fired 
earlier, but I'm talking about the "We're okay with what you've been 
doing so far" scenario. You'll really want to ask yourself the question 
if this person is a good long-term addition to the team, leaving open 
the option that despite all their good work, the answer is no.

Regards,

Jorden