[Osmf-talk] Community consultation: plans to hire a Senior Site Reliability Engineer

Fri Jul 24 15:44:58 UTC 2020

Donal Hunt's reference to the SRE book is extremely valuable.  I support 
his words. I support his stress on creating scalable, reliable 
infrastructure and products.

I was struck by the following words in the book:

/In general, an SRE team is responsible for the //availability, latency, performance, efficiency, change management, 
monitoring, emergency response, and capacity planning//of their service(s)./

I have reworked the job advertisement, mainly to slightly improve 
clarity and to untangle the scope of work. See far below.

I think the scope as it stands is too broad for one person, and strays 
from the responsibilities of a SRE team task above, and is not tightly 
focused on creating scalable, reliable infrastructure and products.  
However, the scope of work is probably usable for our merged 
SRE/Sysadmin team so I have let it stand. If much of it is delegated to 
the team of voluntary sysadmins then it may be workable.

The relationship between the SRE and the SysAdmin team currently is 
cloudy, which may lead to problems and turf wars. The Board must expect 
to be called upon to define roles.

Something that irks me is that the SRE is supposed to deal with 
'users'.  I'm not sure who 'users' are in our context, but whoever they 
are they should not have direct access to the SRE. The SRE is not the 
'Helpdesk'.  Perhaps 'users' can log bugs and issues for attention of 
the team and of course can deal directly with the Board member who 
manages the SRE function.

That Board member above should be the manager of the SRE. This advert 
tiptoes cautiously around that responsibility but IMHO that nettle must 
be bravely grasped as any relationship other than direct management by 
one person will fail in ugly, messy ways.

Craig Allan
===============================

On 2020/07/24 14:09, Donal Hunt wrote:
> ...
>
> I've been both a system administrator and an SRE manager over the past 
> 2 decades and the philosophy between the roles is quite different. 
> There is a tendency to interchange sysadmin, devops and SRE but I 
> would argue that they are distinct roles with differing end goals. An 
> SRE will be invaluable in helping the organisation take stock of where 
> they are at and support the delivery of changes that will create 
> scalable, reliable infrastructure and products.
>
> I would encourage people to read the first chapter 
> <https://landing.google.com/sre/sre-book/chapters/introduction/> of 
> the SRE book which captured the essence of the discipline back in 2016 
> / 2017.
>
> Donal

===========================================

=Senior site reliability engineer, OpenStreetMap Foundation

The OpenStreetMap Foundation (OSMF)operates the systems behind 
OpenStreetMaps, as global voluntary mapping project. The system uses 
about 100 physical and virtual servers around the globe. Keeping this 
core technical infrastructure running is a key responsibility of the 
OSMF. Until the present time the system operations and development roles 
have been admirably managed by a team of very skilled volunteers but 
with the continuous growth of the system the management Board is now 
looking to transtion a key role to a permanent staff member.

The engineer will work full time, and will be managed by one member of 
the OMSF Board.  The engineer will work with the existing team of 
volunteers, and with support of the Board will be able to delegate 
aspects of the Scope of Work (listed below) to members of that team.

An opportunity to apply for this position will be made available to the 
members of the current voluntary sysadmin team before the position is 
more widely advertised.

==Scope of work

===Operations
     Management, installation, configuration, maintenance and responding 
to outages of the current system
     Management of relationships with data centres
     Disaster recovery

===Development
     Improvement of all system infrastructure (hardware, software, 
network, data centres…)
     Support for the applications upgrading pipeline

===Management
     With Board support, manage and adjust the delegation of work to 
volunteers
     Support, mentor and enable volunteers and (eventually) co-workers
     Risk assessment and mitigation planning

===Policing
     Enforcement of usage policies
     Identifying and limiting abuse
     Support Board revision of usage policy

===Support
     Interaction with users, dealing with user requests
     First line of answering user tickets
     Management of github issues

===Strategy
     Coordinating projects to work on with the Board
     Helping the Board establish long-term system development plans

==Current Projects
For information, current systems project proposals that are under 
consideration include:

===Operations
         AWS auditing and improvements
         Improving, centralising and reworking logging, monitoring, 
reporting, and alerting
         Improving and reworking the tile serving architecture and 
infrastructure
         Moving some infrastructure to containers or cloud.
         Moving to ‘server as a resource’ and away from 1 service = 1 
server.
         Upgrading servers to Ubuntu 20.04
         Testing and improving backups
         Improving redundancy and availability of services
         Modernising runtime environments
         Network upgrades in Amsterdam
         Implementing Zero Downtime Upgrades (web, API, possibly other 
deployments)
         Improved storage and hosting of community data (aerial imagery, 
maps, photos…)
         Forum software upgrade
         Relaunch of GPX planet dumps
===Development
         Improving the continuous integration and deployment pipelines
===Management
         Improving disaster recovery preparations
         Improve onboarding Documentation
===Policing
         Improve policy documents and anti-abuse enforcement
===Support
         nop
===Strategy
         nop

==Profile

The applicant should be great communicator, with an excellent command of 
written and spoken English, and should be willing and able to 
collaborate online.  They should be a creative and inventive problem-solver.

Being already involved in OpenStreetMap as a contributor, or having 
experience with other Open Source or Open Data or volunteer communities, 
will be useful to understand how our voluntary community works.  It 
should be noted that the sysadmin team and board are all volunteers who 
have full-time jobs outside OpenStreetMap. The successful engineer 
should be able to self-organise and find direction in a sometimes 
difficult environment that will benefit from their good communication 
and inter-personal skills.

==Technical requirements

/The key words "MUST", "SHOULD", "MAY" are to be interpreted as 
described in RFC 2119./

The applicant MUST demonstrate experience with:

     Ubuntu or Debian based server administration
     Nginx
     Apache
     Shell scripting
     Git and github
     HTTP
     AWS

The applicant SHOULD have experience with:

     Squid
     DNS
     Chef
     Load balancing and high availability architectures
     Containerisation

The applicant MAY have experience with:

     Varnish
     Python
     Mapnik
     Nominatim
     Leaflet
     Vector tiles
     Docker
     Postgresql, postgis
     Mediawiki
     Ruby, Rails

==Employment/contracting structure

The person will work from their own premises, and most of the time will 
determine their schedule.

The OSMF is incorporated in England, deals frequently with UK entities, 
and has most of its servers in York, London and Amsterdam. A base in the 
UK or another country from which travel to these places is easy would 
make some things easier, but it’s not required. The OpenStreetMap 
Foundation is a global organisation; working with people and systems in 
different time zones and handling related scheduling constraints is 
expected.

If the person is based in the UK, IR35 legislation makes it a lot 
simpler for everyone if the OSMF hires them as an employee, rather than 
a contractor or similar.  The contract would in any case be permanent, 
not fixed-term or temporary.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/osmf-talk/attachments/20200724/7e8daddb/attachment.htm>