[Osmf-talk] Community consultation: plans to hire a Senior Site Reliability Engineer
Craig Allan
allan at iafrica.com
Fri Jul 24 15:44:58 UTC 2020
Donal Hunt's reference to the SRE book is extremely valuable. I support
his words. I support his stress on creating scalable, reliable
infrastructure and products.
I was struck by the following words in the book:
/In general, an SRE team is responsible for the //availability, latency, performance, efficiency, change management,
monitoring, emergency response, and capacity planning//of their service(s)./
I have reworked the job advertisement, mainly to slightly improve
clarity and to untangle the scope of work. See far below.
I think the scope as it stands is too broad for one person, and strays
from the responsibilities of a SRE team task above, and is not tightly
focused on creating scalable, reliable infrastructure and products.
However, the scope of work is probably usable for our merged
SRE/Sysadmin team so I have let it stand. If much of it is delegated to
the team of voluntary sysadmins then it may be workable.
The relationship between the SRE and the SysAdmin team currently is
cloudy, which may lead to problems and turf wars. The Board must expect
to be called upon to define roles.
Something that irks me is that the SRE is supposed to deal with
'users'. I'm not sure who 'users' are in our context, but whoever they
are they should not have direct access to the SRE. The SRE is not the
'Helpdesk'. Perhaps 'users' can log bugs and issues for attention of
the team and of course can deal directly with the Board member who
manages the SRE function.
That Board member above should be the manager of the SRE. This advert
tiptoes cautiously around that responsibility but IMHO that nettle must
be bravely grasped as any relationship other than direct management by
one person will fail in ugly, messy ways.
Craig Allan
===============================
On 2020/07/24 14:09, Donal Hunt wrote:
> ...
>
> I've been both a system administrator and an SRE manager over the past
> 2 decades and the philosophy between the roles is quite different.
> There is a tendency to interchange sysadmin, devops and SRE but I
> would argue that they are distinct roles with differing end goals. An
> SRE will be invaluable in helping the organisation take stock of where
> they are at and support the delivery of changes that will create
> scalable, reliable infrastructure and products.
>
> I would encourage people to read the first chapter
> <https://landing.google.com/sre/sre-book/chapters/introduction/> of
> the SRE book which captured the essence of the discipline back in 2016
> / 2017.
>
> Donal
===========================================
=Senior site reliability engineer, OpenStreetMap Foundation
The OpenStreetMap Foundation (OSMF)operates the systems behind
OpenStreetMaps, as global voluntary mapping project. The system uses
about 100 physical and virtual servers around the globe. Keeping this
core technical infrastructure running is a key responsibility of the
OSMF. Until the present time the system operations and development roles
have been admirably managed by a team of very skilled volunteers but
with the continuous growth of the system the management Board is now
looking to transtion a key role to a permanent staff member.
The engineer will work full time, and will be managed by one member of
the OMSF Board. The engineer will work with the existing team of
volunteers, and with support of the Board will be able to delegate
aspects of the Scope of Work (listed below) to members of that team.
An opportunity to apply for this position will be made available to the
members of the current voluntary sysadmin team before the position is
more widely advertised.
==Scope of work
===Operations
Management, installation, configuration, maintenance and responding
to outages of the current system
Management of relationships with data centres
Disaster recovery
===Development
Improvement of all system infrastructure (hardware, software,
network, data centres…)
Support for the applications upgrading pipeline
===Management
With Board support, manage and adjust the delegation of work to
volunteers
Support, mentor and enable volunteers and (eventually) co-workers
Risk assessment and mitigation planning
===Policing
Enforcement of usage policies
Identifying and limiting abuse
Support Board revision of usage policy
===Support
Interaction with users, dealing with user requests
First line of answering user tickets
Management of github issues
===Strategy
Coordinating projects to work on with the Board
Helping the Board establish long-term system development plans
==Current Projects
For information, current systems project proposals that are under
consideration include:
===Operations
AWS auditing and improvements
Improving, centralising and reworking logging, monitoring,
reporting, and alerting
Improving and reworking the tile serving architecture and
infrastructure
Moving some infrastructure to containers or cloud.
Moving to ‘server as a resource’ and away from 1 service = 1
server.
Upgrading servers to Ubuntu 20.04
Testing and improving backups
Improving redundancy and availability of services
Modernising runtime environments
Network upgrades in Amsterdam
Implementing Zero Downtime Upgrades (web, API, possibly other
deployments)
Improved storage and hosting of community data (aerial imagery,
maps, photos…)
Forum software upgrade
Relaunch of GPX planet dumps
===Development
Improving the continuous integration and deployment pipelines
===Management
Improving disaster recovery preparations
Improve onboarding Documentation
===Policing
Improve policy documents and anti-abuse enforcement
===Support
nop
===Strategy
nop
==Profile
The applicant should be great communicator, with an excellent command of
written and spoken English, and should be willing and able to
collaborate online. They should be a creative and inventive problem-solver.
Being already involved in OpenStreetMap as a contributor, or having
experience with other Open Source or Open Data or volunteer communities,
will be useful to understand how our voluntary community works. It
should be noted that the sysadmin team and board are all volunteers who
have full-time jobs outside OpenStreetMap. The successful engineer
should be able to self-organise and find direction in a sometimes
difficult environment that will benefit from their good communication
and inter-personal skills.
==Technical requirements
/The key words "MUST", "SHOULD", "MAY" are to be interpreted as
described in RFC 2119./
The applicant MUST demonstrate experience with:
Ubuntu or Debian based server administration
Nginx
Apache
Shell scripting
Git and github
HTTP
AWS
The applicant SHOULD have experience with:
Squid
DNS
Chef
Load balancing and high availability architectures
Containerisation
The applicant MAY have experience with:
Varnish
Python
Mapnik
Nominatim
Leaflet
Vector tiles
Docker
Postgresql, postgis
Mediawiki
Ruby, Rails
==Employment/contracting structure
The person will work from their own premises, and most of the time will
determine their schedule.
The OSMF is incorporated in England, deals frequently with UK entities,
and has most of its servers in York, London and Amsterdam. A base in the
UK or another country from which travel to these places is easy would
make some things easier, but it’s not required. The OpenStreetMap
Foundation is a global organisation; working with people and systems in
different time zones and handling related scheduling constraints is
expected.
If the person is based in the UK, IR35 legislation makes it a lot
simpler for everyone if the OSMF hires them as an employee, rather than
a contractor or similar. The contract would in any case be permanent,
not fixed-term or temporary.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/osmf-talk/attachments/20200724/7e8daddb/attachment.htm>
More information about the osmf-talk
mailing list