<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">Donal Hunt's reference to the SRE book

      is extremely valuable.  I support his words. I support his stress

      on creating scalable, reliable infrastructure and products.<br>

      <div><br>

      </div>

      I was struck by the following words in the book:<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">

      <pre><i>In general, an SRE team is responsible for the </i><em>availability, latency, performance, efficiency, change management, 

monitoring, emergency response, and capacity planning</em><i> of their service(s).</i></pre>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">I have reworked the job advertisement,

      mainly to slightly improve clarity and to untangle the scope of

      work. See far below.</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">I think the scope as it stands is too

      broad for one person, and strays from the responsibilities of a

      SRE team task above, and is not tightly focused on creating

      scalable, reliable infrastructure and products.  However, the

      scope of work is probably usable for our merged SRE/Sysadmin team

      so I have let it stand. If much of it is delegated to the team of

      voluntary sysadmins then it may be workable.   <br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">The relationship between the SRE and

      the SysAdmin team currently is cloudy, which may lead to problems

      and turf wars. The Board must expect to be called upon to define

      roles.<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Something that irks me is that the SRE

      is supposed to deal with 'users'.  I'm not sure who 'users' are in

      our context, but whoever they are they should not have direct

      access to the SRE. The SRE is not the 'Helpdesk'.  Perhaps 'users'

      can log bugs and issues for attention of the team and of course

      can deal directly with the Board member who manages the SRE

      function. <br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">That Board member above should be the

      manager of the SRE. This advert tiptoes cautiously around that

      responsibility but IMHO that nettle must be bravely grasped as any

      relationship other than direct management by one person will fail

      in ugly, messy ways. <br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Craig Allan</div>

    <div class="moz-cite-prefix">===============================<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">On 2020/07/24 14:09, Donal Hunt wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAF1AMMMv0Zgb6YYJgA5MHc5C51dXWZMUJSyU9prSHTcB=hS-hw@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">...

        <div class="gmail_quote">

          <div><br>

          </div>

          <div>I've been both a system administrator and an SRE manager

            over the past 2 decades and the philosophy between the roles

            is quite different. There is a tendency to interchange

            sysadmin, devops and SRE but I would argue that they are

            distinct roles with differing end goals. An SRE will be

            invaluable in helping the organisation take stock of where

            they are at and support the delivery of changes that will

            create scalable, reliable infrastructure and products.</div>

          <div><br>

          </div>

          <div>I would encourage people to read the <a

              href="https://landing.google.com/sre/sre-book/chapters/introduction/"

              moz-do-not-send="true">first chapter</a> of the SRE book

            which captured the essence of the discipline back in 2016 /

            2017.</div>

          <div><br>

          </div>

          <div>Donal</div>

        </div>

      </div>

    </blockquote>

    <p>===========================================</p>

    <p>=Senior site reliability engineer, OpenStreetMap Foundation<br>

      <br>

      The OpenStreetMap Foundation (OSMF)operates the systems behind

      OpenStreetMaps, as global voluntary mapping project. The system

      uses about 100 physical and virtual servers around the globe.

      Keeping this core technical infrastructure running is a key

      responsibility of the OSMF. Until the present time the system

      operations and development roles have been admirably managed by a

      team of very skilled volunteers but with the continuous growth of

      the system the management Board is now looking to transtion a key

      role to a permanent staff member.<br>

      <br>

      The engineer will work full time, and will be managed by one

      member of the OMSF Board.  The engineer will work with the

      existing team of volunteers, and with support of the Board will be

      able to delegate aspects of the Scope of Work (listed below) to

      members of that team. <br>

      <br>

      An opportunity to apply for this position will be made available

      to the members of the current voluntary sysadmin team before the

      position is more widely advertised.<br>

      <br>

      ==Scope of work<br>

      <br>

      ===Operations<br>

          Management, installation, configuration, maintenance and

      responding to outages of the current system <br>

          Management of relationships with data centres<br>

          Disaster recovery <br>

          <br>

      ===Development<br>

          Improvement of all system infrastructure (hardware, software,

      network, data centres…)<br>

          Support for the applications upgrading pipeline<br>

             <br>

      ===Management<br>

          With Board support, manage and adjust the delegation of work

      to volunteers<br>

          Support, mentor and enable volunteers and (eventually)

      co-workers<br>

          Risk assessment and mitigation planning<br>

      <br>

      ===Policing<br>

          Enforcement of usage policies<br>

          Identifying and limiting abuse<br>

          Support Board revision of usage policy<br>

          <br>

      ===Support<br>

          Interaction with users, dealing with user requests<br>

          First line of answering user tickets<br>

          Management of github issues<br>

      <br>

      ===Strategy<br>

          Coordinating projects to work on with the Board<br>

          Helping the Board establish long-term system development plans<br>

      <br>

      ==Current Projects<br>

      For information, current systems project proposals that are under

      consideration include:<br>

      <br>

      ===Operations<br>

              AWS auditing and improvements<br>

              Improving, centralising and reworking logging, monitoring,

      reporting, and alerting<br>

              Improving and reworking the tile serving architecture and

      infrastructure<br>

              Moving some infrastructure to containers or cloud. <br>

              Moving to ‘server as a resource’ and away from 1 service =

      1 server.<br>

              Upgrading servers to Ubuntu 20.04<br>

              Testing and improving backups<br>

              Improving redundancy and availability of services<br>

              Modernising runtime environments<br>

              Network upgrades in Amsterdam<br>

              Implementing Zero Downtime Upgrades (web, API, possibly

      other deployments)<br>

              Improved storage and hosting of community data (aerial

      imagery, maps, photos…)<br>

              Forum software upgrade<br>

              Relaunch of GPX planet dumps    <br>

      ===Development<br>

              Improving the continuous integration and deployment

      pipelines<br>

      ===Management<br>

              Improving disaster recovery preparations<br>

              Improve onboarding Documentation<br>

      ===Policing<br>

              Improve policy documents and anti-abuse enforcement<br>

      ===Support<br>

              nop<br>

      ===Strategy<br>

              nop<br>

      <br>

      <br>

      ==Profile<br>

      <br>

      The applicant should be great communicator, with an excellent

      command of written and spoken English, and should be willing and

      able to collaborate online.  They should be a creative and

      inventive problem-solver.<br>

      <br>

      Being already involved in OpenStreetMap as a contributor, or

      having experience with other Open Source or Open Data or volunteer

      communities, will be useful to understand how our voluntary

      community works.  It should be noted that the sysadmin team and

      board are all volunteers who have full-time jobs outside

      OpenStreetMap. The successful engineer should be able to

      self-organise and find direction in a sometimes difficult

      environment that will benefit from their good communication and

      inter-personal skills.<br>

      <br>

      ==Technical requirements<br>

      <br>

      <i>The key words "MUST", "SHOULD", "MAY" are to be interpreted as

        described in RFC 2119.</i><br>

            <br>

      The applicant MUST demonstrate experience with:<br>

      <br>

          Ubuntu or Debian based server administration<br>

          Nginx<br>

          Apache<br>

          Shell scripting<br>

          Git and github<br>

          HTTP<br>

          AWS<br>

      <br>

      The applicant SHOULD have experience with:<br>

          <br>

          Squid<br>

          DNS<br>

          Chef<br>

          Load balancing and high availability architectures<br>

          Containerisation<br>

      <br>

      The applicant MAY have experience with:<br>

      <br>

          Varnish<br>

          Python<br>

          Mapnik<br>

          Nominatim<br>

          Leaflet<br>

          Vector tiles<br>

          Docker<br>

          Postgresql, postgis<br>

          Mediawiki<br>

          Ruby, Rails<br>

      <br>

      <br>

      ==Employment/contracting structure<br>

      <br>

      The person will work from their own premises, and most of the time

      will determine their schedule.<br>

      <br>

      The OSMF is incorporated in England, deals frequently with UK

      entities, and has most of its servers in York, London and

      Amsterdam. A base in the UK or another country from which travel

      to these places is easy would make some things easier, but it’s

      not required. The OpenStreetMap Foundation is a global

      organisation; working with people and systems in different time

      zones and handling related scheduling constraints is expected.<br>

      <br>

      If the person is based in the UK, IR35 legislation makes it a lot

      simpler for everyone if the OSMF hires them as an employee, rather

      than a contractor or similar.  The contract would in any case be

      permanent, not fixed-term or temporary.<br>

    </p>

  </body>

</html>