Senior Site Reliability Engineer
Posting date: | 30 July 2024 |
---|---|
Salary: | £52,412 to £78,517 per year |
Hours: | Full time |
Closing date: | 18 August 2024 |
Location: | S3 7UF |
Company: | Government Recruitment Service |
Job type: | Permanent |
Job reference: | 363068/5 |
Summary
Do you like finding the root cause of a problem and building automated solutions to make sure it doesn’t happen again?
If so, we’d love to hear from you.
As a Senior Site Reliability Engineer, you will drive adoption of SRE best practice across our cloud estate.
Utilising both your soft skills and technical experience, you will work with teams to ensure standards and governance is when onboarding our services into the cloud, through a dedicated assessment stage gate process. In turn, ensuring our citizen facing applications satisfy all the required operational and security needs for running in production.
You will execute deployments using runbooks, investigate production incidents and provide dedicated support to teams to determine the root cause.
You'll help to reduce toil and increase automation by developing reliability to ensure we have a reduction of the time to live, and cost spend on repetitive tasks.
Work collaboratively with development teams and provide guidance around best practice and ensure monitoring of applications is enabled.
Successful candidates will be expected to provide on-call service to help restore services, through dedicated run books or technical experience.
As part of the role, you may be required to travel regularly to the other digital hubs. The frequency of this will be discussed further should you be successful.
Please note this role requires you to pass Security Check clearance. For further information, please see 'Selection process details'.
The SRE team will put you in the position to work with application teams across the department on developing reliable and secure solutions to provide to citizens across the UK.
You will work with development teams from the design phase to help them use good practice and department standards when building their application infrastructure.
Additionally, responsibilities of the role will include:
- Contributing authoritative advice and guidance to others in the organisation and externally
- Design and develop the techniques for improving application reliability, run books, knowledge transfer to DWP Digital's User Experience Command Centre (UXCC), and ongoing SRE strategy within your Functional and Professional Communities
- Manage the error budget agreed with the product owner for the application and ensure that work is balanced in alignment with it
- Act as the focal point for the investigation and resolution of major or complex incidents for the service, ensuring people with the right skills and expertise are proactively available to respond effectively
- Assess the impact of change requests in consultation with stakeholders, providing technical expertise and authorising the implementation of subsequent changes
- Manage on-call rotations such that all applications have out-of-hours SRE coverage
- Coach and mentor application development and operations engineers in the practice and techniques of SRE
- Conduct retrospectives for all high priority and major incidents ensuring they are done quickly and published
- Routinely seek views and capture ideas from stakeholders and team members for improvements and encourage collaboration and innovation
- Interdepartmental discussions and meetings with a wide variety of external bodies and organisations on a local, regional, national or international basis, leading community discussions about SRE best practice within Engineering
Check out these blogs about Working in DWP's hybrid cloud services group and Sam's life in the clouds