Company Description
Company Overview:
We are collaborating with a large enterprise client seeking an experienced Senior Site Reliability Engineer for a contract position. The ideal candidate will focus on ensuring system reliability, scalability, and performance while working remotely with overlap in U.S. time zones.
Job Description
Job Title: Senior Site Reliability Engineer (SRE) - Contract Role
Location: Remote (Must have availability to overlap with U.S. time zones)
Key Responsibilities:
- Identify and resolve complex bugs by working within the codebase and utilizing runbooks.
- Write and maintain code to enhance system reliability, scalability, and performance.
- Restart services and implement changes to the codebase as required.
- Investigate complex system issues and develop effective resolutions.
- Design and build fault-tolerant, scalable systems for high availability and performance.
- Apply advanced methodologies like Design for Reliability (DFR), Failure Mode and Effects Analysis (FMEA), and Mean Time Between Failures (MTBF).
- Develop and maintain reliability standards and documentation.
Qualifications
Required Skills and Experience:
- Minimum of 5-7 years in Site Reliability Engineering or related fields.
- Proven experience in designing and implementing fault-tolerant, scalable systems at an enterprise level.
- Deep understanding of DFR, FMEA, MTBF, and other reliability methodologies.
- Proficiency with tools such as DataDog, PagerDuty, Marvin, Backstage, pipeline deployment processes, and rollback procedures.
- Strong coding skills in one or more programming languages commonly used in SRE.
- Exceptional analytical skills to investigate complex issues and devise effective solutions.
- Willingness to learn new products and tools provided by the company.
- Excellent communication skills and ability to work effectively within a distributed team environment.
- Must be able to work remotely with significant overlap during U.S. time zones.
Preferred Qualifications:
- Experience with runbooks and operational excellence methodologies.
- Familiarity with large-scale enterprise systems and environments.
- Relevant certifications in reliability engineering, cloud platforms, or related technologies.