Career Renew is recruiting for one of its clients a Site Reliability Engineer - Remote position, candidates can be based anywhere.
We are looking for a highly motivated engineer with either SRE or DevOps experience that can help us develop and automate the various services we operate as part of the ecosystem. In this role, you will have the opportunity to drive availability and reliability across multiple engineering teams and work closely with them to ensure the operational aspects of managing services is automated and observable.
What you'll be doing:
- Building automation and management systems to deliver the various services which enable our ecosystem to function.
- Coaching teams across the ecosystem on best practices for deployment, observability and scalability
- Collaborate with other SREs and engineering leaders to ensure our architecture and operations are world-class
- Cultivate a culture of learning by providing insight into performance and reliability at an operational level
- Experience building and delivering large-scale software systems
- Previous experience working with both bare metal infrastructure (e.g. Equinix, etc.) and cloud infrastructure (ideally GCP)
- Experienced using the follow technologies BigQuery, Prometheus, Grafana, and VictoriaMetrics
- Experience operating as a SRE (or similar role) with hands-on experience implementing processes that drive reliability and performance
- History of working across organizations to codify and implement best practices for both operation and construction of software systems; knowledge of CI/CD best practices and ability to implement are considered a plus.
- Deep working knowledge of Kubernetes (or other container orchestration systems) and associated technologies
- Clear communication skills (written and verbal) to document processes and architectures