The Internal SRE ensures the reliability, scalability, and performance of internal systems and infrastructure. This role involves monitoring, automation, incident management, and maintaining self-hosted platforms to support smooth development operations. The Internal SRE works closely with cross-functional teams to manage GitLab CI/CD workflows and cloud infrastructure on AWS. The position emphasizes proactive problem-solving, automation, and collaboration to continuously improve system stability and efficiency.
Responsibilities:
- Manage and maintain GitLab environments to ensure high availability and security.
- Design and implement CI/CD pipelines to automate software delivery.
- Monitor and troubleshoot system performance issues, using observability tools like Prometheus, Grafana, or Datadog.
- Collaborate with development teams to align infrastructure efforts with project needs and timelines.
- Build and maintain infrastructure as code (IaC) solutions using tools like Terraform and Ansible.
- Manage AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, and VPC.
- Participate in incident response, conducting root cause analysis and post-incident reviews.
- Automate manual tasks to improve operational efficiency and reduce technical debt.
Minimum Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
- GitLab Administration: Experience managing and securing self-hosted GitLab environments.
- CI/CD Workflows: Expertise in designing and maintaining automated pipelines for continuous delivery.
- AWS Cloud Expertise: Strong knowledge of AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, VPC, and Lambda.
- Infrastructure-as-Code: Proficiency in Terraform, Ansible, or similar tools.
- Monitoring and Observability: Experience with Prometheus, Grafana, Datadog, or other observability platforms.
- Automation and Scripting: Proficiency in Python, Bash, or other scripting languages to automate tasks.
- Incident Management: Ability to lead incident response efforts and conduct root cause analysis.
- Collaboration and Communication: Strong interpersonal skills to work effectively across teams and with stakeholders.
The base pay for this position ranges from $110,000 - $125,000, which will vary depending on how well an applicant's skills and experience align with the job description listed above.
We will accept applications until 2/18/2025.