Site Reliability Engineer (Internal Engineering) (Remote)

Description

The Internal SRE ensures the reliability, scalability, and performance of internal systems and infrastructure. This role involves monitoring, automation, incident management, and maintaining self-hosted platforms to support smooth development operations. The Internal SRE works closely with cross-functional teams to manage GitLab CI/CD workflows and cloud infrastructure on AWS. The position emphasizes proactive problem-solving, automation, and collaboration to continuously improve system stability and efficiency.

Responsibilities:

Manage and maintain GitLab environments to ensure high availability and security.
Design and implement CI/CD pipelines to automate software delivery.
Monitor and troubleshoot system performance issues, using observability tools like Prometheus, Grafana, or Datadog.
Collaborate with development teams to align infrastructure efforts with project needs and timelines.
Build and maintain infrastructure as code (IaC) solutions using tools like Terraform and Ansible.
Manage AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, and VPC.
Participate in incident response, conducting root cause analysis and post-incident reviews.
Automate manual tasks to improve operational efficiency and reduce technical debt.

Minimum Qualifications:

Bachelor’s degree in Computer Science, Information Technology, or a related field.
Equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
GitLab Administration: Experience managing and securing self-hosted GitLab environments.
CI/CD Workflows: Expertise in designing and maintaining automated pipelines for continuous delivery.
AWS Cloud Expertise: Strong knowledge of AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, VPC, and Lambda.
Infrastructure-as-Code: Proficiency in Terraform, Ansible, or similar tools.
Monitoring and Observability: Experience with Prometheus, Grafana, Datadog, or other observability platforms.
Automation and Scripting: Proficiency in Python, Bash, or other scripting languages to automate tasks.
Incident Management: Ability to lead incident response efforts and conduct root cause analysis.
Collaboration and Communication: Strong interpersonal skills to work effectively across teams and with stakeholders.

The base pay for this position ranges from $110,000 - $125,000, which will vary depending on how well an applicant's skills and experience align with the job description listed above.

We will accept applications until 2/18/2025.

Remote Scouter

More Similar Roles...

Want more remote roles like this one sent to you?