Description

About Henry Meds:
Tens of millions of Americans are unable to manage their chronic conditions with commercial medications. Using specialized compounded formulas tailored to individual patient needs, Henry helps people who have been left behind by the commercial market, all while remaining easy, accessible, and affordable. Our customers get access to the care they need, and save thousands of dollars on out-of-pocket healthcare expenses per year!

Enjoy the casual culture, remote-first workplace, and generous PTO/benefits!

Apply today to make a direct, daily impact in one of the fastest-growing startups in the country - we are excited to meet you!

Position Overview:
We are seeking our first Lead SRE (Site Reliability Engineering) Engineer. In this role, you would ensure the reliability, scalability, and performance of complex systems and cloud infrastructure. Off the bat, you will outline observability guidelines for the company. The role involves close collaboration with engineering and security teams to integrate SRE principles throughout the software development lifecycle. Strong analytical skills are essential for diagnosing and resolving issues, while leadership abilities are crucial for mentoring junior engineers and fostering a culture of continuous improvement. As the first SRE hire you will be instrumental in building the team and setting the direction of our DevOps culture. You will assist in hiring for our SRE, Platform, and Shared Services Teams.

Duties and Responsibilities:

Architect and create our observability and monitoring system.
Create a disaster recovery plan and facilitate disaster recovery testing. Familiarity with DiRT exercises is a plus.
How configurations and networking are managed per environment, and how all systems are monitored, supported, and scaled in the production environment.
Oversee teams who are responsible for the design, architecture, and development of operational infrastructure within our platform.
Assist in hiring to perform daily operations and embed SRE operations across the department.
Provide architectural and technical guidance and mentorship to SRE teams, fostering skill development, and building strong and capable SRE practices.
Lead and prioritize multiple projects, create roadmaps, and drive implementation plans.
Partner with product and engineering stakeholders to proactively identify operational needs and deliver solutions.

You will likely have:

Experience in GCP working with stakeholders to develop and document resilient services, across multiple edge and availability zones, with documented comprehensive disaster recovery plans and regularly conduct drills and exercises to test and validate the effectiveness of these plans
Experience managing identity and access management to control resources and services in GCP and work with stakeholders to develop security practices and procedures to ensure compliance with industry best practices and regulations.
Experience managing the security and monitoring systems in our cloud that ensure our systems health.
Experience leading incident management processes, conducting post-mortems, and driving improvements to prevent future incidents.
Experience setting up availability expectations, addressing performance issues, uncovering observability gaps, leading problem management, and driving capacity planning.
The ability to manage cloud operations, installing, maintaining, and monitoring network resources.
Experience Defining SLOs, SLIs, leads on-call support schedules, troubleshooting, building support playbooks, implementing monitoring and alerting, logging standards, and conducting performance testing.
Experience creating playbooks utilizing a chaos engineering mindset and resilience testing
Experience architecting Infrastructure As Code using Terraform

You may have:

10 + years of overall in a DevOps or Site Reliability Engineer environment
2+ years of leading Cloud SRE teams across AWS and Google Cloud Platform
5+ years of hands-on experience with infrastructure design and deployment utilizing Cloud PaaS and IaaS cloud offerings
5+ years of experience in cloud and system observability (Datadog, Grafana, Cloud Profiler) and alerting (OpsGenie, PagerDuty, GCP Cloud Monitoring)
5+ years of experience architecting and building infrastructure with a focus on redundancy, reliability, disaster response and discovery
5+ years of configuration/management experience with Cloud networking technologies (GCP IAM model,Terraform, gcloud-cli)
5+ years of cloud Operations knowledge with automation solutions
5+ years of cloud Solutions (Google Cloud Platform), Cloud Run, Containers, Terraform, GCS, C#, TypeScript

Company Offers:

Platinum PPO Healthcare + Vision & Dental (Henry covers 99% for employees and 50% for their qualified dependents).
401(k) with matching contributions beginning your first day.
Unlimited PTO.
Fully remote position with occasional travel.
Impactful, rewarding work as part of a fast-growing brand helping thousands of people every day.

Equal Opportunity Statement:

Henry Meds is committed to promoting an inclusive work environment free of discrimination and harassment. We value a diverse and balanced team where everyone can belong.

Applicants must be authorized to work for ANY employer in the U.S. We cannot sponsor or take over sponsorship of an employment Visa at this time.

#LI-TS1

Remote Scouter

More Similar Roles...

Want more remote roles like this one sent to you?