Director of Site Reliability and Cloud Infrastructure
Job Overview: We are seeking a highly skilled and strategic Director of Site Reliability and Cloud Infrastructure to join our team. In this role, you will initially take on the responsibilities of an individual contributor, working hands-on to develop, maintain, and enhance our infrastructure while ensuring security, reliability, and scalability. As you establish a strong foundation, you will also be responsible for collaborating with our existing vendors and scaling the internal team by hiring additional resources focused on security, site reliability, and cloud infrastructure.
This position is perfect for a seasoned leader who thrives in both a hands-on technical role and strategic leadership. You will play a critical part in shaping the future of our infrastructure and ensuring that our systems are both secure and highly available.
Key Responsibilities:
Hands-On Infrastructure Management:
Develop and maintain scalable and automated infrastructure solutions,
particularly on AWS.
Implement and manage monitoring, alerting, and logging systems to detect and
address reliability and security risks.
Manage incident response and resolution processes to minimize downtime,
prevent recurrence, and ensure robust disaster recovery practices.
Conduct system performance tuning, capacity planning, and optimization to
effectively manage resource utilization and loads.
Vendor Collaboration and Oversight:
Build and maintain strong relationships with cloud, security, and infrastructure vendors, ensuring their services meet performance, compliance, and security needs.
Lead contract negotiations and performance reviews for external vendors, ensuring alignment with internal standards and SLAs.
Team Building and Leadership:
Hire, mentor, and lead a high-performing team of site reliability engineers (SREs),
security experts, and infrastructure engineers.
Develop career growth plans and technical progression frameworks for team
members, ensuring skills development in cloud technologies and SRE best
practices.
Create a cohesive vision for cloud infrastructure, reliability, and security, aligning
with the broader organizational goals.
Security and Compliance Leadership:
Implement and maintain security best practices, including compliance with SOC2, HIPAA, and other relevant standards.
Ensure the infrastructure is protected against threats and vulnerabilities.
Drive innovation in cloud infrastructure and security, continuously improving our
processes and systems.
Automation and Tooling:
Build and maintain automation tools and scripts to streamline system updates, deployments, and monitoring.
Design and oversee CI/CD pipelines, ensuring seamless integration with development and operations teams.
Collaboration and Stakeholder Management:
Work closely with the development, operations, and product teams to ensure
alignment on priorities and collaboration on large-scale projects.
Provide technical guidance and mentorship across teams, championing a culture
of reliability, automation, and security.
Communicate progress, risks, and issues clearly to both technical and
non-technical stakeholders.
Qualifications:
Bachelor’s degree in Computer Science, Engineering, or a related field.
Proven experience in a senior leadership role managing cloud infrastructure and site
reliability, preferably within an AWS environment (EC2, S3, RDS, ELB, etc.).
Hands-on experience with infrastructure as code (e.g., Terraform, CloudFormation) and
automation tools (e.g., Ansible, Jenkins).
Strong scripting skills (Python, Bash) and the ability to automate complex tasks.
Demonstrated success in scaling infrastructure and teams, particularly within
high-availability and high-growth environments.
Solid understanding of networking, cloud security, and compliance standards (e.g.,
SOC2, HIPAA).
Strong incident management skills and the ability to lead post-incident reviews to drive
improvements.
Excellent communication skills and the ability to collaborate effectively with
cross-functional teams.
Experience in hiring, developing, and managing technical teams with a focus on career
development and innovation.
Preferred Qualifications:
Experience in a high-growth SaaS company, especially within the healthcare or regulated industries.
Familiarity with cloud cost optimization, scalability best practices, and disaster recovery strategies.
Demonstrated ability to lead through influence, setting technical direction and ensuring execution across teams.
Relevant certifications: AWS Solutions Architect, DevOps Engineer, Security; CCSP; CISSP
Perks - What you can expect:
Competitive salaries
Remote/hybrid environment
Potential equity compensation for outstanding performance
Flexible PTO
Company-wide sponsored lunches
Company paid disability and life insurance benefits
Company paid family and medical leave
Medical, dental, and vision insurance benefits
Discounted pet insurance
FSA/DCA and commuter benefits
401k
Prompt Therapy Solutions, Inc is an equal opportunity employer, indiscriminate of race, color, religion, ethnicity, ancestry, national origin, sex, gender, gender identity, sexual orientation, age, marital status, veteran status, disability, medical condition, or any other protected characteristic. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Prompt Therapy Solutions, Inc is an E-Verify Employer.