Description

(Additional locations: San Francisco CA, Sunnyvale CA)

The ML Compute Platform is part of Cruise’s AI Foundation org and owns cloud-agnostic, reliable, and cost-efficient compute backend for the specific needs of Cruise AI. We enable high agility for building new features, supporting innovation, and optimizing for prioritized use cases that are relevant to Cruise AI/ML today, and in the future. This team is tasked with ensuring the reliable training and deployment of SOTA ML models and associated workloads, prioritizing high performance, concurrency, availability, and scalability. We emphasize enhancing efficiency in both ML training and deploying cutting-edge foundational models, along with maximizing the utilization of powerful GPUs such as H100, A100 and more. Our team objectives revolve around reliability, cost-effectiveness, scalability, and velocity.

We are looking for an ML Engineer specializing in constructing and managing robust, scalable, and high-performance compute platforms tailored for ML workflows. Collaborating closely with our scientists, they will ensure efficient model training and seamless integration into our production environment, ultimately ensuring the safe and continuous operation of Cruise and our fleet of autonomous vehicles. The ideal candidate will have experience building and running scalable distributed machine learning platforms, will bring innovative ideas and approaches, should have intellectual curiosity, strong problem-solving skills, and a bias towards action.

If you are looking to solve one of today’s most complex engineering challenges, see the results of your work in hundreds of self-driving cars, and make a positive impact in the world starting in our cities, join us!

What you’ll be doing:

Design core platform backend software components
Experience cloud platforms like GCP, Azure
Thrive in a dynamic, multi-tasking environment with ever-evolving priorities. Interface with other teams to incorporate their innovations and vice versa
Analyze and improve efficiency, scalability, and stability of various system resources
Proactively identify, drive and design large initiatives across Cruise ML workflows
Work on large scale initiatives to raise the overall Cruise engineering bar
Participate/lead open source projects. Lead/drive community recognition for Cruise engineering.

At a Minimum We'd Like You To Have

8+ years of industry experience
Expertise in either Go, C++, Python or other relevant coding languages
Strong background with kubernetes at scale
Relevant experience building large-scale with distributed systems
Experience leading and driving large scale initiatives
Experience working with Google Cloud Platform, Microsoft Azure, or Amazon Web Services

It's Preferred If You Have

Hands-on experience in ML platforms
Experience with GPU/TPU optimizations
Experience with training frameworks like PyTorch, TorchX
Experience with Ray framework
Leadership/active participation in the open source community
Experience infrastructure applications or similar experience

The salary range for this position is $229,500 - $270,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, and benefits. These ranges are subject to change.

Remote Scouter

What you’ll be doing:

At a Minimum We'd Like You To Have

8+ years of industry experience

More Similar Roles...

Want more remote roles like this one sent to you?