(Additional locations: San Francisco CA, Sunnyvale CA)
The ML Compute Platform is part of Cruise’s AI Foundation org and owns cloud-agnostic, reliable, and cost-efficient compute backend for the specific needs of Cruise AI. We enable high agility for building new features, supporting innovation, and optimizing for prioritized use cases that are relevant to Cruise AI/ML today, and in the future. This team is tasked with ensuring the reliable training and deployment of SOTA ML models and associated workloads, prioritizing high performance, concurrency, availability, and scalability. We emphasize enhancing efficiency in both ML training and deploying cutting-edge foundational models, along with maximizing the utilization of powerful GPUs such as H100, A100 and more. Our team objectives revolve around reliability, cost-effectiveness, scalability, and velocity.
We are looking for an ML Engineer specializing in constructing and managing robust, scalable, and high-performance compute platforms tailored for ML workflows. Collaborating closely with our scientists, they will ensure efficient model training and seamless integration into our production environment, ultimately ensuring the safe and continuous operation of Cruise and our fleet of autonomous vehicles. The ideal candidate will have experience building and running scalable distributed machine learning platforms, will bring innovative ideas and approaches, should have intellectual curiosity, strong problem-solving skills, and a bias towards action.
If you are looking to solve one of today’s most complex engineering challenges, see the results of your work in hundreds of self-driving cars, and make a positive impact in the world starting in our cities, join us!
What you’ll be doing:
-
Design core platform backend software components
-
Experience cloud platforms like GCP, Azure
-
Thrive in a dynamic, multi-tasking environment with ever-evolving priorities. Interface with other teams to incorporate their innovations and vice versa
-
Analyze and improve efficiency, scalability, and stability of various system resources
-
Proactively identify, drive and design large initiatives across Cruise ML workflows
-
Work on large scale initiatives to raise the overall Cruise engineering bar
-
Participate/lead open source projects. Lead/drive community recognition for Cruise engineering.
At a Minimum We'd Like You To Have
-
8+ years of industry experience
-
Expertise in either Go, C++, Python or other relevant coding languages
-
Strong background with kubernetes at scale
-
Relevant experience building large-scale with distributed systems
-
Experience leading and driving large scale initiatives
-
Experience working with Google Cloud Platform, Microsoft Azure, or Amazon Web Services
It's Preferred If You Have
-
Hands-on experience in ML platforms
-
Experience with GPU/TPU optimizations
-
Experience with training frameworks like PyTorch, TorchX
-
Experience with Ray framework
-
Leadership/active participation in the open source community
-
Experience infrastructure applications or similar experience
The salary range for this position is $229,500 - $270,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, and benefits. These ranges are subject to change.