Description

About the Team

CoreWeave’s Support Operations team ensures peak performance and reliability acrossthousands of nodes in multiple supercomputer clusters, each with tens of thousands of GPUs.

Collaborate with pioneering generative AI labs, world-renowned VFX organizations, and

visionary developers and artists. These innovators leverage our cutting-edge GPU cloud

infrastructure to power their mission-critical workflows and achieve unprecedented capabilities.

About the role:

As a Support Operations Engineer, you will be responsible to deploy, configure, and maintain

CoreWeave’s GPU fleet across our growing number of data centers in the U.S., Europe, and

beyond.

What You'll Do:

You’ll monitor our fleet’s health, performance, and reliability for issues through the use of our observability stack - Grafana, Prometheus, Victoria Metrics.
You’ll use CoreWeave Kubernetes to troubleshoot customer support requests and act as a technical escalation point for the Cloud Support Engineers.
You’ll learn from your fellow Support Operation Engineer teammates and mentor junior engineers and new hires
You’ll leverage your knowledge of Linux (Ubuntu) to diagnose, troubleshoot, and rectify bugs across the fabric.
You’ll assist and collaborate with other teams involved in the management and operation of CoreWeave infrastructure.
You’ll offer expertise, guidance, and troubleshooting support to ensure the smooth functioning and optimal performance of the clusters.
You’ll support some of the world’s largest bare metal fleets of dedicated servers
running the latest NVIDIA H100 GPU technology on Infiniband deployments
You’ll have a front row seat at the deployment of new CoreWeave supercomputing clusters for unprecedented customer workloads in AI/HPC
You’ll work hand in hand with our Data Center Technicians to install, configure, and troubleshoot all aspects of data center infrastructure
You’ll liaison with Cloud Operations to ensure that the CoreWeave platform is scalable, reliable and stable
You’ll partner with our network engineers and software developers to collect failure logs, reproduce issues, and ultimately solve the world’s hardest problems
You’ll identify, create, and maintain new documentation with our Technical Writing team of troubleshooting workflows, corner case scenarios, and new discoveries
You’ll serve as a technical liaison on incidents and escalations, communicating with all stakeholders
You’ll participate in a 24/7 on-call rotation every few months ensuring that mission-critical
alerts are addressed for infrastructure resiliency.
You’ll develop alerting, telemetry, and new metrics to proactively prevent issues across the fleet and reduce need for reactive support

What we look for:

A working knowledge of cloud computing, virtualization, and container technologies
A working knowledge of Linux - tell us about your favorite Linux distro
A working knowledge of Kubernetes and Docker
A prior role in Sysadmin, Site Reliability Engineering, DevOps, or Infrastructure Operations
A prior role in HPC/AI
A knack for solving problems - recognizing technical issues, developing appropriate solutions, and following through to completion
A love for creating documentation and processes to better your team’s internal knowledge base
An interest in building the world’s largest bespoke supercomputers for leading AI labs
A solid understanding of distributed computing environments and methodologies, such as storage volumes, private networks, load balancers, and virtual machines
Excellent communication skills (both written and verbal)
Willing to work in a very fast-paced environment with dynamic priorities and ever-changing developments
Highly independent engineer yet collaborates well as part of a team
Willingness and interest to travel to CoreWeave data centers as needed

Plus Points:

Prior experience with computer hardware or server hardware - did you build your own PC at home?
Prior experience in a data center as an engineer or a technician - what kind of servers did you work on?
Prior experience with NVIDIA GPUs and CUDA technologies
Prior experience with SuperMicro, Dell, HP Enterprise, and Gigabyte systems
Prior experience with HPC systems
Prior experience with AI / ML

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $75,000/year to $110,000/year. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.

Hybrid Workplace

If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture. Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.

Remote Scouter

More Similar Roles...

Want more remote roles like this one sent to you?