Software Engineer, Research Infrastructure
Company: OpenAI
Location: San Francisco
Posted on: April 17, 2025
Job Description:
Software Engineer, Research Infrastructure - OpenAI -
OpenAICareersSoftware Engineer, Research InfrastructureScaling -
San FranciscoThis role will support the fleet infrastructure team
at OpenAI. The fleet team focuses on running the world's largest,
most reliable, and frictionless GPU fleet to support OpenAI's
general purpose model training and deployment. Work on this team
ranges from:
- Maximizing GPUs doing useful work by building user-friendly
scheduling and quota systems
- Running a reliable and low maintenance platform by building
push-button automation for kubernetes cluster provisioning and
upgrades
- Supporting research workflows with service frameworks and
deployment systems
- Ensuring fast model startup times through high performance
snapshot delivery across blob storage down to hardware caching
- Much more!About the RoleAs an engineer within Fleet
infrastructure, you will design, write, deploy, and operate
infrastructure systems for model deployment and training on one of
the world's largest GPU fleet. The scale is immense, the timelines
are tight, and the organization is moving fast; this is an
opportunity to shape a critical system in support of OpenAI's
mission to advance AI capabilities responsibly.This role is based
in San Francisco, CA. We use a hybrid work model of 3 days in the
office per week and offer relocation assistance to new employees.In
this role, you will:
- Design, implement and operate components of our compute fleet
including job scheduling, cluster management, snapshot delivery,
and CI/CD systems.
- Interface with researchers and product teams to understand
workload requirements.
- Collaborate with hardware, infrastructure, and business teams
to provide a high utilization and high reliability service.You
might thrive in this role if you:
- Have experience with hyperscale compute systems.
- Possess strong programming skills.
- Have experience working in public clouds (especially
Azure).
- Have experience working in Kubernetes.
- Have an execution-focused mentality paired with a rigorous
focus on user requirements.
- As a bonus, have an understanding of AI/ML workloads.About
OpenAIOpenAI is an AI research and deployment company dedicated to
ensuring that general-purpose artificial intelligence benefits all
of humanity. We push the boundaries of the capabilities of AI
systems and seek to safely deploy them to the world through our
products. AI is an extremely powerful tool that must be created
with safety and human needs at its core, and to achieve our
mission, we must encompass and value the many different
perspectives, voices, and experiences that form the full spectrum
of humanity.We are an equal opportunity employer and do not
discriminate on the basis of race, religion, national origin,
gender, sexual orientation, age, veteran status, disability or any
other legally protected status.For US Based Candidates: Pursuant to
the San Francisco Fair Chance Ordinance, we will consider qualified
applicants with arrest and conviction records.We are committed to
providing reasonable accommodations to applicants with
disabilities, and requests can be made via this
link.Compensation$360K - $440K + Offers Equity
#J-18808-Ljbffr
Keywords: OpenAI, Concord , Software Engineer, Research Infrastructure, IT / Software / Systems , San Francisco, California
Didn't find what you're looking for? Search again!
Loading more jobs...