OpenAI

Site Reliability Engineer, Frontier Systems Infrastructure

Job Description

Posted on: 
2025-12-17

Responsibilities

  • Spin up and scale large Kubernetes clusters, including automation for provisioning and lifecycle management.
  • Build software abstractions to unify multiple clusters for training workloads.
  • Manage node bring-up from bare metal through firmware upgrades.
  • Improve operational metrics such as cluster restart times and upgrade cycles.
  • Integrate networking and hardware health systems for reliability.
  • Develop monitoring and observability systems for cluster stability.
  • Execute tasks at the level of a software engineer.

Job Requirements

  • Experience in infrastructure, systems, or distributed systems engineering in large-scale environments.
  • Strong knowledge of Kubernetes internals and cluster scaling patterns.
  • Proficiency in cloud infrastructure concepts and automation tools.
  • Experience with bare-metal Linux environments and GPU hardware.
  • Ability to solve high-impact operational problems and build automation.
  • Strong programming or scripting skills (Python, Go, etc.).
  • Background in GPU workloads or high-performance computing is a bonus.
Apply now

More job openings