OpenAI

Site Reliability Engineer, Frontier Systems Infrastructure

Job Description

Posted on: 
2025-11-04

Responsibilities

  • Spin up and scale large Kubernetes clusters with automation for provisioning and lifecycle management.
  • Build software abstractions for seamless interface to training workloads across multiple clusters.
  • Manage node bring-up from bare metal through firmware upgrades for fast deployment.
  • Improve operational metrics, reducing cluster restart times and accelerating upgrade cycles.
  • Integrate networking and hardware health systems for end-to-end reliability.
  • Develop monitoring and observability systems to maintain cluster stability.
  • Execute tasks at the level of a software engineer while managing operations.

Job Requirements

  • Experience as an infrastructure, systems, or distributed systems engineer in large-scale environments.
  • Strong knowledge of Kubernetes, cluster scaling patterns, and containerized workloads.
  • Proficiency in cloud infrastructure concepts and automating operations.
  • Familiarity with bare-metal Linux environments and GPU hardware.
  • Strong programming or scripting skills (Python, Go, etc.).
  • Enjoy solving high-impact operational problems and building automation.
  • Ability to balance careful engineering with urgency in mission-critical systems.
Apply now

More job openings