OpenAI

Site Reliability Engineer, Frontier Systems Infrastructure

Job Description

Posted on: 
2026-04-25

Responsibilities

  • Spin up and scale large Kubernetes clusters, automating provisioning and lifecycle management.
  • Build software abstractions for seamless interface across multiple clusters.
  • Manage node bring-up from bare metal through firmware upgrades.
  • Improve operational metrics and reduce cluster restart times.
  • Integrate networking and hardware health systems for reliability.
  • Develop monitoring systems to detect issues and maintain stability.
  • Execute tasks at the level of a software engineer.

Job Requirements

  • Experience in infrastructure, systems, or distributed systems engineering in high-availability environments.
  • Strong knowledge of Kubernetes internals and cluster scaling patterns.
  • Proficiency in cloud infrastructure concepts and automation tools.
  • Familiarity with bare-metal Linux environments and GPU hardware.
  • Strong programming or scripting skills (Python, Go, etc.).
  • Ability to solve operational problems and build automation.
  • Balance engineering quality with urgency in mission-critical systems.
Apply now

More job openings