

Site Reliability Engineer, Frontier Systems Infrastructure
Location
San Francisco, CA
Level
Mid-Level
Department
Consumer Electronics
Type
Salary
$255,000 - $490,000
Job Description
Posted on:
2025-11-04
Responsibilities
- Spin up and scale large Kubernetes clusters with automation for provisioning and lifecycle management.
- Build software abstractions for seamless interface to training workloads across multiple clusters.
- Manage node bring-up from bare metal through firmware upgrades for fast deployment.
- Improve operational metrics, reducing cluster restart times and accelerating upgrade cycles.
- Integrate networking and hardware health systems for end-to-end reliability.
- Develop monitoring and observability systems to maintain cluster stability.
- Execute tasks at the level of a software engineer while managing operations.
Job Requirements
- Experience as an infrastructure, systems, or distributed systems engineer in large-scale environments.
- Strong knowledge of Kubernetes, cluster scaling patterns, and containerized workloads.
- Proficiency in cloud infrastructure concepts and automating operations.
- Familiarity with bare-metal Linux environments and GPU hardware.
- Strong programming or scripting skills (Python, Go, etc.).
- Enjoy solving high-impact operational problems and building automation.
- Ability to balance careful engineering with urgency in mission-critical systems.




