

Site Reliability Engineer, Frontier Systems Infrastructure
Location
San Francisco, CA
Level
Mid-Level
Department
Consumer Electronics
Type
Salary
$255,000 - $490,000
Job Description
Posted on:
2025-12-17
Responsibilities
- Spin up and scale large Kubernetes clusters, including automation for provisioning and lifecycle management.
- Build software abstractions to unify multiple clusters for training workloads.
- Manage node bring-up from bare metal through firmware upgrades.
- Improve operational metrics such as cluster restart times and upgrade cycles.
- Integrate networking and hardware health systems for reliability.
- Develop monitoring and observability systems for cluster stability.
- Execute tasks at the level of a software engineer.
Job Requirements
- Experience in infrastructure, systems, or distributed systems engineering in large-scale environments.
- Strong knowledge of Kubernetes internals and cluster scaling patterns.
- Proficiency in cloud infrastructure concepts and automation tools.
- Experience with bare-metal Linux environments and GPU hardware.
- Ability to solve high-impact operational problems and build automation.
- Strong programming or scripting skills (Python, Go, etc.).
- Background in GPU workloads or high-performance computing is a bonus.




