Site Reliability Engineer, HPC and LSF

Location

Seattle, WA

Level

Mid-Level

Department

Semiconductors

Type

Salary

$120,000 - $190,000

Job Description

Posted on:

2025-11-06

Troubleshoot incoming support requests in a large-scale HPC environment.
Enhance deployment automation, configuration management, observability, and operational monitoring.
Ensure compute servers are configured correctly with the appropriate Operating System.
Perform comprehensive troubleshooting from bare metal to application level.
Collaborate with specialist teams to resolve issues.
Work with domain experts to improve chip development processes utilizing infrastructure.
Contribute to overall quality and improve time to market for next-generation chips.

Proficient in administering Centos/RHEL Linux distributions.
Understanding of container technologies like Docker.
Proficiency in Python and UNIX scripting languages such as bash.
Excellent problem-solving skills for complex systems analysis.
Strong communication and teamwork skills.
BS in Computer Science or equivalent experience with 2+ years of relevant post-degree experience.
Solid understanding of cluster configuration management tools such as Ansible.