NVIDIA

Site Reliability Engineer, HPC and LSF

Job Description

Posted on: 
2025-11-06

Responsibilities

  • Troubleshoot incoming support requests in a large-scale HPC environment.
  • Enhance deployment automation, configuration management, observability, and operational monitoring.
  • Ensure compute servers are configured correctly with the appropriate Operating System.
  • Perform comprehensive troubleshooting from bare metal to application level.
  • Collaborate with specialist teams to resolve issues.
  • Work with domain experts to improve chip development processes utilizing infrastructure.
  • Contribute to overall quality and improve time to market for next-generation chips.

Job Requirements

  • Proficient in administering Centos/RHEL Linux distributions.
  • Understanding of container technologies like Docker.
  • Proficiency in Python and UNIX scripting languages such as bash.
  • Excellent problem-solving skills for complex systems analysis.
  • Strong communication and teamwork skills.
  • BS in Computer Science or equivalent experience with 2+ years of relevant post-degree experience.
  • Solid understanding of cluster configuration management tools such as Ansible.
Apply now

More job openings