AMD

HPC Systems Engineer

Job Description

Posted on: 
2026-03-19

Responsibilities

  • Develop, implement, and maintain GPU-based clusters for optimal performance.
  • Administer ML/AI platforms, managing deployments, resource allocation, and security.
  • Automate system provisioning and cluster management end-to-end.
  • Collaborate with cross-functional teams to support AI-related projects and provide technical expertise.
  • Monitor and evaluate the performance of AI systems and clusters.
  • Use AI/ML to improve internal processes and tools.
  • Manage multiple projects simultaneously while ensuring adherence to industry standards.

Job Requirements

  • Experience in developing Python-based AI applications and user interfaces.
  • Proficiency in HPC infrastructure engineering for AI/HPC domains.
  • Familiarity with SLURM and Kubernetes management.
  • Experience in optimizing GPU clusters and managing GPU-based services.
  • Knowledge of automation/monitoring tools such as Ansible, Terraform, and Prometheus.
  • Strong problem-solving and troubleshooting skills.
  • Excellent communication skills for effective collaboration.
Apply now

More job openings