

HPC Systems Engineer
Location
San Jose, CA
Level
Mid-Level
Department
Semiconductors
Type
Salary
Job Description
Posted on:
2026-04-10
The HPC Systems Engineer role at AMD involves the design, development, and administration of High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers, with a focus on building scalable and performant services.
Responsibilities
- Develop, implement, and maintain GPU-based clusters for optimal performance.
- Administer ML/AI platforms, managing deployments and resource allocation.
- Automate system provisioning and cluster management end to end.
- Collaborate with cross-functional teams to support AI infrastructure requirements.
- Monitor and evaluate the performance of AI systems and clusters.
- Use AI/ML to improve internal processes and tools for service delivery.
- Manage GPU clusters and optimize GPU-based services/tools/software.
Job Requirements
- Experience in developing Python-based AI applications.
- Proficiency in HPC infrastructure engineering for AI/HPC domain.
- Familiarity with SLURM and Kubernetes management.
- Strong organizational and problem-solving skills.
- Excellent verbal and written communication skills.
- Experience with automation/monitoring tools such as Ansible, Terraform, etc.
- Demonstrated experience with AI workload schedulers and optimization techniques.




