

HPC Systems Engineer
Location
San Jose, CA
Level
Mid-Level
Department
Semiconductors
Type
Salary
Job Description
Posted on:
2026-03-19
Responsibilities
- Develop, implement, and maintain GPU-based clusters for optimal performance.
- Administer ML/AI platforms, managing deployments, resource allocation, and security.
- Automate system provisioning and cluster management end-to-end.
- Collaborate with cross-functional teams to support AI-related projects and provide technical expertise.
- Monitor and evaluate the performance of AI systems and clusters.
- Use AI/ML to improve internal processes and tools.
- Manage multiple projects simultaneously while ensuring adherence to industry standards.
Job Requirements
- Experience in developing Python-based AI applications and user interfaces.
- Proficiency in HPC infrastructure engineering for AI/HPC domains.
- Familiarity with SLURM and Kubernetes management.
- Experience in optimizing GPU clusters and managing GPU-based services.
- Knowledge of automation/monitoring tools such as Ansible, Terraform, and Prometheus.
- Strong problem-solving and troubleshooting skills.
- Excellent communication skills for effective collaboration.


