

AI/HPC Systems Performance Engineer
Location
Menlo Park, CA
Level
Full-Time
Department
Software Development
Type
Salary
$147,000 - $208,000
Job Description
Posted on:
2025-05-06
Responsibilities
- Develop solutions for large scale training systems as part of a multi-disciplinary team.
- Monitor and troubleshoot performance of the communication system.
- Benchmark overall performance and identify potential issues across the stack.
- Develop and deploy solutions to address performance issues in comms lib, RDMA transport, and networking.
- Evaluate and debug host networking protocols such as RDMA.
- Triaging performance issues in distributed applications.
- Collaborate with teams to improve system performance.
Job Requirements
- Bachelor's degree in Computer Science, Computer Engineering, or related field.
- 4+ years of work experience in relevant fields (BS/MS/PhD preferred).
- Experience with communication libraries like MPI, NCCL, and UCX.
- Experience with RDMA and host networking protocols.
- Understanding of AI training workloads and their network demands.
- Knowledge of RDMA congestion control mechanisms.
- Experience with machine learning frameworks such as PyTorch and TensorFlow.