Meta

AI/HPC Systems Performance Engineer

Job Description

Posted on: 
2025-05-06

Responsibilities

  • Develop solutions for large scale training systems as part of a multi-disciplinary team.
  • Monitor and troubleshoot performance of the communication system.
  • Benchmark overall performance and identify potential issues across the stack.
  • Develop and deploy solutions to address performance issues in comms lib, RDMA transport, and networking.
  • Evaluate and debug host networking protocols such as RDMA.
  • Triaging performance issues in distributed applications.
  • Collaborate with teams to improve system performance.

Job Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, or related field.
  • 4+ years of work experience in relevant fields (BS/MS/PhD preferred).
  • Experience with communication libraries like MPI, NCCL, and UCX.
  • Experience with RDMA and host networking protocols.
  • Understanding of AI training workloads and their network demands.
  • Knowledge of RDMA congestion control mechanisms.
  • Experience with machine learning frameworks such as PyTorch and TensorFlow.
Apply now

More job openings