Microsoft

Site Reliability Engineer

Job Description

Posted on: 
2026-02-14

Responsibilities

  • Own end-to-end reliability for Azure Storage hardware in lab environments.
  • Partner with silicon, firmware, BIOS, networking, and OS teams for DPU hardware validation.
  • Define and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Lead incident response and mitigation for hardware and firmware-related issues.
  • Build automation for provisioning and recovery of DPU-enabled Azure Storage systems.
  • Develop reliability validation strategies, including stress and fault-injection testing.
  • Create and maintain operational runbooks and diagnostics for DPU platforms.

Job Requirements

  • Associate's or Bachelor's Degree in Computer Science, IT, or related field.
  • 2+ years of technical experience in software engineering, network engineering, or systems administration.
  • Experience with large-scale, distributed systems in validation.
  • Proficiency in programming or scripting languages (C++, C#, Python, etc.).
  • Hands-on experience with Microsoft Azure lab infrastructure and live-site operations.
  • Understanding of networking and performance characteristics of I/O-intensive systems.
  • Familiarity with firmware lifecycles and hardware validation processes.
Apply now

More job openings