Hyderabad Jobs
Banglore Jobs
Chennai Jobs
Delhi Jobs
Ahmedabad Jobs
Mumbai Jobs
Pune Jobs
Vijayawada Jobs
Gurgaon Jobs
Noida Jobs

Oil & Gas Jobs
Banking Jobs
Construction Jobs
Top Management Jobs
IT - Software Jobs
Medical Healthcare Jobs
Purchase / Logistics Jobs
Sales

Ajax Jobs
Designing Jobs
ASP .NET Jobs
Java Jobs
MySQL Jobs
Sap hr Jobs
Software Testing Jobs
Html Jobs

IT Jobs
Logistics Jobs
Customer Service Jobs
Airport Jobs
Banking Jobs
Driver Jobs
Part Time Jobs
Civil Engineering Jobs
Accountant Jobs
Safety Officer Jobs
Nursing Jobs
Civil Engineering Jobs
Hospitality Jobs
Part Time Jobs
Security Jobs
Finance Jobs
Marketing Jobs
Shipping Jobs
Real Estate Jobs
Telecom Jobs

Director, System Reliability Engineering

7.00 to 12.00 Years Redmond (Washington) 14 May, 2025

Job Location	Redmond (Washington)
Education	BE/ B.Tech (Engineering)
Salary	As per Industry Standards
Industry	Software Services, Internet/Dot com/ISP
Functional Area	Production/Manufacturing/Maintenance/Packaging
EmploymentType	Full-time

Job Description

OverviewMicrosoft Silicon, Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsofts expanding Cloud Infrastructure and responsible for powering Microsofts Intelligent Cloud mission. SCHIE delivers the core infrastructure and foundational technologies for Microsofts over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate, high energy engineers to help achieve that mission.As Microsofts Cloud business continues to grow, the ability to deploy new offerings and HW infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for Cloud infrastructure reliability, improving the planning process, manufacturing, quality, delivery at scale, serviceability and sustainability. We are looking for a System Reliability Engineering Leader with a passion for customer-focused solutions, insight and industry knowledge to envision and implement future technical solutions that will optimize the Cloud infrastructure and its reliability.We are looking for an experienced Director, System Reliability Engineering who will be responsible to drive reliability performance across architecture, design, component and material selections, manufacturing and integration of datacenter hardware, ensuring that all electrical, mechanical, thermal, environmental, transportation and operational aspects along with telemetry, diagnostic and the SW/FW stack of the cloud solution are optimized throughout the lifecycle of each cloud service. The candidate will interact with Engineering, Supply Chain, Sourcing, Manufacturing & Quality, Fleet Management, Datacenter Operations, and other internal and external stakeholders.QualificationsRequired/minimum qualifications

Bachelors Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 8 years technical engineering experience
OR Masters Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7 years technical engineering experience
OR Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 5 years technical engineering experience.
5 years of people management including resource planning, career development and performance management.
5 years of experience in system reliability, site reliability engineering, or infrastructure engineering, with at least 1 years focused on AI systems.

Other Requirements

Ability to meet Microsoft, customer and/or government security screening requirements is necessary for this role. These requirements include, but are not limited to, the following specialized security screenings: Microsoft Cloud Background Check.

Preferred Qualifications:

Bachelors Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 12 years technical engineering experience
OR Masters Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 10 years technical engineering experience
OR Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7 years technical engineering experience.
Experience in AI lifecycle, including model training, deployment, monitoring, and retraining.
Experience in cloud fleet management, telemetry, diagnostic and troubleshooting of IT systems.
Experience and knowledge in the server industry product development process.
Experience in managing cross-functional teams and large-scale distributed systems.
Experience with system reliability, manufacturing process and datacenter operations, leading continuous improvements through automation.
Experience with liquid cooling infrastructure for IT racks.

Reliability Engineering M5

The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here.Microsoft will accept applications for the role until May 26th, 2025.#azurehwjobs #HIFE #Azure #Cloud #Hardware #AHSIResponsibilities

Lead the design, implementation, and continuous improvement of reliability practices across our AI infrastructure. Ensure the performance, scalability, and resilience of AI systems in production environments.
Lead the development and execution of both systems and components reliability engineering strategies for all Cloud platforms and services.
Collaborate across HW and SW architecture, data engineering, and platform teams to ensure robust deployment of resilient solutions and services.
Lead strategic innovations and develop processes to integrate industry practices to ensure efficiency in achieving high reliability and quality.
Design and implement observability frameworks tailored to AI workloads.
Drive incident response, root cause analysis, and postmortem processes for HW system outages or degradations.
Establish and monitor SLAs (Availability, Node In Service, Time to restore Availability) for all cloud services, ensuring alignment with business goals and product requirements.
Foster a culture of reliability, automation, consistency of execution and continuous improvement across engineering teams.
Support manufacturing, datacenter operation, troubleshooting and diagnostic methods to optimize the cloud infrastructure reliability.

Locations

Redmond, Washington, United States

Keyskills :
system reliability engineering cloud infrastructure development people management experiencecross functional collaboration performance optimization techniques production product development manufacturing mechanical engineering industry knowledge

APPLY NOW