Six Sigma in the datacenter drives a zero-defects culture

Similar documents
Microsoft IT Leverages its Compute Service to Virtualize SharePoint 2010

Symantec Data Center Transformation

Predictive Insight, Automation and Expertise Drive Added Value for Managed Services

THE JOURNEY OVERVIEW THREE PHASES TO A SUCCESSFUL MIGRATION ADOPTION ACCENTURE IS 80% IN THE CLOUD

SYMANTEC: SECURITY ADVISORY SERVICES. Symantec Security Advisory Services The World Leader in Information Security

Meeting PCI DSS 3.2 Compliance with RiskSense Solutions

Dell helps you simplify IT

Product Security Program

Design Build Services - Service Description-v7

Accelerating Digital Transformation

Cisco SP Wi-Fi Solution Support, Optimize, Assurance, and Operate Services

Implementing ITIL v3 Service Lifecycle

Getting Hybrid IT Right. A Softchoice Guide to Hybrid Cloud Adoption

Smart Data Center Solutions

Smart Data Center From Hitachi Vantara: Transform to an Agile, Learning Data Center

How Microsoft IT Reduced Operating Expenses Using Virtualization

Securing Your Digital Transformation

Get more out of technology starting day one. ProDeploy Enterprise Suite

eplus Managed Services eplus. Where Technology Means More.

Professional Services for Cloud Management Solutions

IT Consulting and Implementation Services

IT Monitoring Tool Gaps are Impacting the Business A survey of IT Professionals and Executives

Navigating the Clouds Fortifying ITIL for Cloud Governance

One Release. One Architecture. One OS. High-Performance Networking for the Enterprise with JUNOS Software

PREPARE FOR TAKE OFF. Accelerate your organisation s journey to the Cloud.

Grow Your Services Business

How Cisco IT Improved Development Processes with a New Operating Model

Cisco Network Assurance Engine with ServiceNow Cisco Network Assurance Engine, the industry s first SDN-ready intent assurance suite, integrates with

How Verizon boosted product delivery with Dynatrace Software Intelligence

Cisco Technical Services

Take a Confident Step towards Migration to Microsoft Skype for Business

Data Virtualization Implementation Methodology and Best Practices

ProDeploy Suite. Accelerate enterprise technology adoption with expert deployment designed for you

DATA CENTER SERVICES. A Higher Level of Service Expertise

Reducing Costs and Improving Systems Management with Hyper-V and System Center Operations Manager

CenturyLink for Microsoft

Controlling Costs and Driving Agility in the Datacenter

Avanade s Approach to Client Data Protection

HP Fortify Software Security Center

DATA SHEET RSA NETWITNESS PLATFORM PROFESSIONAL SERVICES ACCELERATE TIME-TO-VALUE & MAXIMIZE ROI

VMware Cloud Operations Management Technology Consulting Services

Session 408 Tuesday, October 22, 10:00 AM - 11:00 AM Track: Industry Insights

WHO SHOULD ATTEND? ITIL Foundation is suitable for anyone working in IT services requiring more information about the ITIL best practice framework.

Cisco SAN Analytics and SAN Telemetry Streaming

INTELLIGENCE DRIVEN GRC FOR SECURITY

DATA SHEET RISK & CYBERSECURITY PRACTICE EMPOWERING CUSTOMERS TO TAKE COMMAND OF THEIR EVOLVING RISK & CYBERSECURITY POSTURE

Symantec Data Center Migration Service

Symantec Business Continuity Solutions for Operational Risk Management

Archiving. Services. Optimize the management of information by defining a lifecycle strategy for data. Archiving. ediscovery. Data Loss Prevention

WITH ACTIVEWATCH EXPERT BACKED, DETECTION AND THREAT RESPONSE BENEFITS HOW THREAT MANAGER WORKS SOLUTION OVERVIEW:

Total Protection for Compliance: Unified IT Policy Auditing

Cisco Collaboration Optimization Services: Tune-Up for Peak Performance

PERFORM FOR HPE CONTENT MANAGER

Polycom Global Services

MODERNIZE INFRASTRUCTURE

SOLUTION BRIEF RSA ARCHER IT & SECURITY RISK MANAGEMENT

2 The IBM Data Governance Unified Process

Storage as a Service From Hitachi Vantara

RiskSense Attack Surface Validation for IoT Systems

Cisco Technical Services Advantage

13.f Toronto Catholic District School Board's IT Strategic Review - Draft Executive Summary (Refer 8b)

Automating the Top 20 CIS Critical Security Controls

Agile Master Data Management TM : Data Governance in Action. A whitepaper by First San Francisco Partners

Sustainable Security Operations

The Value of Migrating from Cisco Tidal Horizon to Cisco Process Orchestrator

Cisco Start. IT solutions designed to propel your business

IBM Corporation. Global Energy Management System Implementation: Case Study. Global

Migrating a critical high-performance platform to Azure with zero downtime

Cloud Service Saves Hosting Provider 65% on Hardware Costs and Increases Revenue

Uptime and Proactive Support Services

New Zealand Government IBM Infrastructure as a Service

BUILDING the VIRtUAL enterprise

Information Infrastructure and Security. The value of smart manufacturing begins with a secure and reliable infrastructure

SOLUTION BRIEF RSA NETWITNESS EVOLVED SIEM

The Journey of a Senior System Center Consultant Implementing BSM

Hitachi Unified Compute Platform Pro for VMware vsphere

Cisco Gains Real-time Visibility in the Business with SAP HANA

Improve the User Experience on Your Website

Industrial Defender ASM. for Automation Systems Management

Break the network innovation gridlock

BUILDING CYBERSECURITY CAPABILITY, MATURITY, RESILIENCE

A company built on security

IBM Internet Security Systems Proventia Management SiteProtector


Network Visibility and Segmentation

OVERVIEW BROCHURE GRC. When you have to be right

Major travel and hospitality organizations often have

STEP Data Governance: At a Glance

Traditional Security Solutions Have Reached Their Limit

IZO MANAGED CLOUD FOR AZURE

A HOLISTIC APPROACH TO IDENTITY AND AUTHENTICATION. Establish Create Use Manage

Water Provider Relocates, Modernizes Data Center

Enabling Security Controls, Supporting Business Results

IBM Proventia Management SiteProtector Sample Reports

EX0-101_ITIL V3. Number: Passing Score: 800 Time Limit: 120 min File Version: 1.0. Exin EX0-101

Orchestrating Network Performance CASE STUDY

Virtustream Managed Services Drive value from technology investments through IT management solutions. Tim Calahan, Manager Managed Services

About KBC KNOW YOUR NETWORK

New Zealand Government IbM Infrastructure as a service

ArcGIS in the Cloud. Andrew Sakowicz & Alec Walker

Transcription:

Six Sigma in the datacenter drives a zero-defects culture Situation Like many IT organizations, Microsoft IT wants to keep its global infrastructure available at all times. Scope, scale, and an environment where production code is frequently disrupted by software builds all contribute to the challenge of providing complete availability and reliability. Solution For the first time, Microsoft IT successfully applied Six-Sigma methodologies to global datacenter operations. Server defects are now systematically identified and eradicated. The organization standardized its platform, built a robust BI system to identify and publicize defects at all levels of the organization, and empowered its staff to proactively address defects on an ongoing basis. Benefits Microsoft made a strong commitment to using statistical methods and data-driven decisions to drive down defects in its worldwide server infrastructure. The defect eradication framework has clearly improved availability, productivity, performance, and security. Other groups within the enterprise can also leverage this framework. February 2016 Microsoft constantly strives to improve reliability and availability in its worldwide datacenter network. Microsoft IT developed an innovative defect reduction program that applies ITIL Problem Management and Six Sigma methodologies to datacenter operations and assets for the first time. Internal hosting consumers now use a greatly standardized and simplified environment. Supported by a robust data-driven framework, Microsoft IT professionals strive to eradicate defects in the datacenter and systematically reduce IT infrastructure failures. Situation Worldwide, Microsoft IT manages more than 40,000 datacenter servers. Servers are spread across datacenters and subsidiaries in the Americas, Europe, Middle East, and Asia Pacific regions. The IT infrastructure is a critical component of the overall business Microsoft servers run thousands of line of business applications, and process $40B in sales annually. Like other large enterprises that manage global datacenters, it is a challenge for Microsoft to keep its IT infrastructure available at all times. Interruptions in service or outright server failures disrupt business in many ways. Software development can slow, system failures can occur during critical business periods, and disruptions can potentially contribute to revenue losses. Availability and reliability is even more at risk at a software and services company like Microsoft, where server availability can be easily disrupted by software changes, trials during incident troubleshooting, infrastructure changes, and so on.

Page 2 Six Sigma in the datacenter drives a zero-defects culture The following figure shows the scale and complexity of the Microsoft IT server platform. Figure 1. Server platform operations Problem management framework Microsoft was already leveraging the Information Technology Infrastructure Library (ITIL) Problem Management discipline in its datacenters, and had successfully improved datacenter processes using Six Sigma continuous improvement methodologies. Microsoft has taken Six Sigma beyond process improvement and applied it to datacenter assets and operations. At the same time, Microsoft did not want to lose its significant investment in ITIL Problem Management. Microsoft chose to apply Six Sigma methodologies to its ITIL Problem Management practices, and established a comprehensive framework for defect management. The goal was to reduce defects, enabling the infrastructure to be more consistent, reliable, and stable worldwide. By applying Six Sigma methodologies to its ITIL Problem Management practices, Microsoft sought to create a zero defects culture, empowering its IT managers and server owners to consistently focus on proactive defect eradication. Organization support The IT organization knew that to successfully make such a significant investment, it would need to create both a comprehensive framework and a culture change. This meant that they had to get support at all levels of the organization from executive management and their hosting service consumers. However, until the team had hard data, they knew it would be a challenge to get executive buyoff on the project and create internal support from stakeholders around the world. Therefore, Microsoft IT correlated server incidents and deviations from configuration standards, also referred to as defects, for 90 days. The results clearly showed that servers with less defects had fewer failures; servers with more defects had more failures. With the data in hand, the organization obtained the necessary approvals to move the project forward.

Page 3 Six Sigma in the datacenter drives a zero-defects culture Solution Microsoft approached identifying and eradicating defects by phasing in improvements gradually, accommodating the complex hosting environment. The project had three distinct phases. At a high level, they were: 1. Establish platform standards. Microsoft established platform configuration standards, reduced environment complexity across the enterprise by driving compliance to standards, and developed a comprehensive configuration baseline. 2. Measure and publish. With an established baseline in place, configuration gaps (defects) were identified. Robust data collection and business intelligence (BI) systems made defects visible to all levels of the IT organization, driving adoption. 3. Eradicate Defects. Defects with the greatest potential to cause failure were prioritized and remediated. Managers and server owners were empowered to understand the impact of defects in their infrastructure portfolio, and were given tools to remediate them. The phases were implemented sequentially, over about five years. The scope and scale of the project was large touching five global datacenters, almost 300 virtual branch offices, hundreds of storage switches and arrays, and thousands of Microsoft SQL Server instances. The project spanned all ITIL processes, from build, to run, to support, and problem management. In the process, Microsoft moved from 2.5 Sigma to 4.0 Sigma, which reflects a significant decrease in defects, and which resulted in greater infrastructure availability and fewer incidents. In 2013, the DPMO (Defects per Million Opportunities) defect eradication program was launched across Microsoft IT. The following sections explain in more detail how Microsoft developed its system to identify, accurately measure, and proactively correct defects in order to improve IT infrastructure availability and reliability. Establish platform standards Microsoft first needed to reduce complexity and standardize its datacenter platform. The goal was to work within a contained set of configuration standards, with the understanding that the latest versions of software and hardware are generally more stable than their predecessors. Microsoft also adopted the same product support lifecycle as external customers who receive product support from Microsoft. In driving comprehensive adoption to the most current hardware and software standards, the organization accomplished two sets of goals. First, it established consistent and updated configuration standards for every server to adhere to, ensured that key stakeholders understood the standards, and drove compliance to them. Second, it created an effective baseline for defect measurement. Detailed server platform configuration standards were defined with precise business impact mapping. Building on the thorough platform knowledge within the problem management discipline, a virtual team of subject matter experts with deep technical expertise contributed to this aspect of the initiative. Microsoft also leveraged a wide range of other sources for robust input. These include industry best practices, the results of root cause analysis from past major or high priority incidents, critical server failure event data, deep analysis of the infrastructure, product group recommendations, and customer engagement data. With this information, Microsoft was able to both define the standards, and map specific standards non-compliance to a definite set of infrastructure failures.

Page 4 Six Sigma in the datacenter drives a zero-defects culture The areas where detailed server platform configuration standards were defined include: Microsoft operating systems Microsoft SQL Server and Database management Storage area network (SAN) Storage Clustering technologies Hyper-V layer Hardware models Firmware Third-party applications Miscellaneous software packages To help create a deep level of infrastructure accountability, this foundational phase also included Microsoft mapping servers to specific owners and teams within the organization. Measure and publish With consistent datacenter platforms and a baseline in place, Microsoft next leveraged System Center Configuration Manager (SCCM), System Center Operations Manager (SCOM), and an in-house tool to collect server data. At regular intervals, large amounts of data were collected to validate compliance to the configuration standards. Microsoft created a robust BI system to analyze and report on the data. Compliance performance against the configuration standards was published and visible up to the CIO level. Role-specific views of reports were created, to support the needs for both long-term and immediate decision making. The data transparency alone gave stakeholders an incentive to invest in their server footprint; after data was published, close to 20 percent of defects were proactively remediated. Collect data SCCM and SCOM collected inventory and configuration data from servers, while monitoring them for availability. Additional configuration data was collected by the in-house tool. The daily and weekly data collection processes mine infrastructure assets without affecting their functionality. Customized Windows Management Instrumentation requests, registry calls, and customized SQL Server queries captured additional configuration data. Results from the configuration scans were then compared to the platform configuration standards. Establish a BI system Embracing a zero defects culture required a well-defined BI system. Microsoft developed a robust BI system for data analysis and reporting that allowed Microsoft IT and business groups to focus on the specific goal of reducing the number of server defects in production. Microsoft needed to empower its stakeholders. Stakeholders needed to both understand the impact of defects in their infrastructure, and have the tools to remediate defects. The Server Deployment and Operations BI team built a series of self-service reports that help anyone in Microsoft IT manage defects in their server portfolio. In the reports, ownership data was coupled with each server s defects, the potential severity per defect, and the configuration remediation steps per defect. With this rollup, each IT organization could see a comprehensive representation of the defects in their own server footprint. Figure 2 shows the data available to the IT organizations. The red lines represent defects within a specific server, called out with

Page 5 Six Sigma in the datacenter drives a zero-defects culture their descriptions, priority, and remediation requirements. DPMO values are calculated from the underlying count of defects and servers. Figure 2. Reporting structure Provide specific data views Microsoft needed to provide data views for a variety of business functions, such as leadership and engineering. A CIO usually has different information needs than a network engineer. This differentiated approach supports data-driven decisions, regardless of where in an organization hierarchy the defects occur.

Page 6 Six Sigma in the datacenter drives a zero-defects culture Microsoft created a leadership and manager view to provide insights on the health of the production environment for an extended period. DPMO trends for a 16-week period allowed this layer of the organization to drive decisions and priorities that are tied to critical business schedules. Figure 3. Role-specific views For engineers and IT Pros, flexible, self-service reports allow server owners to quickly find defects in their servers, to easily understand issues, and, most importantly, to take necessary remediation steps. Eradicate Defects In the final phase of the project, infrastructure risk was assessed, and then most dangerous defects were identified, prioritized, and addressed. The team incorporated input from root cause analysis in order to assess and prioritize infrastructure risk. By analyzing existing problem records, Microsoft could quantify infrastructure components that could eventually fail, causing service disruptions to customers. After each remediation effort, the teams were empowered to perform simple hypothesis tests, which consistently demonstrated a statistical significance (P-Value < 0.05) in the reduction of infrastructure failures. Prioritize risk Microsoft used the Six Sigma Risk Priority Number (RPN) index to prioritize risk. The higher the index, the higher the priority. Configuration deviations that cause severe business impact are categorized as priority 1 (P1) defects. The rest are categorized as priority 2 and 3, based on their potential impact to underlying services. Using the RPN calculation, Microsoft categorized all defects with a score of 200 or more as P1.

Page 7 Six Sigma in the datacenter drives a zero-defects culture The following figure shows how risk criteria were prioritized. Figure 4. Risk Prioritization The RPN calculation took into account potential impact to the business, likelihood of failure, and the ability to detect the risk. Finalized standards with appropriate risk categorization were then ported to a SQL Server database, enabling a data collection engine to collectively analyze and assign defect priority throughout the infrastructure. Apply Six Sigma Once the business understood the areas of opportunity, the IT project team worked with the appropriate business teams to apply the Six Sigma DPMO framework. The DPMO framework was applied to P1 defects across the environment. The teams worked together to execute continuous defect eradication initiatives. As mentioned earlier, after defects began reporting up through the organization, 20 percent of defects were proactively remediated. After the business and IT teams started working together, another 40 percent of defects were resolved within a year. The DPMO effort began in 2012, when a DPMO Proof of Concept was published. The DPMO components include: Defects. Deviations from approved standards or accepted configurations. Opportunities. Measurable service components that can deviate from agreed service levels and impact customer experience. Once defects and opportunities are known, the DPMO ratio can be calculated. DPMO also can be used to calculate process efficiency and effectiveness. Once the defects, opportunities, and DPMO were measured through the data collection process, the team was ready to derive business insight and introduce a zero defects culture approach to IT infrastructure management.

Page 8 Six Sigma in the datacenter drives a zero-defects culture The following figure depicts an overview of the defect reduction framework. Figure 5. Framework overview Benefits By standardizing its server platform, and then measuring and eradicating defects in its IT infrastructure, Microsoft has created a more stable server platform, process efficacy has improved, and enterprise risk has been reduced. Adopting a systematic, data-driven action framework that focuses on measurable and quantifiable results drove process improvements and real business impact. For example: Before the close of each quarter, 40 percent of P1 defects are eradicated on servers hosting revenue-generating applications, revenue-processing applications, or both. Publishing DPMO and Six Sigma RPN Values in the CIO scorecard enabled the organization to improve the DPMO score by 20 percent, without intervention from a project team. During the first year of implementation, Microsoft improved the number of remediated defects by 40 percent, without investing in additional resources. Overall, mean time between sequential failures has gone from 18 days to 125 days. A strong commitment to using statistical methods to make data-driven decisions helped Microsoft drive significant business impact. As of publication, Microsoft IT servers fail six times less often, compared to three years prior. Microsoft IT successfully applied the Six Sigma framework to datacenter operations for the first time. By comparing server defects and ticket-to-asset ratios, the organization determined that servers with large numbers of defects are more likely to have incidents, and may cause corresponding downtime for hosting customers. Merging Six Sigma methodology with Problem Management practices has provided a continuous improvement framework that Microsoft uses to drive down defects on an ongoing basis.

Page 9 Six Sigma in the datacenter drives a zero-defects culture Reliability improvements In just about 18 months, Microsoft IT servers reduced their ticket-to-asset ratio, compared to earlier servers with a comparable number of defects. For the consumers of the Microsoft IT hosting infrastructure, this means a platform with less issues overall. The following figures show 2014 data compared to 2015 data. Figure 6. Correlation of risk to ticket volume per asset, 2014 Figure 7. Correlation of risk to ticket volume per asset, 2015

Page 10 Six Sigma in the datacenter drives a zero-defects culture Availability increased Defect reduction has directly affected availability. The figure below represents the behavior of the environment quarter over quarter after defect remediation projects were implemented at different areas of the infrastructure. In every quarter, the failure rate after remediation consistently decreased by over 50 percent, regardless of the number of servers remediated. Figure 8. Failure to asset ratio over 90 days After each project, Microsoft saw a substantial drop in the number of unexpected failures per server. Failures take the server down for a considerable amount of time, which directly affects application availability for customers and, in turn, affects overall end-to-end business processes, and, in extreme scenarios, revenue loss. Every quarter, Microsoft IT performed a simple hypothesis test to show the statistical significance in the reduction of infrastructure failures. The test results below, performed on three data sets from three consecutive quarters, show that there is a significant difference in failure rates before and after the remediation efforts. Figure 9. Change in failure rates before and after remediation Portability The defect eradication framework can be applied wherever a deviation can be measured. At time of publication, Microsoft IT is collaborating with other groups to adapt the framework for their use. For example, the Microsoft Azure product team is considering leveraging the framework to improve availability and reliability of the Azure Infrastructure as a Service (IaaS) offering. Microsoft IT is also partnering with the System Center product group to enable seamless integration with their product. Internally, Microsoft IT has begun using this framework to improve reliability of its network infrastructure and reduce infrastructure security risks.

Page 11 Six Sigma in the datacenter drives a zero-defects culture Conclusion For the first time, Microsoft IT moved beyond applying Six Sigma to its datacenter processes. The organization successfully adopted the ITIL Problem Management discipline and Six Sigma methodologies to address operations defects in its hosting environment. Microsoft IT managers are now empowered with a defect eradication framework that results in minimal infrastructure risks and greater availability worldwide. Given the significance of compliance to configuration standards, that the risk to the infrastructure was well understood across the business, and that a robust and accurate set of data was readily available for consumption, efforts to deploy structured defect eradication projects across the Microsoft IT environment were very effective. Microsoft can now save time for consumers of its hosting infrastructure environment by improving availability, customer productivity, and satisfaction. Microsoft IT hosting customers can confidently rely on a service that is ready and able to deliver when they need it. For more information Microsoft IT www.microsoft.com/itshowcase For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the web, go to: www.microsoft.com 2016 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.