Hardware design for Cloud-scale datacenters USENIX LISA14

Similar documents
Compute Engineering Workshop March 9, 2015 San Jose

January 28 29, 2014San Jose. Engineering Workshop

January 28 29, 2014 San Jose. Engineering Workshop

January 28 29, 2014 San Jose. Engineering Workshop

October 30-31, 2014 Paris

October 30-31, 2014 Paris

Cisco UCS S3260 System Storage Management

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

Cisco UCS C240 M3 Server

Cisco UCS C240 M3 Server

Cisco HyperFlex HX220c M4 Node

Cisco UCS C24 M3 Server

Cisco UCS C200 M2 High-Density Rack-Mount Server

Cisco UCS S3260 System Storage Management

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

RS U, 1-Socket Server, High Performance Storage Flexibility and Compute Power

Cisco UCS B200 M3 Blade Server

Cisco UCS B230 M2 Blade Server

CSU 0111 Compute Sled Unit

Cisco Secure Network Server

Cisco UCS B440 M1High-Performance Blade Server

Cisco UCS S3260 System Storage Management

Cisco UCS C210 M1 General-Purpose Rack-Mount Server

Cisco Secure Network Server

HUAWEI Tecal X6000 High-Density Server

Open CloudServer chassis management specification V1.0

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6000 High-Density Server

Cisco UCS C210 M2 General-Purpose Rack-Mount Server

Density Optimized System Enabling Next-Gen Performance

Oracle s Netra Modular System. A Product Concept Introduction

Essentials. Expected Discontinuance Q2'15 Limited 3-year Warranty Yes Extended Warranty Available

Cisco UCS B460 M4 Blade Server

DELL EMC DATA DOMAIN DEDUPLICATION STORAGE SYSTEMS

Cisco UCS C250 M2 Extended-Memory Rack-Mount Server

CSU 0201 Compute Sled Unit

Cisco UCS E-Series Servers

Infoblox Trinzic DDI Appliances. Trinzic Appliances Deliver Actionable Network Intelligence. A Scalable Family of Hardware and Software Appliances

DELL EMC DATA DOMAIN DEDUPLICATION STORAGE SYSTEMS

Cisco HyperFlex HX220c Edge M5

Xu Wang Hardware Engineer Facebook, Inc.

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

The Genesis HyperMDC is a scalable metadata cluster designed for ease-of-use and quick deployment.

Cisco UCS C250 M2 Extended-Memory Rack-Mount Server

HP BladeSystem c-class Server Blades OpenVMS Blades Management. John Shortt Barry Kierstein Leo Demers OpenVMS Engineering

DELL EMC DATA DOMAIN DEDUPLICATION STORAGE SYSTEMS

HPE ProLiant ML350 Gen10 Server

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >

2009 Cisco Systems, Inc. All rights reserved. This document is Cisco Public Information. Page 1 of 6

Full Featured with Maximum Flexibility for Expansion

DELL EMC DATA DOMAIN DEDUPLICATION STORAGE SYSTEMS

SECURE 6. Secure64 Appliances Purpose-built DNS appliances for the most demanding environments DNS APPLIANCES DATA SHEET. Appliance Descriptions

Managing Cisco UCS C3260 Dense Storage Rack Server

HUAWEI Tecal X8000 High-Density Rack Server

SUN SERVER X2-8 SYSTEM

Datacenter-ready Secure Control

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

Supports up to four 3.5-inch SAS/SATA drives. Drive bays 1 and 2 support NVMe SSDs. A size-converter

Cisco Connected Safety and Security UCS C220

Backpack: Facebook s 100G. Zhiping Yao Network Hardware Engineer, Facebook

Intel Server. Performance. Reliability. Security. INTEL SERVER SYSTEMS

Table of Contents. Course Introduction. Table of Contents Getting Started About This Course About CompTIA Certifications. Module 1 / Server Setup

Cisco MCS 7825-I1 Unified CallManager Appliance

Intel Select Solution for ucpe

Mika Hatanpää Head of AirFrame data center solutions R&D Nokia

Intel Server. Performance. Reliability. Security. INTEL SERVER SYSTEMS FEATURE-RICH INTEL SERVER SYSTEMS FOR PERFORMANCE, RELIABILITY AND SECURITY

Nokia open edge server

NEC Express5800/B120f-h System Configuration Guide

Agenda. What is Cloud/Azure Azure Services & Scenarios Security Pricing

Alcatel-Lucent OmniAccess 4x50 Series Mobility Controllers Service Multi-tenant Network Management

Cisco MCS 7845-H1 Unified CallManager Appliance

AI Solution

Dual-Core Server Computing Leader!! 이슬림코리아

Dell PowerEdge R230 Owner's Manual

PrepKing. PrepKing

Lenovo Database Configuration Guide

Acer AW2000h w/aw170h F2 Specifications

VIRTUAL EDGE PLATFORM 4600 Next Generation Access

Capacity driven storage server. Capacity driven storage server

The 3-phase, 5-wire model name is PF LNM. The 3-phase, 4-wire model name is PF LNN.

COMPLETE AGENT-FREE MANAGEMENT OF POWEREDGE SERVERS

Data Sheet Fujitsu M10-4S Server

Overview. Cisco UCS Manager User Documentation

CG-OpenRack-19 Sled and Rack Specification Version 0.95

3331 Quantifying the value proposition of blade systems

for Special Applications

HPE ProLiant DL580 Gen10 Server

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

Sugon TC6600 blade server

NEC Express5800/B110d Configuration Guide

Altos R320 F3 Specifications. Product overview. Product views. Internal view

Data Center solutions for SMB

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

Microsoft s Cloud. Delivering operational excellence in the cloud Infrastructure. Erik Jan van Vuuren Azure Lead Microsoft Netherlands

Genesis HyperMDC 200D

User Guide. for TAHOE 8622

Pioneer DreamMicro. Blade Server S75 Series

Project Olympus 3U PCIe Expansion Server. 1/23/2019 Mark D. Chubb

Dell PowerEdge R230 Owner's Manual

Dell EMC PowerEdge Installation, Management and Diagnostics

Cisco MCS 7845-I2 Unified Communications Manager Appliance

Transcription:

Hardware design for Cloud-scale datacenters USENIX LISA14 1

Public Cloud Disaster Recovery / Business Continuity 2

5.8+ billion worldwide queries each month 250+ million active users 400+ million active accounts 2.4+ million emails per day 8.6+ trillion objects in Microsoft Azure storage 48+ million users in 41 markets 50+ million active users 1 in 4 enterprise customers 50+ billion minutes of connections handled each month 200+ Cloud Services 1+ billion customers 20+ million businesses 90+ markets worldwide 3

Design <10K SMB/Enterprise 100K Hosters 1M Cloud-Scale # SKUs Several Limited Extremely limited Redundancy model Hardware based (Hot-*) Software based (Local datacenter) Software based (Geo-distributed) HW availability 99.999% or higher 99.9% - 99.999% 99% - 99.9% HW type Enterprise SKU Off-the-shelf design, custom integration Custom designs, custom integration Infrastructure co-design None Limited integration with Datacenter and Network OS, Datacenter, Server and Network tightly integrated 4

Operations <10K SMB/Enterprise 100K Hosters 1M Cloud-Scale Break/fix support 24 hours x 7 days 8 hours x 5 days Up to 1-2 weeks Issue triage model OOB HW management Management domain scale FRU granularity IT admin Full command set, BMC required Some automation, Admin support Basic feature set, BMC required 100 s of servers 1000 s of servers Hot-swappable components Component replacement Fully automated, Machine learning Power On/Off only, No BMC 10 s of 1000 s of servers Entire server replacement 5

Partition Layer Partition Layer Stream Layer Intra-Stamp Replication Stream Layer Intra-Stamp Replication Storage Stamp Storage Stamp 6

Commit operation Write Erasure Coding operations 7

Query distribution Index unit 1 Index unit 2 Index unit Index unit n Partition 11 Partition 21 Partition n1 Partition 12 Partition 22 Partition n2 Partition 1m Partition 2m Partition nm Source: Web search using mobile cores, ISCA 2010 Query performance is measured as an aggregate of ALL compute nodes 8

9

Performance Customization Uniformity Power Agility Cost Reliability Simplicity 10

Architecture should be adapt to variety of cloud workloads Support for global datacenter operating environments CISPR, ANSI, IEC), UL, IEC, CSA) 11

Design Principles Standardization & Modularization Design Simplicity Operations Excellence 12

Open CloudServer (OCS) design Open Source Code Chassis management Operations Toolkit Specifications Chassis, Blade, Mezzanines Management APIs Certification Requirements Mechanical CAD Models Chassis, Blade, Mezzanines Board Files & Gerbers Power Distribution Backplane Tray Backplane http://www.opencompute.org/wiki/server/specsanddesigns 13

12U Shared Chassis EIA Rack Mountable Shared infrastructure for efficiency and TCO optimization Shared management Shared power Signal backplane Compute blade Shared fans JBOD expansion 14

Blind-mated signal connectivity Simplified installation and repair Cable free design for significantly fewer operator errors during servicing Reduces need for cabling reseats Signal Backplane Blind-mated connectors (12V Power, Ethernet, SAS, Management) Network Repairs 1 75% 25% H/W Replaced Reseated 15

HDDs are #1 failure item AFR increases with temperature 1 Simplified fan control cools HDDs HDDs in front of hot motherboard Closed loop fan moderates temperatures 1 DSN 2011: Impact of Temperature on Hard Disk Drive Reliability in Large Datacenters 16

Blade Address 1GbE Chassis Manager (CM) 1GbE Secure OOB management Low-cost embedded x86 SoC COM1 COM2 COM5 COM3 X86 SoC COM4 REST API for machine management CLI interface for human operations Hard-wired management On/Off to blade power cut-off circuit IPMI-over-serial out of band communication Fan and PSU control and monitoring Remote switch and CM power control 6 PSU RS232 Serial to/from blades Remote Power Control 6 6 6 Fans Fans Fans COM6 PDB ON/OFF TX/ RX CPLD Serial Multiplexer (x2) Serial to I2C (x2) Blade Enable ON/OFF PMBUS PWM GPIO I2C Mux Fan Control 17

Security at all layers Hardware, UEFI, APIs, User Management Trusted Platform Module v1.2 Blades and Chassis manager UEFI Firmware v2.3.2 Secure BIOS and Boot Chassis manager interfaces TLS (SSL) and IPsec for communication encryption User Management Active Directory integration and authentication Role Based Management TLS/SSL UEFI 2.3.2 TPM IPsec Active Directory Integration 18

BMC-Lite IPMI basic mode over Serial I2C Master (SDR) UART I/O System Event Log Power Control KVM, Video drivers Ethernet, Network Stack or SOL USB Full IPMI Command Set 19

Targeted for deployment and production support Features http://github.com/msopentech/ocsoperationstoolkit 20

Identify defective components by physical location Summarize data for quick repairs 21

View configuration command - View-WcsConfig 22

View-Disk, View-Dimm, View-Nic, View-Fru, etc 23

Check, clear, and log the Windows System Event Log and BMC SEL View contents of BMC SEL 24

Commands Example: Update-WcsConfig Command 25

26

Q & A 27