Today s Data Centers. How can we improve efficiencies?

Similar documents
Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Application-Specific Hardware. in the real world

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

N V M e o v e r F a b r i c s -

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

Programmable NICs. Lecture 14, Computer Networks (198:552)

Azure Accelerated Networking: SmartNICs in the Public Cloud

Next Generation Computing Architectures for Cloud Scale Applications

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

Lenovo Database Configuration Guide

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Towards Converged SmartNIC Architecture for Bare Metal & Public Clouds. Layong (Larry) Luo, Tencent TEG August 8, 2018

Building the Most Efficient Machine Learning System

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Zhang Tianfei. Rosen Xu

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Accelerating Data Centers Using NVMe and CUDA

Enabling FPGAs in Hyperscale Data Centers

Thomas Lin, Naif Tarafdar, Byungchul Park, Paul Chow, and Alberto Leon-Garcia

Session objectives and takeaways

Hardware-Accelerated Networks at Scale in the Cloud. Daniel Firestone Tech Lead and Manager, Azure Host Networking Group

NVM Express Awakening a New Storage and Networking Titan Shaun Walsh G2M Research

Service Edge Virtualization - Hardware Considerations for Optimum Performance

Looking ahead with IBM i. 10+ year roadmap

Data center: The center of possibility

Networking at the Speed of Light

All Roads Lead to Convergence

Evolution of Rack Scale Architecture Storage

RDMA and Hardware Support

All product specifications are subject to change without notice.

Storage Protocol Offload for Virtualized Environments Session 301-F

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Data Path acceleration techniques in a NFV world

The HARNESS Project. Cloud application performance modelling. Guillaume Pierre Université de Rennes 1

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

RoCE vs. iwarp Competitive Analysis

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

OCP Engineering Workshop - Telco

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Introduction to Infiniband

The Future of High Performance Interconnects

Inspur AI Computing Platform

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

SERVER. Samuli Toivola Lead HW Architect Nokia

TALK THUNDER SOFTWARE FOR BARE METAL HIGH-PERFORMANCE SOFTWARE FOR THE MODERN DATA CENTER WITH A10 DATASHEET YOUR CHOICE OF HARDWARE

A Cloud-Scale Acceleration Architecture

Rack Disaggregation Using PCIe Networking

By John Kim, Chair SNIA Ethernet Storage Forum. Several technology changes are collectively driving the need for faster networking speeds.

I/O Virtualization The Next Virtualization Frontier

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

AI Solution

Accelerating Telco NFV Deployments with DPDK and SmartNICs

Building the Most Efficient Machine Learning System

Intro to Software as a Service (SaaS) and Cloud Computing

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

#techsummitch

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Intel Open Network Platform. Recep Ozdag Intel Networking Division May 8, 2013

OpenLine and Azure Stack

Getting Real Performance from a Virtualized CCAP

The PowerEdge M830 blade server

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

Optimizing the Data Center with an End to End Solutions Approach

Solaris Engineered Systems

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

ARISTA: Improving Application Performance While Reducing Complexity

Dr. Jean-Laurent PHILIPPE, PhD EMEA HPC Technical Sales Specialist. With Dell Amsterdam, October 27, 2016

OCP3. 0. ConnectX Ethernet Adapter Cards for OCP Spec 3.0

Oracle s Netra Modular System. A Product Concept Introduction

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

Virtual SAN and vsphere w/ Operations Management

UCS M-Series + Citrix XenApp Optimizing high density XenApp deployment at Scale

Driving Performance with Application Velocity. Marc van Hoof, Product Manager Service Routing Tech Group

LegUp: Accelerating Memcached on Cloud FPGAs

ZeroStack vs. AWS TCO Comparison ZeroStack s private cloud as-a-service offers significant cost advantages over public clouds.

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Implementing Ultra Low Latency Data Center Services with Programmable Logic

SmartNIC Programming Models

Accelerating Workload Performance with Cisco 16Gb Fibre Channel Deployments

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

The Economics of InfiniBand Virtual Device I/O

Creating High Performance Clusters for Embedded Use

OCP-T Spec Open Pod Update

Interconnect Your Future

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Optimize New Intel Xeon E based Ser vers with Emulex OneConnect and OneCommand Manager

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Accelerating Host Networking in the Cloud

NFV Infrastructure for Media Data Center Applications

IBM Power AC922 Server

Transcription:

Today s Data Centers O(100K) servers/data center Tens of MegaWatts, difficult to power and cool Very noisy Security taken very seriously Incrementally upgraded 3 year server depreciation, upgraded quarterly Applications change very rapid (weekly, monthly) Many advantages including economies of scale, data all in one place, etc. Very high volumes means every efficiency improvement counts At data center scales, don t need to get an order of magnitude improvement to make sense Positive ROI at large scale easier to achieve How can we improve efficiencies?

Microsoft Cloud Services

Efficiency via Specialization FPGAs ASICs Source: Bob Broderson, Berkeley Wireless group

Original Design Requirements

Version 1: Designed for Bing

Configuration?

~2013 Microsoft Open Compute Server Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 HDDs, 2 SSDs 10 Gb Ethernet

Catapult V1 Accelerator Card Altera Stratix V D5 (2011 part) Logic: ~5.5M ASIC gates+registers 2014 20Kb memories 2x40b @200MHz = 4TB/sec PCIe Gen 2 x8, 8GB DDR3 8GB DDR3 Stratix V What if single FPGA too small? PCIe Gen3 x8

Scalable Reconfigurable Fabric 1 FPGA board per Server 48 Servers per ½ Rack 6x8 Torus Network among FPGAs 12.5Gb over SAS SFF-8088 cables Data Center Server (1U, ½ width)

Version 2: Designed for all Microsoft

In data center, sameness == goodness Not possible everywhere, e.g., GPU SKUs Have enough difference (divergence) from new components E.g., Intel processors change every 1-2 years Economies of Scale Divergence is very costly Stock spares for each type of machine Many parts already end-of-lifed Keeping track of what needs to be purchases Reduced volumes == increased pricing V1 designed for Bing, could others use V1? Can we design a system that is useful for all data center users?

FPGAs to Azure At the time, Microsoft moving to single SKU Azure needed to adopt FPGAs as well Azure networking saw a need for network crypto Software defined networking acceleration But, could you do it with our V1 architecture? Was designed without network modifications because Azure not on board Developed new architecture to accommodate all known uses

QSFP Converged Bing/Azure Architecture Catapult v2 Mezzanine card WCS 2.0 Server Blade Catapult V2 DRAM DRAM DRAM CPU QPI CPU Gen3 2x8 FPGA QSFP 40Gb/s Switch WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA Gen3 x8 NIC QSFP 40Gb/s Completely flexible architecture 1. Can act as a local compute accelerator 2. Can act as a network/storage accelerator 3. Can act as a remote compute accelerator Pikes Peak Option Card Mezzanine Connectors WCS Tray Backplane

Network Connectivity (IP)

How Should We View Our Data Center Servers? Depends on your point of view!

http://www.vectorworldmap.com/vectormaps/vector-world-map-v2.1.jpg

http://texina.net/world_map_chinese.jpg

A Texan s View of the US From Cardcow.com

http://pandawhale.com/post/15302/california-for-beginners

Classic View of Computer DRAM CPU network Storage

Networking View of Computer DRAM Network offload CPU Acc network Storage

Offload Accelerator view of Server DRAM Acc Acc Acc CPU NIC network Storage

Our View of a Data Center Computer DRAM CPU DRAM DRAM CPU Acc FPGA network Storage

Benefits Software receives packets slowly Interrupt or polling Parse packet, start right work FPGA processes every packet anyways Packet arrival is an event that FPGA deals with Identify FPGA work, pass CPU work to CPU Map common case work to FPGA Processor never sees packet Can read/modify system memory to keep app state consistent CPU is complexity offload engine for FPGA! Many possibilities Distributed machine learning Software defined networking Memcached get

Case 1 Use as a local accelerator Bing Ranking as a Service

Bing Document Ranking Flow Selection as a Service (SaaS) Ranking as a Service (RaaS) Query SaaS IFM 1 IFM 1 IFM 1 1 SaaS IFM 2 IFM 2 IFM 2 2 SaaS IFM 3 IFM 3 IFM 3 3 Selected Documents RaaS IFM 1 IFM 1 IFM 11 RaaS IFM 2 IFM 2 IFM 22 RaaS IFM 3 IFM 3 IFM 33 10 blue links SaaS IFM 48 IFM 44 IFM 44 44 RaaS IFM 48 IFM 44 IFM 44 44 Selection-as-a-Service (SaaS) - Find all docs that contain query terms, - Filter and select candidate documents for ranking Ranking-as-a-Service (RaaS) - Compute scores for how relevant each selected document is for the search query - Sort the scores and return the results

FE: Feature Extraction Document {Query, Document} Query: FPGA Configuration NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 ~4K Dynamic Features ~2K Synthetic Features Score L2 Score

HW vs. SW Latency and Load Bing Production Results 99.9% Query Latency versus Queries/sec 99.9% software latency software FPGA average FPGA query load 99.9% FPGA latency average software load

Case 2 Use as a remote accelerator

Feature Extraction FPGA faster than needed Single feature extraction FPGA much faster than single server Wasted capacity and/or wasted FPGA resources Two choices Somehow reduce performance and save FPGA resources Allow multiple servers to use single FPGA? Use network to transfer requests and return responses

Inter-FPGA communication SP0 SP1 SP2 SP3 FPGAs can encapsulate their own UDP packets L1/L2 CS0 CS1 CS2 CS3 Low-latency inter-fpga communication (LTL) L0 ToR ToR Can provide strong network primitives FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA But this topology opens up other opportunities NIC NIC NIC NIC NIC NIC NIC NIC Server Server Server Server Server Server Server Server

Hardware Acceleration as a Service Across Data Center (or even across Internet) CS CS ToR ToR ToR ToR HPC Speech to text Bing Ranking HW Large-scale deep learning Bing Ranking SW

Current Bing Acceleration: DNNs

BrainWave: Scaling FPGAs To Ultra-Large DNN Models Use FPGAs to implement Deep Neural Network evaluation (inference) Map model weights to internal FPGA memories Huge amounts of bandwidth Since FPGA memories are limited, distribute models across as many FPGAs as needed Use HaaS to manage multi-fpga execution, LTL to communicate Designed for batch size of 1 GPUs, Google TPU designed for larger batch sizes, increases queuing delay or decreases efficiency

Case 3 Use as an infrastructure accelerator

FPGA SmartNIC for Cloud Networking Azure runs Software Defined Networking on the hosts Software Load Balancer, Virtual Networks new features each month Before, we relied on ASIC to scale and to be COGS-competitive at 40G+ 12 to 18 month ASIC cycle + time to roll out new HW is too slow to keep up with SDN SmartNIC gives us the agility of SDN with the speed and COGS of HW Base SmartNIC provide common functions like crypto, GFT, QoS, RDMA on all hosts VFP SLB Decap SLB NAT VNET ACL Metering Rule Action Rule Action Rule Action Rule Action Rule Action * Decap * DNAT * Rewrite * Allow * Meter VM VMSwitch Rewrite Transposition Engine SR-IOV (Host Bypass) SmartNIC 50G GFT Flow Action 1.2.3.1->1.3.4.1, 62362->80 Decap, DNAT, Rewrite, Meter Crypto RDMA QoS

Azure Accelerated Networking VM Guest OS Hypervisor VFP NIC VM NIC GFT/FPGA SR-IOV turned on VM accesses NIC hardware directly, sends messages with no OS/hypervisor call FPGA determines flow of each packet, rewrites header to make data center compatible Reduces latency to roughly bare metal Azure now has the fastest public cloud network 25Gb/s at 25us latency Fast crypto developed

Catapult Academic Program Jointly funded by Intel and Microsoft Some system administration/tools servers funded by NSF under FAbRIC In UTexas supercomputer facility under FAbRIC project Provide PCIe device driver, shell (initially compiled/encrypted, discussing source access under NDA with lawyers) Hello, world! programs Individual V1 boards sent to you Remote access to 6*48 servers each with V1 board Accessible with one page proposal catapult@microsoft.com See https://aka.ms/catapult-academic for details

Conclusion: Hardware Specialization on Homogeneous Machines FPGAs being deployed for all new Azure and Bing machines Many other properties as well Ability to reprogram a datacenter s hardware Converts homogenous machines into specialized SKUs dynamically Same hardware supports DNNs, Bing Features, Azure Networking Hyperscale performance with low latency communication Exa-ops of performance with a O(10us) diameter