Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Similar documents
Today s Data Centers. How can we improve efficiencies?

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017

Application-Specific Hardware. in the real world

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

To hear the audio, please be sure to dial in: ID#

LegUp: Accelerating Memcached on Cloud FPGAs

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

Toward a Memory-centric Architecture

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

GPUs and Emerging Architectures

Adaptable Intelligence The Next Computing Era

Evolution of Rack Scale Architecture Storage

An NVMe-based FPGA Storage Workload Accelerator

The HARNESS Project. Cloud application performance modelling. Guillaume Pierre Université de Rennes 1

A U G U S T 8, S A N T A C L A R A, C A

Enabling FPGAs in Hyperscale Data Centers

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

Dr. Jean-Laurent PHILIPPE, PhD EMEA HPC Technical Sales Specialist. With Dell Amsterdam, October 27, 2016

Building the Most Efficient Machine Learning System

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

IBM Power Advanced Compute (AC) AC922 Server

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Building the Most Efficient Machine Learning System

Implementing Ultra Low Latency Data Center Services with Programmable Logic

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Interconnect Your Future

SNAP Performance Benchmark and Profiling. April 2014

Pactron FPGA Accelerated Computing Solutions

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

SERVER. Samuli Toivola Lead HW Architect Nokia

Martin Dubois, ing. Contents

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

N V M e o v e r F a b r i c s -

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

RDMA and Hardware Support

Tightly Coupled Accelerators Architecture

Accelerating Data Centers Using NVMe and CUDA

RapidIO.org Update. Mar RapidIO.org 1

HPE ProLiant ML350 Gen10 Server

Onto Petaflops with Kubernetes

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

When MPPDB Meets GPU:

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

General Purpose GPU Computing in Partial Wave Analysis

Thomas Lin, Naif Tarafdar, Byungchul Park, Paul Chow, and Alberto Leon-Garcia

Programmable NICs. Lecture 14, Computer Networks (198:552)

Service Edge Virtualization - Hardware Considerations for Optimum Performance

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

A Cloud-Scale Acceleration Architecture

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

Intel SSD Data center evolution

World s most advanced data center accelerator for PCIe-based servers

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Field Program mable Gate Arrays

OpenPOWER Performance

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Data-Centric Innovation Summit DAN MCNAMARA SENIOR VICE PRESIDENT GENERAL MANAGER, PROGRAMMABLE SOLUTIONS GROUP

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Interconnect Your Future

Deep Dive -PCIe3 RAID SAS Adapters Optimized for SSD s

Industry Collaboration and Innovation

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

33% 148% 2. at 4 years. Silo d applications & data pockets. Slow Deployment of new services. Security exploits growing. Network bottlenecks

Maximizing heterogeneous system performance with ARM interconnect and CCIX

The Future of High Performance Interconnects

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

New Interconnnects. Moderator: Andy Rudoff, SNIA NVM Programming Technical Work Group and Persistent Memory SW Architect, Intel

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

IBM Power AC922 Server

The way toward peta-flops

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Zynq-7000 All Programmable SoC Product Overview

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Interconnect Your Future

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Rack Disaggregation Using PCIe Networking

Using FPGAs to accelerate NVMe-oF based Storage Networks

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Developing Low Latency NVMe Systems for HyperscaleData Centers. Prepared by Engling Yeo Santa Clara, CA Date: 08/04/2017

Building NVLink for Developers

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Solutions for Scalable HPC

XPU A Programmable FPGA Accelerator for Diverse Workloads

Hardware Accelerated Application Integration: Challenges and Opportunities. ACM/IFIP/USENIX Middleware 2017 Daniel Ritter

High-Performance Heterogeneous Computing Platform GRIFON

Intel Xeon Phi архитектура, модели программирования, оптимизация.

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Simplifying FPGA Design for SDR with a Network on Chip Architecture

Accelerating Data Center Workloads with FPGAs

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

DGX UPDATE. Customer Presentation Deck May 8, 2017

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

CPMD Performance Benchmark and Profiling. February 2014

Transcription:

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015

The Cloud is a Growing Disruptor for HPC Moore s Law Homogeneity Economics Disruption

A 2-3 Horse Race

Hyperscale Cloud Fabrics CS CS ToR ToR ToR CS ToR ToR

Accelerator Constraints of the Cloud Homogeneity Efficiency (ASICS) 5

Catapult Project History December 9, 2010 initial meeting Christmas break 2010: feasible to accelerate ranking? January 12, 2011 Meeting with Bing leadership 2011 v0: ported then Bing ranking stack, built BFB board 2012 v1: developed distributed architecture 2013 Took v1 to scale, Bing pilot 2014 v2: developed new architecture, commenced work with Azure 2015 Mainstreamed: production and expansion Intel announced Altera acquisition, $16.7B

Microsoft Open Compute Server Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 HDDs, 2 SSDs 10 Gb Ethernet No cable attachments to server Microsoft Confidential 7

Catapult V1 Accelerator Card Altera Stratix V D5 172.6K ALMs, 2014 M20Ks 457KLEs 1 KLE == ~12K gates M20K is a 2.5KB SRAM PCIe Gen 2 x8, 8GB DDR3 20 Gb network among FPGAs 8GB DDR3 Stratix V PCIe Gen3 x8 Microsoft Confidential 8

6x8 Torus in a 2x24 Server Layout

1,632 server pilot deployed in production BN datacenter

Target: Accelerate Ranking as a Service Selection as a Service (SaaS) Ranking as a Service (RaaS) Query SaaS 1 1 1 1 SaaS 2 2 2 2 SaaS 3 3 3 3 Selected Documents RaaS 1 1 11 RaaS 2 2 22 RaaS 3 3 33 10 blue links SaaS 48 44 44 44 RaaS 48 44 44 44 Selection-as-a-Service (SaaS) - Find all docs that contain query terms - Filter and select candidate documents for ranking Ranking-as-a-Service (RaaS) - Compute relevance scores for each selected doc - Sort the scores and return the results

FPGA Accelerator for Bing Ranking 12-Stage Pipeline FPGA 0 Query Augmentation Document + Query FPGA 1 Query Understanding FE: Feature Extraction Document features - Hand-coded Verilog FPGA 2 FPGA 3 Document Selection ~4K features FPGA 4 Document Ranking FFE: Free-Form Expressions FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1) (2 * NumberOfTuples_0_1) FPGA 5 FPGA 6 Caption Generation Page Assembly MLS: Machine Learning Scoring T 2 FE9 Score ~2K Synthetic features FE7 T 3 > T 3 score score T 1 > T 1 FFE2 FFE3 > T2 T 3 > T 3 score score score Demonstrated ~2x throughput gain and stability justifying production FPGA 7 FPGA 8 FPGA 9 FPGA 10 FPGA 11

Throughput Throughput Pilot Results (FPGA vs. Software) Average Latency vs. Throughput 95% Latency vs. Throughput HW SW HW SW 4000 3500 3000 4500 4000 3500 Bing s latency target at ~2X throughput 2500 2000 3000 2500 2000 1500 1500 1000 1000 500 500 0 0 2 4 6 8 10 Average Latency 0 0 5 10 15 20 Latency

64 slots 2 x 16 RAMs 32B 64KB / slot Catapult V1 Shell Architecture 12V Voltage regulator 256 Mb NAND 1.5V 4 RSU 4GB SO-DIMM 120 DDR3 core 4GB SO-DIMM 120 DDR3 core Driver Reconfig JTAG Status LEDs 0.85V Gen2 x8 (Gen3 Capable) PCIe core Local application I O PCIe DMA Inter-FPGA router Xcvr config SLIII core SLIII core SLIII core SLIII core FPGA 4 4 4 4

Production issues at scale Build system License servers, availability of source, build machines Scale-out qualification of IP Clean interfaces for high-productivity development environment Shell/driver/application versioning and deployment Backwards compatibility Health monitoring and failure diagnostics Continuous reporting of interfaces health, soft error rate, etc. Debugging (esp. on livesite) Flight Data Recorder to replay bug-generating condition System integrity testing - many servers/vendors Scalability of verification In situ updates to drivers, golden image, shell Supply chain management

Azure SmartNIC Host Announced at ONS Use an FPGA for reconfigurable functions FPGAs are already used in Bing (Catapult) Roll out hardware as we do software Programmed using Generic Flow Tables (GFT) Language for programming SDN to hardware Uses connections and structured actions as primitives SmartNIC can also do Crypto, QoS, storage acceleration, and more 40Gb bidirectional AES demo NIC ASIC CPU FPGA ToR

FPGAs versus GPUs CPUs GPUs FPGAs Language C/C++ CUDA Verilog -> OpenCL (?) Performance 400 Gflops 6 Tflops -> 10T 100G -> 1T -> 4T Efficiency 5 Gflops/W -> 20 Gflops/W 40-50 G/W -> 80-100 G/W Scale 2M+ and growing 1s -> 10s -> 100s 10Ks -> 100Ks -> 1M+ DRAM BW 85 GB/s 2x240 GB/s 10GB/s -> 20GB/s -> 200-500GB/s

Large-Scale Reconfigurable Computing for HPC CS CS ToR ToR ToR ToR Deep Learning HPC / MPI Offload Bing Ranking HW Deep Compression Bing Ranking SW

Conclusions We are at the dawn of a new era Programmable logic playing a central role in systems at massive scale A new kind of computer Will enable new applications and services to be cost effective Will change system architecture, both in server and at cloud scale