Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015

The Cloud is a Growing Disruptor for HPC Moore s Law Homogeneity Economics Disruption

A 2-3 Horse Race

Hyperscale Cloud Fabrics CS CS ToR ToR ToR CS ToR ToR

Accelerator Constraints of the Cloud Homogeneity Efficiency (ASICS) 5

Catapult Project History December 9, 2010 initial meeting Christmas break 2010: feasible to accelerate ranking? January 12, 2011 Meeting with Bing leadership 2011 v0: ported then Bing ranking stack, built BFB board 2012 v1: developed distributed architecture 2013 Took v1 to scale, Bing pilot 2014 v2: developed new architecture, commenced work with Azure 2015 Mainstreamed: production and expansion Intel announced Altera acquisition, $16.7B

Microsoft Open Compute Server Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 HDDs, 2 SSDs 10 Gb Ethernet No cable attachments to server Microsoft Confidential 7

Catapult V1 Accelerator Card Altera Stratix V D5 172.6K ALMs, 2014 M20Ks 457KLEs 1 KLE == ~12K gates M20K is a 2.5KB SRAM PCIe Gen 2 x8, 8GB DDR3 20 Gb network among FPGAs 8GB DDR3 Stratix V PCIe Gen3 x8 Microsoft Confidential 8

6x8 Torus in a 2x24 Server Layout

1,632 server pilot deployed in production BN datacenter

Target: Accelerate Ranking as a Service Selection as a Service (SaaS) Ranking as a Service (RaaS) Query SaaS 1 1 1 1 SaaS 2 2 2 2 SaaS 3 3 3 3 Selected Documents RaaS 1 1 11 RaaS 2 2 22 RaaS 3 3 33 10 blue links SaaS 48 44 44 44 RaaS 48 44 44 44 Selection-as-a-Service (SaaS) - Find all docs that contain query terms - Filter and select candidate documents for ranking Ranking-as-a-Service (RaaS) - Compute relevance scores for each selected doc - Sort the scores and return the results

FPGA Accelerator for Bing Ranking 12-Stage Pipeline FPGA 0 Query Augmentation Document + Query FPGA 1 Query Understanding FE: Feature Extraction Document features - Hand-coded Verilog FPGA 2 FPGA 3 Document Selection ~4K features FPGA 4 Document Ranking FFE: Free-Form Expressions FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1) (2 * NumberOfTuples_0_1) FPGA 5 FPGA 6 Caption Generation Page Assembly MLS: Machine Learning Scoring T 2 FE9 Score ~2K Synthetic features FE7 T 3 > T 3 score score T 1 > T 1 FFE2 FFE3 > T2 T 3 > T 3 score score score Demonstrated ~2x throughput gain and stability justifying production FPGA 7 FPGA 8 FPGA 9 FPGA 10 FPGA 11

Throughput Throughput Pilot Results (FPGA vs. Software) Average Latency vs. Throughput 95% Latency vs. Throughput HW SW HW SW 4000 3500 3000 4500 4000 3500 Bing s latency target at ~2X throughput 2500 2000 3000 2500 2000 1500 1500 1000 1000 500 500 0 0 2 4 6 8 10 Average Latency 0 0 5 10 15 20 Latency

64 slots 2 x 16 RAMs 32B 64KB / slot Catapult V1 Shell Architecture 12V Voltage regulator 256 Mb NAND 1.5V 4 RSU 4GB SO-DIMM 120 DDR3 core 4GB SO-DIMM 120 DDR3 core Driver Reconfig JTAG Status LEDs 0.85V Gen2 x8 (Gen3 Capable) PCIe core Local application I O PCIe DMA Inter-FPGA router Xcvr config SLIII core SLIII core SLIII core SLIII core FPGA 4 4 4 4

Production issues at scale Build system License servers, availability of source, build machines Scale-out qualification of IP Clean interfaces for high-productivity development environment Shell/driver/application versioning and deployment Backwards compatibility Health monitoring and failure diagnostics Continuous reporting of interfaces health, soft error rate, etc. Debugging (esp. on livesite) Flight Data Recorder to replay bug-generating condition System integrity testing - many servers/vendors Scalability of verification In situ updates to drivers, golden image, shell Supply chain management

Azure SmartNIC Host Announced at ONS Use an FPGA for reconfigurable functions FPGAs are already used in Bing (Catapult) Roll out hardware as we do software Programmed using Generic Flow Tables (GFT) Language for programming SDN to hardware Uses connections and structured actions as primitives SmartNIC can also do Crypto, QoS, storage acceleration, and more 40Gb bidirectional AES demo NIC ASIC CPU FPGA ToR

FPGAs versus GPUs CPUs GPUs FPGAs Language C/C++ CUDA Verilog -> OpenCL (?) Performance 400 Gflops 6 Tflops -> 10T 100G -> 1T -> 4T Efficiency 5 Gflops/W -> 20 Gflops/W 40-50 G/W -> 80-100 G/W Scale 2M+ and growing 1s -> 10s -> 100s 10Ks -> 100Ks -> 1M+ DRAM BW 85 GB/s 2x240 GB/s 10GB/s -> 20GB/s -> 200-500GB/s

Large-Scale Reconfigurable Computing for HPC CS CS ToR ToR ToR ToR Deep Learning HPC / MPI Offload Bing Ranking HW Deep Compression Bing Ranking SW

Conclusions We are at the dawn of a new era Programmable logic playing a central role in systems at massive scale A new kind of computer Will enable new applications and services to be cost effective Will change system architecture, both in server and at cloud scale