Application-Specific Hardware. in the real world

Size: px

Start display at page:

Download "Application-Specific Hardware. in the real world"

Ethan Carr
5 years ago
Views:

1 Application-Specific Hardware in the real world 1

2 2

3 Large-Scale Reconfigurable Computing in a Microsoft Datacenter ISCA 14

4 Capabilities Performance/Watt $

6 FPGAs ASICs Source: Bob Broderson, Berkeley Wireless group

8 1U, 2U, or 4U rack-mounted 1/2/4 x 10Ge ports Up to 4 PCIe x16 slots 2 sockets, 6-core Intel Westmere

12 Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 2 TB, GB 10 Gb Ethernet No cable attachments to server 68 ⁰C

8 lanes to Mini-SAS SFF-8088 connectors Powered by PCIe slot

13 Altera Stratix V GS D5 172k ALMs, 2,014 M20Ks, 1,590 DSPs 8GB DDR MB Configuration Flash Config PCIe Gen 3 x8 8 lanes to Mini-SAS SFF-8088 connectors Powered by PCIe slot Stratix V Flash PCIe Gen3 x8 4x 20 Gbps Torus Network 8GB DDR3

14 FPGA 1U Mezz Conn.

16 8 shell cables 6 shell cables

17 Data Center Server (1U, ½width)

18 Infrastructure and Platform Architecture To enable productive use of the FPGA: (1) APIs for interfacing software with the FPGA SW Interface (2) interfaces b/n FPGA and board-level functions shell (3) support for resilience and debugging Error correction

19 Software Interface Design goals: Host-to-FPGA: Latency (10us/16KB) multithreading Custom PCIe interface with DMA support No sys calls 1 I/P, 1 O/P buffer in non-paged memory Buffer has 64 slots of size 64KB Status bits Services initiated through calls to low-level software library

20 Math Acceleration Service Physics Engine Comp. Vision Service FPGA F W PG e A bsearch Pi F p P e G li A ne FPGA

21 4 GB DDR ECCSO-DIMM 72 4 GB DDR ECCSO-DIMM 72 Shell DDR3 Core 0 DDR3 Core 1 Role Config Flash (RSU) JTAG Mb QSPI Config Flash Host CPU 8 x8 PCIe Core DMA Engine Application LEDs Temp Sensors I 2 C Inter-FPGARouter xcvr reconfig SEU North SLIII South SLIII East SLIII West SLIII

Selection as a Service (SaaS) SaaS 1 IF I M FM 1 Ranking as a Service (RaaS) RaaS1 IF I M FM 1 1 Query SaaS IF I M 2 FM 2 SaaS IF I M 3 FM 3 Selected Documents RaaS2 IF I M FM 2 2 RaaS3 IF I M FM 3 3

22 Selection as a Service (SaaS) SaaS 1 IF I M FM 1 Ranking as a Service (RaaS) RaaS1 IF I M FM 1 1 Query SaaS IF I M 2 FM 2 SaaS IF I M 3 FM 3 Selected Documents RaaS2 IF I M FM 2 2 RaaS3 IF I M FM blue links SaaS 4 IFM44 RaaS48 IFM44 Ported to Catapult Selection-as-a-Service (SaaS) -Find all docs that contain query terms, -Filter and select candidate documents for ranking Ranking-as-a-Service (RaaS) - Compute scores for how relevant each selected document is for the search query - Sort the scores and return the results

23 Query: FPGA Configuration Document NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 Score

24 Document NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1) (2 * NumberOfTuples_0_1) Metafeature #1 = 9 Score

25 PCIe Compressed Document Stream Preprocessing FSM Control/Data Tokens Distribution latches Free Form Expression (FFE) Feature Gathering Network 196 feature families 43 state machines 2.6K dynamic features extracted in less than 4us (~600us in SW)

26 Out pu Cluster 0 Core 0 Core 1 Core 2 tfst Complex Core 3 Core 4 Core 5 Scheduler I-Mem Feature Store Thread 0 Thread 1 Thread 2 Thread 3 F D E M W

Document Scoring Request Return Score Server Server Server FPGA 5 Document Scoring

27 Document 8-Stage Pipeline FE: FeatureExtraction FPGA 0 FPGA 1 Route to Head Route to Head RaaS Servers Server Server FFE: Free-Form Expressions FPGA 2 FPGA 3 FPGA 4 Document Scoring Request Return Score Server Server Server FPGA 5 Document Scoring Request Server Score Compute Score Compute Score FPGA 6 FPGA 7 Return Score Server Server

29 Accelerating Large-Scale Services Bing Search 1,632 Servers with FPGAs Running Bing Page Ranking Service (~30,000 lines of C++) Reduce no. of Servers More compute time for improving relevance

32 Google - Tensor Processing Unit 32

33 AGoldenAge in Microprocessor Design Stunning progress in microprocessor design 40 years 10 6 x faster! Three architectural innovations(~1000x) Width: bit(~8x) Instruction level parallelism: 4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x) Multicore: 1 processor to16 cores (~16x) Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture) Made possible by ICtechnology: Moore s Law: growth in transistor count (2X every 1.5 years) Dennard Scaling: power/transistor shrinks at same rate as transistors are added (constant per mm2 of silicon)

34 End of Growth of Performance? Since Transistorsnot getting muchbetter Powerbudget not getting muchhigher Already switched from 1inefficient processor/chipto Nefficient processors/chip Only path left isdomainspecific Architetures Justdo afew tasks,but extremely well

35 TPUOrigin Starting as far back as 2006, Google engineers had discussions about deploying GPUs, FPGAs, or custom ASICs in their data centers. They concluded that they can use the excess capacity of the large data centers. The conversation changed in 2013 when it was projected that if people used voice search for 3 minutes a day using speech recognition DNNs, it would have required Google s data centers to double in order to meet computation demands. Google then started a high-priority project to quickly produce a custom ASIC for inference. The goal was to improve cost-performance by 10x over GPUs. Given this mandate, the TPU was designed, verified, built, and deployed in data centers in just 15 months

Neural nets Weights 3 Kinds of Popular NNs Multi-Layer Perceptrons(MLP) Each new layer is a set of nonlinear functions of weighted sum of all outputs (

subsets of outputs from the prior layer, which also reuses the weights. http://cs231n.github.

36 Neural nets Weights 3 Kinds of Popular NNs Multi-Layer Perceptrons(MLP) Each new layer is a set of nonlinear functions of weighted sum of all outputs ( fully connected) from a prior one Convolutional Neural Networks(CNN) Each ensuing layer is a set of nonlinear functions of weighted sums of spatially nearby subsets of outputs from the prior layer, which also reuses the weights. Recurrent Neural Networks(RNN) Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the previous state. The most popular RNN is Long Short-Term Memory (LSTM). 36

37 Inference DatacenterWorkload(95%)

38 TPU architecture PCIe coprocessor No internal instruction fetch CISC-like instructions from host: Read_Host_Memory Read_Weights MatrixMultiply/Convolve Activate Write_Host_Memory Off-chip DDR3 weight memory 38

39 TPU architecture MACs for core computation 24MB Unified Buffer Store intermediate results Sized to match pitch of matmult unit, simplify compilation w/ specific apps Tiny control logic 39

40 TPU architecture systolic structure 40

Software Stack Software stack is split into auserspace Driver and a KernelDriver. The Kernel Driver is lightweight Handles only memorymanagement andinterrupts.

41 Software Stack Software stack is split into auserspace Driver and a KernelDriver. The Kernel Driver is lightweight Handles only memorymanagement andinterrupts. The User Space driver changes frequently. It sets up and controls TPU execution Reformats data into TPU order Translates API calls into TPU instructions, and turns them into an application binary.

42 System configurations * TPU is less than half die size of the Intel Haswell processor K80 and TPU in 28nm process, Haswell fabbed in intel 22nm process Widely deployed in Google data centers 42

43 Performance Roofline curves computation vs memory-intensity Ridge point at intensity where app becomes compute-bound Before ridge = memory-bound After ridge = compute-bound Below curve = response timeconstrained 43

44 Performance energy efficiency 44

45 Performance energy proportionality 45

46 Design space exploration 46

47 TPU v2 At HotChips 2017: 2x 128x128x32b mixed multiply units (MXUs) 64GB HBM 64x TPU modules per pod 4TB HBM Some available in TensorFlow cloud svc 47

48 Google vs. Microsoft Why Google ASIC? Why Microsoft FPGA? Flexibility? Programmability? Cost and usefulness over time? 48

49 References Figures and slides are from Norman P. Jouppi, et al., "In-Datacenter PerformanceAnalysis of atensor Processing Unit", 44th IEEE/ACM International Symposium on Computer Architecture (ISCA-44), Toronto, Canada, June David Patterson, "Evaluation of the Tensor Processing Unit: A Deep Neural NetworkAccelerator for the Datacenter", NAE Regional Meeting,April KazSato, An in-depth look at Google s firsttensor Processing Unit (TPU),

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s