Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman

Maximizing Server Efficiency with ML accelerators Michael Ferdman

Stony Brook University

Toward Efficiency of Cloud Servers Number of servers grows rapidly User base increasing exponentially New services appearing daily Constant need for more servers Many costs: HW, space, power Want max server efficiency

Achieving Server Efficiency Must target two classes of workloads Server software Machine learning

Talk Outline Machine Learning Accelerators Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Resource Partitioning [FPL 16][ISCA 17] Server Workloads CloudSuite Benchmarks [ASPLOS 12][TopPicks 14][ISPASS 16] Proactive Instruction Fetch [MICRO 08][MICRO 11] Hardware Memoization [ASPLOS 15] Scale-Out Processors [ISCA 12] Reactive NUCA, Cuckoo Directories [ISCA 09][TopPicks 10][HPCA 10] Cache Bursts, Spatio-Temporal Streaming [MICRO 08][HPCA 09][TopPicks 10]

Deep Convolutional Neural Networks Revolutionizing machine learning Need fast and efficient evaluation mechanisms DOG Lots of computation, good target for accelerators GPUs, FPGAs, ASICs

CNN Accelerator Efficiency What we want: All compute units do useful work all the time Main challenges: Off-chip data transfer Starves compute units Expensive energy cost Mapping computation to compute units An accelerator has thousands of compute units How to make sure no compute unit is left idle?

Our Contributions Reduce off-chip data transfer: Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Increase Compute Unit utilization: Resource Partitioning [ISCA 17]

Feature Map Data in CNN Acceleration Layers are computed one after another Uses external memory at each layer Input and output small, inter-layer data are large Conv. Layer 0 Conv. Layer 1 Conv. Layer X DOG External Memory

Data Size [MB] 16 14 12 10 8 6 4 2 0 Feature Map Transfer Challenge Large amount of data transferred on and off chip e.g., VGGNet-E (includes pooling layers) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Convolutional Layer Number Input Size Output Size Weight Size Transfers MBs of data on and off chip per image

Fused-Layer CNNs Demonstrate existence of inter-layer data locality Re-order computation to cache intermediate data Trade on-chip buffer for reduced off-chip transfer Conv. Layer 0 Conv. Layer 1 Conv. Layer X DOG External Memory

Background: CNN Evaluation

Background: CNN Evaluation Becomes input for next layer Data goes to external memory in-between layers

Layer Fusion Save memory bandwidth by processing multiple layers Challenges: Intermediate data too large to store on chip Layer s output order different than next layer s input order Approach: Re-arrange CNN evaluation operations to exploit inter-layer locality

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P output pixel 1 tile 1

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P output pixel 1 output pixel 2 tile 1 tile 2

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P tile 1 tile 2 new data for tile 2 output pixel 1 output pixel 2

Inter-Layer Locality: Sliding Pyramid Input Feature Maps Intermediate Feature Maps Output Feature Maps N Layer 1 M Layer 2 P tile 1 tile 2 new data for tile 2 overlapping computation output pixel 1 Store intermediate values on chip output pixel 2

Layer Fusion Exploration Results No fused layers: 0 KB storage 86 MB transfer 118 KB storage 25 MB transfer DRAM Transfer [MB] 100 80 60 40 20 0 Example: VGGNet-E Layers 1 5 0 100 200 300 400 Extra on-chip storage required [KB] All fused layers: 362 KB storage 3.6 MB transfer

Fused-Layer Benefits Validated concept with prototype FPGA: fused-layer vs. FPGA 15 [Zhang et al.] 95% transfer reduction (on 5 layers of VGGNet-E) Just 20% extra on-chip memory cost CPUs experience ~6x speedup from fusing layers When targeting buffer size ~L1 cache size GPUs can similarly do fused-layer BW-limited mobile/embedded GPUs Implementing this can be challenging Massive BW reduction, works on FPGA, GPU, CPU!

Our Contributions Reduce off-chip data transfer: Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Increase Compute Unit utilization: Resource Partitioning [ISCA 17]

Data Size [MB] Feature Map Transfer Challenge Large amount of data transferred on and off chip e.g., VGGNet-E (includes pooling layers) 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Convolutional Layer Number Input Size Output Size Weight Size Transfers MBs of data on and off chip per image

Data Size [MB] Weight Data Transfer Challenge Large amount of data transferred on and off chip e.g., VGGNet-E (includes pooling layers) 350 300 250 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 F1 F2 F3 CNN Layers Input Size Output Size Weight Size Transfers MBs of data on and off chip per image

Weight Bandwidth and Batching Weight transfer a limiter of throughput Bandwidth(GB/s) Throughput VS. Weight Bandwidth Requirement 50 40 30 20 10 0 DDR3(800MHz) VGGNet-E 0 20 40 60 80 100 Throughput(Images/s) AlexNet Batching reduces weight transfer Read a weight on chip once, reuse for a batch of images Batching is limited in existing accelerators On-chip buffer capacity = # images * buffer size Need optimized batching to minimize bandwidth

Fully-Connected Layers Input Vector Output Vector Weight Vectors Input Vector Weight Vector Output

Output Buffer Size and Input Retransfer 12 Input Vector Accum. 1 st input Accum. 2 nd input Accum. all input Output Vector On-chip Buffers Weight Vectors Buffer whole Output Vector Input Vector read once Buffer half an Output Vector Input Vector read twice Smaller output buffer More Input Vector retransfer

MB transferrred per image Input vs. Weight Transfer Weight Dominant Input Dominant 20 16 160 KB (Outbuf Budget) 12 320 KB 8 640 KB 4 0 1280 KB 0 200 400 600 800 1000 G (Batch Size) (first fully-connected layer in VGGNet-E, 32-bit floating point) Different budgets have different optimal points Different layers also have different optimal points Flexible buffering is needed to be near optimal points

Optimized Batching Contributions Batch size affects Input Feature Map retransfer An important trade-off overlooked by existing designs Conv layers also need batching Existing designs only consider batching FC layers New accelerator design w/flexible buffering: Escher Different layers can use different optimized batch sizes Support batching in both Conv and FC layers Escher vs. existing batching: reduce bw. up to 2.4x Batching in Conv layers: reduce bw. up to 1.7x Flexible buffering enables optimized batching

Our Contributions Reduce off-chip data transfer: Fused-Layer CNN Accelerators [MICRO 16] Flexible Buffering [FCCM 17] Increase Compute Unit utilization: Resource Partitioning [ISCA 17]

CNN Accelerator Underutilization State-of-the-art: all compute one processor Single-CLP (Convolutional Layer Processor) Dimension mismatch causes underutilization CNN layers each have different dimensions (N R C M) Traditional approach: one fixed-dimension CLP FPGA Not utilized CLP Convolutional Layer One size fits all poor compute utilization

Compute-unit Underutilization Problem Average utilization 56% Out-Buf In-Buf Out-Buf In-Buf Out-Buf Out-Buf Dimension mismatch poor dynamic utilization

Our Multi-CLP Solution Single large CLP Multiple smaller CLPs Each CLP optimized for a subset of layers CLP dimensions fit CNN layers Accelerator CLP {Layers 1 to 5} Accelerator CLP0 {Layer 1} CLP1 {Layer 3} CLP2 {Layers 2,4,5} Use specialization to increase dynamic utilization

Throughput Improvement Virtex-7 690T, 16-bit fixed-point Throughput is proportional to compute utilization 4 Single-CLP Multi-CLP Relative Throughput 3 2 1 0 AlexNet VGGNet-E SqueezeNet GoogleNet Multi-CLP significantly increases throughput

Scalability Projection AlexNet, 32-bit floating point, 100MHz Throughput (Images/s) 300 250 200 150 100 50 0 V7485T V7690T Multi-CLP Single-CLP 0 2000 4000 6000 8000 10000 DSP Slices VU9P VU11P Multi-CLP benefit scales with FPGA size

CNN Accelerator Conclusions Fusing Convolutional Layers Demonstrated new locality in CNN evaluation Use small on-chip memory to save off-chip data transfer Batch size Input vs. Weight trade-off Find optimal batch size per layer Benefits batching of Conv and FC layers Single-CLP has dynamic underutilization Multi-CLP has more flexibility Each CLP optimized for a subset of layers

Debugging Systems with FPGAs is Difficult Involve HW, SW, and OS at the same time Painfully slow FPGA compilation process Tedious host rebooting due to freezes Poor visibility for debugging

State Of The Art Testbenches for HW excludes SW and OS Existing full-system simulation is targeting SoC ASIC HW-SW Co-Simulation excludes OS Debug on hardware (synth, place, route, reboot)

The VM-HDL Co-Simulation Framework Virtual Host System Machine HDL FPGA Simulator Software Operating System Hardware Design Virtual Hardware PCIe Queue Link PCIe Simulation PCIe BlockBridge Attach GDB Save Waveforms

Run Time Comparison Real System (sec.) Co-Simulation (sec.) Compilation - 167 Synthesis 1617 - Place and Route 2672 - Reboot 120 - Execution 0.000032 6.02 Total 4409 (1h:13m:29s) 173 (2m:53s)

Thanks! Happy to take any questions and comments Now, after the talk, or via email mferdman@cs.stonybrook.edu