Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Size: px

Start display at page:

Download "Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al."

Silvester Reed
5 years ago
Views:

1 Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth

2 In short: The situation Image credit: NVIDIA Implementing DNNs efficiently is very important. Andreas Kurth

3 In short: The problem Which computing device is most suitable for this task? Open question depending on many factors.... but GPUs (and ASICs such as DaDianNao and the TPU) are the de-facto standard. Why not FPGAs? They can nominally be more energy-efficient than GPUs, but their inferior memory interface and floating point performance negates this advantage. Andreas Kurth

4 In short: The solution Intel claims their upcoming FPGA families will address this, as one flagship FPGA (Stratix 10 SX2800) will feature: >5k hard macro floating-point units (FPUs) 28 MB on-chip RAM up to 1 TB/s off-chip memory bandwidth (HBM2) Andreas Kurth

5 In short: The evaluation According to Intel s calculations, the SX2800 matches or outperforms a stateof-the-art GPU (NVIDIA TITAN X) in terms of nominal GEMM performance: 9.2 TFLOP/s vs. 11 TFLOP/s (FP32) and energy efficiency: 60 GFLOP/s/W vs. 45 GFLOP/s/W (FP32), as well as in benchmarks: sparse (85% pruned) AlexNet: DNNs with narrow (int6) data types: BinaryNet: Ternary ResNet-50 (ImageNet): 1.1x in performance, 1.9x in energy efficiency 1.5x in performance, 2.1x in energy efficiency 5.4x in performance, 5.0x in energy efficiency 2.0x in performance, 2.7x in energy efficiency Andreas Kurth

6 If you only remember one thing from this talk... Intel claims their next-generation FPGAs will... surpass state-of-the-art GPUs in terms of energy efficiency and match them in performance at SGEMM operations both nominally and for real DNN workloads. Andreas Kurth

7 How does Intel justify this claim? 1) DNN trends could favor FPGAs. 2) FPGA architecture and technology is closing the gap to GPUs. 3) Intel developed a computational template that matches (1) to (2). Andreas Kurth

8 DNN trends could favor FPGAs 1) DNNs are getting deeper (more layers) to increase accuracy, but they are not getting larger in terms of memory. Table 1: Recent ImageNet challenge winners. The increased compute density and the employed irregularity (e.g., sizes, links) across layers are thought to be favorable for FPGAs. Andreas Kurth

DNN trends could favor FPGAs 2) Compact data types (e.g., FP16, Int8, but also binary or ternary) reduce number of computations and memory size at moderate accuracy losses. Figure 5.

9 DNN trends could favor FPGAs 2) Compact data types (e.g., FP16, Int8, but also binary or ternary) reduce number of computations and memory size at moderate accuracy losses. Figure 5.b) Binarized matrix multiply implemented with XNOR and bitcount. Even though modern GPUs support FP16 and Int8, the non-fp32/64 operations on such data types (e.g., binary XNOR-net) can be favorable for FPGAs. Andreas Kurth

DNN trends could favor FPGAs 3) Weights and neurons are never 100% non-zero (e.g., in non fullyconnected layers or after ReLU), yet zeros wastefully participate in calculations.

10 DNN trends could favor FPGAs 3) Weights and neurons are never 100% non-zero (e.g., in non fullyconnected layers or after ReLU), yet zeros wastefully participate in calculations. Sparsity can additionally be increased by pruning weights that are deemed unimportant. Above a certain level of sparsity, FPGAs support sparse calculations more efficiently than GPUs due to irregular computations. Andreas Kurth

FPGA architecture and technology is closing the gap to GPUs Increased on-chip RAM: 28.6 MB on SX2800 vs. 13.

11 FPGA architecture and technology is closing the gap to GPUs Increased on-chip RAM: 28.6 MB on SX2800 vs MB on TITAN X On-par bandwidth to main memory (HBM2) HyperFlex to increase clock frequencies Nearly on-par peak FP32 performance: 9.2 TFLOP/s on SX2800 vs. 11 TFLOP/s on TITAN X Larger set of native data types through bit-level manipulations and FP16/32. Andreas Kurth

12 Customizable Hardware Architecture Template for DNNs Figure 4: Customizable hardware architecture template for DNNs. Andreas Kurth

13 Evaluation: Methodology Table 2: FPGAs and GPU under study. Altera Quartus Early Beta for synthesis Altera Early Power Estimator and post-implementation(?) netlist to estimate performance and power GPU: nvprof for performance and power numbers on an implementation Andreas Kurth

14 Evaluation: Dense DNNs Numbers from the respective data sheets, not implementation. They did not create an optimized FPGA implementation because FP32 dense matrix multiplications are a sweet spot for GPUs, not FPGAs. FPGA frequency not specified. Andreas Kurth

15 Evaluation: Sparse (Pruned) DNNs 85% pruned sparsity on AlexNet (<1% accuracy loss) with FP32 GPU implementation: extension of the optimized open-source MAGMA dense matrix multiplication library; performs worse than dense multiplications. (cusparse targets >99% sparsity.) FPGA implementation: with Sparse PEs, giving 4x speedup. 300 MHz conservative estimate, 500 MHz and 700 MHz moderate and aggressive projections, respectively. Andreas Kurth

16 Evaluation: Compact DNNs 6 bit integers for weights and neurons No GPU implementation, nominal Int8 peak performance used instead. FPGA implementation based on systolic GEMM, achieving 920 MHz because well optimized for frequency. Andreas Kurth

Evaluation: Binary DNNs 1 bit types for both weights and neurons GPU implementation: xnor_gemm kernel from BinaryNet (CUDA threads perform xnor and bitcount operations; 32 32-bit bitcounts per SM per

17 Evaluation: Binary DNNs 1 bit types for both weights and neurons GPU implementation: xnor_gemm kernel from BinaryNet (CUDA threads perform xnor and bitcount operations; bit bitcounts per SM per cycle) FPGA implementation: systolic array of Binary PEs, 256-wide binary dot product operations; synthesized for both Arria and Stratix; measured on Arria, simulated on Stratix Andreas Kurth

Evaluation: Ternary ResNet-50 Ternary weights (-1, 0, +1), FP32 neurons; within 1% accuracy of full ResNet 70..80% sparsity across weights and neurons (ideally 3.3..5x op.

18 Evaluation: Ternary ResNet-50 Ternary weights (-1, 0, +1), FP32 neurons; within 1% accuracy of full ResNet % sparsity across weights and neurons (ideally x op. reduction) GPU implementation: Torch, batch size 64, cudnn 5 with most aggressive performance setting, 3x faster than closest other implementation FPGA implementation: only exploit sparsity in neurons to achieve simpler design (450 MHz ( conservative estimation ) but only 2x op. reduction) Andreas Kurth

19 Conclusion Can FPGAs beat GPUs in performance for next-generation DNNs? Yes, if the Stratix meets Intel s performance projections: sparse (85% pruned) AlexNet: DNNs with narrow (int6) data types: BinaryNet: Ternary ResNet-50 (ImageNet): 1.1x in performance, 1.9x in energy efficiency 1.5x in performance, 2.1x in energy efficiency 5.4x in performance, 5.0x in energy efficiency 2.0x in performance, 2.7x in energy efficiency Andreas Kurth

20 My opinion: The good Intel is trying to accelerate innovation in FPGAs (e.g., memory, architectural mix) and wants to challenge the market lead of GPUs for DNNs. Concretely, Intel proposes an accelerator template and evaluates it in a promising case study. They use a competitive baseline for the GPU or use GPU peak numbers where no such baseline was available. Andreas Kurth

21 My opinion: The bad The paper is more marketing than science: Core methods are based on unreleased tools, devices, and benchmarks, making results not reproducible by the community, thus claims not falsifiable. Comparison to related work is basically all other work is based on obsolete platforms and/or not focused on emerging DNNs. Energy efficiency is important, but so is time-to-market and cost scaling. It remains unclear how and if the proposed accelerator template can be integrated in a productive development framework, and Intel can price FPGAs competitively for wide adoption in HPC (SX2800 vs TITAN X: 15k $ vs. 1.5k $). Andreas Kurth

22 My opinion: The ugly They compare their next-generation devices that still have not hit the market by Q to a GPU that was in mass production in Q Their main figure of merit, energy efficiency, is based on preliminary estimations, not measurements. Moreover, really significant advantages only result at very aggressive projections. Andreas Kurth

23 Outlook: The systems perspective GPU-like system integration on PCIe bus In-package integration on server socket The Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA plugs into a server to accelerate workloads. Announced for Q Image credit: Intel Intel Arria 10 GX MCP co-integrated in a single package with a 15-core Broadwell EP, interconnected with QPI (?) for high-bandwidth, low-latency shared memory. Image credit: TheNextPlatform Andreas Kurth

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? Eriko Nurvitadhi 1, Ganesh Venkatesh 1, Jaewoong Sim 1, Debbie Marr 1, Randy Huang 2, Jason Gee Hock Ong 2, Yeong Tat Liew 2, Krishnan