On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Size: px

Start display at page:

Download "On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing"

Anabel Newton
6 years ago
Views:

1 On the Efficacy of a Fued CPU+GPU Proceor (or APU) for Parallel Computing Mayank Daga, Ahwin M. Aji, and Wu-chun Feng Dept. of Computer Science

2 Sampling of field that ue GPU Mac OS X Comology Molecular Dynamic and Modeling Computational Fluid Dynamic 32

3 GPU in HPC Rank Computer R max R peak %age (R max /R peak ) 1 K computer SPARC64 VIIIfx 2.0 GHz, Tofu Interconnect 2 Tianhe-1A - NUDT TH MPP, X Ghz 6C, NVIDIA GPU FT C 3 Jaguar Cray XT5-HE Opteron 6-core 2.6Ghz 4 Nebulae Dawning TC6300 Blade, Intel X5650, NVIDIA Tela C2050 GPU 5 TSUBAME 2.0 HP ProLiant SL390 G7 Xeon 6C X5670, NVIDIA GPU % % % % % Sytem with GPU achieve only ~50 % of R peak Sytem without GPU achieve ~84 % of R peak 33

4 Sytem Memory (Hot) Architecture of Dicrete GPU SIMD Engine (~500 Gflop/) X86 CPU Core Thread Execution Control Thread Proceor Thread Proceor Thread Proceor DMA/PCIe Device Memory 34

5 A Reaon for Poor Efficiency Overhead p/n Symmetric Multi-Core (N-core) p Sequential Proceor p' Data Tranfer Overhead Accelerator-baed Sytem 35

Sequential Proceor Dicrete GPU p' 0 50 100 150 200 Time (m)

6 A Reaon for Poor Efficiency Overhead p/n Symmetric Multi-Core (N-core) FMAD Single-core CPU Multi-core CPU (4 core) p Sequential Proceor Dicrete GPU p' Time (m) Serial Time Parallel Time Overhead Data Tranfer Overhead Accelerator-baed Sytem 36

7 Ideal Efficiency Scenario Overhead p/n Symmetric Multi-Core (N-core) p Sequential Proceor p' Data Tranfer Overhead Accelerator-baed Sytem 37

8 Ideal Efficiency Scenario p/n Overhead Symmetric Multi-Core (N-core) p Sequential Proceor p' Accelerator-baed Sytem 38

9 Sytem Memory (Hot) Ideal Placement of CPU and GPU Core SIMD Engine X86 CPU Core Thread Execution Control Thread Proceor Thread Proceor Thread Proceor DMA/PCIe Device Memory 39

10 Ideal Placement of CPU and GPU Core SIMD Engine Thread Execution Control Toward a fued CPU+GPU Thread Proceor Thread Proceor Thread Proceor X86 CPU Core Device Memory 40

11 Outline Motivation AMD Fuion APU A Fued CPU+GPU Reviiting Amdahl Law Experimental Analyi Application Benchmark Reult and Dicuion Concluion and Future Work

12 AMD Fuion APU A Fued CPU+GPU High Performance Bu and Memory Controller Sytem Memory SIMD Engine Thread Execution Control X86 CPU Core Thread Proceor Thread Proceor Thread Proceor Unified Video Decoder Platform Interface 42

13 State of the Data Tranfer Dicrete GPU Sytem Memory (Hot) PCIe Tranfer Device Memory AMD Fuion APU (1 t Generation) (x86) memcpy Sytem Memory 192 MB (SIMD Engine) AMD provide high peed block tranfer engine that move data between the x86 and SIMD memory partition. (AMD, AMD Fuion Family of APU: Enabling a Superior, Immerive PC Experience )

14 Outline Motivation AMD Fuion APU A Fued CPU+GPU Reviiting Amdahl Law Experimental Analyi Application Benchmark Reult and Dicuion Concluion and Future Work

$erial fraction$

15 Reviiting Amdahl Law (M. Hill and M. Marty, Amdahl Law in the Multi-core Era ) Symmetric Multi-core Aymmetric Multi-core Speedup value for different erial fraction Higher Efficiency of Aymmetric Chip

16 Reviiting Amdahl Law p/n o Symmetric Multi-Core (N-core) p Sequential Proceor p' o Accelerator-baed Sytem 46

17 Reviiting Amdahl Law p Sequential Proceor?? p' o Accelerator-baed Sytem o DicreteGPU v. o Fuion Fuion i expected to be better than dicrete GPU p DicreteGPU v. p Fuion Depend on everal factor, like algorithmic mapping, memory bandwidth, number of compute unit, etc.

18 Implication Aymmetric chip alway offer better efficiency than ymmetric chip if reearcher continue to addre cheduling and overhead challenge Fuing CPU and GPU core reduce data tranfer overhead to a great extent AMD Fuion, Intel Knight Ferry, and NVIDIA Tegra are all tep in the right direction. Our focu today: AMD Fuion

19 Outline Motivation AMD Fuion APU A Fued CPU+GPU Reviiting Amdahl Law Experimental Analyi Application Benchmark Reult and Dicuion Concluion and Future Work

20 Experimental Analyi Sytem o Engineering ample of AMD Fuion o Dual CPU core + 80 GPU core AMD Radeon HD 5870 o High-powered dicrete GPU o 1600 GPU core AMD Radeon HD 5450 o Low-powered dicrete GPU o 80 GPU core

21 Experimental Setup

parallel o Meaure performance of pairwie calculation of Lennard- Jone potential o Meaure performance of the parallel prefix um

22 Experimental Analyi SHOC Benchmark Application Benchmark Bandwidth Tet FFT MD Scan o Meaure PCIe bandwidth for dicrete GPU o Meaure memory bandwidth for APU o Meaure performance of a 2-D Fat Fourier Tranform o Compute multiple FFT of ize 512 in parallel o Meaure performance of pairwie calculation of Lennard- Jone potential o Meaure performance of the parallel prefix um algorithm on a large array of floating point data Reduction o Meaure performance of a um reduction operation uing floating point data

Bandwidth (GB/) Bandwidth (GB/) Bandwidth Tet 2.5 2 1.5 1 0.

APU Radeon HD 5870 Radeon HD 5450 2.5 2 1.5 1 Device to Hot 0.

23 Bandwidth (GB/) Bandwidth (GB/) Bandwidth Tet Hot to Device Size (KB) Zacate APU Radeon HD 5870 Radeon HD Device to Hot Size (KB) Zacate APU Radeon HD 5870 Radeon HD

24 Problem Size (MB) Fat Fourier Tranform (FFT) AMD HD5450 AMD HD5450 AMD HD5450 AMD HD5450 AMD HD5450 APU reduce data tranfer time for all problem ize. Kernel Execution time i more for APU becaue of it lower memory bandwidth Time (m) Data Tranfer Kernel Execution

25 Number of Atom Molecular Dynamic AMD HD5450 AMD HD5450 AMD HD5450 AMD HD5450 APU reduce data tranfer time for all problem ize. The kernel execute fatet on dicrete AMD 5870 due to more and fater GPU core. The fued Zacate APU i next fatet Time (m) Data Tranfer Kernel Execution Compute-bound

Problem Size (MB) Scan 2 4 8 16 32 Total execution time i equal for dicrete and fued GPU Thi i tunning given that dicrete GPU have

26 Problem Size (MB) Scan Total execution time i equal for dicrete and fued GPU Thi i tunning given that dicrete GPU have 20-time more core Thee core are computationally more powerful a well Time (m) Data Tranfer Kernel Execution I/O-bound

27 Vector Size (MB) Reduction Total execution time i 3-time better for the APU The efficacy of the APU increae a the problem ize increae Time (m) Data Tranfer Kernel Execution I/O-bound

Time (m) Time (m) Reduction Time (m) 100 80 60 40 20 0 15 10 5 0 Tranfer Time 4 8 16 32 64 Vector Size (MB) AMD Fuion AMD Radeon HD 5870 Kernel Execution Time 4 8

28 Time (m) Time (m) Reduction Time (m) Tranfer Time Vector Size (MB) AMD Fuion AMD Radeon HD 5870 Kernel Execution Time Vector Size (MB) Total Execution Time Vector Size (MB) AMD Fuion AMD Radeon HD x AMD Fuion AMD Radeon HD

29 Outline Motivation AMD Fuion APU A Fued CPU+GPU Reviiting Amdahl Law Experimental Analyi Application Benchmark Reult and Dicuion Concluion and Future Work

30 Future Work A more robut model alo capturing the computational difference between fued and dicrete GPU Power modeling baed on AMD Power Gating technology

31 Concluion Fued CPU+GPU i a tep in the right direction for efficient upercomputer Data tranfer overhead i largely mitigated (up to 6x) Application execution time can be largely ped up (up to 3x in ome cae) No change i needed in the programming model But thi i till not a panacea Quetion? GPU core on the APU are not yet a powerful or a plentiful in number a the dicrete GPU Device memory bandwidth doe not yet match that of dicrete GPU Contact Mayank daga (mdaga@c.vt.edu) Ahwin M. Aji (aaji@c.vt.edu) Dr. Wu-chun Feng (feng@c.vt.edu)

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved