On the efficiency of the Accelerated Processing Unit for scientific computing

Size: px

Start display at page:

Download "On the efficiency of the Accelerated Processing Unit for scientific computing"

Angel Watkins
5 years ago
Views:

1 24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra contact: isaid@uh.edu

used for scientific computing: Liner algebra, numerical simulations / iterative methods, signal processing,

2 HPC ecosystem The expanding demands of GPUs I. Said 24th High Performance Computing Symposium 04/05/2016 1/34 Graphics Processing Units (GPUs) are widely used for scientific computing: Liner algebra, numerical simulations / iterative methods, signal processing, etc.. However: Applications with high CPU-GPU communications can be bottlenecked by the PCI CPU+discrete GPU (GPUs are not standalone) systems require large amounts of energy

3 HPC ecosystem Towards unifying CPUs and GPUs I. Said 24th High Performance Computing Symposium 04/05/2016 2/34 Dispatch units CPU FPU FPU CU 0 PE Register file CU 1 PE Register file CU N-1 PE Register file CPU 0 CPU s-1 L1 L2 WC L1 L2 WC PCI Express Bus L3 Local memory L1 Local memory L1 Local memory L1 System memory CPU+discrete GPU L2 GPU main memory Quad-core CPU module UNB Integrated GPU module Accelerated Processing Unit (APU) FPU CPU 0 FPU CPU 1 Memory controller GPU memory controller Dispatch units L1 WC L1 CPU WC CU 0 PE Register file CU 1 PE Register file CU N-1 PE Register file FPU L2 FPU. CPU 2 CPU 3 L1 WC L1 WC Local memory TEX L1 Local memory TEX L1 Local memory TEX L1 L2 L2 ONION GARLIC System memory

4 I. Said 24th High Performance Computing Symposium 04/05/2016 3/34 HPC ecosystem Strengths Why using APUs? No PCI Express bus Integrated GPUs can address the entire memory Low power processors ( 95 W TDP at most): CPU 150 W TDP at most GPU 250 W at most Weaknesses Low compute power as compared to GPUs: Kaveri APU (A K) 730 GFlop/s (integrated GPU) Phenom CPU (X6 1055T) 130 GFlop/s Tahiti GPU (HD 7970) 3700 GFlop/s An order of magnitude less memory bandwidth than GPUs: APU up to 25 GB/s memory bandwidth GPU 300 GB/s

5 I. Said 24th High Performance Computing Symposium 04/05/2016 4/34 HPC ecosystem Motivations and context. Can we find a certain range of applications (with appropriate problem sizes) for which APUs may be suitable and/or more power efficient than discrete GPUs? In the scope of this work, we only consider using the integrated GPU of an APU as it represents the major computation power (Kaveri: 87%)

6 Outline I. Said 24th High Performance Computing Symposium 04/05/2016 5/34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion

7 Outline I. Said 24th High Performance Computing Symposium 04/05/2016 6/34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion

8 I. Said 24th High Performance Computing Symposium 04/05/2016 7/34 Understanding the memory system Multiple memory locations Importance of the data placement on the APU Software manipulations are needed to ensure zero-copy

9 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem Onion: coherent bus (slow) Garlic: non coherent bus (full memory bandwidth)

10 The APU memory subsystem I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 c: regular CPU memory (size depends on the RAM)

11 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem g: fixed size (512 MB to 4 GB) cg: explicit copy from CPU memory to GPU memory gc: explicit copy from GPU memory to CPU memory

12 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem u: zero-copy and non coherent (read-only accesses from GPU cores) Fixed and limited size (up to 1 GB)

13 The APU memory subsystem I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 z: zero-copy and coherent memory

14 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem p: zero-copy memory that lies on the GPU memory Limited size (up to 512 MB)

15 I. Said 24th High Performance Computing Symposium 04/05/2016 9/34 Data placement strategies on APU OpenCL data copy kernel From buffer A to buffer B Store buffers A and B in different memory locations Evaluate different combinations: For example cggc (explicit copy): zz (zero-copy):

16 Data placement benchmark on APU I. Said 24th High Performance Computing Symposium 04/05/ / Time (ms) cggc zgc ugc zz uz up pp Memory location init iwrite kernel oread obackup Using zero-copy = 60% maximum sustained bandwidth Select the most relevant strategies: cggc, ugc up, and zz

17 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion

18 I. Said 24th High Performance Computing Symposium 04/05/ /34 Applicative benchmarks on APU Matrix multiplication Building block in linear algebra (e.g. the LINPACK benchmark) Compute bound algorithm Evaluate the sustained compute gap between GPUs and APUs 8 th order 3D finite difference stencil Building block of seismic workflows (e.g. Reverse Time Migration) Memory bound algorithm Evaluate the APU memory performance Impact of data placement strategies on the APU performance

19 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication SGEMM (BLAS) C = βc + αa B, α = β = 1 A, B and C are squared matrices of dimension N N, N [64, 4096] Compute complexity O(N 3 ) Storage complexity O(N 2 ) Include possible CPU-GPU data transfers and study their impact on the application performance

20 Matrix Multiplication OpenCL deployment I. Said 24th High Performance Computing Symposium 04/05/ /34 2D work-item grid on the 2D squared matrices (N [64, 4096]) X elements of C/work-item (X = 2 or X = 4 in practice) Implementations: scalar: global memory (natural blocking thanks to OpenCL) local scalar: A and B are partitioned using the local memory vectorized: explicit vectorization local vectorized: local memory + explicit vectorization image: cache friendly tiled layout format (texture memory)

I. Said 24th High Performance Computing Symposium 04/05/2016 15/34 Finite difference stencil Linear combination of neighboring values weighted by coefficients U i,j,k = p/2 l= p/2 a l U

21 I. Said 24th High Performance Computing Symposium 04/05/ /34 Finite difference stencil Linear combination of neighboring values weighted by coefficients U i,j,k = p/2 l= p/2 a l U i+l,j,k + p/2 l= p/2 a l U i,j+l,k + p/2 l= p/2 a l U i,j,k+l, p = 8 Problem sizes N N 32, N [64, 1024] Compute complexity O(N 3 ) Storage complexity O(N 3 ) Data snapshotting (K [1 10])

the Z axis/work-item (X = 2 or X = 4 in practice) Register blocking when traversing the Z dimension Implementations:

22 Finite Difference Stencil OpenCL deployment I. Said 24th High Performance Computing Symposium 04/05/ /34 2D work-item grid on the 3D domain X columns along the Z axis/work-item (X = 2 or X = 4 in practice) Register blocking when traversing the Z dimension Implementations: scalar: global memory local scalar: local memory to exploit memory access redundancies vectorized: global memory + explicit vectorization local vectorized: local memory + explicit vectorization

23 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion

24 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication CPU performance 256 GFlop/s higher is better theoretical peak performance scalar local vectorized GotoBLAS vectorized OpenMP N OpenCL > OpenMP thanks to the OpenCL natural blocking vectorized > scalar thanks to SSE local vectorized > vectorized thanks to partitioning A and B GotoBLAS is the best thanks to close-to-hardware optimizations

25 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication GPU performance GFlop/s higher is better 4096 theoretical peak performance scalar vectorized image local scalar local vectorized N local vectorized (up to TFlop/s) > vectorized OpenCL images offer only 7% of enhancement (GCN) vectorized versions >> scalar versions (not expected)

26 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication APU performance GFlop/s higher is better 1024 theoretical peak performance scalar vectorized image local scalar local vectorized N Similarly to the GPU, vectorized versions > scalar versions Similarly to the GPU, local memory enhances the performance OpenCL images improve the performance by 25%

27 Matrix Multiplication APU performance and Data Placement Strategies Consider timing CPU-GPU interactions Best OpenCL implementations (vectorized, local vectorized) Combine with data placement strategies: cggc, ugc, up and zz I. Said 24th High Performance Computing Symposium 04/05/ /34 GFlop/s higher is better theoretical peak performance vectorized-cggc vectorized-ugc vectorized-zz vectorized-up local vectorized-cggc local vectorized-ugc local vectorized-zz local vectorized-up N Best: local vectorized coupled with the cggc data placement strategy local vectorized-zz only 3% lower than local vectorized-cggc (enhancement of Onion bandwidth as compared to older APUs)

28 Matrix Multiplication Performance comparison I. Said 24th High Performance Computing Symposium 04/05/ / GFlop/s higher is better 1024 performance projection CPU (GotoBLAS) APU GPU APU(Onion=Garlic*) * to mimic upcoming APUs with fully unified memory N CPU > APU N 100 (small matrices) APU > GPU, N 700 (medium sized matrices) GPU > APU, N > 700 (large matrices, transfer times are small compared to computation)

29 Finite Difference Stencil CPU performance I. Said 24th High Performance Computing Symposium 04/05/ /34 GFlop/s higher is better 18 scalar local vectorized vectorized OpenMP NxNx32 Explicit vectorization helped to deliver the best performance (SSE) OpenCL OpenMP

30 Finite Difference Stencil GPU performance I. Said 24th High Performance Computing Symposium 04/05/ / GFlop/s higher is better scalar vectorized local scalar local vectorized NxNx32 Scalar vectorized thanks to GCN (Graphics Core Next) Scalar code + OpenCL local memory offered the best performance

31 Finite Difference Stencil APU performance I. Said 24th High Performance Computing Symposium 04/05/ /34 90 GFlop/s higher is better scalar vectorized local scalar local vectorized NxNx32 Local scalar gives the best performance numbers for N 128 Vectorization is not needed thanks to GCN

32 Finite Difference Stencil APU performance and Data Placement Strategies Fixed problem size ( ) One snapshot every K computations (1 K 10) Select the best OpenCL implementations (scalar, local scalar) Combine with data placement strategies: cggc, ugc, up and zz 80 problem size: 1024x1024x32 (128 MB) GFlop/s higher is better best scalar-zz local scalar-ugc scalar-cggc scalar-up local scalar-zz scalar-ugc local scalar-cggc local scalar-up K computations + 1 snapshot Best: local scalar (zz) for 1 K 3 and (cggc) for 3 K 10 Kaveri is the first APU (compared to older ones) that enables performance gains when using zero-copy buffers I. Said 24th High Performance Computing Symposium 04/05/ /34

33 Finite Difference Stencil Performance comparison I. Said 24th High Performance Computing Symposium 04/05/ / GFlop/s higher is better performance projection 50 CPU APU GPU APU(Onion=Garlic*) * to mimic upcoming APUs with fully unified memory K computations + 1 snapshot APU > CPU K GPU > APU, 2 K 10 APU > GPU when performing one snapshot after each iteration

34 APU performance evaluation Conclusions I. Said 24th High Performance Computing Symposium 04/05/ /34 APU can be an attractive solution: For a high rate of data snapshotting (finite difference) For medium sized problems (matrix multiplication) For other cases: 3 to 4 practical GPU/APU performance gap (against 5 to 10 theoretical gap) Power is gaining interest in the HPC community (Green500) Power wall and Exascale What about power consumption?

35 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion

36 Power efficiency Methodology for power measurement I. Said 24th High Performance Computing Symposium 04/05/ /34 Tools and metric Raritan PX (DPXR8A-16) PDU to monitor the power consumption Performance per Watt (PPW) metric Methodology The power drawn by the system as a whole: Same functional hardware components for the 3 architectures CPU+GPU for GPU based solutions Importance of Power Supply Units (PSUs) electric efficiency

37 I. Said 24th High Performance Computing Symposium 04/05/ /34 Power efficiency Matrix Multiplication GFlop/s/W higher is better 8 7 up to 55 W 6 up to 197 W up to 145 W 1 cpu tahiti kaveri N CPU offers a very low power efficiency ( 1 GFlop/s/W) APU is 20% more power efficient than the GPU

38 Power efficiency Finite Difference Stencil I. Said 24th High Performance Computing Symposium 04/05/ /34 GFlop/s/W higher is better 1.4 problem size: 1024x1024x32 (128 MB) up to 62 W up to 222 W CPU GPU APU 0.2 up to 159 W K computations + 1 snapshot CPU offers a very low power efficiency (0.08 GFlop/s/W) APU is 13% more power efficient than the GPU Higher gain for compute bound algorithm (matrix multiplication): Flops consume less power than memory accesses

39 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion

40 I. Said 24th High Performance Computing Symposium 04/05/ /34 Conclusions and perspectives Conclusions The APUs (almost) always outperforms the CPUs The APUs can match or outperform discrete GPUs For some medium-sized problems For problems with high communication requirements (snapshotting) Performance + Power consumption Despite 3.3-fold performance difference APUs are more power efficient than GPUs More in Perspectives Promising features with upcoming APUs: Full memory unification (hardware level) HBM (High Bandwidth Memory) + compute units count increase OpenPower and NVLink

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC