n N c CIni.o ewsrg.au

Size: px

Start display at page:

Download "n N c CIni.o ewsrg.au"

Octavia Garrison
6 years ago
Views:

1 @NCInews

2 NCI and Raijin National Computational Infrastructure 2

3 Our Partners

GB/s mem bandwidth 2.91 Tflops double prec. 8.74 Tflops single prec.

4 General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU Example: Nvidia Tesla GPU (K80) 2 x 2496 cores (562MHz/875MHz) 2 x 12 GB RAM 500 GB/s mem bandwidth 2.91 Tflops double prec Tflops single prec. Coprocessor Accelerators Example: Intel Xeon Phi 7120X (MIC architecture) 61 cores (244 threads) 16 GB RAM 352 GB/s mem bandwidth 1.2 Tflops double prec.

Processor 2 SandyBridge Xeon E5-2670 CPUs 2 Haswell Xeon E5-2670 v3 CPUs #Cores 16 24

5 Dell C4130 Node CPU0 12 cores QPI CPU1 12 cores PCIE-3 x16 x16 x16 x16 x8 IB FDR K80 K80 K80 K80 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7 Raijin Node Dell C4130 Node Processor 2 SandyBridge Xeon E CPUs 2 Haswell Xeon E v3 CPUs #Cores Memory 32 GB 128 GB Network Infiniband FDR Infiniband FDR Accelerator None 4 NVIDIA Tesla K80s

(384-bit wide) 562 MHz 875 MHz 300W max. 5.61/8.74 TFLOPs 1.87/2.

6 NVIDIA Tesla K80 GPU Tesla K80 GK210 GK210 cores memory memory BW clock base clock max power SP DP Architecture PCIe GB (48x256M) 480 GB/s (384-bit wide) 562 MHz 875 MHz 300W max. 5.61/8.74 TFLOPs 1.87/2.91 TFLOPs Kepler Gen 3 (15.7 GB) 12GB GDDR5 PCIe switch 12GB GDDR5 PCIe Gen3 Connector

7 GK210 and SMX Number of SMX Manufacturing Register File Size Shared Mem / L1 Cache Transistor Count 13/15 TSMC 28nm 512 KB 128 KB 7.1 B single-prec cores double-prec units special-func units load/store units

8 Software Stack Item Software OS CentOS Kernel el6.x86_64 OFED Mellanox OFED 2.1 Host Compiler Intel-CC/ MPI Library OpenMPI/1.6.5 MKL Library Intel-MKL/ CUDA Toolkit CUDA 6.5 CUDA Driver

9 HPL HPL GFLOPS and Speedups GFLOPS Acceleration (X) Raijin Haswell Haswell+2K80s Haswell+4K80s 2node 4K80 2node 8K Actual GFLOPS Speedup Binary version: hpl-2.1_cuda-6.5 from NVIDIA Auto boost is used in all the tests (manual CLOCK may give better results ) Some experiments are not fully-tuned (e.g., half GPUs) Speedups are based on one Raijin node

Power Consumption (Watts) 3500 3000 2500 Power (W) 2000 1500 1000 500 0 Time Time 3500 2n 4K80 2n 8K80 Haswell+2K80

10 Power Consumption (Watts) Power (W) Time Time n 4K80 2n 8K80 Haswell+2K80 Haswell+4K Power (W) System power reading from ipmi-sensors As a reference, 2 Raijin nodes consume ~600 W

11 GPU Autoboost CLOCK Power in Watt CLOCK POWER HPL benchmarking on single node using 8 gpus Power consumption is for GPUs only CLOCK range from 374 MHz to 875 MHz

12 NAMD Acceleration (X) Raijin Haswell (24cores) Haswell+2K80s Haswell+4K80s 0 apoa f1atpase stmv GPU version - NAMD 2.10_Linux_x86_64-multicore-CUDA CPU version - NAMD 2.10 Speedups are based on one Raijin node

NAMD STMV Comparison with Raijin days/ns 3 2.5 2 1.5 1 0.5 0 0.

13 NAMD STMV Comparison with Raijin days/ns Number of Nodes Power The performance of 24 nodes using MPI is similar to one GPU node 24 nodes 0.696, GPU node The power consumption is 5463 W compared to GPU node 3111 W

14 HPL Tuning on a GPU node GFLOPS fermi naïve highly-tuned HPL running on single node with 8 GPUs, with same input Code version does matter - from fermi to NVIDIA-hpl-2.1 binary Tuning does matter - optimised binary is not sufficient

15 Hybrid Programming Model NUMA-aware, accelerator-aware, 1 billion vs 1000 x 1000 x 1000 MPI + OpenMP + CUDA Accelerator Programming CUDA OpenMP 4.0 OpenCL OpenACC MIC

16 Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality

17 Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality CPU0 12 cores QPI CPU1 12 cores r3596 x16 x16 x16 x16 PCIE-3 K80 K80 K80 K80 x8 IB FDR Infiniband 56Gb/s connect to r3597 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7

.../run_script # run_script export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 case $OMPI_COMM_WORLD_LOCAL_RANK in [0])

18 Execution Model of HPL module load openmpi/1.6.5 cuda/ export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 mpirun -np 16 --bind-to-none..../run_script # run_script export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 case $OMPI_COMM_WORLD_LOCAL_RANK in [0]) export CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=0,2,4./xhpl ;; [1]) export CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6,8,10./xhpl ;;... [7]) esac export CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=19,21,23./xhpl ;;

19 A program view from a computer scientist Resource Utilisation Computation CPU, ILP, Parallelism Memory Caching, Conflict, Locality Communication Bandwidth, Latency I/O IO Caching, Granularity Program Machine

20 PAPI CUDA Component papi/5.4.1-cuda module on Raijin supporting CUDA counters Sample output: Process GPU results on 8 GPUs... PAPI countervalue > cuda:::device:0:inst_executed PAPI countervalue > cuda:::device:0:elapsed_cycles_sm PAPI countervalue > cuda:::device:1:inst_executed PAPI countervalue > cuda:::device:1:elapsed_cycles_sm PAPI countervalue > cuda:::device:2:inst_executed PAPI countervalue > cuda:::device:2:elapsed_cycles_sm PAPI countervalue > cuda:::device:3:inst_executed PAPI countervalue > cuda:::device:3:elapsed_cycles_sm PAPI countervalue > cuda:::device:4:inst_executed PAPI countervalue > cuda:::device:4:elapsed_cycles_sm PAPI countervalue > cuda:::device:5:inst_executed PAPI countervalue > cuda:::device:5:elapsed_cycles_sm PAPI countervalue > cuda:::device:6:inst_executed PAPI countervalue > cuda:::device:6:elapsed_cycles_sm PAPI countervalue > cuda:::device:7:inst_executed PAPI countervalue > cuda:::device:7:elapsed_cycles_sm

Performance Modelling Performance Modelling (or Performance Expectation) estimate baseline performance estimate potential benefit identify critical resources Benchmarking is not performance modelling

21 Performance Modelling Performance Modelling (or Performance Expectation) estimate baseline performance estimate potential benefit identify critical resources Benchmarking is not performance modelling Combine performance tools with analytical methods Compute Compute MPI 61.3% walltime 17.5% in scalar numeric ops 2.5% in vector numeric ops 80.0% in memory accesses 31.8% walltime 57.6% in collective calls, process rate 12.6 MB/s 42.4% in point-to-pint calls, process rate 108 MB/s MPI I/O I/O 6.9% walltime 0% in reads, process read rate 0 MB/s 100% in writes, process write rate 28.9 MB/s

33 Memory A(:) = c * B(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) 0.

22 Computational Intensity Computational Intensity = number of calculation operations each memory load/store Example loop CI Key factor A(:) = B(:) + C(:) 0.33 Memory A(:) = c * B(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) * E(:) 0.6 Memory A(:) = c * B(:) + d * C(:) 1.0 Still memory A(:) = c + B(:) * (d + B(:) * e) 2.0 Calculation

23 Working and To Do Profiling Tools - nvprof, nsight, etc. - PAPI CUDA components - CUDA-aware MPI - OpenMPI built with CUDA awareness - GPU Direct RDMA - PBS Scheduling and GPUs - resource utilisation - nvidia-smi permissions

24 References Tesla K80 GPU Accelerator Board Specification, Jan 2015 NVIDIA s CUDA Compute Architecture: Kepler GK110/210 (white paper) GPU Performance Analysis and Optimisation (NVIDIA) 2015 OpenMPI with RDMA Support and CUDA GPU Hardware Execution Model and Overview, University of Utah, 2011 NCI Nvidia CUDA

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has