n N c CIni.o ewsrg.au - PDF Free Download

@NCInews

NCI and Raijin National Computational Infrastructure 2

Our Partners

General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU Example: Nvidia Tesla GPU (K80) 2 x 2496 cores (562MHz/875MHz) 2 x 12 GB RAM 500 GB/s mem bandwidth 2.91 Tflops double prec. 8.74 Tflops single prec. Coprocessor Accelerators Example: Intel Xeon Phi 7120X (MIC architecture) 61 cores (244 threads) 16 GB RAM 352 GB/s mem bandwidth 1.2 Tflops double prec.

Dell C4130 Node CPU0 12 cores QPI CPU1 12 cores PCIE-3 x16 x16 x16 x16 x8 IB FDR K80 K80 K80 K80 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7 Raijin Node Dell C4130 Node Processor 2 SandyBridge Xeon E5-2670 CPUs 2 Haswell Xeon E5-2670 v3 CPUs #Cores 16 24 Memory 32 GB 128 GB Network Infiniband FDR Infiniband FDR Accelerator None 4 NVIDIA Tesla K80s

NVIDIA Tesla K80 GPU Tesla K80 GK210 GK210 cores memory memory BW clock base clock max power SP DP Architecture PCIe 4992 24 GB (48x256M) 480 GB/s (384-bit wide) 562 MHz 875 MHz 300W max. 5.61/8.74 TFLOPs 1.87/2.91 TFLOPs Kepler Gen 3 (15.7 GB) 12GB GDDR5 PCIe switch 12GB GDDR5 PCIe Gen3 Connector

GK210 and SMX Number of SMX Manufacturing Register File Size Shared Mem / L1 Cache Transistor Count 13/15 TSMC 28nm 512 KB 128 KB 7.1 B single-prec cores double-prec units special-func units load/store units 192 64 32 32

Software Stack Item Software OS CentOS Kernel 3.14.46.el6.x86_64 OFED Mellanox OFED 2.1 Host Compiler Intel-CC/12.1.9.293 MPI Library OpenMPI/1.6.5 MKL Library Intel-MKL/12.1.9.293 CUDA Toolkit CUDA 6.5 CUDA Driver 340.87

HPL 16000 14000 HPL GFLOPS and Speedups 46.20 50 45 12000 40 35 GFLOPS 10000 8000 6000 22.04 16.95 13960 30 25 20 Acceleration (X) 4000 2000 0 6.90 6659 5122 2.46 1.00 302 742 2084 Raijin Haswell Haswell+2K80s Haswell+4K80s 2node 4K80 2node 8K80 15 10 5 0 Actual GFLOPS Speedup Binary version: hpl-2.1_cuda-6.5 from NVIDIA Auto boost is used in all the tests (manual CLOCK may give better results ) Some experiments are not fully-tuned (e.g., half GPUs) Speedups are based on one Raijin node

Power Consumption (Watts) 3500 3000 2500 Power (W) 2000 1500 1000 500 0 Time Time 3500 2n 4K80 2n 8K80 Haswell+2K80 Haswell+4K80 3000 2500 Power (W) 2000 1500 1000 500 0 System power reading from ipmi-sensors As a reference, 2 Raijin nodes consume ~600 W

GPU Autoboost CLOCK 1000 900 800 700 600 500 400 300 200 100 0 1400 1200 1000 800 600 400 200 0 Power in Watt CLOCK POWER HPL benchmarking on single node using 8 gpus Power consumption is for GPUs only CLOCK range from 374 MHz to 875 MHz

NAMD 16 14 12 12.67 13.69 Acceleration (X) 10 8 6 4 2 5.68 6.70 8.54 2.33 2.07 2.03 1.00 1.00 1.00 9.03 Raijin Haswell (24cores) Haswell+2K80s Haswell+4K80s 0 apoa f1atpase stmv GPU version - NAMD 2.10_Linux_x86_64-multicore-CUDA CPU version - NAMD 2.10 Speedups are based on one Raijin node

NAMD STMV Comparison with Raijin days/ns 3 2.5 2 1.5 1 0.5 0 0.696212 4 8 12 16 20 24 28 32 36 Number of Nodes 7000 6000 5000 4000 3000 2000 1000 0 Power The performance of 24 nodes using MPI is similar to one GPU node 24 nodes 0.696, GPU node 0.681 The power consumption is 5463 W compared to GPU node 3111 W

HPL Tuning on a GPU node 8000 6000 5936 6659 GFLOPS 4000 3804 2000 0 fermi naïve highly-tuned HPL running on single node with 8 GPUs, with same input Code version does matter - from fermi to NVIDIA-hpl-2.1 binary Tuning does matter - optimised binary is not sufficient

Hybrid Programming Model NUMA-aware, accelerator-aware, 1 billion vs 1000 x 1000 x 1000 MPI + OpenMP + CUDA Accelerator Programming CUDA OpenMP 4.0 OpenCL OpenACC MIC

Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality

Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality CPU0 12 cores QPI CPU1 12 cores r3596 x16 x16 x16 x16 PCIE-3 K80 K80 K80 K80 x8 IB FDR Infiniband 56Gb/s connect to r3597 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7

Execution Model of HPL module load openmpi/1.6.5 cuda/6.5... export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 mpirun -np 16 --bind-to-none..../run_script # run_script export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 case $OMPI_COMM_WORLD_LOCAL_RANK in [0]) export CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=0,2,4./xhpl ;; [1]) export CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6,8,10./xhpl ;;... [7]) esac export CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=19,21,23./xhpl ;;

A program view from a computer scientist Resource Utilisation Computation CPU, ILP, Parallelism Memory Caching, Conflict, Locality Communication Bandwidth, Latency I/O IO Caching, Granularity Program Machine

PAPI CUDA Component papi/5.4.1-cuda module on Raijin supporting CUDA counters Sample output: Process GPU results on 8 GPUs... PAPI countervalue 4432 --> cuda:::device:0:inst_executed PAPI countervalue 9977 --> cuda:::device:0:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:1:inst_executed PAPI countervalue 10228 --> cuda:::device:1:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:2:inst_executed PAPI countervalue 9961 --> cuda:::device:2:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:3:inst_executed PAPI countervalue 9885 --> cuda:::device:3:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:4:inst_executed PAPI countervalue 9942 --> cuda:::device:4:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:5:inst_executed PAPI countervalue 9852 --> cuda:::device:5:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:6:inst_executed PAPI countervalue 9836 --> cuda:::device:6:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:7:inst_executed PAPI countervalue 9757 --> cuda:::device:7:elapsed_cycles_sm

Performance Modelling Performance Modelling (or Performance Expectation) estimate baseline performance estimate potential benefit identify critical resources Benchmarking is not performance modelling Combine performance tools with analytical methods Compute Compute MPI 61.3% walltime 17.5% in scalar numeric ops 2.5% in vector numeric ops 80.0% in memory accesses 31.8% walltime 57.6% in collective calls, process rate 12.6 MB/s 42.4% in point-to-pint calls, process rate 108 MB/s MPI I/O I/O 6.9% walltime 0% in reads, process read rate 0 MB/s 100% in writes, process write rate 28.9 MB/s

Computational Intensity Computational Intensity = number of calculation operations each memory load/store Example loop CI Key factor A(:) = B(:) + C(:) 0.33 Memory A(:) = c * B(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) * E(:) 0.6 Memory A(:) = c * B(:) + d * C(:) 1.0 Still memory A(:) = c + B(:) * (d + B(:) * e) 2.0 Calculation

Working and To Do Profiling Tools - nvprof, nsight, etc. - PAPI CUDA components - CUDA-aware MPI - OpenMPI 1.10.0 built with CUDA awareness - GPU Direct RDMA - PBS Scheduling and GPUs - resource utilisation - nvidia-smi permissions

References Tesla K80 GPU Accelerator Board Specification, Jan 2015 NVIDIA s CUDA Compute Architecture: Kepler GK110/210 (white paper) GPU Performance Analysis and Optimisation (NVIDIA) 2015 OpenMPI with RDMA Support and CUDA GPU Hardware Execution Model and Overview, University of Utah, 2011 NCI http://www. Nvidia CUDA http://www.nvidia.com/object/cuda_home_new.html