HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk

Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications: Flop/s ODE Bandwidth PinT Similarities between CPUs and GPUs Future trend(s)

No more performance by increasing clock rate: limitation reached Using the TOP500 to Trace and Project Technology and Architecture Trends, SC11 Peter M. Kogge, Timothy J. Dysart Increasing frequency led to performance improvements in the past (until ~ 2004) Almost no performance gains anymore by increasing clock cycles One way out of performance increase is given by parallelizing data processing Trend towards hybrid architectures Programmability?

Hybrid architectures NVIDIA GPUs Intel CPUs Intel XeonPhi AMD CPUs (the newcomer ) AMD GPUs https://folding.stanford.edu/home/faq/faq-gpu3-common/ http://www.gameswelt.de/prozessoren-intel/news/ivy-bridge-cpus-ab-dem-29.-april,156589 http://spectrum.ieee.org/semiconductors/processors/what-intels-xeon-phi-coprocessor-means-for-the-future-of-supercomputing http://bitbitbyte.com/2012/03/22/leaked-amd-7990-dual-gpu-specs-leak-try-to-steal-thunder-from-nvidias-party/

Performance improvements Algorithmic / mathematical Instruction parallelism Pipelining Using optimized instructions (FMA) Data parallelism Smart instructions (SIMD/SIMT) Many-core level Many-node level (distributed memory) [not part of this talk]... data parallelization time Input data Parallel processing OpenMP/OpenCL/CUDA/... part of this talk Output data

Performance limitations Compute-bound Computation intensive operation (e.g. matrix-matrix multiplication) Memory-bound Data access results in memory bandwidth limitation (e.g. stencil computation) Accelerator cards (GPU/XeonPhi) offer solutions for both challenges => More computing power => More bandwidth Accelerator cards are not so different to CPUs => see next slides

Exploiting computational power: Flop/s

Theoretical max. performance in Flop/s GPUs / CPUs / XeonPhi http://docs.nvidia.com/cuda/cuda-c-programming-guide/ (Tesla, single precision) XeonPhi (Knights Landing), Double precision Tesla K80, double precision Tesla K40, double precision XeonPhi (Knights Corner), Double precision

Getting Flop/s on GPUs Multiple Streaming Multiprocessors (SM) Single instruction, multiple thread programming (SIMT) Half-warp: 16 compute units / threads ~ vector-wise processing Same operations: Operations for threads in one half warp should be identical to max. performance http://docs.nvidia.com/cuda/cuda-c-programming-guide

Getting Flop/s on CPUs: Vector-wise via vectors... Multi-cores (Hyperthreading on Multi-cores similar to SM on NVIDIA GPUs) Processor 0 on Socket 0 Core: Processor 1 on socket 1 Similar to half-warp SIMD: Vector-wise processing Core0 Core1 Core2 Core3 Same operations: Operations on vectors have to be identical Vec Elem. 0 Vec Elem. 1 Vec Elem....... Vec Elem. 7 Core11

Current processing block sizes on GPUs / CPUs / XeonPhis GPUs: Half-warp: 16 Threads SP: 16 scalars DP: 8 scalars Particular size of block data transfer: 512 bit CPUs: AVX: 256 bit AVX2: 256 + FMA +... XeonPhi: => Maximizing Flop/s on current and future architect relies on vectororiented processing! AVX: 512 bit + FMA +... SP: 16 scalars DP: 8 scalars

Exploiting data movement optimizations: Efficiently moving data to run computations on

Max. Bandwidth XeonPhi (Knights Corner)

Latency hiding (Cache hierarchy is not considered here) GPUs: Loading data from main memory takes >100 cycles on all architectures Strategies to hide this latency are required Execute different group of threads in case of pipeline stall CPUs: => Over-subscription of physically available resources zero-overhead thread scheduling Hyperthreading: More logical cores than physical floating point units available XeonPhi: Hyperthreading: Each physical core has 3 additional hyperthreads

Block-wise reading Data is always loaded blockwise from main memory Blocks have certain size Alignment of blocks GPU: Optimize for coalesced memory access (supported features depend on compute capability) CPU/XeonPhi: Use SIMD data loads, permutations, gather, scatter, etc. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Computations vs. memory access: A roofline model Limited by bandwidth in this area Roofline model can show performance limitations of your application Optimize for TLP, ILP, SIMD, Intel Xeon (Clovertown) Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures Samuel Williams, Andrew Waterman, and David Patterson

Computations vs. memory access: A roofline model GPU & XeonPhi N-body tutorial on Xeon Phi Rio Yokota

Programmability Making computational performance accessible One decade ago: OpenGL/DirectX shader programming Nowadays: CUDA (Language extension: NVIDIA only) OpenCL (API: NVIDIA, AMD, CPUs, XeonPHI) OpenMP simd (Language extension: CPUs, XeonPHI) (OpenACC)

Joint work with Ozgur Akman and Eleftherios Avramidis Case study: Accelerator architectures for ODE solvers Parameter estimation for biological systems: no branch-divergence in kernels Solving independent set of ODEs Threading: Speedup = 124.5 SIMD: Speedup = 5.7 XeonPhi Alignment: Speedup = 1.02 OpenMP on XeonPhi 60+1 physical cores Threaded vs. nonthreaded code SIMD: Vectorized vs. non-vectorized code Benchmarks computed at IPVS, Univ. Stuttgart

Joint work with Ozgur Akman and Eleftherios Avramidis Case study: Accelerator architectures for ODE solvers OpenCL + OpenMP: Performance boost with OpenMP on Intel architectures OpenCL is not performance portable (see OpenCL OpenMP) Currently, no OpenMP support for NVIDIA Tesla cards OpenMP SIMD fails for another ODE (For this, K10 is faster than XeonPhi => future work) CPU: OpenCL OpenMP Speedup = 3.1 CPU XeonPhi: Speedup = 3.2 New NVIDIA GPU: Speedup=1.6 Benchmarks computed at IPVS, Univ. Stuttgart

Parallelization-in-time (see clock-rate limitation from the first slide) Compensate strongscalability limitations Joint work with Beth Wingate, Adam Peddle, Terry Haut, et al Multiple simulation instances Coarse and fine time stepping Iterative errorcorrection method in time Future target application: Climate & weather (GungHo!/LFRic) Example for iterative correction (solving ODEs)

Parallelization-in-time Example: Parallel-in-time solver for rotational shallow water equations Benchmarks computed on MAC Cluster, TU Munich A Decentralized Parallelization-in-Time Approach with Parareal, M. Schreiber, A. Peddle, T. Haut, B. Wingate, in review

Outlook

Future trends: Hardware XeonPhi: Knights Landing at the end of this year Two different versions (see right handed image) Supports OpenMP programming model 3+ TFlops http://www.v3.co.uk/img/374/305374/intel-xeon-phi-roadmap.png

Future trends? XeonPhi+GPUs Intel XeonPhi: GPUs/AMD/NVIDIA: Trend towards GPUs: Smaller caches, many-core system OpenCL support not optimal Supports OpenMP programming model efficiently => Vectorization mandatory! No vectorization => no Flop/s [1] https://en.wikipedia.org/wiki/xeon_phi#knights_corner Trend towards CPU-like architecture: Caches, Support for C++11,... Good CUDA support, but proprietary OpenCL support not optimal for NVIDIA cards => OpenMP could be the future of threaded- and SIMD-parallelism => OpenMP for GPUs?

Thank you for your attention Interested in code optimization? => M.Schreiber@exeter.ac.uk!

Additional slides

Branching GPUs: if (i == 0); a[i] += c[i]; else a[i] += b[i]; Would result in branch divergence and serialization of all threads CPUs: Use masking / blend SIMD instructions Or use arithmetic/bit tricks: for (int I = 0; I < N; i++) { int m = (i == 0); a[i] += c[i]*m + b[i]*(1 m); } Solution: Predicate registers Information in registers decide if instruction is executed for certain thread http://www.slideshare.net/ttyman1/gpu-performance-prediction-using-highlevel-application-models

Data alignment GPU: Assume assigned buffers to avoid uncoalesced memory access CPU/XeonPhi: Avoid misaligned memory access: int posix_memalign( void **memptr, size_t alignment, size_t size); Buffers are automatically aligned Align memory buffers at 64 byte boundaries Provide information on alignment, e.g. #pragma omp parallel for simd\ aligned(variable)

OpenCL... Lot of boiler plate code to create context, load&compile kernels, etc.... Really a lot! (about 200 lines of code)... ClEnqueueWriteBuffer( command_queue, client_mem_x, CL_TRUE, 0, size_of(someobject), host_x, 0, NULL, NULL); ClEnqueueNDRangeKernel( command_queue, kernel, 1, NULL, &param_global_workgroup_size, &opencl_local_workgroup_size, 0, NULL, NULL); API Explicit Buffer management Kernel oriented programming Explicit kernel execution...

OpenMP #pragma offload_transfer target(mic) \ in(params : length(n) ALLOC) \ in(x : length(m) ALLOC) #pragma omp parallel for simd aligned(x, params) for (int i = 0; i < num_sims; i++) ODEBenchmark_OpenMP_ver2(x, params, i, num_sims); // KERNEL #pragma omp declare simd aligned(i_x, params_g) SPEC_TARGETS void ODEBenchmark_OpenMP_ver2( double *i_x, double *params_g, int x, int realizations ) {...} Aligned memory allocation Offload model SIMD annotation Kernel-like programming Support for Fortran

Joint work with Ozgur Akman and Eleftherios Avramidis Benchmarks computed at IPVS, Univ. Stuttgart Case study: Accelerator architectures for ODE solvers OpenCL only Different architectures Poor performance on Intel CPU & XeonPhi

ODEs: Exploration of different architectures and parallelization models required Software: Intel Compiler infrastructure (icpc/ifort) Hardware: GNU Compiler (offload support in 5.0) (Intel) CPUs 2x XeonPhi (onboard Phis available?) OpenCL/CUDA support 2x NVIDIA GPUs (OpenACC support?) 2x AMD GPUs

PinT: requirements Software requirements: Python related: Python 2.7 and 3.4 MPI4py Matplotlib Future software requirements See GungHo/LFRic project LAPACK library support (e.g. Intel MKL) for EV decomposition Hardware: Focus on CPUs (Maybe XeonPhi) ~500 GB storage space