HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Similar documents
CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CME 213 S PRING Eric Darve

GPU Fundamentals Jeff Larkin November 14, 2016

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Performance of deal.ii on a node

Parallel Computing. November 20, W.Homberg

OpenACC programming for GPGPUs: Rotor wake simulation

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Parallel Accelerators

Parallel Accelerators

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to GPU hardware and to CUDA

Parallel Systems. Project topics

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Programming Libraries and implementations

Programmer's View of Execution Teminology Summary

n N c CIni.o ewsrg.au

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Introduction to GPGPU and GPU-architectures

Optimisation Myths and Facts as Seen in Statistical Physics

Preparing for Highly Parallel, Heterogeneous Coprocessing

Scientific Computing on GPUs: GPU Architecture Overview

Parallel Programming. Libraries and Implementations

Introduction to CUDA Programming

The Era of Heterogeneous Computing

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Warps and Reduction Algorithms

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Modern Processor Architectures. L25: Modern Compiler Design

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Real-Time Rendering Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Improving Performance of Machine Learning Workloads

Instructor: Leopold Grinberg

Cartoon parallel architectures; CPUs and GPUs

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Bring your application to a new era:

HPC code modernization with Intel development tools

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Trends in HPC (hardware complexity and software challenges)

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Parallel Numerical Algorithms

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Master Informatics Eng.

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Preparing seismic codes for GPUs and other

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to Runtime Systems

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Tesla Architecture, CUDA and Optimization Strategies

OpenStaPLE, an OpenACC Lattice QCD Application

Parallel Programming on Ranger and Stampede

General introduction: GPUs and the realm of parallel architectures

Dense Linear Algebra. HPC - Algorithms and Applications

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

The Art of Parallel Processing

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

High Performance Computing and GPU Programming

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Computer Architecture

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Technology for a better society. hetcomp.com

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1


INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

45-year CPU Evolution: 1 Law -2 Equations

COSC 6339 Accelerators in Big Data

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Dan Stafford, Justine Bonnot

Accelerator programming with OpenACC

Transcription:

HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk

Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications: Flop/s ODE Bandwidth PinT Similarities between CPUs and GPUs Future trend(s)

No more performance by increasing clock rate: limitation reached Using the TOP500 to Trace and Project Technology and Architecture Trends, SC11 Peter M. Kogge, Timothy J. Dysart Increasing frequency led to performance improvements in the past (until ~ 2004) Almost no performance gains anymore by increasing clock cycles One way out of performance increase is given by parallelizing data processing Trend towards hybrid architectures Programmability?

Hybrid architectures NVIDIA GPUs Intel CPUs Intel XeonPhi AMD CPUs (the newcomer ) AMD GPUs https://folding.stanford.edu/home/faq/faq-gpu3-common/ http://www.gameswelt.de/prozessoren-intel/news/ivy-bridge-cpus-ab-dem-29.-april,156589 http://spectrum.ieee.org/semiconductors/processors/what-intels-xeon-phi-coprocessor-means-for-the-future-of-supercomputing http://bitbitbyte.com/2012/03/22/leaked-amd-7990-dual-gpu-specs-leak-try-to-steal-thunder-from-nvidias-party/

Performance improvements Algorithmic / mathematical Instruction parallelism Pipelining Using optimized instructions (FMA) Data parallelism Smart instructions (SIMD/SIMT) Many-core level Many-node level (distributed memory) [not part of this talk]... data parallelization time Input data Parallel processing OpenMP/OpenCL/CUDA/... part of this talk Output data

Performance limitations Compute-bound Computation intensive operation (e.g. matrix-matrix multiplication) Memory-bound Data access results in memory bandwidth limitation (e.g. stencil computation) Accelerator cards (GPU/XeonPhi) offer solutions for both challenges => More computing power => More bandwidth Accelerator cards are not so different to CPUs => see next slides

Exploiting computational power: Flop/s

Theoretical max. performance in Flop/s GPUs / CPUs / XeonPhi http://docs.nvidia.com/cuda/cuda-c-programming-guide/ (Tesla, single precision) XeonPhi (Knights Landing), Double precision Tesla K80, double precision Tesla K40, double precision XeonPhi (Knights Corner), Double precision

Getting Flop/s on GPUs Multiple Streaming Multiprocessors (SM) Single instruction, multiple thread programming (SIMT) Half-warp: 16 compute units / threads ~ vector-wise processing Same operations: Operations for threads in one half warp should be identical to max. performance http://docs.nvidia.com/cuda/cuda-c-programming-guide

Getting Flop/s on CPUs: Vector-wise via vectors... Multi-cores (Hyperthreading on Multi-cores similar to SM on NVIDIA GPUs) Processor 0 on Socket 0 Core: Processor 1 on socket 1 Similar to half-warp SIMD: Vector-wise processing Core0 Core1 Core2 Core3 Same operations: Operations on vectors have to be identical Vec Elem. 0 Vec Elem. 1 Vec Elem....... Vec Elem. 7 Core11

Current processing block sizes on GPUs / CPUs / XeonPhis GPUs: Half-warp: 16 Threads SP: 16 scalars DP: 8 scalars Particular size of block data transfer: 512 bit CPUs: AVX: 256 bit AVX2: 256 + FMA +... XeonPhi: => Maximizing Flop/s on current and future architect relies on vectororiented processing! AVX: 512 bit + FMA +... SP: 16 scalars DP: 8 scalars

Exploiting data movement optimizations: Efficiently moving data to run computations on

Max. Bandwidth XeonPhi (Knights Corner)

Latency hiding (Cache hierarchy is not considered here) GPUs: Loading data from main memory takes >100 cycles on all architectures Strategies to hide this latency are required Execute different group of threads in case of pipeline stall CPUs: => Over-subscription of physically available resources zero-overhead thread scheduling Hyperthreading: More logical cores than physical floating point units available XeonPhi: Hyperthreading: Each physical core has 3 additional hyperthreads

Block-wise reading Data is always loaded blockwise from main memory Blocks have certain size Alignment of blocks GPU: Optimize for coalesced memory access (supported features depend on compute capability) CPU/XeonPhi: Use SIMD data loads, permutations, gather, scatter, etc. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Computations vs. memory access: A roofline model Limited by bandwidth in this area Roofline model can show performance limitations of your application Optimize for TLP, ILP, SIMD, Intel Xeon (Clovertown) Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures Samuel Williams, Andrew Waterman, and David Patterson

Computations vs. memory access: A roofline model GPU & XeonPhi N-body tutorial on Xeon Phi Rio Yokota

Programmability Making computational performance accessible One decade ago: OpenGL/DirectX shader programming Nowadays: CUDA (Language extension: NVIDIA only) OpenCL (API: NVIDIA, AMD, CPUs, XeonPHI) OpenMP simd (Language extension: CPUs, XeonPHI) (OpenACC)

Joint work with Ozgur Akman and Eleftherios Avramidis Case study: Accelerator architectures for ODE solvers Parameter estimation for biological systems: no branch-divergence in kernels Solving independent set of ODEs Threading: Speedup = 124.5 SIMD: Speedup = 5.7 XeonPhi Alignment: Speedup = 1.02 OpenMP on XeonPhi 60+1 physical cores Threaded vs. nonthreaded code SIMD: Vectorized vs. non-vectorized code Benchmarks computed at IPVS, Univ. Stuttgart

Joint work with Ozgur Akman and Eleftherios Avramidis Case study: Accelerator architectures for ODE solvers OpenCL + OpenMP: Performance boost with OpenMP on Intel architectures OpenCL is not performance portable (see OpenCL OpenMP) Currently, no OpenMP support for NVIDIA Tesla cards OpenMP SIMD fails for another ODE (For this, K10 is faster than XeonPhi => future work) CPU: OpenCL OpenMP Speedup = 3.1 CPU XeonPhi: Speedup = 3.2 New NVIDIA GPU: Speedup=1.6 Benchmarks computed at IPVS, Univ. Stuttgart

Parallelization-in-time (see clock-rate limitation from the first slide) Compensate strongscalability limitations Joint work with Beth Wingate, Adam Peddle, Terry Haut, et al Multiple simulation instances Coarse and fine time stepping Iterative errorcorrection method in time Future target application: Climate & weather (GungHo!/LFRic) Example for iterative correction (solving ODEs)

Parallelization-in-time Example: Parallel-in-time solver for rotational shallow water equations Benchmarks computed on MAC Cluster, TU Munich A Decentralized Parallelization-in-Time Approach with Parareal, M. Schreiber, A. Peddle, T. Haut, B. Wingate, in review

Outlook

Future trends: Hardware XeonPhi: Knights Landing at the end of this year Two different versions (see right handed image) Supports OpenMP programming model 3+ TFlops http://www.v3.co.uk/img/374/305374/intel-xeon-phi-roadmap.png

Future trends? XeonPhi+GPUs Intel XeonPhi: GPUs/AMD/NVIDIA: Trend towards GPUs: Smaller caches, many-core system OpenCL support not optimal Supports OpenMP programming model efficiently => Vectorization mandatory! No vectorization => no Flop/s [1] https://en.wikipedia.org/wiki/xeon_phi#knights_corner Trend towards CPU-like architecture: Caches, Support for C++11,... Good CUDA support, but proprietary OpenCL support not optimal for NVIDIA cards => OpenMP could be the future of threaded- and SIMD-parallelism => OpenMP for GPUs?

Thank you for your attention Interested in code optimization? => M.Schreiber@exeter.ac.uk!

Additional slides

Branching GPUs: if (i == 0); a[i] += c[i]; else a[i] += b[i]; Would result in branch divergence and serialization of all threads CPUs: Use masking / blend SIMD instructions Or use arithmetic/bit tricks: for (int I = 0; I < N; i++) { int m = (i == 0); a[i] += c[i]*m + b[i]*(1 m); } Solution: Predicate registers Information in registers decide if instruction is executed for certain thread http://www.slideshare.net/ttyman1/gpu-performance-prediction-using-highlevel-application-models

Data alignment GPU: Assume assigned buffers to avoid uncoalesced memory access CPU/XeonPhi: Avoid misaligned memory access: int posix_memalign( void **memptr, size_t alignment, size_t size); Buffers are automatically aligned Align memory buffers at 64 byte boundaries Provide information on alignment, e.g. #pragma omp parallel for simd\ aligned(variable)

OpenCL... Lot of boiler plate code to create context, load&compile kernels, etc.... Really a lot! (about 200 lines of code)... ClEnqueueWriteBuffer( command_queue, client_mem_x, CL_TRUE, 0, size_of(someobject), host_x, 0, NULL, NULL); ClEnqueueNDRangeKernel( command_queue, kernel, 1, NULL, &param_global_workgroup_size, &opencl_local_workgroup_size, 0, NULL, NULL); API Explicit Buffer management Kernel oriented programming Explicit kernel execution...

OpenMP #pragma offload_transfer target(mic) \ in(params : length(n) ALLOC) \ in(x : length(m) ALLOC) #pragma omp parallel for simd aligned(x, params) for (int i = 0; i < num_sims; i++) ODEBenchmark_OpenMP_ver2(x, params, i, num_sims); // KERNEL #pragma omp declare simd aligned(i_x, params_g) SPEC_TARGETS void ODEBenchmark_OpenMP_ver2( double *i_x, double *params_g, int x, int realizations ) {...} Aligned memory allocation Offload model SIMD annotation Kernel-like programming Support for Fortran

Joint work with Ozgur Akman and Eleftherios Avramidis Benchmarks computed at IPVS, Univ. Stuttgart Case study: Accelerator architectures for ODE solvers OpenCL only Different architectures Poor performance on Intel CPU & XeonPhi

ODEs: Exploration of different architectures and parallelization models required Software: Intel Compiler infrastructure (icpc/ifort) Hardware: GNU Compiler (offload support in 5.0) (Intel) CPUs 2x XeonPhi (onboard Phis available?) OpenCL/CUDA support 2x NVIDIA GPUs (OpenACC support?) 2x AMD GPUs

PinT: requirements Software requirements: Python related: Python 2.7 and 3.4 MPI4py Matplotlib Future software requirements See GungHo/LFRic project LAPACK library support (e.g. Intel MKL) for EV decomposition Hardware: Focus on CPUs (Maybe XeonPhi) ~500 GB storage space