Parallel Accelerators

Similar documents
Parallel Accelerators

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to CUDA Programming

Lecture 8: GPU Programming. CSE599G1: Spring 2017

High Performance Computing and GPU Programming

COSC 6339 Accelerators in Big Data

COSC 6374 Parallel Computations Introduction to CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU Computing: Introduction to CUDA. Dr Paul Richmond

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU Computing with CUDA

Programmable Graphics Hardware (GPU) A Primer

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

University of Bielefeld

Josef Pelikán, Jan Horáček CGG MFF UK Praha

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Basics of CADA Programming - CUDA 4.0 and newer

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Introduc)on to GPU Programming

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Using a GPU in InSAR processing to improve performance

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Parallel Numerical Algorithms

Multi-Processors and GPU

CUDA C Programming Mark Harris NVIDIA Corporation

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Lecture 11: GPU programming

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017


INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Accelerator Programming Lecture 1

Tesla Architecture, CUDA and Optimization Strategies

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Intel Xeon Phi Coprocessors

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Advanced OpenMP Features

ECE 574 Cluster Computing Lecture 15

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU Programming Using CUDA

CSE 160 Lecture 24. Graphical Processing Units

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

High-Performance Computing Using GPUs

CUDA C/C++ BASICS. NVIDIA Corporation

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Cartoon parallel architectures; CPUs and GPUs

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

Portland State University ECE 588/688. Graphics Processors

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to Parallel Computing with CUDA. Oswald Haan

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

ECE 574 Cluster Computing Lecture 17

Intel Knights Landing Hardware

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CME 213 S PRING Eric Darve

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Improving Performance of Machine Learning Workloads

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

GPU Programming with CUDA. Pedro Velho

GPU Fundamentals Jeff Larkin November 14, 2016

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

EEM528 GPU COMPUTING

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Master Informatics Eng.

GPU CUDA Programming

HPCSE II. GPU programming and CUDA

Introduction to OpenACC. 16 May 2013

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Vector Addition on the Device: main()

Lecture 5. Performance Programming with CUDA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Parallelism Model

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to GPU Programming

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 1: Introduction and Computational Thinking

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Transcription:

Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1

Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon Phi 2

Graphical Processing Units 3

GPU Nvidia - Roadmap 4

GPU - Use GPU is especially well-suited to address problems that can be expressed as data-parallel computations. The same program is executed on many data elements in parallel - with high arithmetic intensity. Applications that process large data sets can use a data-parallel programming model to speed up the computations (3D rendering, image processing, video encoding, ) Many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing too (machine learning, general signal processing, physics simulation, finance, ). 5

GPU - Overview CPU code runs on the host, GPU code runs on the device. A kernel consists of multiple threads. Threads execute in 32-thread groups called warps. Threads are grouped into blocks. A collection of blocks is called a grid. 6

GPU - Hardware Organization Overview GPU chip consists of one or more streaming multiprocessors (SM). A multiprocessor consists of 1 (CC 1.x), 2 (CC 2.x), or 4 (CC 3.x, 5.x, 6.x, 7.x) warp schedulers. (CC = CUDA Capability) Each warp scheduler can issue to 2 (CC 5 and 6) or 1 (CC 7) dispatch units. A multiprocessor consists of functional units of several types. 7

Streaming Multiprocessor (SM) - Volta 8

GPU - Functional Units INT, FP32 (CUDA Core) - functional units that executes most types of instructions, including most integer and single precision floating point instructions. FP64 (Double Precision) - executes double-precision floating point instructions. SFU (Special Functional Unit) - executes reciprocal and transcendental instructions such as sine, cosine, and reciprocal square root. LD/ST (Load/Store Unit) handles load and store instructions. TENSOR CORE for deep learning matrix arithmetic. 9

GPU - Tensor Core V100 GPU contains 640 Tensor Cores: eight (8) per SM. Tensor Core performs 64 floating point FMA (fused multiply add) operations per clock. Matrix-Matrix multiplication (GEMM) operations are at the core of neural network training and inferencing. Each Tensor Core operates on a 4x4 matrices. 10

Streaming Multiprocessor (SM) Each SM has a set of temporary registers split amongst threads. Instructions can access high-speed shared memory. Instructions can access a cache-backed constant space. Instructions can access local memory. Instructions can access global space. (very slow in general) 11

GPU Architecture - Volta 12

GPU Architectures Tesla Product Tesla K40 Tesla M40 Tesla P100 Tesla V100 Manufacturing Process 28nm 28nm 16nm 12nm Transistors 7.1 Billion 8.0 Billion 15.3 Billion 21.1 Billion SMs / GPU 15 24 28 40 F32 Cores / SM 192 128 64 64 F32 Cores / GPU 2880 3072 3584 5120 Peak FP32 TFLOPS Peak FP64 TFLOPS 5 6.8 10.6 15.7 1.7 0,21 5.3 7.8 GPU Boost Clock 810/870 MHz 1114 MHz 1480 MHz 1530 MHz TDP 235 W 250 W 300 W 300 W Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 Maximum TDP 244W 250W 250W 300W 13

Single-Instruction, Multiple-Thread SIMT is an execution model where single instruction, multiple data (SIMD) is combined with multithreading. The SM creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. A warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. 14

CUDA The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). CUDA (Compute Unified Device Architecture) provides a way how a CUDA program can be executed on any number of SMs. A multithreaded program is partitioned into blocks of threads that execute independently from each other. A GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors. 15

CUDA 16

Grid/Block/Thread threads can be identified using a 1-D, 2-D, or 3-D thread index, forming a 1-D, 2-D, or 3-D block of threads, called a thread block. Blocks are organized into a 1-D, 2-D, or 3-D grid of thread blocks. 2-D grid with 2-D thread blocks 17

Kernel CUDA C extends C by allowing the programmer to define C functions, called kernels. // Kernel definition global void VecAdd(float* A, float* B, float* C) { int i = threadidx.x; C[i] = A[i] + B[i]; } int main() {... } // Kernel invocation with N threads inside 1 thread block VecAdd<<<1, N>>>(A, B, C); threadidx is a 3-component vector, so that threads can be identified using a 1-D, 2-D, or 3-D thread index. 18

Memory Hierarchy Each thread has private set of registers and local memory. Each thread block has shared memory visible to all threads of the block. All threads have access to the same global memory. There are also two additional read-only memory spaces accessible by all threads (constant and texture memory). 19

GPU Programming - Example Element by element vector addition [1] NVIDIA Corporation, CUDA Toolkit Documentation v9.0.176, 2017. 20

Element by element vector addition /* Host main routine */ int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host input vectors A and B and output vector C float *h_a = (float *)malloc(size); float *h_b = (float *)malloc(size); float *h_c = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { } h_a[i] = rand()/(float)rand_max; h_b[i] = rand()/(float)rand_max; 21

Element by element vector addition // Allocate the device input vectors A and B and output vector C float *d_a = NULL; cudamalloc((void **)&d_a, size); float *d_b = NULL; cudamalloc((void **)&d_b, size); float *d_c = NULL; cudamalloc((void **)&d_c, size); // Copy the host input vectors A and B in host memory to the device // input vectors in device memory cudamemcpy(d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice); 22

Element by element vector addition // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid = (numelements + threadsperblock - 1) / threadsperblock; vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); // Copy the device result vector in device memory to the host result vector // in host memory. cudamemcpy(h_c, d_c, size, cudamemcpydevicetohost); 23

Element by element vector addition // Free device global memory err = cudafree(d_a); err = cudafree(d_b); err = cudafree(d_c); // Free host memory free(h_a); free(h_b); free(h_c); return 0; } 24

Element by element vector addition /** * CUDA Kernel Device code * * Computes the vector addition of A and B into C. The 3 vectors have the same * number of elements numelements. */ global void vectoradd(float *A, float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; } if (i < numelements) { C[i] = A[i] + B[i]; } 25

Intel Xeon Phi 26

Intel Xeon Phi - Roadmap 27

Intel Xeon Phi Intel Xeon Phi coprocessors are designed to extend the reach of applications that have demonstrated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems. Code compiled for Xeon processors can be run on an Xeon Phi (Knights Landing). For successful parallelization it requires a program with lots of threads and operations with vectors. 28

Knights Landing Architecture 29

Knights Landing Architecture The chip is constituted by 36 tiles interconnected by 2D mesh. The tile has two Cores (Atom Silvermont architecture), two vector processing units (VPU) and 1M L2 cache. A tile can execute concurrently 4 threads. The tiles are interconnected a cache-coherent 2D mesh; which provides a higher bandwidth and lower latency compare to the 1D ring interconnect on Knights corner. The mesh enforces XY routing rule. 30

Knights Landing Architecture Xeon Phi has 2 types of memory: (i) MCDRAM (Multichannel DRAM) and (ii) DDR. MCDRAM is a high-bandwidth memory integrated on the package. There are 8 of them 2 GB each. MCDRAM can be configured at boot time into one of three modes: Cache mode MCDRAM is a cache for DDR, Flat mode MCDRAM is a standard memory in the same address space as DDR, Hybrid a combination DDR is a high-capacity memory which is external to the Knight Landing package. 31

Vectorization Each tile has two VPUs (Vector Processing Unit). It is the heard of computation. It processes all floating point computations using SSE, AVX, AVX2,, AVX-512. Thus each tile can execute two 512-bit vector multiple-add instructions per cycle, i.e. compute 32 double precision resp. 64 single precision floating point operation in each cycle. 33

Knights Corner vs. Knights Landing Product Name Code Name Intel Xeon Phi Coprocessor 7120X Knights Corner Intel Xeon Phi Processor 7290F Knights Landing Lithography Recommended Customer Price # of Cores Processor Base Frequency Cache TDP Max Memory Size (dependent on memory type) Max Memory Bandwidth 22 nm 14 nm N/A $6401.00 61 72 1.24 GHz 1.50 GHz 30.5 MB L2 36 MB L2 300 W 260 W 16 GB 384 GB 352 GB/s 490 GB/s 34

Offloading Choose highly-parallel sections of code to run on the coprocessor. Serial code offloaded to the coprocessor will run much slower than on the CPU. Intel provides a set of directives to the compiler named LEO : Language Extension for Offload. These directives manage the transfer and execution of portions of code to the device. #include <omp.h> #pragma offload target(mic:0) { //a highly-parallel section } 35

Offloading A host and device don't have a shared memory, therefore variables must be copied in an explicit or in an implicit way. Implicit copy is assumed for scalar variables and static arrays. Explicit copy must be managed by the programmer using clauses defined in the LEO. Programmer clauses for explicit copy: in, out, inout, nocopy #pragma offload target(mic) in(data:length(size)) 36

Offloading void f() { } int x = 55; int y[10] = { 0,1,2,3,4,5,6,7,8,9}; // x, y sent from CPU // To use values computed into y by this offload in next offload, // y is brought back to the CPU #pragma offload target(mic:0) in(x,y) inout(y) { y[5] = 66; } // The assignment to x on the CPU // is independent of the value of x on the coprocessor x = 30; // Reuse of x from previous offload is possible using nocopy // However, array y needs to be sent again from CPU #pragma offload target(mic:0) nocopy(x) in(y) { = y[5]; // Has value 66 } = x; // x has value 55 from first offload 37

Xeon Phi Programming - Demo Simple matrix equation (A = a*a + B). [2] James Jeffers and James Reinders, Intel Xeon Phi Coprocessor High-Performance Programming, Morgan Kaufmann, 2013. 38

Simple matrix equation int main(int argc, char* argv[]) { high_resolution_clock::time_point start = high_resolution_clock::now(); uint64_t numthreads = 1; #pragma offload target(mic: 0) { init(); numthreads = omp_get_max_threads(); #pragma omp parallel { } kernel(); } double totalduration = duration_cast<duration<double>> return 0; (high_resolution_clock::now()-start).count(); } 39

Simple matrix equation #define IMCI_ALIGNMENT 64 #define ARR_SIZE (1024*1024) #define ITERS_PER_LOOP 128 #define LOOP_COUNT (64*1024*1024) #pragma omp declare target float a; float fa[arr_size] attribute ((aligned(imci_alignment))); float fb[arr_size] attribute ((aligned(imci_alignment))); inline void init() { } a = 1.1f; #pragma omp simd aligned(fa, fb: IMCI_ALIGNMENT) for (uint64_t i = 0; i < ARR_SIZE; ++i) { } fa[i] = ((float) i) + 0.1f; fb[i] = ((float) i) + 0.2f; 40

Simple matrix equation inline void kernel() { } uint64_t offset = omp_get_thread_num()*iters_per_loop; for (uint64_t j = 0; j < LOOP_COUNT; ++j) { } #pragma omp simd aligned(fa,fb: IMCI_ALIGNMENT) for (uint64_t k = 0; k < ITERS_PER_LOOP; ++k) fa[k+offset] = a*fa[k+offset]+fb[k+offset]; #pragma omp end declare target 41

References [1] David M. Koppelman, GPU Microarchitecture Lecture notes, Louisiana State University, 2017. [2] James Jeffers and James Reinders, Intel Xeon Phi Coprocessor High-Performance Programming, Morgan Kaufmann, 2013. [3] NVIDIA, CUDA Toolkit Documentation v10.0, 2018. (http://docs.nvidia.com/cuda/index.html) [4] Avinash Sodani, Knights Landing (KNL): 2nd Generation Intel Xeon Phi Processor, Intel, 2016. [5] James Jeffers and James Reinders and Avinash Sodani, Intel Xeon Phi Processor High Performance Programming, 2nd Edition, Morgan Kaufmann, 2016. 42

References [6] Kevin, D: Effective Use of the Intel Compiler's Offload Features, https://software.intel.com/en-us/articles/effectiveuse-of-the-intel-compilers-offload-features, 2014. [7] Fabio AFFINITO CINECA, Introduction to Intel Xeon Phi programming, CINECA, 2017. 43