Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

Size: px
Start display at page:

Download "Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014"

Transcription

1 Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

2 More and more accelerators in Top500 fastest clusters

3 More and more accelerators in Top500 fastest clusters What are they? Should I care? Where do I start?

4 IBM PC XT (1982)

5

6 Intel 8087 math co-processor Hardware accelerator for foating-point arithmetics

7

8 DSP Digital Signal Processor GPU Graphic Processing Unit FPGA Field-Programmable Gate Array

9 Hardware accelerators: offoad some computational burden from the central processing unit to dedicated hardware.

10 Accelerators in the HPC world

11

12

13 The beginning of wisdom is the definition of terms. * Name Xeon Phi Visual Category Visual Name Product series Tesla Microprocessor architecture Kepler 5110P Shippable product (SKU) K20X Knights Corner Chip code name GK110 MIC Many Integrated Core Architecture MPSS Manycore Platform Software Stack Set of software (drivers, kernel modules, etc.) * Socrates ( B.C.) N.A. CUDA toolkit

14 Hardware accelerators: What is so appealing about them?

15 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s

16 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s Xeon Phi 5510P 60 cores Gfops 230W - RAM 320 GB/sec Tesla K20X 2688 cores Gfops 230W - RAM 250 GB/s

17 Hardware accelerators: What is so appealing about them? Large number of cores Large memory throughput Low electrical consumption Small physical footprint

18 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s Xeon Phi 5510P 60 cores Gfops 230W - RAM 320 GB/sec Tesla K20X 2688 cores Gfops 230W - RAM 250 GB/s

19 Sandy Bridge core

20 MIC core

21 GPU core

22 Every action has its pleasures and its price. * Hardware accelerators: They can bring you more speed, but they are different from CPU's. To fully benefit from them: you must understand them you must adapt to them * Socrates ( B.C.)

23 Hardware accelerators: GPU computing on nvidia Tesla with CISM/CÉCI HPC training session Fall 2013

24 Hardware accelerators: GPU computing on nvidia Tesla. 1. GPU hardware CISM/CÉCI HPC training session Fall 2013

25

26

27

28

29

30

31 Hardware accelerators: GPU computing on nvidia Tesla. 2. Programming model CISM/CÉCI HPC training session Fall 2013

32 e t pu d t c u d m nte ro o e ft p C i r t -o so c ire ame icro D G M s m r r nm e o e l pi latf Op m p h o t I c any wi G m e P r by erg o f d m C s C e rte ed v A cti o n n p an e e p r p Di Su Pl O Su Su Op pp cc e o r e s s nc te o d ro L by f a l Op l p en la G tfo L rm s.0 4 P M CU ai D n to A pi c he re

33 OpenMP Multithread SPMD CUDA Multithread SIMD One thread per CPU core One thread per data element Few threads handle many data elements Many threads handle few data elements

34 Threads are designed to be logically organized like data are organized e.g. data is a matrix then think 2D array of threads j threadid.y i threadid.x Threads are furthermore logically organized in blocks

35

36 CPU and GPU have distinct memory modules cores 4-16 cores

37

38 New in CUDA 6

39 New in CUDA 6

40 Matrix multiplication on cpu void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; douvbe b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } on gpu global void MatMul(fdouble* Md, double *Nd, double *Pd) { int row = threadidx.x int col = threadidx.y double sp = 0.0; for (int k = 0; k < Width; ++k) { double a = Md[row * Width + k]; double b = Nd[k * Width + col]; sp += a * b; } Pd[row * Width + col] = sp; }... dim3 Grid(1, 1); // 1 block dim3 Block(Width, Width); // Width^2 threads MatMulKernel<<<Grid,Block>>>(Md,Nd,Pd,Width); }

41 Data size matters Square matrix multiplication Seconds CPU GPU Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC

42 Data size matters Square matrix multiplication GPU GFlops CPU Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC

43 Computation complexity matters Seconds Square matrix addition Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC

44 Computation complexity matters MatMul MatAdd

45 Memories GPU memory Memory Location Access Scope Lifetime Register On-chip R/W One thread Thread Local Off-chip R/W One thread Thread Shared On-chip R/W All threads in a common block Block Global Off-chip R/W All threads + host Application Constant Off-chip R All threads + host Application Feature Off-chip R All threads + host Application Scope Lifetime CPU memory Memory Location Access RAM Off-card None Application

46 Memory choices matter Square matrix multiplication FLOPs With shared memory Without Matrix size SP GPU = nvidia M GHz CUDA C/GCC

47 Streams

48 Overlapping matters ms 128MB transfer back and forth 4 streams 4 streams GPU = nvidia M GHz CUDA C/GCC

49 Hardware accelerators: GPU computing on nvidia Tesla. 3. GPU toolkit CISM/CÉCI HPC training session Fall 2013

50 CUDA dev tools

51 CUDA math libraries

52

53

54 cublas drop-in replacement example: Octave Compile nvidia-provided fortran_thunking.c into a library $ mkdir cublas_thinking && cd $_ $ g++ -c -DCUBLAS_GFORTRAN -o fortran_thunking.o $CUDAROOT/src/fortran_thunking.c $ ar rvs libcublas_thinking.a fortran_thunking.o Configure with links to cublas $./configure LDFLAGS=" -lcublas -lcudart -L$(pwd)/cublas_thinking -lcublas_thinking" Patch Octave code to call cublas_dgemm rather than blas' dgemm $ sed -i 's/dgemm/cublas_dgemm/g' liboctave/dmatrix.cc $ sed -i 's/dgemm/cublas_dgemm/g' liboctave/dmatrix.cc Compile $ make $ make install New in CUDA6: nvblas pure drop in

55 Octave benchmark code printf('n\tseconds\tgfops/s\tmbytes\n') for n = [2, 2.^(1:9), 768:256: :512:12288] a=randn(n); b=randn(n); c=zeros(n); tstart = time; c = a*b; tend = time; seconds = tend-tstart; gfops = 2*n^3 / seconds * 1e-9; mbytes = n*n*3*8/1024/1024; printf('%d\t%f\t%f\t%f\n', n, seconds, gfops, mbytes) end

56 cublas drop-in replacement example: Octave cublas GFlops openblas openblas monothreaded blas openblas and blas on Intel(R) Xeon(R) CPU E GHz cublas on nvidia M GHz

57

58 & gnumpy

59

60 60

61 61

62 Hardware accelerators: MIC computing with Intel Compiler on CISM/CÉCI HPC training session Fall 2013

63 Hardware accelerators: MIC computing with Intel Xeon Phi. 1. Xeon Phi hardware CISM/CÉCI HPC training session Fall 2013

64

65

66

67

68 Hardware accelerators: MIC computing with Intel Xeon Phi. 2. Programming model CISM/CÉCI HPC training session Fall 2013

69 Programming models

70 Execution models MKL OpenCL Offoad OpenMP Offoad MPI Intel MPI Native OpenMP Native Intel MPI

71 OpenCL (symmetric) Execution models Offoad Hybrid Native OpenMP Programming models MPI MKL Easy Bit more complex Truly complex Impossible

72 Matrix multiplication on cpu on Xeon Phi (offoad) void MatMul(double* M, double *N, double *P) { #pragma offoad target(mic) \ in(a:length(width*width)) \ in(b:length(width*width)) \ void MatMul(double* M,double *N, double *P) out(c:length(width*width)) { #pragma omp parallel for #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { for (int y = 0; y < Width; ++y) { int row = x; int row = x; int col = y; int col = y; fdouble sp = 0.0; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; double a = M[row * width + k]; double b = N[k * width + col]; douvbe b = N[k * width + col]; sp += a * b; sp += a * b; } } P[row * Width + col] = sp; P[row * Width + col] = sp; } } } } } }

73 Data size matters Square matrix multiplication Seconds CPU Phi Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

74 Data size matters Square matrix multiplication Phi FLOPs CPU Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

75 Code vectorization matters Square matrix multiplication FLOPs With auto-vectorization Without auto-vectorization Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

76 Multi-threading matters FLOPs Square matrix multiplication Number of hardware threads DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

77 Hardware accelerators: MIC computing with Intel Xeon Phi. 2. Xeon Phi toolkit CISM/CÉCI HPC training session Fall 2013

78 Intel dev tools

79

80 Xeon Phi libraries

81 MKL drop-in replacement example: Octave Set up environment to use Intel compiler $ module load intel/clusterstudio $ export CC=icc $ export F77=ifort $ export CFLAGS="-O3 -ipo- -std=c99 -fpic -DMKL_LP64" $ export CPPFLAGS="-I$MKLROOT/include -I$MKLROOT/include/fftw" $ export LDFLAGS="-L$MKLROOT/lib/intel64 -L$MKLROOT/../compiler/lib/intel64" Configure with links to mkl $./configure --with-blas="-wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,-end-group -liomp5 -lpthread" --with-lapack="-wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread"

82 MKL drop-in replacement example: Octave Phi-MKL MKL MKL GFlops monothreaded blas MKL and blas on Intel(R) Xeon(R) CPU E GHz MKL-Phi on Xeon Phi 5110P GHz

83 MKL drop-in replacement example: Octave cublas Phi-MKL MKL MKL GFlops monothreaded blas MKL and blas on Intel(R) Xeon(R) CPU E GHz MKL-Phi on Xeon Phi 5110P GHz

84 on cpu Matrix multiplication void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; douvbe b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } on Xeon Phi (native) void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; double b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } }

85 Native OpenMP Square matrix multiplication Native Offoad FLOPs CPU Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

86 Hardware accelerators: Multi and Hybrid strategies CISM/CÉCI HPC training session Fall 2013

87 cublas manual CPU/GPU workdivision

88 cublas manual CPU/GPU workdivision

89 CUDA manual multi-gpu workdivision

90 Automatic workdivision with Magma

91 Remote GPU utilization

92 MKL CPU/Phi auto-workdivision $ export OFFLOAD_REPORT=1 $ time octave mkl/bin/octave -q bench.m n Seconds Gfops/s Mbytes [...] [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] [MKL] [MIC 00] [AO DGEMM CPU Time] seconds [MKL] [MIC 00] [AO DGEMM MIC Time] seconds [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] [MKL] [MIC 00] [AO DGEMM CPU Time] seconds [MKL] [MIC 00] [AO DGEMM MIC Time] seconds [...]

93 Intel MPI CPU/Phi hybrid computing and/or multi-phi computing

94 Hardware accelerators: Summary and wrap-up CISM/CÉCI HPC training session Fall 2013

95 Xeon Phi core 1 to 1.3 GHz Xeon Phi core 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals

96 What are they? Expensive extra hardware designed to crunch data Strength: parallel data processing Weakness: non-trivial to optimize code Intel compiler Big help: vendor software CUDA libraries Major issue: data transfers

97 Should I care? Maybe.

98 Should I care? You have not much data to develop lots of code a simple algorithm with low fop per byte count an algorithm that is not easily parallelized or that requires frequent global synchronization Clear NO.

99 Should I care? You have ready-made libraries or extensions large data and data parallelism complex algorithms with large fop per byte count Clear YES.

100 vs. = Processing power v Ease of programming = Availability of dev tools Availability of libraries v v Versatility Cost of HW/SW v

101 Where do I start? Look for available CUDA libraries related to your work Found? Look for a CUDA-enabled gpu (in your laptop? Found? Use it to devel, then use the big ones on the clusters No library found? Lots of data-intensive code exist? No? You might want to consider writing your kernel. Otherwise, Try Xeon Phi if you can afford one (You have access to intel compiler?)

102

103

104

105

106 MatMul*: A simple CPU version in C k WIDTH void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { N int row = x; int col = y; float sp = 0.0; for (int k = 0; k < Width; ++k) { y float a = M[row * width + k]; float b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } M P } } x WIDTH * a.k.a. the 'Hello world' of high performance computing k David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign WIDTH WIDTH 106

107 MatMul: A simple GPU version in CUDA/C void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md,m,size,cudamemcpyhosttodevice); cudamalloc(&nd, size); cudamemcpy(nd,n,size,cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size); David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

108 MatMul: Output Matrix Data Transfer 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); } David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

109 MatMul: Kernel Function Md k WIDTH global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { Nd int row = threadidx.x int col = threadidx.y float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[row * Width + k]; float b = Nd[k * Width + col]; ty sp += a * b; } Pd[row * Width + col] = sp; } Pd ty WIDTH tx tx k David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign WIDTH WIDTH 109

110 MatMul void MatrixMulOnHost(float* M,...) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = M[row * width + k]; float b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } global void MatrixMulKernel(float* Md,...) { int row = threadidx.x int col = threadidx.y float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[row * Width + k]; float b = Nd[k * Width + col]; sp += a * b; } Pd[row * Width + col] = sp; }

111 MatMul: Kernel invocation 1. // Allocate and Load M, N to device memory // Kernel invocation code // Setup the execution configuration dim3 dimgrid(1, 1); // 1 block dim3 dimblock(width, Width); // Width^2 threads // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); 3. // Read P from the device... David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

112 MatMul void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int x = 0; x < Width; ++x) for (int y = 0; y < Width; ++y) { float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = M[x * width + k]; float b = N[k * width + y]; sp += a * b; } P[x * Width + y] = sp; } } CPU GPU void MatrixMulOnDevice(float* M, float* N, float* P, int Width) {... dim3 dimgrid(1, 1); // 1 block dim3 dimblock(width, Width); // Width^2 threads MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);... }

113 Only One Thread Block Used One Block of threads compute matrix Pd Block 1 Each thread computes one element of Pd Size of matrix limited by the number of threads allowed in a thread block 2 Thread (2, 2) WIDTH Md David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign 2 4 Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Nd Grid 1 Pd

114 bx 0 Matrix Multiplication Using Multiple Blocks 1 2 tx 012 TILE_WIDTH-1 Nd Break-up Pd into tiles Each block calculates one tile WIDTH Each thread calculates one element Block size equal tile size Md Pd 1 ty Pdsub TILE_WIDTH-1 TILE_WIDTH 2 David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign WIDTH WIDTH WIDTH by TILE_WIDTHE 0

115 MatMul: Kernel using Multiple Blocks 1. // Allocate and Load M, N to device memory // Kernel invocation code // Setup the execution configuration dim3 dimgrid(width / TILE_WIDTH, Width / TILE_WIDTH); dim3 dimblock(tile_width, TILE_WIDTH); // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); 3. // Read P from the device...

116 MatMul: Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y * TILE_WIDTH + threadidx.y; // Calculate the column index of Pd and N int Col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[Row * Width + k]; float b = Nd[k * Width + Col]; sp += a * b; } Pd[Row * Width + Col] = sp; }

117 MatMul: Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y * TILE_WIDTH + threadidx.y; // Calculate the column index of Pd and N int Col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[Row * Width + k]; float b = Nd[k * Width + Col]; sp += a * b; } Pd[Row * Width + Col] = sp; } Md, Nd, Pd are on GPU DIMM (far far away but global) a, b, sp are local to the thread (close but private)

118 MatMul: Use Shared Memory to reuse global memory data N WIDTH Each input element is read by Width threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory M bandwidth P ty WIDTH tx David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign WIDTH WIDTH 118

119 Shared memory Matrix Multiplication global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Declare shared memory shared float Mds[TILE_WIDTH][TILE_WIDTH]; shared float Nds[TILE_WIDTH][TILE_WIDTH]; // Identify the row and column of the Pd element to work on int row = blockidx.y * TILE_WIDTH + threadidx.y; int col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; // Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Each thread loads one element of M and of N in shared memory Mds[ty][tx] = Md[row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[col + (m*tile_width + ty)*width]; // Make sure everyone has copied before we go on with computations tx = threadidx.x syncthreads(); ty = threadidx.y // Compute sp taking values from shared memory

120 Shared memory Matrix Multiplication Previous version Shared memory version Each thread does Each thread does (2 x Width) accesses to GPU DIMM (2 x Width / TILE_WIDTH) accesses to GPU DIMM (Width) multiplications (Width-1) additions (2 x Width) accesses to shared memory (Width) multiplications (Width-1) additions Shared memory ~1.5Tb/s, near zero latency GPU DIMM ~100Gb/s, clock cycles latency

121 Native OpenMP

122 Native OpenMP /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gcc fopenmp hello.openmp.c

123 Offload OpenMP

124 Offload OpenMP

125 Hybrid OpenMP

126 Hybrid OpenMP

127 Native intel MPI

128 Native intel MPI

129 Hybrid intel MPI

130 Hybrid intel MPI mpirun -n 4 -host localhost./hello.mpi : -n 40 -host mic0 /tmp/hello.mpi.mic

131 Offload intel MPI

132 Offload intel MPI

133 Native MKL

134 Native MKL

135 Automatic offload MKL

136 Automatic offload MKL

137 Automatic offload MKL

138 Compiler-assisted Offload MKL

139 Compiler-assisted Offload MKL

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

GPU Memory Memory issue for CUDA programming

GPU Memory Memory issue for CUDA programming Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global

More information

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating

More information

Computation to Core Mapping Lessons learned from a simple application

Computation to Core Mapping Lessons learned from a simple application Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread

More information

Lessons learned from a simple application

Lessons learned from a simple application Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Introduction to CUDA (2 of 2)

Introduction to CUDA (2 of 2) Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Using The CUDA Programming Model

Using The CUDA Programming Model Using The CUDA Programming Model Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin Eau Claire 1 What is (Historical) GPGPU? General Purpose computation using

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model CIS 665: GPU Programming Lecture 2: The CUDA Programming Model 1 Slides References Nvidia (Kitchen) David Kirk + Wen-Mei Hwu (UIUC) Gary Katz and Joe Kider 2 3D 3D API: API: OpenGL OpenGL or or Direct3D

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

GPGPU. Lecture 2: CUDA

GPGPU. Lecture 2: CUDA GPGPU Lecture 2: CUDA GPU is fast Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CS 677: Parallel Programming for Many-core Processors Lecture 1

CS 677: Parallel Programming for Many-core Processors Lecture 1 1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Lecture 1: Introduction

Lecture 1: Introduction ECE 498AL Applied Parallel Programming Lecture 1: Introduction 1 Course Goals Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability

More information

A Multi-Tiered Optimization Framework for Heterogeneous Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 Solutions to Parallel Processing Message Passing (distributed) MPI (library) Threads (shared memory) pthreads (library) OpenMP (compiler)

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage: 1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

GPU Programming. Ringberg Theorie Seminar 2010

GPU Programming. Ringberg Theorie Seminar 2010 or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Computing with Graphical Processing Units CUDA Programming Matrix multiplication Announcements A2 has been released: Matrix multiplication

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I) Today s Content Lecture 7 CUDA (I) Introduction Trends in HPC GPGPU CUDA Programming 1 Trends Trends in High-Performance Computing HPC is never a commodity until 199 In 1990 s Performances of PCs are getting

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information