Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014
|
|
- Regina Fox
- 5 years ago
- Views:
Transcription
1 Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014
2 More and more accelerators in Top500 fastest clusters
3 More and more accelerators in Top500 fastest clusters What are they? Should I care? Where do I start?
4 IBM PC XT (1982)
5
6 Intel 8087 math co-processor Hardware accelerator for foating-point arithmetics
7
8 DSP Digital Signal Processor GPU Graphic Processing Unit FPGA Field-Programmable Gate Array
9 Hardware accelerators: offoad some computational burden from the central processing unit to dedicated hardware.
10 Accelerators in the HPC world
11
12
13 The beginning of wisdom is the definition of terms. * Name Xeon Phi Visual Category Visual Name Product series Tesla Microprocessor architecture Kepler 5110P Shippable product (SKU) K20X Knights Corner Chip code name GK110 MIC Many Integrated Core Architecture MPSS Manycore Platform Software Stack Set of software (drivers, kernel modules, etc.) * Socrates ( B.C.) N.A. CUDA toolkit
14 Hardware accelerators: What is so appealing about them?
15 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s
16 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s Xeon Phi 5510P 60 cores Gfops 230W - RAM 320 GB/sec Tesla K20X 2688 cores Gfops 230W - RAM 250 GB/s
17 Hardware accelerators: What is so appealing about them? Large number of cores Large memory throughput Low electrical consumption Small physical footprint
18 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s Xeon Phi 5510P 60 cores Gfops 230W - RAM 320 GB/sec Tesla K20X 2688 cores Gfops 230W - RAM 250 GB/s
19 Sandy Bridge core
20 MIC core
21 GPU core
22 Every action has its pleasures and its price. * Hardware accelerators: They can bring you more speed, but they are different from CPU's. To fully benefit from them: you must understand them you must adapt to them * Socrates ( B.C.)
23 Hardware accelerators: GPU computing on nvidia Tesla with CISM/CÉCI HPC training session Fall 2013
24 Hardware accelerators: GPU computing on nvidia Tesla. 1. GPU hardware CISM/CÉCI HPC training session Fall 2013
25
26
27
28
29
30
31 Hardware accelerators: GPU computing on nvidia Tesla. 2. Programming model CISM/CÉCI HPC training session Fall 2013
32 e t pu d t c u d m nte ro o e ft p C i r t -o so c ire ame icro D G M s m r r nm e o e l pi latf Op m p h o t I c any wi G m e P r by erg o f d m C s C e rte ed v A cti o n n p an e e p r p Di Su Pl O Su Su Op pp cc e o r e s s nc te o d ro L by f a l Op l p en la G tfo L rm s.0 4 P M CU ai D n to A pi c he re
33 OpenMP Multithread SPMD CUDA Multithread SIMD One thread per CPU core One thread per data element Few threads handle many data elements Many threads handle few data elements
34 Threads are designed to be logically organized like data are organized e.g. data is a matrix then think 2D array of threads j threadid.y i threadid.x Threads are furthermore logically organized in blocks
35
36 CPU and GPU have distinct memory modules cores 4-16 cores
37
38 New in CUDA 6
39 New in CUDA 6
40 Matrix multiplication on cpu void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; douvbe b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } on gpu global void MatMul(fdouble* Md, double *Nd, double *Pd) { int row = threadidx.x int col = threadidx.y double sp = 0.0; for (int k = 0; k < Width; ++k) { double a = Md[row * Width + k]; double b = Nd[k * Width + col]; sp += a * b; } Pd[row * Width + col] = sp; }... dim3 Grid(1, 1); // 1 block dim3 Block(Width, Width); // Width^2 threads MatMulKernel<<<Grid,Block>>>(Md,Nd,Pd,Width); }
41 Data size matters Square matrix multiplication Seconds CPU GPU Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC
42 Data size matters Square matrix multiplication GPU GFlops CPU Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC
43 Computation complexity matters Seconds Square matrix addition Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC
44 Computation complexity matters MatMul MatAdd
45 Memories GPU memory Memory Location Access Scope Lifetime Register On-chip R/W One thread Thread Local Off-chip R/W One thread Thread Shared On-chip R/W All threads in a common block Block Global Off-chip R/W All threads + host Application Constant Off-chip R All threads + host Application Feature Off-chip R All threads + host Application Scope Lifetime CPU memory Memory Location Access RAM Off-card None Application
46 Memory choices matter Square matrix multiplication FLOPs With shared memory Without Matrix size SP GPU = nvidia M GHz CUDA C/GCC
47 Streams
48 Overlapping matters ms 128MB transfer back and forth 4 streams 4 streams GPU = nvidia M GHz CUDA C/GCC
49 Hardware accelerators: GPU computing on nvidia Tesla. 3. GPU toolkit CISM/CÉCI HPC training session Fall 2013
50 CUDA dev tools
51 CUDA math libraries
52
53
54 cublas drop-in replacement example: Octave Compile nvidia-provided fortran_thunking.c into a library $ mkdir cublas_thinking && cd $_ $ g++ -c -DCUBLAS_GFORTRAN -o fortran_thunking.o $CUDAROOT/src/fortran_thunking.c $ ar rvs libcublas_thinking.a fortran_thunking.o Configure with links to cublas $./configure LDFLAGS=" -lcublas -lcudart -L$(pwd)/cublas_thinking -lcublas_thinking" Patch Octave code to call cublas_dgemm rather than blas' dgemm $ sed -i 's/dgemm/cublas_dgemm/g' liboctave/dmatrix.cc $ sed -i 's/dgemm/cublas_dgemm/g' liboctave/dmatrix.cc Compile $ make $ make install New in CUDA6: nvblas pure drop in
55 Octave benchmark code printf('n\tseconds\tgfops/s\tmbytes\n') for n = [2, 2.^(1:9), 768:256: :512:12288] a=randn(n); b=randn(n); c=zeros(n); tstart = time; c = a*b; tend = time; seconds = tend-tstart; gfops = 2*n^3 / seconds * 1e-9; mbytes = n*n*3*8/1024/1024; printf('%d\t%f\t%f\t%f\n', n, seconds, gfops, mbytes) end
56 cublas drop-in replacement example: Octave cublas GFlops openblas openblas monothreaded blas openblas and blas on Intel(R) Xeon(R) CPU E GHz cublas on nvidia M GHz
57
58 & gnumpy
59
60 60
61 61
62 Hardware accelerators: MIC computing with Intel Compiler on CISM/CÉCI HPC training session Fall 2013
63 Hardware accelerators: MIC computing with Intel Xeon Phi. 1. Xeon Phi hardware CISM/CÉCI HPC training session Fall 2013
64
65
66
67
68 Hardware accelerators: MIC computing with Intel Xeon Phi. 2. Programming model CISM/CÉCI HPC training session Fall 2013
69 Programming models
70 Execution models MKL OpenCL Offoad OpenMP Offoad MPI Intel MPI Native OpenMP Native Intel MPI
71 OpenCL (symmetric) Execution models Offoad Hybrid Native OpenMP Programming models MPI MKL Easy Bit more complex Truly complex Impossible
72 Matrix multiplication on cpu on Xeon Phi (offoad) void MatMul(double* M, double *N, double *P) { #pragma offoad target(mic) \ in(a:length(width*width)) \ in(b:length(width*width)) \ void MatMul(double* M,double *N, double *P) out(c:length(width*width)) { #pragma omp parallel for #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { for (int y = 0; y < Width; ++y) { int row = x; int row = x; int col = y; int col = y; fdouble sp = 0.0; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; double a = M[row * width + k]; double b = N[k * width + col]; douvbe b = N[k * width + col]; sp += a * b; sp += a * b; } } P[row * Width + col] = sp; P[row * Width + col] = sp; } } } } } }
73 Data size matters Square matrix multiplication Seconds CPU Phi Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler
74 Data size matters Square matrix multiplication Phi FLOPs CPU Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler
75 Code vectorization matters Square matrix multiplication FLOPs With auto-vectorization Without auto-vectorization Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler
76 Multi-threading matters FLOPs Square matrix multiplication Number of hardware threads DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler
77 Hardware accelerators: MIC computing with Intel Xeon Phi. 2. Xeon Phi toolkit CISM/CÉCI HPC training session Fall 2013
78 Intel dev tools
79
80 Xeon Phi libraries
81 MKL drop-in replacement example: Octave Set up environment to use Intel compiler $ module load intel/clusterstudio $ export CC=icc $ export F77=ifort $ export CFLAGS="-O3 -ipo- -std=c99 -fpic -DMKL_LP64" $ export CPPFLAGS="-I$MKLROOT/include -I$MKLROOT/include/fftw" $ export LDFLAGS="-L$MKLROOT/lib/intel64 -L$MKLROOT/../compiler/lib/intel64" Configure with links to mkl $./configure --with-blas="-wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,-end-group -liomp5 -lpthread" --with-lapack="-wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread"
82 MKL drop-in replacement example: Octave Phi-MKL MKL MKL GFlops monothreaded blas MKL and blas on Intel(R) Xeon(R) CPU E GHz MKL-Phi on Xeon Phi 5110P GHz
83 MKL drop-in replacement example: Octave cublas Phi-MKL MKL MKL GFlops monothreaded blas MKL and blas on Intel(R) Xeon(R) CPU E GHz MKL-Phi on Xeon Phi 5110P GHz
84 on cpu Matrix multiplication void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; douvbe b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } on Xeon Phi (native) void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; double b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } }
85 Native OpenMP Square matrix multiplication Native Offoad FLOPs CPU Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler
86 Hardware accelerators: Multi and Hybrid strategies CISM/CÉCI HPC training session Fall 2013
87 cublas manual CPU/GPU workdivision
88 cublas manual CPU/GPU workdivision
89 CUDA manual multi-gpu workdivision
90 Automatic workdivision with Magma
91 Remote GPU utilization
92 MKL CPU/Phi auto-workdivision $ export OFFLOAD_REPORT=1 $ time octave mkl/bin/octave -q bench.m n Seconds Gfops/s Mbytes [...] [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] [MKL] [MIC 00] [AO DGEMM CPU Time] seconds [MKL] [MIC 00] [AO DGEMM MIC Time] seconds [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] [MKL] [MIC 00] [AO DGEMM CPU Time] seconds [MKL] [MIC 00] [AO DGEMM MIC Time] seconds [...]
93 Intel MPI CPU/Phi hybrid computing and/or multi-phi computing
94 Hardware accelerators: Summary and wrap-up CISM/CÉCI HPC training session Fall 2013
95 Xeon Phi core 1 to 1.3 GHz Xeon Phi core 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals
96 What are they? Expensive extra hardware designed to crunch data Strength: parallel data processing Weakness: non-trivial to optimize code Intel compiler Big help: vendor software CUDA libraries Major issue: data transfers
97 Should I care? Maybe.
98 Should I care? You have not much data to develop lots of code a simple algorithm with low fop per byte count an algorithm that is not easily parallelized or that requires frequent global synchronization Clear NO.
99 Should I care? You have ready-made libraries or extensions large data and data parallelism complex algorithms with large fop per byte count Clear YES.
100 vs. = Processing power v Ease of programming = Availability of dev tools Availability of libraries v v Versatility Cost of HW/SW v
101 Where do I start? Look for available CUDA libraries related to your work Found? Look for a CUDA-enabled gpu (in your laptop? Found? Use it to devel, then use the big ones on the clusters No library found? Lots of data-intensive code exist? No? You might want to consider writing your kernel. Otherwise, Try Xeon Phi if you can afford one (You have access to intel compiler?)
102
103
104
105
106 MatMul*: A simple CPU version in C k WIDTH void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { N int row = x; int col = y; float sp = 0.0; for (int k = 0; k < Width; ++k) { y float a = M[row * width + k]; float b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } M P } } x WIDTH * a.k.a. the 'Hello world' of high performance computing k David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign WIDTH WIDTH 106
107 MatMul: A simple GPU version in CUDA/C void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md,m,size,cudamemcpyhosttodevice); cudamalloc(&nd, size); cudamemcpy(nd,n,size,cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size); David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
108 MatMul: Output Matrix Data Transfer 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); } David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
109 MatMul: Kernel Function Md k WIDTH global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { Nd int row = threadidx.x int col = threadidx.y float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[row * Width + k]; float b = Nd[k * Width + col]; ty sp += a * b; } Pd[row * Width + col] = sp; } Pd ty WIDTH tx tx k David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign WIDTH WIDTH 109
110 MatMul void MatrixMulOnHost(float* M,...) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = M[row * width + k]; float b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } global void MatrixMulKernel(float* Md,...) { int row = threadidx.x int col = threadidx.y float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[row * Width + k]; float b = Nd[k * Width + col]; sp += a * b; } Pd[row * Width + col] = sp; }
111 MatMul: Kernel invocation 1. // Allocate and Load M, N to device memory // Kernel invocation code // Setup the execution configuration dim3 dimgrid(1, 1); // 1 block dim3 dimblock(width, Width); // Width^2 threads // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); 3. // Read P from the device... David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
112 MatMul void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int x = 0; x < Width; ++x) for (int y = 0; y < Width; ++y) { float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = M[x * width + k]; float b = N[k * width + y]; sp += a * b; } P[x * Width + y] = sp; } } CPU GPU void MatrixMulOnDevice(float* M, float* N, float* P, int Width) {... dim3 dimgrid(1, 1); // 1 block dim3 dimblock(width, Width); // Width^2 threads MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);... }
113 Only One Thread Block Used One Block of threads compute matrix Pd Block 1 Each thread computes one element of Pd Size of matrix limited by the number of threads allowed in a thread block 2 Thread (2, 2) WIDTH Md David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign 2 4 Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Nd Grid 1 Pd
114 bx 0 Matrix Multiplication Using Multiple Blocks 1 2 tx 012 TILE_WIDTH-1 Nd Break-up Pd into tiles Each block calculates one tile WIDTH Each thread calculates one element Block size equal tile size Md Pd 1 ty Pdsub TILE_WIDTH-1 TILE_WIDTH 2 David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign WIDTH WIDTH WIDTH by TILE_WIDTHE 0
115 MatMul: Kernel using Multiple Blocks 1. // Allocate and Load M, N to device memory // Kernel invocation code // Setup the execution configuration dim3 dimgrid(width / TILE_WIDTH, Width / TILE_WIDTH); dim3 dimblock(tile_width, TILE_WIDTH); // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); 3. // Read P from the device...
116 MatMul: Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y * TILE_WIDTH + threadidx.y; // Calculate the column index of Pd and N int Col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[Row * Width + k]; float b = Nd[k * Width + Col]; sp += a * b; } Pd[Row * Width + Col] = sp; }
117 MatMul: Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y * TILE_WIDTH + threadidx.y; // Calculate the column index of Pd and N int Col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[Row * Width + k]; float b = Nd[k * Width + Col]; sp += a * b; } Pd[Row * Width + Col] = sp; } Md, Nd, Pd are on GPU DIMM (far far away but global) a, b, sp are local to the thread (close but private)
118 MatMul: Use Shared Memory to reuse global memory data N WIDTH Each input element is read by Width threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory M bandwidth P ty WIDTH tx David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign WIDTH WIDTH 118
119 Shared memory Matrix Multiplication global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Declare shared memory shared float Mds[TILE_WIDTH][TILE_WIDTH]; shared float Nds[TILE_WIDTH][TILE_WIDTH]; // Identify the row and column of the Pd element to work on int row = blockidx.y * TILE_WIDTH + threadidx.y; int col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; // Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Each thread loads one element of M and of N in shared memory Mds[ty][tx] = Md[row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[col + (m*tile_width + ty)*width]; // Make sure everyone has copied before we go on with computations tx = threadidx.x syncthreads(); ty = threadidx.y // Compute sp taking values from shared memory
120 Shared memory Matrix Multiplication Previous version Shared memory version Each thread does Each thread does (2 x Width) accesses to GPU DIMM (2 x Width / TILE_WIDTH) accesses to GPU DIMM (Width) multiplications (Width-1) additions (2 x Width) accesses to shared memory (Width) multiplications (Width-1) additions Shared memory ~1.5Tb/s, near zero latency GPU DIMM ~100Gb/s, clock cycles latency
121 Native OpenMP
122 Native OpenMP /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gcc fopenmp hello.openmp.c
123 Offload OpenMP
124 Offload OpenMP
125 Hybrid OpenMP
126 Hybrid OpenMP
127 Native intel MPI
128 Native intel MPI
129 Hybrid intel MPI
130 Hybrid intel MPI mpirun -n 4 -host localhost./hello.mpi : -n 40 -host mic0 /tmp/hello.mpi.mic
131 Offload intel MPI
132 Offload intel MPI
133 Native MKL
134 Native MKL
135 Automatic offload MKL
136 Automatic offload MKL
137 Automatic offload MKL
138 Compiler-assisted Offload MKL
139 Compiler-assisted Offload MKL
Matrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationGPU Memory Memory issue for CUDA programming
Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global
More informationCSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating
More informationComputation to Core Mapping Lessons learned from a simple application
Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread
More informationLessons learned from a simple application
Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationGPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes
Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationIntroduction to CUDA (2 of 2)
Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationGeneral Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop
General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationUsing The CUDA Programming Model
Using The CUDA Programming Model Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin Eau Claire 1 What is (Historical) GPGPU? General Purpose computation using
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationCIS 665: GPU Programming. Lecture 2: The CUDA Programming Model
CIS 665: GPU Programming Lecture 2: The CUDA Programming Model 1 Slides References Nvidia (Kitchen) David Kirk + Wen-Mei Hwu (UIUC) Gary Katz and Joe Kider 2 3D 3D API: API: OpenGL OpenGL or or Direct3D
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationGPGPU. Lecture 2: CUDA
GPGPU Lecture 2: CUDA GPU is fast Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCS 677: Parallel Programming for Many-core Processors Lecture 1
1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationLecture 1: Introduction
ECE 498AL Applied Parallel Programming Lecture 1: Introduction 1 Course Goals Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability
More informationA Multi-Tiered Optimization Framework for Heterogeneous Computing
A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationGPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014
GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 Solutions to Parallel Processing Message Passing (distributed) MPI (library) Threads (shared memory) pthreads (library) OpenMP (compiler)
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationMany-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:
1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationGPU Programming. Ringberg Theorie Seminar 2010
or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationLecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication
Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Computing with Graphical Processing Units CUDA Programming Matrix multiplication Announcements A2 has been released: Matrix multiplication
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationToday s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)
Today s Content Lecture 7 CUDA (I) Introduction Trends in HPC GPGPU CUDA Programming 1 Trends Trends in High-Performance Computing HPC is never a commodity until 199 In 1990 s Performances of PCs are getting
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More information