Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

Size: px

Start display at page:

Download "Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014"

Regina Fox
5 years ago
Views:

1 Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

2 More and more accelerators in Top500 fastest clusters

3 More and more accelerators in Top500 fastest clusters What are they? Should I care? Where do I start?

4 IBM PC XT (1982)

6 Intel 8087 math co-processor Hardware accelerator for foating-point arithmetics

8 DSP Digital Signal Processor GPU Graphic Processing Unit FPGA Field-Programmable Gate Array

9 Hardware accelerators: offoad some computational burden from the central processing unit to dedicated hardware.

10 Accelerators in the HPC world

The beginning of wisdom is the definition of terms.

Microprocessor architecture Kepler 5110P Shippable product (SKU)

Core Architecture MPSS Manycore Platform Software Stack Set of

13 The beginning of wisdom is the definition of terms. * Name Xeon Phi Visual Category Visual Name Product series Tesla Microprocessor architecture Kepler 5110P Shippable product (SKU) K20X Knights Corner Chip code name GK110 MIC Many Integrated Core Architecture MPSS Manycore Platform Software Stack Set of software (drivers, kernel modules, etc.) * Socrates ( B.C.) N.A. CUDA toolkit

14 Hardware accelerators: What is so appealing about them?

15 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s

16 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s Xeon Phi 5510P 60 cores Gfops 230W - RAM 320 GB/sec Tesla K20X 2688 cores Gfops 230W - RAM 250 GB/s

17 Hardware accelerators: What is so appealing about them? Large number of cores Large memory throughput Low electrical consumption Small physical footprint

18 Xeon E (Sandy Bridge) 8 cores Gfops 135W - RAM 80 GB/s Xeon Phi 5510P 60 cores Gfops 230W - RAM 320 GB/sec Tesla K20X 2688 cores Gfops 230W - RAM 250 GB/s

19 Sandy Bridge core

20 MIC core

21 GPU core

22 Every action has its pleasures and its price. * Hardware accelerators: They can bring you more speed, but they are different from CPU's. To fully benefit from them: you must understand them you must adapt to them * Socrates ( B.C.)

23 Hardware accelerators: GPU computing on nvidia Tesla with CISM/CÉCI HPC training session Fall 2013

24 Hardware accelerators: GPU computing on nvidia Tesla. 1. GPU hardware CISM/CÉCI HPC training session Fall 2013

31 Hardware accelerators: GPU computing on nvidia Tesla. 2. Programming model CISM/CÉCI HPC training session Fall 2013

32 e t pu d t c u d m nte ro o e ft p C i r t -o so c ire ame icro D G M s m r r nm e o e l pi latf Op m p h o t I c any wi G m e P r by erg o f d m C s C e rte ed v A cti o n n p an e e p r p Di Su Pl O Su Su Op pp cc e o r e s s nc te o d ro L by f a l Op l p en la G tfo L rm s.0 4 P M CU ai D n to A pi c he re

33 OpenMP Multithread SPMD CUDA Multithread SIMD One thread per CPU core One thread per data element Few threads handle many data elements Many threads handle few data elements

34 Threads are designed to be logically organized like data are organized e.g. data is a matrix then think 2D array of threads j threadid.y i threadid.x Threads are furthermore logically organized in blocks

36 CPU and GPU have distinct memory modules cores 4-16 cores

38 New in CUDA 6

39 New in CUDA 6

40 Matrix multiplication on cpu void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; douvbe b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } on gpu global void MatMul(fdouble* Md, double *Nd, double *Pd) { int row = threadidx.x int col = threadidx.y double sp = 0.0; for (int k = 0; k < Width; ++k) { double a = Md[row * Width + k]; double b = Nd[k * Width + col]; sp += a * b; } Pd[row * Width + col] = sp; }... dim3 Grid(1, 1); // 1 block dim3 Block(Width, Width); // Width^2 threads MatMulKernel<<<Grid,Block>>>(Md,Nd,Pd,Width); }

41 Data size matters Square matrix multiplication Seconds CPU GPU Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC

42 Data size matters Square matrix multiplication GPU GFlops CPU Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC

43 Computation complexity matters Seconds Square matrix addition Matrix size SP CPU = Intel(R) Xeon(R) CPU E (1 2.20GHz GPU = nvidia M GHz CUDA C/GCC

44 Computation complexity matters MatMul MatAdd

45 Memories GPU memory Memory Location Access Scope Lifetime Register On-chip R/W One thread Thread Local Off-chip R/W One thread Thread Shared On-chip R/W All threads in a common block Block Global Off-chip R/W All threads + host Application Constant Off-chip R All threads + host Application Feature Off-chip R All threads + host Application Scope Lifetime CPU memory Memory Location Access RAM Off-card None Application

46 Memory choices matter Square matrix multiplication FLOPs With shared memory Without Matrix size SP GPU = nvidia M GHz CUDA C/GCC

47 Streams

48 Overlapping matters ms 128MB transfer back and forth 4 streams 4 streams GPU = nvidia M GHz CUDA C/GCC

49 Hardware accelerators: GPU computing on nvidia Tesla. 3. GPU toolkit CISM/CÉCI HPC training session Fall 2013

50 CUDA dev tools

51 CUDA math libraries

54 cublas drop-in replacement example: Octave Compile nvidia-provided fortran_thunking.c into a library $ mkdir cublas_thinking && cd $_ $ g++ -c -DCUBLAS_GFORTRAN -o fortran_thunking.o $CUDAROOT/src/fortran_thunking.c $ ar rvs libcublas_thinking.a fortran_thunking.o Configure with links to cublas $./configure LDFLAGS=" -lcublas -lcudart -L$(pwd)/cublas_thinking -lcublas_thinking" Patch Octave code to call cublas_dgemm rather than blas' dgemm $ sed -i 's/dgemm/cublas_dgemm/g' liboctave/dmatrix.cc $ sed -i 's/dgemm/cublas_dgemm/g' liboctave/dmatrix.cc Compile $ make $ make install New in CUDA6: nvblas pure drop in

55 Octave benchmark code printf('n\tseconds\tgfops/s\tmbytes\n') for n = [2, 2.^(1:9), 768:256: :512:12288] a=randn(n); b=randn(n); c=zeros(n); tstart = time; c = a*b; tend = time; seconds = tend-tstart; gfops = 2*n^3 / seconds * 1e-9; mbytes = n*n*3*8/1024/1024; printf('%d\t%f\t%f\t%f\n', n, seconds, gfops, mbytes) end

56 cublas drop-in replacement example: Octave cublas GFlops openblas openblas monothreaded blas openblas and blas on Intel(R) Xeon(R) CPU E GHz cublas on nvidia M GHz

58 & gnumpy

60 60

61 61

62 Hardware accelerators: MIC computing with Intel Compiler on CISM/CÉCI HPC training session Fall 2013

63 Hardware accelerators: MIC computing with Intel Xeon Phi. 1. Xeon Phi hardware CISM/CÉCI HPC training session Fall 2013

68 Hardware accelerators: MIC computing with Intel Xeon Phi. 2. Programming model CISM/CÉCI HPC training session Fall 2013

69 Programming models

70 Execution models MKL OpenCL Offoad OpenMP Offoad MPI Intel MPI Native OpenMP Native Intel MPI

71 OpenCL (symmetric) Execution models Offoad Hybrid Native OpenMP Programming models MPI MKL Easy Bit more complex Truly complex Impossible

72 Matrix multiplication on cpu on Xeon Phi (offoad) void MatMul(double* M, double *N, double *P) { #pragma offoad target(mic) \ in(a:length(width*width)) \ in(b:length(width*width)) \ void MatMul(double* M,double *N, double *P) out(c:length(width*width)) { #pragma omp parallel for #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { for (int y = 0; y < Width; ++y) { int row = x; int row = x; int col = y; int col = y; fdouble sp = 0.0; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; double a = M[row * width + k]; double b = N[k * width + col]; douvbe b = N[k * width + col]; sp += a * b; sp += a * b; } } P[row * Width + col] = sp; P[row * Width + col] = sp; } } } } } }

73 Data size matters Square matrix multiplication Seconds CPU Phi Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

74 Data size matters Square matrix multiplication Phi FLOPs CPU Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

75 Code vectorization matters Square matrix multiplication FLOPs With auto-vectorization Without auto-vectorization Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

76 Multi-threading matters FLOPs Square matrix multiplication Number of hardware threads DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

77 Hardware accelerators: MIC computing with Intel Xeon Phi. 2. Xeon Phi toolkit CISM/CÉCI HPC training session Fall 2013

78 Intel dev tools

80 Xeon Phi libraries

81 MKL drop-in replacement example: Octave Set up environment to use Intel compiler $ module load intel/clusterstudio $ export CC=icc $ export F77=ifort $ export CFLAGS="-O3 -ipo- -std=c99 -fpic -DMKL_LP64" $ export CPPFLAGS="-I$MKLROOT/include -I$MKLROOT/include/fftw" $ export LDFLAGS="-L$MKLROOT/lib/intel64 -L$MKLROOT/../compiler/lib/intel64" Configure with links to mkl $./configure --with-blas="-wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,-end-group -liomp5 -lpthread" --with-lapack="-wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread"

82 MKL drop-in replacement example: Octave Phi-MKL MKL MKL GFlops monothreaded blas MKL and blas on Intel(R) Xeon(R) CPU E GHz MKL-Phi on Xeon Phi 5110P GHz

83 MKL drop-in replacement example: Octave cublas Phi-MKL MKL MKL GFlops monothreaded blas MKL and blas on Intel(R) Xeon(R) CPU E GHz MKL-Phi on Xeon Phi 5110P GHz

84 on cpu Matrix multiplication void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; douvbe b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } on Xeon Phi (native) void MatMul(double* M,double *N, double *P) { #pragma omp parallel for for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; fdouble sp = 0.0; for (int k = 0; k < Width; ++k) { double a = M[row * width + k]; double b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } }

85 Native OpenMP Square matrix multiplication Native Offoad FLOPs CPU Matrix size DP CPU = Intel(R) Xeon(R) CPU E (8 2.20GHz Phi = Xeon Phi 5110P GHz Intel compiler

86 Hardware accelerators: Multi and Hybrid strategies CISM/CÉCI HPC training session Fall 2013

87 cublas manual CPU/GPU workdivision

88 cublas manual CPU/GPU workdivision

89 CUDA manual multi-gpu workdivision

90 Automatic workdivision with Magma

91 Remote GPU utilization

92 MKL CPU/Phi auto-workdivision $ export OFFLOAD_REPORT=1 $ time octave mkl/bin/octave -q bench.m n Seconds Gfops/s Mbytes [...] [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] [MKL] [MIC 00] [AO DGEMM CPU Time] seconds [MKL] [MIC 00] [AO DGEMM MIC Time] seconds [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] [MKL] [MIC 00] [AO DGEMM CPU Time] seconds [MKL] [MIC 00] [AO DGEMM MIC Time] seconds [...]

93 Intel MPI CPU/Phi hybrid computing and/or multi-phi computing

94 Hardware accelerators: Summary and wrap-up CISM/CÉCI HPC training session Fall 2013

In-order architecture x86 + mic extensions 4

hardware threads nvidia Kepler SMX 735 to 745

2 32 SFU units 1 double op/cycle Supports

95 Xeon Phi core 1 to 1.3 GHz Xeon Phi core 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals

96 What are they? Expensive extra hardware designed to crunch data Strength: parallel data processing Weakness: non-trivial to optimize code Intel compiler Big help: vendor software CUDA libraries Major issue: data transfers

97 Should I care? Maybe.

98 Should I care? You have not much data to develop lots of code a simple algorithm with low fop per byte count an algorithm that is not easily parallelized or that requires frequent global synchronization Clear NO.

99 Should I care? You have ready-made libraries or extensions large data and data parallelism complex algorithms with large fop per byte count Clear YES.

100 vs. = Processing power v Ease of programming = Availability of dev tools Availability of libraries v v Versatility Cost of HW/SW v

101 Where do I start? Look for available CUDA libraries related to your work Found? Look for a CUDA-enabled gpu (in your laptop? Found? Use it to devel, then use the big ones on the clusters No library found? Lots of data-intensive code exist? No? You might want to consider writing your kernel. Otherwise, Try Xeon Phi if you can afford one (You have access to intel compiler?)

102

103

104

105

106 MatMul*: A simple CPU version in C k WIDTH void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { N int row = x; int col = y; float sp = 0.0; for (int k = 0; k < Width; ++k) { y float a = M[row * width + k]; float b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } M P } } x WIDTH * a.k.a. the 'Hello world' of high performance computing k David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign WIDTH WIDTH 106

107 MatMul: A simple GPU version in CUDA/C void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md,m,size,cudamemcpyhosttodevice); cudamalloc(&nd, size); cudamemcpy(nd,n,size,cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size); David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

108 MatMul: Output Matrix Data Transfer 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); } David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

109 MatMul: Kernel Function Md k WIDTH global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { Nd int row = threadidx.x int col = threadidx.y float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[row * Width + k]; float b = Nd[k * Width + col]; ty sp += a * b; } Pd[row * Width + col] = sp; } Pd ty WIDTH tx tx k David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign WIDTH WIDTH 109

110 MatMul void MatrixMulOnHost(float* M,...) { for (int x = 0; x < Width; ++x) { for (int y = 0; y < Width; ++y) { int row = x; int col = y; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = M[row * width + k]; float b = N[k * width + col]; sp += a * b; } P[row * Width + col] = sp; } } } global void MatrixMulKernel(float* Md,...) { int row = threadidx.x int col = threadidx.y float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[row * Width + k]; float b = Nd[k * Width + col]; sp += a * b; } Pd[row * Width + col] = sp; }

111 MatMul: Kernel invocation 1. // Allocate and Load M, N to device memory // Kernel invocation code // Setup the execution configuration dim3 dimgrid(1, 1); // 1 block dim3 dimblock(width, Width); // Width^2 threads // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); 3. // Read P from the device... David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

112 MatMul void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int x = 0; x < Width; ++x) for (int y = 0; y < Width; ++y) { float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = M[x * width + k]; float b = N[k * width + y]; sp += a * b; } P[x * Width + y] = sp; } } CPU GPU void MatrixMulOnDevice(float* M, float* N, float* P, int Width) {... dim3 dimgrid(1, 1); // 1 block dim3 dimblock(width, Width); // Width^2 threads MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);... }

113 Only One Thread Block Used One Block of threads compute matrix Pd Block 1 Each thread computes one element of Pd Size of matrix limited by the number of threads allowed in a thread block 2 Thread (2, 2) WIDTH Md David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign 2 4 Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Nd Grid 1 Pd

114 bx 0 Matrix Multiplication Using Multiple Blocks 1 2 tx 012 TILE_WIDTH-1 Nd Break-up Pd into tiles Each block calculates one tile WIDTH Each thread calculates one element Block size equal tile size Md Pd 1 ty Pdsub TILE_WIDTH-1 TILE_WIDTH 2 David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign WIDTH WIDTH WIDTH by TILE_WIDTHE 0

115 MatMul: Kernel using Multiple Blocks 1. // Allocate and Load M, N to device memory // Kernel invocation code // Setup the execution configuration dim3 dimgrid(width / TILE_WIDTH, Width / TILE_WIDTH); dim3 dimblock(tile_width, TILE_WIDTH); // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); 3. // Read P from the device...

116 MatMul: Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y * TILE_WIDTH + threadidx.y; // Calculate the column index of Pd and N int Col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[Row * Width + k]; float b = Nd[k * Width + Col]; sp += a * b; } Pd[Row * Width + Col] = sp; }

117 MatMul: Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y * TILE_WIDTH + threadidx.y; // Calculate the column index of Pd and N int Col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; for (int k = 0; k < Width; ++k) { float a = Md[Row * Width + k]; float b = Nd[k * Width + Col]; sp += a * b; } Pd[Row * Width + Col] = sp; } Md, Nd, Pd are on GPU DIMM (far far away but global) a, b, sp are local to the thread (close but private)

118 MatMul: Use Shared Memory to reuse global memory data N WIDTH Each input element is read by Width threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory M bandwidth P ty WIDTH tx David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign WIDTH WIDTH 118

119 Shared memory Matrix Multiplication global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Declare shared memory shared float Mds[TILE_WIDTH][TILE_WIDTH]; shared float Nds[TILE_WIDTH][TILE_WIDTH]; // Identify the row and column of the Pd element to work on int row = blockidx.y * TILE_WIDTH + threadidx.y; int col = blockidx.x * TILE_WIDTH + threadidx.x; float sp = 0.0; // Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Each thread loads one element of M and of N in shared memory Mds[ty][tx] = Md[row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[col + (m*tile_width + ty)*width]; // Make sure everyone has copied before we go on with computations tx = threadidx.x syncthreads(); ty = threadidx.y // Compute sp taking values from shared memory

120 Shared memory Matrix Multiplication Previous version Shared memory version Each thread does Each thread does (2 x Width) accesses to GPU DIMM (2 x Width / TILE_WIDTH) accesses to GPU DIMM (Width) multiplications (Width-1) additions (2 x Width) accesses to shared memory (Width) multiplications (Width-1) additions Shared memory ~1.5Tb/s, near zero latency GPU DIMM ~100Gb/s, clock cycles latency

121 Native OpenMP

122 Native OpenMP /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gcc fopenmp hello.openmp.c

123 Offload OpenMP

124 Offload OpenMP

125 Hybrid OpenMP

126 Hybrid OpenMP

127 Native intel MPI

128 Native intel MPI

129 Hybrid intel MPI

130 Hybrid intel MPI mpirun -n 4 -host localhost./hello.mpi : -n 40 -host mic0 /tmp/hello.mpi.mic

131 Offload intel MPI

132 Offload intel MPI

133 Native MKL

134 Native MKL

135 Automatic offload MKL

136 Automatic offload MKL

137 Automatic offload MKL

138 Compiler-assisted Offload MKL

139 Compiler-assisted Offload MKL

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block