From Hello World to Exascale

Size: px

Start display at page:

Download "From Hello World to Exascale"

Christiana Bennett
6 years ago
Views:

1 From Hello World to Exascale Rob Farber Chief Scien0st, BlackDog Endeavors, LLC Author, CUDA Applica0on Design and Development Research consultant: ICHEC and others Doctor Dobb s Journal CUDA & OpenACC tutorials OpenCL The Code Project tutorials Columnist

2 What is a co- processor? HOST Applica0on CUDA Libraries CUDA Thrust API CUDA Run0me API CUDA Driver API Device driver

3 Three rules for fast co- processor (GPU) codes 1. Get the data on the device (and keep it there!) PCIe x16 v2.0 bus: 8 GiB/s in a single direc0on 20- series GPUs: GiB/s 2. Give the device enough work to do Assume 10 µs latency and 1 TF device Can waste (10-6 * ) = 1M opera0ons 3. Reuse and locate data to avoid global memory bandwidth bo@lenecks flop hardware delivers flop when global memory limited Can cause a 100x slowdown! Corollary: Avoid malloc/free! 3

If you know C++, you are already programming //seqserial.

cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.

h> Will revisit these examples in the workshop.

i=0; i < N; i++) a[i]=i; // task 3: calculate the sum of the array int suma=0; for(int i=0; i < N; i++) suma

. N- 1 int sumcheck=0; for(int i=0; i < N; i++) sumcheck += i; // task 5: check the results agree if(suma ==

" << endl; return(1); return(0); int main() const int N=50000; // task 1: create the array

4 If you know C++, you are already programming //seqserial.cpp #include <iostream> #include <vector> using namespace std; GPUs! //seqcuda.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> Will revisit these examples in the workshop. int main() const int N=50000; // task 1: create the array vector<int> a(n); // task 2: fill the array for(int i=0; i < N; i++) a[i]=i; // task 3: calculate the sum of the array int suma=0; for(int i=0; i < N; i++) suma += a[i]; // task 4: calculate the sum of 0.. N- 1 int sumcheck=0; for(int i=0; i < N; i++) sumcheck += i; // task 5: check the results agree if(suma == sumcheck) cout << "Test Succeeded!" << endl; else cerr << "Test FAILED!" << endl; return(1); return(0); int main() const int N=50000; // task 1: create the array thrust::device_vector<int> a(n); // task 2: fill the array thrust::sequence(a.begin(), a.end(), 0); // task 3: calculate the sum of the array int suma= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0.. N- 1 int sumcheck=0; for(int i=0; i < N; i++) sumcheck += i; // task 5: check the results agree if(suma == sumcheck) cout << "Test Succeeded!" << endl; else cerr << "Test FAILED!" << endl; return(1); return(0);

Thrust saves handling details int main() const int N=50000; // task 1: create the array thrust::device_vector<int> a(n); // task 2: fill the array using the run0me

. N- 1 int sumcheck=0; for(int i=0; i < N; i++) sumcheck += i; // task 5: check the results agree if(suma == sumcheck) cout << "Test Succeeded!" << endl; else cerr << "Test FAILED!

5 Thrust saves handling details int main() const int N=50000; // task 1: create the array thrust::device_vector<int> a(n); // task 2: fill the array using the run0me fill(thrust::raw_pointer_cast(&a[0]),n); // task 3: calculate the sum of the array int suma= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0.. N- 1 int sumcheck=0; for(int i=0; i < N; i++) sumcheck += i; // task 5: check the results agree if(suma == sumcheck) cout << "Test Succeeded!" << endl; else cerr << "Test FAILED!" << endl; return(1); return(0); //seqrun0me.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> global void fillkernel(int *a, int n) int 0d = blockidx.x*blockdim.x + threadidx.x; if (0d < n) a[0d] = 0d; void fill(int* d_a, int n) int nthreadsperblock= 512; int nblocks= n/nthreadsperblock + ((n%nthreadsperblock)?1:0); fillkernel <<< nblocks, nthreadsperblock >>> (d_a, n);

6 CUDA is based on a 1D, 2D, or 3D Grid All parallel loops are broken into blocks Only threads in a block can communicate! //Parallel for loop for(int i=0; i < N; i++) fillkernel(a,n); //seqrun0me.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> global void fillkernel(int *a, int n) int 0d = blockidx.x*blockdim.x + threadidx.x; if (0d < n) a[0d] = 0d; void fill(int* d_a, int n) int nthreadsperblock= 512; int nblocks= n/nthreadsperblock + ((n%nthreadsperblock)?1:0); fillkernel <<< nblocks, nthreadsperblock >>> (d_a, n);

7 Each thread needs to find it s loca0on Each thread calculates a different value for Gd. //seqrun0me.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> global void fillkernel(int *a, int n) int Gd = blockidx.x*blockdim.x + threadidx.x; if (0d < n) a[0d] = 0d; void fill(int* d_a, int n) int nthreadsperblock= 512; int nblocks= n/nthreadsperblock + ((n%nthreadsperblock)?1:0); fillkernel <<< nblocks, nthreadsperblock >>> (d_a, n);

8 Scalability required to use all those cores (strong scaling execu0on model) Each thread running ﬁllKernel() writes a[0d] =0d; a

Reduc0on Gack! // task 3: calculate the sum of the array int suma= thrust::reduce(a.begin(),a.end(), 0); #include <stdio.

N_THREADS 1024 #define WARP_SIZE 32 template <class T, typename UnaryFunc0on, typename BinaryFunc0on> global void _func0onreduce(t *g_odata, unsigned int n, T initval, BinaryFunc0on fcn1) T myval =

x; i < n; i += gridsize) //myval = fcn1(fcn(i), myval); for(int i = n- 1 - (blockidx.x * blockdim.x + threadidx.

9 Reduc0on Gack! // task 3: calculate the sum of the array int suma= thrust::reduce(a.begin(),a.end(), 0); #include <stdio.h> #ifndef REDUCE_H #define REDUCE_H //Define the number of blocks as a mul0ple of the number of SM // and the number of threads as the maximum resident on the SM #define N_BLOCKS (1*14) #define N_THREADS 1024 #define WARP_SIZE 32 template <class T, typename UnaryFunc0on, typename BinaryFunc0on> global void _func0onreduce(t *g_odata, unsigned int n, T initval, BinaryFunc0on fcn1) T myval = initval; // 1) Use fastest memory first. UnaryFunc0on fcn, const int gridsize = blockdim.x*griddim.x; //for(int i = blockidx.x * blockdim.x + threadidx.x; i < n; i += gridsize) //myval = fcn1(fcn(i), myval); for(int i = n- 1 - (blockidx.x * blockdim.x + threadidx.x); i >= 0; i - = gridsize) myval = fcn1(fcn(i), myval); // 2) Use the second fastest memory (shared memory) in a warp // synchronous fashion. // Create shared memory for per- block reduc0on. // Reuse the registers in the first warp. vola0le shared T smem[n_threads- WARP_SIZE]; // put all the register values into a shared memory if(threadidx.x >= WARP_SIZE) smem[threadidx.x - WARP_SIZE] = myval; syncthreads(); // wait for all threads in the block to complete. if(threadidx.x < WARP_SIZE) // now using just one warp. The SM can only run one warp at a 0me #pragma unroll for(int i=threadidx.x; i < (N_THREADS- WARP_SIZE); i += WARP_SIZE) myval = fcn1(myval,(t)smem[i]); smem[threadidx.x] = myval; // save myval in this warp to the start of smem // reduce shared memory. if (threadidx.x < 16) smem[threadidx.x] = fcn1((t)smem[threadidx.x],(t)smem[threadidx.x + 16]); if (threadidx.x < 8) smem[threadidx.x] = fcn1((t)smem[threadidx.x],(t)smem[threadidx.x + 8]); if (threadidx.x < 4) smem[threadidx.x] = fcn1((t)smem[threadidx.x],(t)smem[threadidx.x + 4]); if (threadidx.x < 2) smem[threadidx.x] = fcn1((t)smem[threadidx.x],(t)smem[threadidx.x + 2]); if (threadidx.x < 1) smem[threadidx.x] = fcn1((t)smem[threadidx.x],(t)smem[threadidx.x + 1]); // 3) Use global memory as a last resort to transfer results to the host // write result for each block to global mem if (threadidx.x == 0) g_odata[blockidx.x] = smem[0]; // Can put the final reduc0on across SM here if desired. template<typename T, typename UnaryFunc0on, typename BinaryFunc0on> inline void par0alreduce(const int n, T** d_par0alvals, T initval, BinaryFunc0on const& fcn1) if(*d_par0alvals == NULL) cudamalloc(d_par0alvals, (N_BLOCKS+1) * sizeof(t)); UnaryFunc0on const& fcn, #endif _func0onreduce<t><<< N_BLOCKS, N_THREADS>>>(*d_par0alVals, n, fcn1); template<typename T, typename UnaryFunc0on, typename BinaryFunc0on> inline T func0onreduce(const int n, T** d_par0alvals, T initval, BinaryFunc0on const& fcn1) par0alreduce(n, d_par0alvals, initval, fcn, fcn1); //Get the values onto the host //Note: uses the default stream in the current context T h_par0alvals[n_blocks]; if(cudamemcpy(h_par0alvals, *d_par0alvals, sizeof(t)*n_blocks, cudamemcpydevicetohost)!= cudasuccess) cerr << "_func0onreduce copy failed!" << endl; exit(1); // Perform the final reduc0on T val = h_par0alvals[0]; for(int i=1; i < N_BLOCKS; i++) val = fcn1(h_par0alvals[i],val); return(val); initval, fcn, UnaryFunc0on const& fcn,

10 The NVIDIA Visual Profiler is your friend! Move matrices a,b, and c to the coprocessor (GPU) Perform the matrix mul0ply (line 24 in main) Move matrix c to the host Farber, Pragma0c Parallelism Part 1: Introducing OpenACC

11 Only touched the CUDA ecosystem

12 Congrats on your ﬁrst CUDA program! Thrust::transform_reduce() Uses a functor to operate on (transform) data Applies the reduc0on Surprise, you are now petascale to exascale capable!

Working with Real Code

Working with Real Code Rob Farber Chief Scien0st, BlackDog Endeavors, LLC Author, CUDA Applica0on Design and Development Research consultant: ICHEC and others Doctor Dobb s Journal CUDA & OpenACC tutorials