CPUs+GPUs. PCIe: 1GB/s per lane x16 => 16GB/s full duplex. DMI2 20 Gb/s 40x PCIe C L1 L2. 40x PCIe. 2 QPI links 2x 19.2GB/s Full Duplex.

Size: px

Start display at page:

Download "CPUs+GPUs. PCIe: 1GB/s per lane x16 => 16GB/s full duplex. DMI2 20 Gb/s 40x PCIe C L1 L2. 40x PCIe. 2 QPI links 2x 19.2GB/s Full Duplex."

Hugh Morton
5 years ago
Views:

1 GPU AND MANYCORE

2 CPUs+GPUs PCH: PCI, SCSI/ SATA, USB, Audio, etc. DMI2 20 Gb/s 40x PCIe C L1 L2 C L1 L2 C L1 L2 C L1 L2 Memory 1A 2A Banks 3A C L1 L2 C L1 L2 L3 Memory Controller 1B 2B 3B x16 1C 2C 3C C L1 L2 C L1 L2 1D 2D 3D 2 QPI links 2x 19.2GB/s Full Duplex C L1 L2 C L1 L2 Memory Banks x16 C L1 L2 C L1 L2 Memory Controller 1A 2A 3A C L1 L2 C L1 L2 40x PCIe L3 1B 2B 3B PCIe: 1GB/s per lane x16 => 16GB/s full duplex 1C 2C 3C C L1 L2 C L1 L2 1D 2D 3D

3 Lookahead: Programming Model? #pragma omp accelerate [clauses...] Structured-block

4 Programming Model GPU is a multi-processor One or more processes per GPU? How to start this processes? Persistent or Fork-Join? Distributed memory wrt CPU cores? What about memory across GPU cores CPU core CPU core CPU core CPU core Memory GPU cores Mem GPU cores Mem

5 Programming Model? Execution Model: Fork off low synchronization work-items to each accelerator Memory Model Explicit memory copy to each accelerator Support shared memory

7 Volta

8 SM

9 Tesla K40 Tesla M40 Tesla P100 Tesla V100 SMs FP32 Cores/GPU FP64 Cores/GPU GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1462 MHz Peak FP32 TFLOPS Peak FP64 TFLOPS Texture Units Memory Interface 84-bit GDDR5 384-bit GDDR bit HBM bit HBM2 Memory Size Up to 12 GB Upto24GB 16 GB 16 GB L2 Cache Size 1536 KB 3072 KB 4096 KB 6144 KB Shared Mem Size/SM 16/32/48 KB 96 KB 64 KB Up to 96KB Register File Size/GPU 3840 KB 6144 KB KB KB TDP 235 WaSs 235 WaCs 250 WaCs 300 WaCs 300 WaCs Transistors 7.1 billion 8 billion 15.3 billion 21.1 billion

10 Kepler Source nvidia 10

11 Kepler: SMX Source nvidia

12 Pascal P100 12

13 P100 SM 13

14 3D Memory 3D Chip-on-Wafer integration 5X bandwidth 2.5X capacity 4X energy efficiency

15 Parallel Memory Read PCI Express: 16 GB/s Broadwell memory BW: 60/s P100 memory BW: 288GB/s nvlink: 20 GB/s (per link) 15

16 NVLink Differential with embedded clock PCIe programming model (w/ DMA+) Unified Memory (Cache coherency in Gen 2.0) 5 to 12X PCIe 16

17 Inter-GPU NVLink Source: nvidia, 2016

18 CPU vs GPU architecture Memory latency needs to be hidden Run many threads ~ 8MB ~ 64KB Can do because of high compute density Source nvidia 18

19 Cuda Architecture 19 Courtesy NVIDIA

20 GPU Performance Massively parallel: a few thousand cores Low power Massively threaded:many thousand threads Hardware-supported threads 20

21 Programming Model? #pragma omp accelerate [clauses...] Structured-block

22 CUDA Programming Model Co-processor Many cores CPU code schedules multi-threaded tasks Message passing and shared memory models GPU Threads are organized hierarchically Grids, Blocks, Warps Shared memory Memory hierarchy opaque Parallel read/write 22

23 CUDA is C-like Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args); Courtesy Kirk & Hwu

24 CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few 24 Courtesy Kirk & Hwu

25 Extended C Declspecs device float filter[n]; global, device, shared, local, constant global void convolve (float *image){ Built-in variables threadidx, blockidx Intrinsics syncthreads RunZme API } Memory, symbol, execuzon management FuncZon launch shared float region[m];... region[threadidx] = image[i]; syncthreads()... image[j] = result; // Allocate GPU memory void *myimage = cudamalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 25 Courtesy Kirk & Hwu

26 Extended-C SW stack Integrated source (foo.cu) nvcc C/C++ frontend cuda-gdb GPU Assembly foo.s CPU Host Code foo.cpp CUDA Visual Profiler OCG gcc / cl Parallel Nsight Architecture SASS foo.sass 26

27 CUDA software pipeline Source files has mix of host and device code nvcc separates device code from host code and compiles device code into PTX/cubin Host code is output as C code nvcc can invoke the host compiler or, it can be compiled later Applications can link to the generated host code host code includes PTX/cubin code as a global initialized data array and cudart CUDA C runtime function calls to load and launch kernels Alternatvely, one may load and execute the PTX/cubin using the CUDA driver API host code is then ignored 27

28 CUDA software architecture Provides library funcmons for host as well as device Implement subset of stdlib Source nvidia 28

29 System Requirements CUDA GPU With CUDA device driver CUDA software CUDA Toolkit Tools to build a CUDA Libraries header files, and other resources CUDA SDK Sample projects (with configurations) including utility functions C/C++ compiler Needs to be a compatible version 29

30 Arrays of Parallel Threads A CUDA kernel is executed many times By a block of threads running concurrently Once per thread, each running the same kernel (SPMD) Thread have access to their ID may compute different memory addresses or control thread ID float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; 30

31 Arrays of Parallel Blocks Multiple blocks of threads may execute a kernel A grid of blocks Threads within a block communicate using shared memory, global memory, atomic operations and barrier Threads in different blocks only share global memory thread ID Thread Block 0 Thread Block 1 Thread Block N float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; 31

32 Main CUDA Construct Run k instances of function f Such parallel functions are called Kernels Declared with global specifier // Kernel definition global void f(float* F) { int id = threadidx.x; } int main() { // Kernel invocation f<<<1, N>>>(F); } Object oriented parts of C++ not supported for device code in earlier CUDA 32

33 Kernel Invocation <<<A, B>>> specifies a 2-level hierarchy Grid of blocks A blocks, each block is of size B All thread within a block scheduled on the same SIMD SM Can share local memory Actually called shared memory in CUDA lingo There is separate thread local memory Ironically, may not be physically close Can synchronize with each other in a block syncthreads() Different blocks only loosely tied Must be able to execute independently (concurrently) Do share global memory 33

34 Thread Execution A block does not execute in a SIMD fashion Executed in groups of 32 parallel threads called warps There need not be 32 physical cores All threads of a warp start together But may diverge by branching Branch paths are serialized until they converge back Important efficiency consideration 34

35 Grid/Block Dimension Invocation: <<<A,B>>> A and B need not be ints A is a (up to) two dimensional vector dim3 A(m, n, 1); m, n are ints B is a (up to) three dimensional vector dim3 B(a, b, c); a, b, c are ints a x b x c <= 512 on Tesla (1024 on Fermi, Kepler) Up to 8 blocks may co-exist on SM (16 on Kepler) Resource sharing can further limit the counts; at least 1 must fit c is the most significant dimension, a is the least Dereference: B.x, B.y and B.z Thread ID = (B.x + B.y * a + B.z * a*b) 35

36 Thread ID b y z x a c (x + y * a + z * a*b). 36

37 CUDA Memory Model Overview Global memory Main host-device data communication path Visible to all threads Long latency L1 and L2 exist on Fermi/Kepler Shared Memory Fast memory Use as scratch Shared across block More memory segments Constant and texture Read-only, cached Host Grid Block (0, 0) Registers Shared Memory Thread (0, 0) Global Memory Registers Thread (1, 0) Block (1, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0) 37 courtesy Kirk & Hwu

38 Memory Model Details Shared memory is tied to a block Lifetime ends with the block global, constant, and texture memories are persistent across kernels (within application) These are recognized as device memory Separate from host memory App must explicitly allocate/de-allocate device memory And manage data transfer between host and device memory 38

39 CUDA Device Memory Allocation cudamalloc() Allocates in global memory parameters: Address of a pointer to the allocated object Size of allocated object No graphics display reset cudafree() Frees object from global memory Called on the host! Appears like host pointers Grid Block (0, 0) Registers Global Memory Shared Memory Thread (0, 0) Host Registers Thread (1, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) 39 Thanks Kirk & Hwu

40 CUDA Device Memory Allocation Code example: (cont.) Allocate a 64 * 64 single precision float array Attach the allocated storage to dm Prefix/Suffix d often used for device data TILE_WIDTH = 64; float *dm, *M; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudamalloc((void**)&dm, size); cudafree(dm); 40 Courtesy Kirk & Hwu

41 Example Memory Copy size_t size = N * sizeof(float); // Allocate vector in host memory float* ha = (float*)malloc(size); // Declare device vectors handles float *da, *db, *dc; // Allocate vectors in device global memory cudamalloc(&da, size); cudamalloc(&db, size); // Copy host->device cudamemcpy(da, ha, size, cudamemcpyhosttodevice); // Invoke kernel on GPU ProcessDo<<<blocksPerGrid, threadsperblock>>>(da, db, N); // Copy result, h_b, from device memory to host memory cudamemcpy(hb, db, size, cudamemcpydevicetohost); // Free device memory cudafree(da); cudafree(db); See cudamallocpitch() and cudamalloc3d() for allocating 2D/3D arrays. Pads to meet alignment for efficient access (also see cudamemcpy2d(), cudamemcpy3d(), cudamemcpytosymbol() ). 41

42 Example Host-Device Data Transfer cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamemcpy(m, dm, size, cudamemcpydevicetohost); Transfers a 64 * 64 float array M is in host memory and dm is in device memory cudamemcpyhosttodevice and cudamemcpydevicetohost are symbolic constants Also see cudamemcpyasync(). Recall allocazon earlier 42 Courtesy Kirk & Hwu

43 More Ways to Initialize constant float constdata[256]; float data[256]; cudamemcpytosymbol(constdata, data, sizeof(data)) There also is page-locked (i.e., pinned) host memory cudahostalloc() and cudafreehost() Copies between page-locked host memory and device memory can be performed concurrently with kernel execution Page-locked host memory can be directly mapped into the address space of the device Bandwidth between host memory and device memory is generally higher 43

44 Memory Mapping A block of page-locked host memory can mapped into the address space of the device Use flag cudahostallocmapped in cudahostalloc() cudahostalloc() returns host memory pointer device memory pointer can be retrieved using cudahostgetdevicepointer() Accessing host-mapped memory from within a kernel has some advantages: data transfers are implicitly performed as needed by the kernel kernel-originated data transfers can overlap with kernel execution No need to use streams to overlap data transfers with kernel execution Application must synchronize memory accesses using streams or events to prevent RAW, WAR, or WAW hazards. 44

45 Memory Access Efficiency All (active) threads of the warp load/store Concurrent accesses by a warp can be coalesced into memory transactions Shared memory distributed into banks Non-conflicting access efficient Atomic operations are serialized 45

46 CUDA Function Qualifiers device float dsomename() {} global void ksomename() {} global defines a kernel function Must return void called from host, run on device No recursion device are executed and called on device host qualifier also exists device and host can be used together 46

47 CUDA Function Declarations (cont.) device functions inlined for Tesla can noinline and forceinline recursion is not supported in older versions For functions executed on the device: No static variables inside the function No variable number of arguments 47 Courtesy Kirk & Hwu

48 Calling a Kernel Function Thread Creation kernels called with execuzon configurazon: global void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads/block size_t SharedMemBytes = 64; // 64 bytes of shared mem KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Dynamically allocated. More later. Calls to a kernel function are asynchronous Parameters are passed through shared/constant memory Provision for explicit sync for blocking 48 Courtesy Kirk & Hwu

49 CUDA Variable Qualifiers device Can be used in conjunction with qualifiers below Resides in global memory; visible to all threads Accessible from host through the runtime library constant Read only on device shared Has the lifetime of the block Accessible only from threads of that block extern allowed only with a single array cannot be initialized at declaration Writes to non-volatile variables may be delayed syncthreads() required to ensure visibility across threads File scope only: No extern No struct/union in formal param No struct/union on local vars Cannot be used on host-local variables constant/shared are static 49

50 External Shared Memory Variables Instead of: extern shared short array0[128]; extern shared float array1[64]; extern shared int array2[256]; Do extern shared char array[]; device void func() // or global { short* array0 = (short*)array; float* array1 = (float*)&array0[128]; int* array2 = (int*)&array1[64]; } 50

51 Memory Usage Local variables in register or local memory Pointers in device code must be resolved at compile time Whether pointing to shared or global memory [1.x] Following a global memory pointer in host code is not allowed The address obtained by taking the address of a device, shared or constant variable can only be used in device code. The address of a device or constant variable queried using cudagetsymboladdress() on host 51

52 Built-in Vector Types char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2 dim3 int2 point = make_int2(int x, int y); 52

53 void aplusb(int n, float *restrict a, float *b) { #pragma acc kernels for( int j = 1; j < n-1; j++) { #pragma acc kernels for( int i = 1; i < m-1; i++ ) { for (int i = 0; i < n; i++) a[i] = a[i] + b[i]; } } #pragma acc data copy(a, Anew) #pragma acc kernels #pragma acc data copy(a, Anew) while ( error OpenACC > tol && iter < iter_max ) { error = 0.f; } Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); error = maxf(error, fabsf(anew[j][i]-a[j][i])); #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Jacobi Iteration } iter++; 53

54 OpenCL Execution Model: Kernels are executed by one or more work-items. Work-items are collected into work-groups and each work-group executes on a compute unit. Memory Model Kernel data must be specifically placed in one of four address spaces global memory, constant memory, local memory, or private memory. The location of the data determines how quickly it can be processed.

55 Matrix Multiplication Example Illustrates basic memory/thread management Assumes square matrix for simplicity No shared memory usage yet Local, register usage Thread ID usage Memory data transfer between host and device 55 Courtesy Kirk & Hwu

56 Square Matrix Multiplication P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory M P N N WIDTH M P WIDTH WIDTH WIDTH 56 Courtesy Kirk & Hwu

57 MATMUL: A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) N { } for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; M } P[i * Width + j] = sum; } k i P j k WIDTH WIDTH 38 WIDTH WIDTH 57 Courtesy Kirk & Hwu

58 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *Md, *Nd, *Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamalloc(&nd, size); cudamemcpy(nd, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size); 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); 58 Courtesy Kirk & Hwu }

59 MATMUL: Kernel Function // Matrix mulmplicamon kernel per thread code global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue stores the matrix element computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; 41 59

60 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *dm, *dn, *dp; 1. // Allocate and Load M, N to device memory cudamalloc(&dm, size); cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamalloc(&dn, size); cudamemcpy(dn, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&dp, size); 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, dp, size, cudamemcpydevicetohost); // Free device matrices cudafree(dm); cudafree(dn); cudafree (dp); 60 Courtesy Kirk & Hwu }

61 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *dm, *dn, *dp; 1. // Allocate and Load M, N to device memory cudamalloc(&dm, size); cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamalloc(&dn, size); cudamemcpy(dn, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&dp, size); 2. // Setup the execution configuration dim3 dimgrid(1, 1); dim3 dimblock(width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimblock>>>(dm, dn, dp, Width); 3. // Read P from the device cudamemcpy(p, dp, size, cudamemcpydevicetohost); // Free device matrices cudafree(dm); cudafree(dn); cudafree (dp); } 61

62 Single Thread Blocks One thread-block computes Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 Thread (2, 2) Nd 48 WIDTH Md Pd 62 Courtesy Kirk & Hwu

63 MATMUL: Multiple Blocks 2D thread block computes a (TILE_WIDTH) 2 sub-matrix of Pd (TILE_WIDTH) 2 threads/block Generate a 2D Grid of (WIDTH/ TILE_WIDTH) 2 blocks Nd WIDTH Smll need a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size! Md Pd bx by TILE_WIDTH ty tx WIDTH WIDTH 45 WIDTH 63 Courtesy Kirk & Hwu

64 MATMUL: Multiple Blocks 2D thread block computes a (TILE_WIDTH) 2 sub-matrix of Pd (TILE_WIDTH) 2 threads/block Nd tx WIDTH Generate a 2D Grid of (WIDTH/ TILE_WIDTH) 2 blocks If grid size required is greater than maximum allowed, invoke mulmple kernels. Md ty Pd by TILE_WIDTH tx ty k bx, by WIDTH WIDTH WIDTH 64

65 Synchronization (Block) void syncthreads() block barrier AND ensure all global/shared memory accesses by all threads are visible in the block int syncthreads_count(int predicate) returns the count of threads where predicate!= 0 int syncthreads_and(int predicate) returns non-zero iff predicate!= 0 for all threads int syncthreads_or(int predicate) returns non-zero iff predicate!= 0 for any threads

66 Intra Warp Synchronization int all(int predicate) Return non-zero iff predicate!= 0 for all threads int any(int predicate) Return non-zero iff predicate!= 0 for any threads uint ballot(int predicate) Return an int with nth bit set iff predicate!= 0 for the nth thread of the warp Only supported by devices of compute capability above

67 Atomic Operations Read-modify-write on one 32-bit or 64-bit word in global or shared memory. Atomic ops on 64-bit words in shared memory available since 2.0 If the target is mapped page-locked memory: it s not atomic from host s perspective Only supported on signed and unsigned integers except atomicexch and atomicadd() in 2.0 that also work for float. More generic Compare and Swap atomiccas() 67

68 CAS Example device double atomicadd(double* address, double val) { double old = *address, assumed; do { assumed = old; old = longlong_as_double( atomiccas( (unsigned long long int*)address, double_as_longlong(assumed), double_as_longlong(val+ assumed) )); } while (assumed!= old); return old; } 68

69 Thrust Template Library #include <thrust/host_vector.h> #include <thrust/device_vector.h> int main(void) { thrust::host_vector<int> h_vec(1 << 24); // Initialize host array now.. } // transfer data to the device thrust::device_vector<int> d_vec = h_vec; // Do more.. // Later transfer data to the device thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; 69

70 Checking Return Values Functions return cudaerror_t cudasuccess cudaerrorinvalidvalue cudaerrorinvalidsymbol cudaerrorinvaliddevicepointer cudaerrorinvalidmemcpydirection 70

71 CUDA Sort Example #define NUM 256 int main(int argc, char** argv) { } int values[num]; for(int i = 0; i < NUM; i++) values[i] = getnextvalue(); int *dvalues; cudamalloc((void**)&dvalues, sizeof(int) * NUM); cudamemcpy(dvalues, values, sizeof(int)*num, cudamemcpyhosttodevice)); ParallelSort<<<1,NUM, sizeof(int)*num>>> (dvalues); cudamemcpy(values, dvalues, sizeof(int)* NUM, cudamemcpydevicetohost)); cudafree(dvalues); Contd.. 71

72 device inline void swap(int &a, int &b) { int tmp = a; // We expect these in register a = b; b = tmp; } global static void ParallelSort(int * values) { extern shared int shared[]; const unsigned int tid = threadidx.x; // Copy input to shared mem. shared[tid] = values[tid]; syncthreads(); Sort Kernel pg 1 72

73 } // Parallel bitonic sort. for (unsigned int k = 2; k <= NUM; k >>= 1) { } for (unsigned int j = k/2; j>0; j <<= 2) { } } unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) } else { } swap(shared[tid], shared[ixj]); if (shared[tid] < shared[ixj]) syncthreads(); swap(shared[tid], shared[ixj]); values[tid] = shared[tid]; // Write result. Sort Kernel pg 73 2

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates