Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Similar documents
Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Parallel Computing. Lecture 19: CUDA - I

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Module 2: Introduction to CUDA C. Objective

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Parallel Numerical Algorithms

Module 2: Introduction to CUDA C

GPU CUDA Programming

CS 179: GPU Computing. Lecture 2: The Basics

Lecture 3: Introduction to CUDA

CSE 160 Lecture 24. Graphical Processing Units

Lecture 2: Introduction to CUDA C

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Tesla Architecture, CUDA and Optimization Strategies

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to Parallel Computing with CUDA. Oswald Haan

EEM528 GPU COMPUTING

HPCSE II. GPU programming and CUDA

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA C Programming Mark Harris NVIDIA Corporation

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Basics. July 6, 2016

Introduction to CUDA Programming

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPGPUs and to CUDA programming model

CS179 GPU Programming Recitation 4: CUDA Particles

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

University of Bielefeld

CUDA Architecture & Programming Model

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Massively Parallel Algorithms

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Programming with CUDA, WS09

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

COSC 6374 Parallel Computations Introduction to CUDA

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to CUDA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

High Performance Computing and GPU Programming

COSC 6339 Accelerators in Big Data

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduc)on to GPU Programming

Practical Introduction to CUDA and GPU

Lecture 10!! Introduction to CUDA!

ECE 574 Cluster Computing Lecture 15

Basic unified GPU architecture

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

GPU Programming Using CUDA

Cartoon parallel architectures; CPUs and GPUs

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

CUDA Programming. Aiichiro Nakano

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Computer Architecture

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Lecture 3. Programming with GPUs

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Computing: A Quick Start

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 11: GPU programming

Lecture 1: an introduction to CUDA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Performance Optimization. Patrick Legresley

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

Introduction to CELL B.E. and GPU Programming. Agenda

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

CS377P Programming for Performance GPU Programming - I

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Introduction to GPU hardware and to CUDA

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU programming. Dr. Bernhard Kainz

CUDA Programming Model

ECE 574 Cluster Computing Lecture 17

Transcription:

Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690

Moore s Law 2

Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium Pentium II Pentium III 10 80386 80286 8086 8080 1 1974 1978 1982 1985 1989 1993 1997 1999 2000 2004 2006 2008 2011 Date of Introduction 3

How did they use all those additional transistors? Additional functionality Floating point units SSE vector units Caches Data Instructions Translation Lookaside Buffer (virtual memory) Hardware prefetcher Instruction Level Parallelism Instruction pipelining Superscalar execution Out of order execution Speculative execution Branch prediction 4

Multicore CPUs Maximum # cores 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Six core AMD Opteron (image from AMD) 2003 2004 2005 2006 2007 2008 2009 2010 2011 Date of Introduction for AMD Opteron CPUs 5

Graphics Processing Unit (GPU) Leverages demand and volume from video game players Consumer versions widely available at any electronics store at a variety of price points Programmable via free software options (images from NVIDIA) 6

Compute and Bandwidth Performance FLOPS Bandwidth Source: NVIDIA CUDA C Programming Guide Version 4.2 7

Hardware Characteristics 3 or 4 generations of NVIDIA CUDA architectures released from 2007-2012 Key characteristics are: Architectural differences Double precision floating point on newer generations Memory system Newer generations are much more flexible L1 and L2 caches have been added in later generations Number of cores and the rate at which they re clocked Bandwidth Depends on memory clock and width of the memory interface Amount of on-board memory (256 MB to 6 GB) Power consumption Higher end GPUs need auxiliary 6- and/or 8-pin PCIe power connectors 8

NVIDIA Fermi Architecture 9

Fermi Streaming Multiprocessor 10

GPU Parallelism Multicore CPU needs one or two threads per core to run efficiently GPUs need thousands to tens of thousands of threads to run efficiently Each time a GPU computes a frame (which it does tens of times a second) it uses a thread per pixel, of which there are millions In the context of computation, this means fine grain parallelism This is possible because GPU threads are different than CPU threads 11

GPU Threads Lightweight compared to CPU threads Creation, scheduling, destruction are done in hardware Fast switching between threads Many times more threads than cores Typically about 100x more threads than cores Why so many threads? 12

GPU Memory System GPU doesn t rely solely on cache to hide memory latency Many more transistors available for computational units Can do without cache because it s specialized to handle parallel computations When a thread stalls due to memory access latency, a core switches to executing another thread and when that thread eventually stalls, the core switches to another thread, etc. Registers are partitioned to allow fast switching (one clock cycle) More threads means more opportunity for latency hiding A GPU can run a single-threaded function but the performance will be horrible 13

GPU Programming Today NVIDIA CUDA C and Fortran OpenCL Microsoft DirectCompute OpenACC 14

Getting Started with CUDA Hardware Requires compatible hardware (emulator no longer available) Any reasonably new NVIDIA GPU supports CUDA Check NVIDIA website for more information: http://www.nvidia.com/object/cuda_learn_products.html CUDA C Freely available from NVIDIA website: http://www.nvidia.com/getcuda CUDA driver (part of display driver), toolkit (compiler and libraries), and SDK code examples Windows, Linux, and Mac OS X supported Use the Quick Start Guides for installation and verification 15

CUDA C CPU code API for interacting with the GPU(s) Extension for easily invoking computational kernels that run on the GPU GPU code Subset of C Kernels must return void Recursion, malloc / free or new / delete, printf, assert, etc. only supported on Fermi GPUs and beyond Extensions Parallel programming model Libraries Callable from the CPU, run on the GPU BLAS, FFT, CURAND, CUSPARSE, NPP, Thrust, etc. 16

Simple CUDA C Program global void addvectors(float *a, float *b, float *c) { int idx = blockidx.x*blockdim.x + threadidx.x c[idx] = a[idx] + b[idx]; addvectors<<<nvalues/256, 256>>>(a_d, b_d, c_d); Keyword that indicates GPU code } Each thread computes a unique index to access the data it will access and produce // CPU code calls GPU code CPU specifies how many GPU threads to start, in this case one thread per element in the vectors 17

CPU and GPU Memory CPU and GPU each have their own physical memory Data transferred over PCI Express (PCIe) 8 GB/s theoretical peak for Gen 2 x16, up to ~5.5 GB/s observed PCI Express 18

Allocating GPU Memory GPU memory explicitly allocated and freed cudamalloc, cudafree Pointers to memory allocated on the GPU are not valid on the CPU and vice versa GPU uses a virtual memory system, but: On Windows Vista systems (and their derivatives, e.g. Windows HPC Server 2008) and up, allocating beyond physical memory will automatically result in paging to CPU memory On all other operating systems allocations will fail Done for performance reasons Allocation has the life of the host CPU process/thread Automatically cleaned up by driver if application doesn t 19

cudamalloc cudaerror_t cudamalloc(void **devptr, size_t size) float *a_d; status = cudamalloc((void **) &a_d, 1024*sizeof(float)); assert(status == cudasuccess); No type distinction between CPU and GPU pointers Recommend adopting a standard convention, e.g. _d suffix Be sure to implement some sort of error checking consistent with your application 20

cudafree cudaerror_t cudafree(void *devptr) status = cudafree(a_d); assert(status == cudasuccess); 21

cudamemcpy cudaerror_t cudamemcpy(void *dst, const void *src, size_t count, enum cudamemcpykind kind) Copy from memory area pointed to by src to memory area pointed to by dst kind specifies the direction of the copy: cudamemcpyhosttohost cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Calls with dst and src pointers inconsistent with the copy direction will result in undefined behavior (typically garbage in the destination, or perhaps even an application crash) cudamemcpy will block until the memory copy has completed Options for asynchronous copies also available 22

cudamemcpy Example size_t bytes = 1024*sizeof(float); a = (float *) malloc(bytes); b = (float *) malloc(bytes); cudamalloc((void **) &a_d, bytes); cudamalloc((void **) &b_d, bytes); cudamemcpy(a_d, a, bytes, cudamemcpyhosttodevice); cudamemcpy(b_d, a_d, bytes, cudamemcpydevicetodevice); cudamemcpy(b, b_d, bytes, cudamemcpydevicetohost); for(int n = 0; n < 1024; n++) assert(a[n] == b[n]); 23

Parallel Programming Model Parallel portions of the code are initiated from the CPU and run on the GPU Parallelism is based on many threads running in parallel Developer writes one thread program Each instance of the thread will use a unique index to determine which portion of the computation to perform Sometimes referred to as SIMD, SPMD, or SIMT Single Instruction Multiple Data Single Program Multiple Data Single Instruction Multiple Threads 24

Parallel Threads... idx 0 1 2 3 nthreads - 1 x = input[idx]; y = func(x); output[idx] = y; 25

Thread Cooperation Useful to have threads cooperate with one another Share intermediate results Reduce memory accesses (stencil operations, etc.) Cooperation is difficult to scale Synchronization is expensive Potential for deadlock Kernels launched as grid of thread blocks 26

Grid of Thread Blocks Grid Block 0 1 2 nblocks - 1... Threads in the same block can cooperate More on this later when we discuss shared memory Threads in different blocks cannot cooperate 27

Hardware Execution Software Hardware Thread Thread Processor Thread Block Multiprocessor Grid... Device 28

Scalability Across GPUs Blocks scheduled across one or more multiprocessors Correctly written program will work for any number of multiprocessors and ordering of blocks Blocks which won t fit are queued and started when other blocks finish Device A Device B Time 29

GPU Code Kernel is a C function with restrictions: Must return void Cannot access host memory No variable number of arguments No recursion on older generations of GPUs No static variables Function arguments automatically copied from host to device But not memory that backs pointers 30

Function Qualifiers Function qualifiers used to specify where a function will be called from and where it will execute global device Called from the host and executes on the device Called from the device and executes on the device host Called from the host and executes on the host Combine with device for overloading 31

Kernel Launch Special syntax for invoking kernels: mykernel<<<dim3 dimgrid, dim3 dimblock>>>( ); <<<, >>> referred to as the execution configuration dimgrid is the number of blocks in the grid One or two dimensional: dimgrid.x, dimgrid.y dimblock is the number of threads in a block One, two, or three dimensional: dimblock.x, dimblock.y, dimblock.z Multidimensional grids and blocks are for programming convenience Unspecified dim3 fields default to 1 32

Execution Configuration Examples dim3 grid, block; grid.x = 2; grid.y = 4; block.x = 16; block.y = 16; mykernel<<<grid, block>>>( ); dim3 grid(2, 4), block(16, 16); mykernel<<<grid, block>>>( ); mykernel<<<8, 256>>>( ); 33

Built in Variables global and device functions have access to several automatically defined variables dim3 griddim Dimension of the grid in blocks dim3 blockdim Dimension of the block in threads dim3 blockidx Block index within the grid dim3 threadidx Thread index within the block 34

Globally Unique Thread Indices Grid blockidx.x 0 1 2 nblocks - 1... threadidx.x 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 idx 0 1 2 3 4 5 6 7 8 9 10 11 idx = blockidx.x*blockdim.x + threadidx.x; 35

Vector Addition Example global void addvectors(float *a, float *b, float *c, int N) { int idx = blockidx.x*blockdim.x + threadidx.x; if (idx < N) c[idx] = a[idx] + b[idx]; }... blocksize = 256; dim3 dimgrid(ceil(nvalues/(float)blocksize)); addvectors<<<dimgrid, blocksize>>>(a_d, b_d, c_d, nvalues); 36

Reduction Combine the elements of an array using an associative, commutative operator Typical examples include: sum, min, max, product, etc. sum = 0.; for (n = 0; n < nvalues; n++) sum += a[n]; Note that CUDPP and Thrust implement very efficient reductions as library calls: http://gpgpu.org/developer/cudpp http://code.google.com/p/thrust/ 37

Generic Parallel Reduction Implemented using recursive pairwise reduction 7 3 8-5 19 27 0 9 10 3 46 9 13 55 68 38

Simple Parallel Reduction 7 3 8-5 19 27 0 9 Kernel 1 26 30 8 4 19 27 0 9 Kernel 2 34 34 8 4 19 27 0 9 Kernel 3 68 34 8 4 19 27 0 9 39

Simple Parallel Reduction global void sumreductionkernel(float *a, int nthreads){ int idx = blockdim.x*blockidx.x + threadidx.x; if (idx < nthreads) a[idx] += a[idx + nthreads]; }... nthreads = nvalues/2; while (nthreads > 0){ griddim.x = ceil((float)nthreads/blocksize); sumreductionkernel<<<nvalues/256,blocksize>>>(a_d, nthreads); nthreads /= 2; } cudamemcpy(a, a_d, sizeof(float), cudamemcpydevicetohost); printf( sum of a = %f\n, a[0]); 40

Memory Model So far we ve seen per thread variables (like idx) which are stored in registers and memory in the offchip DRAM (device/global memory) Registers: Accessible by one thread Life of the thread... Device: Accessible by all threads Life of the application 41

Shared Memory Exchanging data through device memory is expensive due to bandwidth and latency and it also requires multiple kernel launches Global synchronization between kernel launches Instead use high performance on-chip memory ~100 times lower latency than device memory ~10 times more bandwidth 16 KB to 48 KB SRAM per multiprocessor 16 KB on compute capability 1.x 16 KB to 48 KB on compute capability 2.x+ Allocated per thread block and can be read and written by any thread in the block Has the lifetime of the thread block 42

Expanded Memory Model Registers: Accessible by one thread Life of the thread Shared Memory: Accessible by all threads in a block Life of the thread block... Device: Accessible by all threads Life of the application 43

Variable Qualifiers device Located in off-chip DRAM memory Allocated with cudamalloc ( device qualifier implied) Life of the application Accessible from threads and host shared Located in on-chip shared memory Life of the thread block Only accessible from threads within the block constant See the documentation Unqualified variables in device code normally reside in registers 44

Example of Shared Memory Declaration #define BLOCKSIZE 256 global void mykernel(float *a, int nvalues) { /* Per thread block shared memory. */ shared float a_s[blocksize]; } /* Local (per thread) variables. */ int idx;... 45

More on Shared Memory Declaration global void mykernel(float *a, int nvalues) { /* Per thread block shared memory. */ extern shared float a_s[]; /* Local variables. */ int idx;... Size of a_s specified at kernel launch }... bytes = 256*sizeof(float); mykernel<<<dimgrid, dimblock, bytes>>>(a_d, nvalues); 46

Thread Synchronization Threads can cooperate by writing and reading to shared memory but there is potential for race conditions Thread reads from shared memory before another thread has written the data, etc. syncthreads() synchronizes all threads in a block Acts as a barrier No thread in the block can continue until all threads reach it Allowed in conditional code only if the conditional is uniform across the entire thread block! 47

Better Parallel Reduction a[0] a[255] a[256] a[511] a[512] a[767] Kernel 1 Block 0 Block 1 Block 2 a[0] a[1] a[2] Kernel 2 Block 0 a[0] Reductions within each thread block significantly reduce the number of kernel invocations and the amount of data written back to memory from each kernel 48

Better Parallel Reduction: GPU Code (1) #define BLOCKSIZE 256 global void sumreductionkernel(float *a, int nvalues { int n = BLOCKSIZE/2; int idx = blockidx.x*blockdim.x + threadidx.x; /* Shared memory common to all threads within a block. */ shared float a_s[blocksize]; /* Load data from global memory into shared memory. */ if (idx < nvalues) a_s[threadidx.x] = a[idx]; else a_s[threadidx.x] = 0.f;... syncthreads(); 49

Better Parallel Reduction: GPU Code (2)... /* Reduction within this thread block. */ while (n > 0) { if (threadidx.x < n) a_s[threadidx.x] += a_s[threadidx.x + n]; n /= 2; syncthreads(); } /* Thread 0 writes the one value from this block back to global memory. */ if (threadidx.x == 0) a[blockidx.x] = a_s[0]; } 50

Better Parallel Reduction: CPU Code... nthreads = nvalues; while (nthreads > 0) { griddim.x = ceil((float)nthreads/blocksize); sumreductionkernel<<<griddim, BLOCKSIZE>>>(a_d, nthreads); nthreads /= BLOCKSIZE; }... 51

Processor-Memory Gap From: Computer Architecture: A Quantitative Approach by Hennessy and Patterson 52

Thread Warps Thread blocks are made up of groups of threads called warps Warp size on all current hardware is 32, but could change on future hardware (can query through device properties) A warp is executed in lock step SIMD on a multiprocessor Hardware automatically handles divergence due to branching Note that you are free to specify an arbitrary number of threads per block, but the hardware can only work in increments of warps Number of threads is internally rounded up to a multiple of the warp size and extras are masked out in terms of memory accesses Trivia: Warps is a term which comes from weaving. They are threads woven in parallel. 53

Warps and Half Warps = Thread block Warp 0 Warp 1 Warp 2 Warp n Warp n = Half warp Half warp 54

Compute Capability Compute capability is a versioning scheme for keeping track of multiprocessor capabilities / features Compute capability 1.0 Tesla architecture, first CUDA capable multiprocessor Compute capability 1.1 Adds atomic operations for global memory, etc. Compute capability 1.2 Tesla 2 architecture Doubles the number of registers from 1.0 and 1.1 Adds atomic operations for shared memory, etc. Compute capability 1.3 Adds double precision floating point, etc. Compute capability 2.x Fermi architecture Compute capability 3.x Kepler architecture Compute capability of a GPU can be queried at runtime 55

Memory Coalescing Coalescing is the process of combining global memory access (load or store) across threads within a warp or half warp into one or more transactions How coalescing is performed depends on the compute capability 1.0 and 1.1 have the same coalescing characteristics 1.2 and 1.3 have the same coalescing characteristics 1.0 and 1.1 are subsets of 1.2 and 1.3 2.0 and 3.0 adds L1 and L2 caches Global memory divided into segments of size 32, 64, 128, and 256 bytes Pointers from cudamalloc are always at least 256 byte aligned 56

Global Memory Segments 57

Coalescing on Compute Capability 1.2 and 1.3 Global memory accesses by a half warp are combined to minimize the total number of transactions Eliminates the dependence on the order in which threads access data in compute capability 1.0 and 1.1 In addition, transaction sizes are automatically reduced to avoid wasted bandwidth Recursively reduces the transaction size if only the upper or lower half of the segment is needed See the programming guide for more details of the algorithm Minimum transaction size of 32 bytes and per thread word sizes of 32, 64, and 128 bits Note that the coalescing for Compute Capability 1.2 and 1.3 is a superset of the requirements for Compute Capability 1.0 and 1.1 Code that is efficient on 1.0 and 1.1 will continue to be efficient on 1.2 and 1.3 but not necessarily vice versa 58

Examples of Memory Transactions 59

Examples of Memory Transactions 60

Coalescing and Caching on Compute Capability 2.x+ Each multiprocessor has 64 KB of SRAM for shared memory and L1 cache Can choose split between shared memory and L1 on a per kernel basis GPU as a whole has an L2 cache Memory accesses are coalesced across the full warp of 32 threads The cache line size is 128 bytes, so any misses in L1 will result in one or more 128 bytes transactions from L2 to L1 If the L1 cache is bypassed, either through a compilation flag or an inline assembly instruction, the requests are served from L2 using 32 byte transactions There is no way to bypass L2 cache 61

Big Picture on Global Memory Access On a CPU, you want spatial locality of data down a thread On a GPU, you want spatial locality of data across threads True for both NVIDIA and AMD GPUs CPU GPU time 62