Multi-Processors and GPU

Size: px

Start display at page:

Download "Multi-Processors and GPU"

Coleen Sparks
6 years ago
Views:

1 Multi-Processors and GPU Philipp Koehn 7 December 2016

2 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005)

3 Actual CPU Clock Speed 2 Clock speed 2016: 3 GHz

4 Why? 3 Intel estimate, around 2000: 1MW by 2016?

5 Moore s Law 4 Number of transitors per chip still exponential

6 What to do with the Transitors? 5 More parallelism faster execution of instructions More processors on a chip

7 6 multi-processors

8 Intel Core i7: Quad-Core 7

9 Intel Xeon Phi: 72 cores (2016) 8

10 Handling Multiple Processes 9 Kernel can keep multiple processes running Each process is assigned to a core each core has a local cache all cores share a common cache, common memory Synchronization between cores not trivial e.g., cache coherence

11 More Parallelism 10 Multiple processes not always the best way to parallelize Often, within a process parallel execution would be helpful Example: matrix multiplication loops over different parts of the data instructions highly independent can be executed in parallel

12 Multi-Threading 11 Parallel execution within process No switching of process context (e.g., virtual address space) Supported by various libraries pthread in C++ thread in C++11 thread in Python Programmer has to take care of conflicts

13 12 computer graphics

14 Computer Graphics 13

15 Computer Graphics 14

16 tl;dr 15 Given 3d models of objects lighting, textures ray tracing Lots of vector and matrix operations Color value for each pixel on the screen has to be computed

17 High Demand 16 Computer games on regular PCs Game consoles Atari ( ) Nintendo/Wii (since 1977) Playstation (since 1994) X-Box (since 2001) 100s of millions sold

18 17 history

19 VGA Controller 18

20 GPU 19

21 Co-Processor 20 CPU handles the bulk of the complexity GPU focuses on specific problems

22 Graphics Pipeline 21 Intitially: dedicated hardware for core steps

23 Unified GPU Architecture 22

24 23 gpu

25 Streaming Multiprocessor (SM) 24 Fetches instruction (I-Cache) Has to apply it over a vector of data Each vector element is processed in one thread (MT Issue) Thread is handled by scalar processor (SP) Special function units (SFU)

26 Taxonomy 25 SISD (single instruction, single data) uni-processors (6502, Intel until 1990s) MIMD (multi instruction, multiple data) Intel Core i7 multiple cores on a chip each core runs instructions that operate on their own data SIMD (single instruction, multiple data) Streaming Multi-Processors multiple cores on a chip same instruction executed on different data

27 GPU Architecture 26

28 Graphics Programming 27 Libraries that support all steps of graphics pipeline Open standard: OpenGL Microsoft: Direct3D Libraries handle mapping to GPU hardware

29 Direct3D Pipeline 28

30 29 more uses for gpus

31 Deep Learning 30

32 Deep Learning 31 The latest machine learning hype Computationally lots of matrix multiplications lots of vector operations massive data sets Just what GPUs are good at

33 CUDA 32 Extension of C++ to support general GPU programming Fairly low-level identify parts of program to be handled by GPU define function to be executed by a thread define how many threads are used Key concepts kernel = function to be executed by a thread thread block = set of threads to be executed in parallel thread grid = set of thread blocks

34 Example 33 Serial loop void example(int n, float alpha, float *x, float *y) { for( int i=0; i<n; n++) y[i] = alpha * x[i] + y[i] } example(n, 2.0, x, y); Parallel with CUDA #define THREADS 256 void cuda_example(int n, float alpha, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = alpha*x[i] + y[i]; } int nblocks = (n + THREADS - 1) / THREADS; cuda_example<<< nblocks, THREADS >>>(n, 2.0, x, y);

35 Memory Levels 34

36 35 multiprocessor architecture

Nvidia Titan X 36 20 streaming multiprocessors, 3584 cores Clock speed 1.

37 Nvidia Titan X streaming multiprocessors, 3584 cores Clock speed 1.4 GHz Memory size 12 GB, bandwidth 320 GB/sec Retail price $1200 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 7 December 2016

38 Multithreaded Multiprocessor 37

39 Single Instruction, Multiple Thread 38 Each scalar processors executes same instruction on different data has own register file Branch synchroization if threads diverge on conditional branches execute different paths separately Shared memory

40 39 instructions

41 Basics 40 Design more similar to MIPS than x86 Various data types - each of different sizes untyped bit arrays (8, 16, 32, 64 bits) unsigned integers (8, 16, 32, 64 bits) signed integers (8, 16, 32, 64 bits) floating points (16, 32, 64 bits)

42 Basic Instructions 41 Arithmetic instructions operate on registers add d, a, b d = a+b mul d, a, b d = a*b mad d, a, b, c d = a*b+c mov d, a d = a Special functions handled by SFU processors square root (sqrt) sine (sin) cosine (cos) binary logarithm (lg2)

43 Memory Access 42 Different memory spaces (global, shared, local, const) Different data sizes (8, 16, 32, 64 bits) Load (ld) and store (st) Atomic memory read, write, add, min, max, and,...

44 Control Flow 43 Branch (conditional on register value = 0) Subroutine call: call, ret Synchronization: bar.sync forces all threads to synchronize Terminate thread: exit

45 44 memory

46 Overview 45 Memory has to be very fast Graphic card has several DRAM outside GPU (fast access, high bandwidth, lots of pins) Cache on chip: L2 cache associated with each DRAM chip Virtual memory addresses handled by memory management unit (MMU)

47 Levels 46 Global: external DRAM (not on chip) Shared: per streaming multiprocessor Local: in DRAM, but cached on chip Constant: read-only, in DRAM

48 Graphics-Related Optimizations 47 Texture memory for read-only texture maps There are also special instructions to deal with textures

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization