Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

Size: px
Start display at page:

Download "Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018"

Transcription

1 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018

2 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig model Memory hierarchy ad memory maagemet Performace cosideratios Memory access SIMD utilizatio Atomic operatios Data trasfers 2

3 Recommeded Readigs CUDA Programmig Guide Hwu ad Kirk, Programmig Massively Parallel Processors, Third Editio,

4 A Example GPU

5 NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 5

6 NVIDIA GeForce GTX 285 core (I) 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 6

7 NVIDIA GeForce GTX 285 core (II) 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp): they execute the same istructio o differet data Up to 32 warps are iterleaved i a FGMT maer Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 7

8 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 8

9 Evolutio of NVIDIA GPUs #Stream Processors GFLOPS 0 GTX 285 (2009) GTX 480 (2010) GTX 780 (2013) GTX 980 (2014) P100 (2016) V100 (2017) 0 Stream Processors GFLOPS 9

10 NVIDIA V100 NVIDIA-speak: 5120 stream processors SIMT executio Geeric speak: 80 cores 64 SIMD fuctioal uits per core Specialized Fuctioal Uits for Machie Learig (tesor cores i NVIDIA-speak) 10

11 NVIDIA V100 Block Diagram 80 cores o the V

12 NVIDIA V100 Core 15.7 TFLOPS Sigle Precisio 7.8 TFLOPS Double Precisio 125 TFLOPS for Deep Learig (Tesor cores ) 12

13 GPU Programmig

14 Recall: Vector Processor Disadvatages -- Works (oly) if parallelism is regular (data/simd parallelism) ++ Vector operatios -- Very iefficiet if parallelism is irregular -- How about searchig for a key i a liked list? Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA

15 Geeral Purpose Processig o GPU Easier programmig of SIMD processors with MD GPUs have democratized High Performace Computig (HPC) Great FLOPS/$, massively parallel chip o a commodity PC May workloads exhibit iheret parallelism Matrices Image processig However, this is ot for free New programmig model Algorithms eed to be re-implemeted ad rethought Still some bottleecks CPU-GPU data trasfers (PCIe, NVLINK) DRAM memory badwidth (GDDR5, HBM2) Data layout 15

16 CPU vs. GPU Differet desig philosophies CPU: A few out-of-order cores GPU: May i-order FGMT cores CPU GPU Cotrol ALU ALU ALU ALU Cache DRAM DRAM Slide credit: Hwu & Kirk 16

17 GPU Computig Computatio is offloaded to the GPU Three steps CPU-GPU data trasfer (1) GPU kerel executio (2) GPU-CPU data trasfer (3) CPU cores CPU memory Matrix 1 GPU memory Matrix 2 GPU cores 3 17

18 Traditioal Program Structure CPU threads ad GPU kerels Seuetial or modestly parallel sectios o CPU Massively parallel sectios o GPU Serial Code (host) Parallel Kerel (device) KerelA<<< Blk, Thr >>>(args);... Serial Code (host) Parallel Kerel (device) KerelB<<< Blk, Thr >>>(args);... Slide credit: Hwu & Kirk 18

19 Recall: MD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 19

20 CUDA/OpeCL Programmig Model SIMT or MD Bulk sychroous programmig Global (coarse-grai) sychroizatio betwee kerels The host (typically CPU) allocates memory, copies data, ad lauches kerels The device (typically GPU) executes kerels Grid (NDRage) Block (work-group) Withi a block, shared memory, ad sychroizatio Thread (work-item) 20

21 Trasparet Scalability Hardware is free to schedule thread blocks Device Kerel grid Block 0 Block 1 Device Block 2 Block 3 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Each block ca execute i ay order relative to other blocks. Slide credit: Hwu & Kirk 21

22 Memory Hierarchy Grid (Device) Block (0, 0) Block (1, 0) Shared memory Shared memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global / Texture & Surface memory Host Costat memory 22

23 Traditioal Program Structure i CUDA Fuctio prototypes float serialfuctio(); global void kerel(); mai() 1) Allocate memory space o the device cudamalloc(&d_i, bytes); 2) Trasfer data from host to device cudamemcpy(d_i, h_i, ); 3) Executio cofiguratio setup: #blocks ad #threads 4) Kerel call kerel<<<executio cofiguratio>>>(args); 5) Trasfer results from device to host cudamemcpy(h_out, d_out, ); repeat as eeded Kerel global void kerel(type args,) Automatic variables trasparetly assiged to registers Shared memory: shared Itra-block sychroizatio: sycthreads(); Slide credit: Hwu & Kirk 23

24 CUDA Programmig Laguage Memory allocatio cudamalloc((void**)&d_i, #bytes); Memory copy cudamemcpy(d_i, h_i, #bytes, cudamemcpyhosttodevice); Kerel lauch kerel<<< #blocks, #threads >>>(args); Memory deallocatio cudafree(d_i); Explicit sychroizatio cudadevicesychroize(); 24

25 Idexig ad Memory Access Images are 2D data structures height x width Image[j][i], where 0 j < height, ad 0 i < width Image[0][1] Image[1][2]

26 Image Layout i Memory Row-major layout Image[j][i] = Image[j x width + i] Stride = width Image[0][1] = Image[0 x 8 + 1] Image[1][2] = Image[1 x 8 + 2] 26

27 Idexig ad Memory Access: 1D Grid Oe GPU thread per pixel Grid of Blocks of Threads griddim.x, blockdim.x blockidx.x, threadidx.x blockidx.x Block 0 Thread 0 Thread 1 Thread 2 Thread 3 threadidx.x Block 0 6 * = 25 blockidx.x * blockdim.x + threadidx.x 27

28 Idexig ad Memory Access: 2D Grid 2D blocks griddim.x, griddim.y Block (0, 0) threadidx.x = 1 threadidx.y = 0 Row = blockidx.y * blockdim.y + threadidx.y Col = blockidx.x * blockdim.x + threadidx.x blockidx.x = 2 blockidx.y = 1 Row = 1 * = 3 Col = 0 * = 1 Image[3][1] = Image[3 * 8 + 1] 28

29 Brief Review of GPU Architecture (I) Streamig Processor Array Tesla architecture (G80/GT200) Streamig Processor Array TPC TPC TPC.. TPC TPC SM Istructio Cache SM SM Istructio Fetch/Dispatch Register File Texture Uit SFU Texture L1 Cache SFU Shared Memory Costat Cache 29

30 Brief Review of GPU Architecture (II) Streamig Multiprocessors (SM) Streamig Processors () Streamig Multiprocessor Istructio Cache Warp Scheduler Warp Scheduler Dispatch Uit Dispatch Uit Register File Blocks are divided ito warps SIMD uit (32 threads) LD/ST LD/ST LD/ST LD/ST SFU Block 0 s warps t0 t1 t2 t31 Block 1 s warps t0 t1 t2 t31 Block 2 s warps t0 t1 t2 t31 LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST SFU SFU SFU Shared Memory / L1 Cache Costat Cache NVIDIA Fermi architecture 30

31 Brief Review of GPU Architecture (III) Streamig Multiprocessors (SM) or Compute Uits (CU) SIMD pipelies Streamig Processors () or CUDA cores Vector laes Number of SMs x s across geeratios Tesla (2007): 30 x 8 Fermi (2010): 16 x 32 Kepler (2012): 15 x 192 Maxwell (2014): 24 x 128 Pascal (2016): 56 x 64 Volta (2017): 80 x 64 31

32 Performace Cosideratios

33 Performace Cosideratios Mai bottleecks Global memory access CPU-GPU data trasfers Memory access Latecy hidig Occupacy Memory coalescig Data reuse Shared memory usage SIMD (Warp) Utilizatio: Divergece Atomic operatios: Serializatio Data trasfers betwee CPU ad GPU Overlap of commuicatio ad computatio 33

34 Memory Access

35 Latecy Hidig FGMT ca hide log latecy operatios (e.g., memory accesses) Occupacy: ratio of active warps 4 active warps 2 active warps Warp 0 Warp 0 Istructio 3 Istructio 3 Warp 1 Istructio 2 Warp 1 Istructio 2 time Warp 2 Warp 3 Istructio 1 Istructio 1 Warp 0 Istructio 4 (Log latecy) time Warp 1 Istructio 3 Warp 0 Istructio 4 (Log latecy) Warp 1 Istructio 3 Warp 0 Warp 0 Istructio 5 Istructio 5 35

36 Occupacy SM resources (typical values) Maximum umber of warps per SM (64) Maximum umber of blocks per SM (32) Register usage (256KB) Shared memory usage (64KB) Occupacy calculatio Number of threads per block (defied by the programmer) Registers per thread (kow at compile time) Shared memory per block (defied by the programmer) 36

37 Memory Coalescig Whe accessig global memory, we wat to make sure that cocurret threads access earby memory locatios Peak badwidth utilizatio occurs whe all threads i a warp access oe cache lie Not coalesced Coalesced Md Nd Thread 1 Thread 2 WIDTH WIDTH Slide credit: Hwu & Kirk 37

38 Ucoalesced Memory Accesses Access directio i Kerel code M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 2 T 1 T 2 T 3 T 4 Time Period 1 T 1 T 2 T 3 T 4 M M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Slide credit: Hwu & Kirk 38

39 Coalesced Memory Accesses Access directio i Kerel code M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 1 T 1 T 2 T 3 T 4 Time Period 2 T 1 T 2 T 3 T 4 M M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Slide credit: Hwu & Kirk 39

40 AoS vs. SoA Array of Structures vs. Structure of Arrays Structure of Arrays (SoA) struct foo{ float a[8]; float b[8]; float c[8]; it d[8]; } A; Array of Structures (AoS) struct foo{ float a; float b; float c; it d; } A[8]; 40

41 CPUs Prefer AoS, GPUs Prefer SoA Liear ad strided accesses GPU CPU 12.0# 5.0# Throughput#(GB/s)# 11.0# 10.0# 9.0# 8.0# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# GPU# Throughput#(GB/s)# 4.5# 4.0# 3.5# 3.0# 2.5# 2.0# 1.5# 1.0# 0.5# 1CPU# 2CPU# 4CPU# 0.0# 1# 2# 4# 8# 16# 32# 64# 128# 256# 512# 1024# 0.0# 1# 2# 4# 8# 16# 32# 64# 128# 256# 512# 1024# Stride#(Structure#size)# Stride#(Structure#size)# AMD Kaveri A K Sug+, DL: A data layout trasformatio system for heterogeeous computig, INPAR

42 Data Reuse Same memory locatios accessed by eighborig threads for (it i = 0; i < 3; i++){ for (it j = 0; j < 3; j++){ sum += gauss[i][j] * Image[(i+row-1)*width + (j+col-1)]; } } 42

43 Data Reuse: Tilig To take advatage of data reuse, we divide the iput ito tiles that ca be loaded ito shared memory shared it l_data[(l_size+2)*(l_size+2)]; Load tile ito shared memory sycthreads(); for (it i = 0; i < 3; i++){ for (it j = 0; j < 3; j++){ sum += gauss[i][j] * l_data[(i+l_row-1)*(l_size+2)+j+l_col-1]; } } 43

44 Shared Memory Shared memory is a iterleaved (baked) memory Each bak ca service oe address per cycle Typically, 32 baks i NVIDIA GPUs Successive 32-bit words are assiged to successive baks Bak = Address % 32 Bak coflicts are oly possible withi a warp No bak coflicts betwee differet warps 44

45 Shared Memory Bak Coflicts (I) Bak coflict free Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Thread 15 Bak 15 Thread 15 Bak 15 Liear addressig: stride = 1 Radom addressig 1:1 Slide credit: Hwu & Kirk 45

46 Shared Memory Bak Coflicts (II) N-way bak coflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Bak 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 x8 Bak 0 Bak 1 Bak 2 Bak 7 Bak 8 Bak 9 Bak 15 2-way bak coflict: stride = 2 8-way bak coflict: stride = 8 Slide credit: Hwu & Kirk 46

47 Reducig Shared Memory Bak Coflicts Bak coflicts are oly possible withi a warp No bak coflicts betwee differet warps If strided accesses are eeded, some optimizatio techiues ca help Paddig Radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA 1991 Hash fuctios V.d.Braak+, Cofigurable XOR Hash Fuctios for Baked Scratchpad Memories i GPUs, IEEE TC,

48 SIMD Utilizatio

49 SIMD Utilizatio Itra-warp divergece Compute(threadIdx.x); if (threadidx.x % 2 == 0){ Do_this(threadIdx.x); } else{ Do_that(threadIdx.x); } Compute If Else 49

50 Icreasig SIMD Utilizatio Divergece-free executio Compute(threadIdx.x); if (threadidx.x < 32){ Do_this(threadIdx.x * 2); } else{ Do_that((threadIdx.x%32)*2+1); } Compute If Else 50

51 Vector Reductio: Naïve Mappig (I) Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread iteratios Slide credit: Hwu & Kirk 51

52 Vector Reductio: Naïve Mappig (II) Program with low SIMD utilizatio shared float partialsum[] usiged it t = threadidx.x; for (it stride = 1; stride < blockdim.x; stride *= 2) { } sycthreads(); if (t % (2*stride) == 0) partialsum[t] += partialsum[t + stride]; 52

53 Divergece-Free Mappig (I) All active threads belog to the same warp Thread 0 Thread 1 Thread 2 Thread 14 Thread iteratios 3 Slide credit: Hwu & Kirk 53

54 Divergece-Free Mappig (II) Program with high SIMD utilizatio shared float partialsum[] usiged it t = threadidx.x; for (it stride = blockdim.x; stride > 1; stride >> 1){ } sycthreads(); if (t < stride) partialsum[t] += partialsum[t + stride]; 54

55 Atomic Operatios

56 Shared Memory Atomic Operatios Atomic Operatios are eeded whe threads might update the same memory locatios at the same time CUDA: it atomicadd(it*, it); PTX: atom.shared.add.u32 %r25, [%rd14], %r24; SASS: Tesla, Fermi, Kepler /*00a0*/ LDSLK P0, R9, [R8]; IADD R10, R9, R7; STSCUL P1, [R8], R10; BRA 0xa0; Maxwell, Pascal, Volta /*01f8*/ ATOMS.ADD RZ, [R7], R11; Native atomic operatios for 32-bit iteger, ad 32-bit ad 64-bit atomiccas 56

57 Atomic Coflicts We defie the itra-warp coflict degree as the umber of threads i a warp that update the same memory positio The coflict degree ca be betwee 1 ad 32 th0 th1 th0 th1 th1 t coflict th0 th1 th0 0 2 t base 2 2 t base No atomic coflict = cocurret updates Shared memory Atomic coflict = serialized updates Shared memory 57

58 Histogram Calculatio Histograms cout the umber of data istaces i disjoit categories (bis) for (each pixel i i image I){ Pixel = I[i] Pixel = Computatio(Pixel) Histogram[Pixel ]++ } // Read pixel // Optioal computatio // Vote i histogram bi Iput data data[] data[+1] data[+2]... data[2-1] data[0] data[1] data[2]... data[-1] Thread 0 Thread 1 Thread 2 Thread Histogram... B-1 Atomic additios 58

59 Histogram Calculatio of Natural Images Freuet coflicts i atural images

60 Optimizig Histogram Calculatio Privatizatio: Per-block sub-histograms i shared memory Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Shared memory Block 0 s sub-histo Block 1 s sub-histo Block 2 s sub-histo Block 3 s sub-histo b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 Global memory Fial histogram Gomez-Lua+, Performace Modelig of Atomic Additios o GPU Scratchpad Memory, IEEE TPDS,

61 Data Trasfers betwee CPU ad GPU

62 Data Trasfers Sychroous ad asychroous trasfers Streams (Commad ueues) Seuece of operatios that are performed i order CPU-GPU data trasfer Kerel executio D iput data istaces, B blocks GPU-CPU data trasfer Default stream t T Copy data t E Execute 62

63 Asychroous Trasfers Computatio divided ito Streams D iput data istaces, B blocks Streams D/Streams data istaces B/Streams blocks t T Copy data t E Execute Copy data Execute Estimates t E + t T Streams t E >= t T (domiat kerel) t T + t E Streams t T > t E (domiat trasfers) 63

64 Overlap of Commuicatio ad Computatio Applicatios with idepedet computatio o differet data istaces ca beefit from asychroous trasfers For istace, video processig Nostreamed executio A seuece of 6 frames is trasferred to device 6 x b blocks compute o the seuece of frames Streamed executio A chuk of 2 frames is trasferred to device 2 x b blocks compute o the chuk, while the secod chuk is beig trasferred Executio time saved thaks to streams Gomez-Lua+, Performace models for asychroous data trasfers o cosumer Graphics Processig Uits, JPDC,

65 Summary GPU as a accelerator Program structure Bulk sychroous programmig model Memory hierarchy ad memory maagemet Performace cosideratios Memory access Latecy hidig: occupacy (TLP) Memory coalescig Data reuse: shared memory SIMD utilizatio Atomic operatios Data trasfers 65

66 PUMPS+AI Summer School, Barceloa, July

67 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig 2018 11 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad

More information

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017 Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017 Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

Bank-interleaved cache or memory indexing does not require euclidean division

Bank-interleaved cache or memory indexing does not require euclidean division Bak-iterleaved cache or memory idexig does ot require euclidea divisio Adré Sezec To cite this versio: Adré Sezec. Bak-iterleaved cache or memory idexig does ot require euclidea divisio. 11th Aual Workshop

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Chapter 4 The Datapath

Chapter 4 The Datapath The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

SCI Reflective Memory

SCI Reflective Memory Embedded SCI Solutios SCI Reflective Memory (Experimetal) Atle Vesterkjær Dolphi Itercoect Solutios AS Olaf Helsets vei 6, N-0621 Oslo, Norway Phoe: (47) 23 16 71 42 Fax: (47) 23 16 71 80 Mail: atleve@dolphiics.o

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics Trasformig Irregular lgorithms for Heterogeeous omputig - ase Studies i ioiformatics Jig Zhag dvisor: Dr. Wu Feg ollaborator: Hao Wag syergy.cs.vt.edu Irregular lgorithms haracterized by Operate o irregular

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

Simple and Fast Parallel Algorithms for the Voronoi Maps and the Euclidean Distance Map, with GPU implementations

Simple and Fast Parallel Algorithms for the Voronoi Maps and the Euclidean Distance Map, with GPU implementations Simple ad Fast Parallel Algorithms for the Vorooi Maps ad the Euclidea Distace Map, with GPU implemetatios Takumi Hoda, Shiosuke Yamamoto, Hiroaki Hoda, Koji Nakao, Yasuaki Ito Departmet of Iformatio Egieerig

More information

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen Programmig with Shared Memory PART II HPC Sprig 2017 Prof. Robert va Egele Overview Sequetial cosistecy Parallel programmig costructs Depedece aalysis OpeMP Autoparallelizatio Further readig HPC Sprig

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

SPIRAL DSP Transform Compiler:

SPIRAL DSP Transform Compiler: SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs CS 152 Computer Architecture ad Egieerig CS252 Graduate Computer Architecture Lecture 17 GPUs Krste Asaovic Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff Computer rchitecture Microcomputer rchitecture ad Iterfacig Colorado School of Mies Professor William Hoff Computer Hardware Orgaizatio Processor Performs all computatios; coordiates data trasfer Iput

More information

CSE 417: Algorithms and Computational Complexity

CSE 417: Algorithms and Computational Complexity Time CSE 47: Algorithms ad Computatioal Readig assigmet Read Chapter of The ALGORITHM Desig Maual Aalysis & Sortig Autum 00 Paul Beame aalysis Problem size Worst-case complexity: max # steps algorithm

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

Cluster Computing Spring 2004 Paul A. Farrell

Cluster Computing Spring 2004 Paul A. Farrell Cluster Computig Sprig 004 3/18/004 Parallel Programmig Overview Task Parallelism OS support for task parallelism Parameter Studies Domai Decompositio Sequece Matchig Work Assigmet Static schedulig Divide

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig 2018 27 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

cutt: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

cutt: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs cutt: A High-Performace Tesor Traspose Library for CUDA Compatible GPUs ANTT-PEKKA HYNNNEN, NVDA Corporatio, Sata Clara, CA 95050 DMTRY. LYAKH, Natioal Ceter for Computatioal Scieces, Oak Ridge Natioal

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Introduction to SWARM Software and Algorithms for Running on Multicore Processors

Introduction to SWARM Software and Algorithms for Running on Multicore Processors Itroductio to SWARM Software ad Algorithms for Ruig o Multicore Processors David A. Bader Georgia Istitute of Techology http://www.cc.gatech.edu/~bader Tutorial compiled by Rucheek H. Sagai M.S. Studet,

More information

Stevina Dias* Sherrin Benjamin* Mitchell D silva* Lynette Lopes* *Assistant Professor Dwarkadas J Sanghavi College of Engineering, Vile Parle

Stevina Dias* Sherrin Benjamin* Mitchell D silva* Lynette Lopes* *Assistant Professor Dwarkadas J Sanghavi College of Engineering, Vile Parle GPU Programmig Models Stevia Dias* Sherri Bejami* Mitchell D silva* Lyette Lopes* *Assistat Professor Dwarkadas J Saghavi College of Egieerig, Vile Parle Abstract The CPU, the brais of the computer is

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 6 Parallel Processors from Cliet to Cloud Itroductio Goal: coectig multiple computers to get higher performace Multiprocessors

More information

Cache-Optimal Methods for Bit-Reversals

Cache-Optimal Methods for Bit-Reversals Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

CS : Programming for Non-Majors, Summer 2007 Programming Project #3: Two Little Calculations Due by 12:00pm (noon) Wednesday June

CS : Programming for Non-Majors, Summer 2007 Programming Project #3: Two Little Calculations Due by 12:00pm (noon) Wednesday June CS 1313 010: Programmig for No-Majors, Summer 2007 Programmig Project #3: Two Little Calculatios Due by 12:00pm (oo) Wedesday Jue 27 2007 This third assigmet will give you experiece writig programs that

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Threads ad Cocurrecy i Java: Part 1 1 Cocurrecy What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Neural Networks A Model of Boolean Functions

Neural Networks A Model of Boolean Functions Neural Networks A Model of Boolea Fuctios Berd Steibach, Roma Kohut Freiberg Uiversity of Miig ad Techology Istitute of Computer Sciece D-09596 Freiberg, Germay e-mails: steib@iformatik.tu-freiberg.de

More information

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea FPGA IMPLEMENTATION OF BASE-N LOGARITHM Salvador E. Tropea Electróica e Iformática Istituto Nacioal de Tecología Idustrial Bueos Aires, Argetia email: salvador@iti.gov.ar ABSTRACT I this work, we preset

More information

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information