Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018
|
|
- Lauren Taylor
- 5 years ago
- Views:
Transcription
1 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018
2 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig model Memory hierarchy ad memory maagemet Performace cosideratios Memory access SIMD utilizatio Atomic operatios Data trasfers 2
3 Recommeded Readigs CUDA Programmig Guide Hwu ad Kirk, Programmig Massively Parallel Processors, Third Editio,
4 A Example GPU
5 NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 5
6 NVIDIA GeForce GTX 285 core (I) 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 6
7 NVIDIA GeForce GTX 285 core (II) 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp): they execute the same istructio o differet data Up to 32 warps are iterleaved i a FGMT maer Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 7
8 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 8
9 Evolutio of NVIDIA GPUs #Stream Processors GFLOPS 0 GTX 285 (2009) GTX 480 (2010) GTX 780 (2013) GTX 980 (2014) P100 (2016) V100 (2017) 0 Stream Processors GFLOPS 9
10 NVIDIA V100 NVIDIA-speak: 5120 stream processors SIMT executio Geeric speak: 80 cores 64 SIMD fuctioal uits per core Specialized Fuctioal Uits for Machie Learig (tesor cores i NVIDIA-speak) 10
11 NVIDIA V100 Block Diagram 80 cores o the V
12 NVIDIA V100 Core 15.7 TFLOPS Sigle Precisio 7.8 TFLOPS Double Precisio 125 TFLOPS for Deep Learig (Tesor cores ) 12
13 GPU Programmig
14 Recall: Vector Processor Disadvatages -- Works (oly) if parallelism is regular (data/simd parallelism) ++ Vector operatios -- Very iefficiet if parallelism is irregular -- How about searchig for a key i a liked list? Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA
15 Geeral Purpose Processig o GPU Easier programmig of SIMD processors with MD GPUs have democratized High Performace Computig (HPC) Great FLOPS/$, massively parallel chip o a commodity PC May workloads exhibit iheret parallelism Matrices Image processig However, this is ot for free New programmig model Algorithms eed to be re-implemeted ad rethought Still some bottleecks CPU-GPU data trasfers (PCIe, NVLINK) DRAM memory badwidth (GDDR5, HBM2) Data layout 15
16 CPU vs. GPU Differet desig philosophies CPU: A few out-of-order cores GPU: May i-order FGMT cores CPU GPU Cotrol ALU ALU ALU ALU Cache DRAM DRAM Slide credit: Hwu & Kirk 16
17 GPU Computig Computatio is offloaded to the GPU Three steps CPU-GPU data trasfer (1) GPU kerel executio (2) GPU-CPU data trasfer (3) CPU cores CPU memory Matrix 1 GPU memory Matrix 2 GPU cores 3 17
18 Traditioal Program Structure CPU threads ad GPU kerels Seuetial or modestly parallel sectios o CPU Massively parallel sectios o GPU Serial Code (host) Parallel Kerel (device) KerelA<<< Blk, Thr >>>(args);... Serial Code (host) Parallel Kerel (device) KerelB<<< Blk, Thr >>>(args);... Slide credit: Hwu & Kirk 18
19 Recall: MD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 19
20 CUDA/OpeCL Programmig Model SIMT or MD Bulk sychroous programmig Global (coarse-grai) sychroizatio betwee kerels The host (typically CPU) allocates memory, copies data, ad lauches kerels The device (typically GPU) executes kerels Grid (NDRage) Block (work-group) Withi a block, shared memory, ad sychroizatio Thread (work-item) 20
21 Trasparet Scalability Hardware is free to schedule thread blocks Device Kerel grid Block 0 Block 1 Device Block 2 Block 3 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Each block ca execute i ay order relative to other blocks. Slide credit: Hwu & Kirk 21
22 Memory Hierarchy Grid (Device) Block (0, 0) Block (1, 0) Shared memory Shared memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global / Texture & Surface memory Host Costat memory 22
23 Traditioal Program Structure i CUDA Fuctio prototypes float serialfuctio(); global void kerel(); mai() 1) Allocate memory space o the device cudamalloc(&d_i, bytes); 2) Trasfer data from host to device cudamemcpy(d_i, h_i, ); 3) Executio cofiguratio setup: #blocks ad #threads 4) Kerel call kerel<<<executio cofiguratio>>>(args); 5) Trasfer results from device to host cudamemcpy(h_out, d_out, ); repeat as eeded Kerel global void kerel(type args,) Automatic variables trasparetly assiged to registers Shared memory: shared Itra-block sychroizatio: sycthreads(); Slide credit: Hwu & Kirk 23
24 CUDA Programmig Laguage Memory allocatio cudamalloc((void**)&d_i, #bytes); Memory copy cudamemcpy(d_i, h_i, #bytes, cudamemcpyhosttodevice); Kerel lauch kerel<<< #blocks, #threads >>>(args); Memory deallocatio cudafree(d_i); Explicit sychroizatio cudadevicesychroize(); 24
25 Idexig ad Memory Access Images are 2D data structures height x width Image[j][i], where 0 j < height, ad 0 i < width Image[0][1] Image[1][2]
26 Image Layout i Memory Row-major layout Image[j][i] = Image[j x width + i] Stride = width Image[0][1] = Image[0 x 8 + 1] Image[1][2] = Image[1 x 8 + 2] 26
27 Idexig ad Memory Access: 1D Grid Oe GPU thread per pixel Grid of Blocks of Threads griddim.x, blockdim.x blockidx.x, threadidx.x blockidx.x Block 0 Thread 0 Thread 1 Thread 2 Thread 3 threadidx.x Block 0 6 * = 25 blockidx.x * blockdim.x + threadidx.x 27
28 Idexig ad Memory Access: 2D Grid 2D blocks griddim.x, griddim.y Block (0, 0) threadidx.x = 1 threadidx.y = 0 Row = blockidx.y * blockdim.y + threadidx.y Col = blockidx.x * blockdim.x + threadidx.x blockidx.x = 2 blockidx.y = 1 Row = 1 * = 3 Col = 0 * = 1 Image[3][1] = Image[3 * 8 + 1] 28
29 Brief Review of GPU Architecture (I) Streamig Processor Array Tesla architecture (G80/GT200) Streamig Processor Array TPC TPC TPC.. TPC TPC SM Istructio Cache SM SM Istructio Fetch/Dispatch Register File Texture Uit SFU Texture L1 Cache SFU Shared Memory Costat Cache 29
30 Brief Review of GPU Architecture (II) Streamig Multiprocessors (SM) Streamig Processors () Streamig Multiprocessor Istructio Cache Warp Scheduler Warp Scheduler Dispatch Uit Dispatch Uit Register File Blocks are divided ito warps SIMD uit (32 threads) LD/ST LD/ST LD/ST LD/ST SFU Block 0 s warps t0 t1 t2 t31 Block 1 s warps t0 t1 t2 t31 Block 2 s warps t0 t1 t2 t31 LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST SFU SFU SFU Shared Memory / L1 Cache Costat Cache NVIDIA Fermi architecture 30
31 Brief Review of GPU Architecture (III) Streamig Multiprocessors (SM) or Compute Uits (CU) SIMD pipelies Streamig Processors () or CUDA cores Vector laes Number of SMs x s across geeratios Tesla (2007): 30 x 8 Fermi (2010): 16 x 32 Kepler (2012): 15 x 192 Maxwell (2014): 24 x 128 Pascal (2016): 56 x 64 Volta (2017): 80 x 64 31
32 Performace Cosideratios
33 Performace Cosideratios Mai bottleecks Global memory access CPU-GPU data trasfers Memory access Latecy hidig Occupacy Memory coalescig Data reuse Shared memory usage SIMD (Warp) Utilizatio: Divergece Atomic operatios: Serializatio Data trasfers betwee CPU ad GPU Overlap of commuicatio ad computatio 33
34 Memory Access
35 Latecy Hidig FGMT ca hide log latecy operatios (e.g., memory accesses) Occupacy: ratio of active warps 4 active warps 2 active warps Warp 0 Warp 0 Istructio 3 Istructio 3 Warp 1 Istructio 2 Warp 1 Istructio 2 time Warp 2 Warp 3 Istructio 1 Istructio 1 Warp 0 Istructio 4 (Log latecy) time Warp 1 Istructio 3 Warp 0 Istructio 4 (Log latecy) Warp 1 Istructio 3 Warp 0 Warp 0 Istructio 5 Istructio 5 35
36 Occupacy SM resources (typical values) Maximum umber of warps per SM (64) Maximum umber of blocks per SM (32) Register usage (256KB) Shared memory usage (64KB) Occupacy calculatio Number of threads per block (defied by the programmer) Registers per thread (kow at compile time) Shared memory per block (defied by the programmer) 36
37 Memory Coalescig Whe accessig global memory, we wat to make sure that cocurret threads access earby memory locatios Peak badwidth utilizatio occurs whe all threads i a warp access oe cache lie Not coalesced Coalesced Md Nd Thread 1 Thread 2 WIDTH WIDTH Slide credit: Hwu & Kirk 37
38 Ucoalesced Memory Accesses Access directio i Kerel code M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 2 T 1 T 2 T 3 T 4 Time Period 1 T 1 T 2 T 3 T 4 M M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Slide credit: Hwu & Kirk 38
39 Coalesced Memory Accesses Access directio i Kerel code M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 1 T 1 T 2 T 3 T 4 Time Period 2 T 1 T 2 T 3 T 4 M M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Slide credit: Hwu & Kirk 39
40 AoS vs. SoA Array of Structures vs. Structure of Arrays Structure of Arrays (SoA) struct foo{ float a[8]; float b[8]; float c[8]; it d[8]; } A; Array of Structures (AoS) struct foo{ float a; float b; float c; it d; } A[8]; 40
41 CPUs Prefer AoS, GPUs Prefer SoA Liear ad strided accesses GPU CPU 12.0# 5.0# Throughput#(GB/s)# 11.0# 10.0# 9.0# 8.0# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# GPU# Throughput#(GB/s)# 4.5# 4.0# 3.5# 3.0# 2.5# 2.0# 1.5# 1.0# 0.5# 1CPU# 2CPU# 4CPU# 0.0# 1# 2# 4# 8# 16# 32# 64# 128# 256# 512# 1024# 0.0# 1# 2# 4# 8# 16# 32# 64# 128# 256# 512# 1024# Stride#(Structure#size)# Stride#(Structure#size)# AMD Kaveri A K Sug+, DL: A data layout trasformatio system for heterogeeous computig, INPAR
42 Data Reuse Same memory locatios accessed by eighborig threads for (it i = 0; i < 3; i++){ for (it j = 0; j < 3; j++){ sum += gauss[i][j] * Image[(i+row-1)*width + (j+col-1)]; } } 42
43 Data Reuse: Tilig To take advatage of data reuse, we divide the iput ito tiles that ca be loaded ito shared memory shared it l_data[(l_size+2)*(l_size+2)]; Load tile ito shared memory sycthreads(); for (it i = 0; i < 3; i++){ for (it j = 0; j < 3; j++){ sum += gauss[i][j] * l_data[(i+l_row-1)*(l_size+2)+j+l_col-1]; } } 43
44 Shared Memory Shared memory is a iterleaved (baked) memory Each bak ca service oe address per cycle Typically, 32 baks i NVIDIA GPUs Successive 32-bit words are assiged to successive baks Bak = Address % 32 Bak coflicts are oly possible withi a warp No bak coflicts betwee differet warps 44
45 Shared Memory Bak Coflicts (I) Bak coflict free Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Thread 15 Bak 15 Thread 15 Bak 15 Liear addressig: stride = 1 Radom addressig 1:1 Slide credit: Hwu & Kirk 45
46 Shared Memory Bak Coflicts (II) N-way bak coflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Bak 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 x8 Bak 0 Bak 1 Bak 2 Bak 7 Bak 8 Bak 9 Bak 15 2-way bak coflict: stride = 2 8-way bak coflict: stride = 8 Slide credit: Hwu & Kirk 46
47 Reducig Shared Memory Bak Coflicts Bak coflicts are oly possible withi a warp No bak coflicts betwee differet warps If strided accesses are eeded, some optimizatio techiues ca help Paddig Radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA 1991 Hash fuctios V.d.Braak+, Cofigurable XOR Hash Fuctios for Baked Scratchpad Memories i GPUs, IEEE TC,
48 SIMD Utilizatio
49 SIMD Utilizatio Itra-warp divergece Compute(threadIdx.x); if (threadidx.x % 2 == 0){ Do_this(threadIdx.x); } else{ Do_that(threadIdx.x); } Compute If Else 49
50 Icreasig SIMD Utilizatio Divergece-free executio Compute(threadIdx.x); if (threadidx.x < 32){ Do_this(threadIdx.x * 2); } else{ Do_that((threadIdx.x%32)*2+1); } Compute If Else 50
51 Vector Reductio: Naïve Mappig (I) Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread iteratios Slide credit: Hwu & Kirk 51
52 Vector Reductio: Naïve Mappig (II) Program with low SIMD utilizatio shared float partialsum[] usiged it t = threadidx.x; for (it stride = 1; stride < blockdim.x; stride *= 2) { } sycthreads(); if (t % (2*stride) == 0) partialsum[t] += partialsum[t + stride]; 52
53 Divergece-Free Mappig (I) All active threads belog to the same warp Thread 0 Thread 1 Thread 2 Thread 14 Thread iteratios 3 Slide credit: Hwu & Kirk 53
54 Divergece-Free Mappig (II) Program with high SIMD utilizatio shared float partialsum[] usiged it t = threadidx.x; for (it stride = blockdim.x; stride > 1; stride >> 1){ } sycthreads(); if (t < stride) partialsum[t] += partialsum[t + stride]; 54
55 Atomic Operatios
56 Shared Memory Atomic Operatios Atomic Operatios are eeded whe threads might update the same memory locatios at the same time CUDA: it atomicadd(it*, it); PTX: atom.shared.add.u32 %r25, [%rd14], %r24; SASS: Tesla, Fermi, Kepler /*00a0*/ LDSLK P0, R9, [R8]; IADD R10, R9, R7; STSCUL P1, [R8], R10; BRA 0xa0; Maxwell, Pascal, Volta /*01f8*/ ATOMS.ADD RZ, [R7], R11; Native atomic operatios for 32-bit iteger, ad 32-bit ad 64-bit atomiccas 56
57 Atomic Coflicts We defie the itra-warp coflict degree as the umber of threads i a warp that update the same memory positio The coflict degree ca be betwee 1 ad 32 th0 th1 th0 th1 th1 t coflict th0 th1 th0 0 2 t base 2 2 t base No atomic coflict = cocurret updates Shared memory Atomic coflict = serialized updates Shared memory 57
58 Histogram Calculatio Histograms cout the umber of data istaces i disjoit categories (bis) for (each pixel i i image I){ Pixel = I[i] Pixel = Computatio(Pixel) Histogram[Pixel ]++ } // Read pixel // Optioal computatio // Vote i histogram bi Iput data data[] data[+1] data[+2]... data[2-1] data[0] data[1] data[2]... data[-1] Thread 0 Thread 1 Thread 2 Thread Histogram... B-1 Atomic additios 58
59 Histogram Calculatio of Natural Images Freuet coflicts i atural images
60 Optimizig Histogram Calculatio Privatizatio: Per-block sub-histograms i shared memory Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Shared memory Block 0 s sub-histo Block 1 s sub-histo Block 2 s sub-histo Block 3 s sub-histo b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 Global memory Fial histogram Gomez-Lua+, Performace Modelig of Atomic Additios o GPU Scratchpad Memory, IEEE TPDS,
61 Data Trasfers betwee CPU ad GPU
62 Data Trasfers Sychroous ad asychroous trasfers Streams (Commad ueues) Seuece of operatios that are performed i order CPU-GPU data trasfer Kerel executio D iput data istaces, B blocks GPU-CPU data trasfer Default stream t T Copy data t E Execute 62
63 Asychroous Trasfers Computatio divided ito Streams D iput data istaces, B blocks Streams D/Streams data istaces B/Streams blocks t T Copy data t E Execute Copy data Execute Estimates t E + t T Streams t E >= t T (domiat kerel) t T + t E Streams t T > t E (domiat trasfers) 63
64 Overlap of Commuicatio ad Computatio Applicatios with idepedet computatio o differet data istaces ca beefit from asychroous trasfers For istace, video processig Nostreamed executio A seuece of 6 frames is trasferred to device 6 x b blocks compute o the seuece of frames Streamed executio A chuk of 2 frames is trasferred to device 2 x b blocks compute o the chuk, while the secod chuk is beig trasferred Executio time saved thaks to streams Gomez-Lua+, Performace models for asychroous data trasfers o cosumer Graphics Processig Uits, JPDC,
65 Summary GPU as a accelerator Program structure Bulk sychroous programmig model Memory hierarchy ad memory maagemet Performace cosideratios Memory access Latecy hidig: occupacy (TLP) Memory coalescig Data reuse: shared memory SIMD utilizatio Atomic operatios Data trasfers 65
66 PUMPS+AI Summer School, Barceloa, July
67 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018
Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units
Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)
More informationDesign of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018
Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig 2018 11 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad
More informationComputer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017
Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017 Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik
More informationCMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems
More informationMultiprocessors. HPC Prof. Robert van Engelen
Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies
More informationAppendix D. Controller Implementation
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);
More informationCMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access
More informationCSC 220: Computer Organization Unit 11 Basic Computer Organization and Design
College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:
More informationInstruction and Data Streams
Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad
More informationMulti-Threading. Hyper-, Multi-, and Simultaneous Thread Execution
Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig
More informationCMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,
More informationChapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings
Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The
More informationBank-interleaved cache or memory indexing does not require euclidean division
Bak-iterleaved cache or memory idexig does ot require euclidea divisio Adré Sezec To cite this versio: Adré Sezec. Bak-iterleaved cache or memory idexig does ot require euclidea divisio. 11th Aual Workshop
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationChapter 4 The Datapath
The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationLecture 1: Introduction and Strassen s Algorithm
5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationAnalysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis
Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationSCI Reflective Memory
Embedded SCI Solutios SCI Reflective Memory (Experimetal) Atle Vesterkjær Dolphi Itercoect Solutios AS Olaf Helsets vei 6, N-0621 Oslo, Norway Phoe: (47) 23 16 71 42 Fax: (47) 23 16 71 80 Mail: atleve@dolphiics.o
More informationCourse Site: Copyright 2012, Elsevier Inc. All rights reserved.
Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationElementary Educational Computer
Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified
More informationMorgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5
Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad
More informationTransforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics
Trasformig Irregular lgorithms for Heterogeeous omputig - ase Studies i ioiformatics Jig Zhag dvisor: Dr. Wu Feg ollaborator: Hao Wag syergy.cs.vt.edu Irregular lgorithms haracterized by Operate o irregular
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationCMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago
CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device
More informationChapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.
Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig
More informationPerformance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies
More informationGPUMP: a Multiple-Precision Integer Library for GPUs
GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract
More informationMorgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.
Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationPseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance
Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured
More informationCS2410 Computer Architecture. Flynn s Taxonomy
CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)
More informationSimple and Fast Parallel Algorithms for the Voronoi Maps and the Euclidean Distance Map, with GPU implementations
Simple ad Fast Parallel Algorithms for the Vorooi Maps ad the Euclidea Distace Map, with GPU implemetatios Takumi Hoda, Shiosuke Yamamoto, Hiroaki Hoda, Koji Nakao, Yasuaki Ito Departmet of Iformatio Egieerig
More informationProgramming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen
Programmig with Shared Memory PART II HPC Sprig 2017 Prof. Robert va Egele Overview Sequetial cosistecy Parallel programmig costructs Depedece aalysis OpeMP Autoparallelizatio Further readig HPC Sprig
More informationRunning Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments
Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The
More informationVOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017
VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100
More informationUH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu
UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies
More informationRunning Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments
Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.
More informationAnalysis of Algorithms
Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The
More informationSPIRAL DSP Transform Compiler:
SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity
More informationThe University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.
Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationCS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs
CS 152 Computer Architecture ad Egieerig CS252 Graduate Computer Architecture Lecture 17 GPUs Krste Asaovic Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley http://www.eecs.berkeley.edu/~krste
More informationMaster Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1
Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationComputer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff
Computer rchitecture Microcomputer rchitecture ad Iterfacig Colorado School of Mies Professor William Hoff Computer Hardware Orgaizatio Processor Performs all computatios; coordiates data trasfer Iput
More informationCSE 417: Algorithms and Computational Complexity
Time CSE 47: Algorithms ad Computatioal Readig assigmet Read Chapter of The ALGORITHM Desig Maual Aalysis & Sortig Autum 00 Paul Beame aalysis Problem size Worst-case complexity: max # steps algorithm
More informationIsn t It Time You Got Faster, Quicker?
Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More informationCluster Computing Spring 2004 Paul A. Farrell
Cluster Computig Sprig 004 3/18/004 Parallel Programmig Overview Task Parallelism OS support for task parallelism Parameter Studies Domai Decompositio Sequece Matchig Work Assigmet Static schedulig Divide
More informationLecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming
Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis
More informationUNIVERSITY OF MORATUWA
UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationCOMP Parallel Computing. PRAM (1): The PRAM model and complexity measures
COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems
More informationData Structures and Algorithms. Analysis of Algorithms
Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output
More informationPage 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!
Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationDesign of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018
Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig 2018 27 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationAPPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS
APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationChapter 3 Classification of FFT Processor Algorithms
Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationcutt: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs
cutt: A High-Performace Tesor Traspose Library for CUDA Compatible GPUs ANTT-PEKKA HYNNNEN, NVDA Corporatio, Sata Clara, CA 95050 DMTRY. LYAKH, Natioal Ceter for Computatioal Scieces, Oak Ridge Natioal
More informationBasic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.
5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationIntroduction to SWARM Software and Algorithms for Running on Multicore Processors
Itroductio to SWARM Software ad Algorithms for Ruig o Multicore Processors David A. Bader Georgia Istitute of Techology http://www.cc.gatech.edu/~bader Tutorial compiled by Rucheek H. Sagai M.S. Studet,
More informationStevina Dias* Sherrin Benjamin* Mitchell D silva* Lynette Lopes* *Assistant Professor Dwarkadas J Sanghavi College of Engineering, Vile Parle
GPU Programmig Models Stevia Dias* Sherri Bejami* Mitchell D silva* Lyette Lopes* *Assistat Professor Dwarkadas J Saghavi College of Egieerig, Vile Parle Abstract The CPU, the brais of the computer is
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 6 Parallel Processors from Cliet to Cloud Itroductio Goal: coectig multiple computers to get higher performace Multiprocessors
More informationCache-Optimal Methods for Bit-Reversals
Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationThreads and Concurrency in Java: Part 1
Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationCS : Programming for Non-Majors, Summer 2007 Programming Project #3: Two Little Calculations Due by 12:00pm (noon) Wednesday June
CS 1313 010: Programmig for No-Majors, Summer 2007 Programmig Project #3: Two Little Calculatios Due by 12:00pm (oo) Wedesday Jue 27 2007 This third assigmet will give you experiece writig programs that
More informationThreads and Concurrency in Java: Part 1
Threads ad Cocurrecy i Java: Part 1 1 Cocurrecy What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.
More informationNeural Networks A Model of Boolean Functions
Neural Networks A Model of Boolea Fuctios Berd Steibach, Roma Kohut Freiberg Uiversity of Miig ad Techology Istitute of Computer Sciece D-09596 Freiberg, Germay e-mails: steib@iformatik.tu-freiberg.de
More informationFPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea
FPGA IMPLEMENTATION OF BASE-N LOGARITHM Salvador E. Tropea Electróica e Iformática Istituto Nacioal de Tecología Idustrial Bueos Aires, Argetia email: salvador@iti.gov.ar ABSTRACT I this work, we preset
More informationHash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More information