Around GPGPU: architecture, programming, and arithmetic

Size: px
Start display at page:

Download "Around GPGPU: architecture, programming, and arithmetic"

Transcription

1 Around GPGPU: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon RAIM'11, Perpignan February 9, 2011

2 Key challenges for parallel architectures In computer architecture how to? Scalability Overcome memory wall and power wall? Programmability Write portable, reusable parallel software with minimal effort? Reliability Tolerate hardware faults? Validate large-scale parallel applications? In computer arithmetic how to? Adjust precision to minimize power and memory transfers? Make error analysis easy, debug floating-point programs? Validate, certify numerical algorithms and results? 2

3 Why should we care about GPUs? Today Yesterday ( ) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Tomorrow Today ( ) Heterogeneous multi-core Intel Sandy Bridge shipping AMD Fusion sampling NVIDIA Denver/Maxwell project announced Today's GPU = tomorrow's throughput processor Latencyoptimized cores Throughputoptimized cores (Configurable?) hardware accelerators Heterogeneous multi-core chip 3

4 Outline How a GPU works Software: fine-grained vs. coarse-grained threading Hardware: from multi-core to SIMT GPU programming guidelines Arithmetic features 4

5 Threading granularity e.g. In-place vector scaling X a X X Coarse-grained threading Decouple tasks to reduce conflicts and inter-thread communication T0 T1 T2 T3 X[0..3] X[4..7] X[8..11] X[12-15] Fine-grained threading Interleave tasks Exhibit locality: neighbor threads share memory Exhibit regularity: neighbor threads have a similar behavior X T0 T1 T2 T3 X[0] X[4] X[8] X[12] X[1] X[2] X[5] X[6] X[9] X[10] X[13] X[14] X[3] X[7] X[11] X[15] 5

6 Regularity Similarity in behavior between threads Instruction regularity Regular Irregular Thread mul mul mul mul mul add store load add add add add load mul sub add Time Control regularity i=17 i=17 i=17 i=17 switch(i) { case 2:... case 17:... case 21:... } i=21 i=4 i=17 i=2 Memory regularity load X[8] load X[9] load X[10] load X[11] load X[8] load X[0] load X[11] load X[3] X Memory 6

7 Outline How a GPU works Software: fine-grained vs. coarse-grained threading Hardware: from multi-core to SIMT GPU programming guidelines Arithmetic features 7

8 Homogeneous multi-core Replication of the complete execution engine Threads: coarse-grained T0 T2 X Machine code Architecture move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop add i 18 IF store X[17] mul IF ID EX LSU add i 50 IF store X[49] mul IF ID EX LSU Memory Improves throughput thanks to explicit parallelism 8

9 Interleaved multi-threading Time-multiplexing of processing units Threads: fine-grained X T0 T1 T2 T3 Architecture mul mul Fetch Decode Machine code move i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop add i 21 add i 23 load X[16] store X[17] load X[18] store X[19] Execute L/S Unit Memory Improves latency tolerance thanks to explicit parallelism 9

10 Single-instruction, multiple threads (SIMT) Cooperative sharing of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers X Threads: fine-grained T0 T1 T2 T3 (0) mul (0-3) load IF (0-3) store ID (1) mul (2) mul (3) mul EX Memory (0) (1) (2) (3) LSU Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 10

11 Around GPGPU fine-grained multi-thread architectures exploiting regularity: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon RAIM'11, Perpignan February 9, 2011

12 Core 32 Core 18 Core 17 Core 16 Core 2 Core 1 Example GPU: NVIDIA GeForce GTX 580 SIMT: warps of 32 synchronized threads 16 Streaming Multiprocessors / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 3 Warp 2 Warp 4 Warp 47 Warp 48 Time 1580 Gflop/s Up to threads in flight SM1 SM16 12

13 SIMD vs. SIMT Instruction regularity Control regularity Memory regularity SIMD Vectorization at compile-time Software-managed Bit-masking, predication Compiler selects: vector load-store or gather-scatter SIMT Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Hardware-managed Gather-scatter with hardware coalescing SIMT architecture : run regular fine-grained threads on SIMD units SIMT = one possible implementation of SPMD 13

14 GPU design space: not just SIMT How can we execute SPMD threads? MIMD (multi-core) Horizontal SIMT NVIDIA Tesla NVIDIA Fermi Multi-threading (Hyperthreading-like) Vertical SIMT Pipelined vectors (Cray-like) This is the GPU architect's problem Programmer's point of view: just threads Microarchitecture-specific optimizations Or just focus on locality and regularity AMD Evergreen AMD Northern Islands 14

15 Outline How a GPU works GPU programming guidelines Consequences of fine-grained threading Bottlenecks and limitations The more threads, the better? Arithmetic features 15

16 Consequence: SoA data layout Array of Structures (AoS) Good for random access to the whole structure (RGB pixel) struct Pixel { float r, g, b; }; Pixel image_aos[480][640]; Structure of Arrays (SoA) Good for sequential access Efficient access to a subset of fields (only green) With fine-grained threading: brings regularity struct Image { float R[480][640]; float G[480][640]; float B[480][640]; }; Image image_soa; 16

17 Case study: call stack Coarse-grained (x86 ) Split stacks Fine-grained (Fermi) Interleaved stacks T0 T1 SP T0 T1 T2 T3 SP Cache line T2 T3 Memory Memory No false sharing Good for coherent mem Regularity cache way conflicts [MengSka09] False sharing Good for spacial locality! 17

18 Outline How a GPU works GPU programming guidelines Consequences of fine-grained threading Bottlenecks and limitations The more threads, the better? Arithmetic features 18

19 Our #1 constraint is power Power measurements on NVIDIA GT200 [CDT10] Energy/op (nj) Total power (W) Instruction control Multiply-add on element vector Load 128B from DRAM With the same amount of energy Load 1 word from DRAM Compute 44 flops Need to keep memory traffic low Standard solution: caches, on-chip memory 19

20 On-chip memory on GPUs Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide Actual data GPU NVIDIA GF110 AMD Cayman Register files + caches 3.9 MB 7.7 MB At this rate, will catch up with CPUs by

21 Little's law: data=throughput latency Throughput (GB/s) Intel Core i , Latency (ns) NVIDIA GeForce GTX 580 L1 L2 320 DRAM ns21

22 Outline How a GPU works GPU programming guidelines Consequences of fine-grained threading Bottlenecks and limitations The more threads, the better? Arithmetic features 22

23 As many threads as possible? Example: SGEMM from CUBLAS 1.1 From: Vasily Volkov. Programming inverse memory hierarchy : case of stencils on GPUs. ParCFD, Addresses, indices (affine, linear increase) Duplicated data (uniform) Useful data 15 registers / thread Temporary data 512 threads / CTA 15 registers/thread Only 2 registers/thread really needed: 87% wastage 23

24 Overhead of SIMT Run as many threads as possible? Maximal parallelism better latency tolerance Less locality: need to keep private data of each thread More thread management overhead Initialization, redundant operations Instruction-Level Parallelism is still alive Out-of-order completion of load/stores on Tesla, Fermi Superscalar (supervector?) execution on GF104 VLIW on AMD architectures Thread count is not increasing, flops/thread is GPU (year) Threads MFlops/thread GT200 (2008) GF110 (2010)

25 Fewer threads, more computations Volkov SGEMM 8 elements computed / thread Unrolled loops Less traffic through shared memory, more through registers Overhead amortized 1920 registers vs for the same amount of work Works for redundant computations too Success story +60% compared to CUBLAS 1.1 Adopted in CUBLAS 2.0 More in [Volk10] Adresses, indices Useful data Duplicated data Temporary data 64 threads / CTA 30 registers / thread 25

26 Takeaway Distribute work and data Favor SoA Favor locality and regularity Use common sense (avoid extraneous copies or indirections) More threads higher performance Saturate instruction-level parallelism first (almost free) Then use data parallelism (expensive in terms of locality) Compiler optimization: unroll & jam And/or hardware-based solutions flexible register allocation data deduplication [CDZ09] 26

27 Outline How a GPU works GPU programming guidelines Arithmetic features IEEE-754? A bit of history FP capabilities 27

28 Every new generation is now IEEE-754 The Vector Unit floating-point precision is approximately IEEE. E. Lindholm, M. Kilgard, H. Moreton, A user-programmable vertex engine, SIGGRAPH, 2001 The vector unit can perform four IEEE single-precision multiply, add, or multiply-add operations, as well as inner products, max, min, and so on. J. Montrym, H. Moreton, The GeForce 6800, IEEE Micro, 2005 The floating-point add and multiply operations are compatible with the IEEE 754 standard for single-precision FP numbers, including not-a-number (NaN) and infinity values. Erik Lindholm et al., NVIDIA Tesla: a unified graphics and computing architecture, IEEE Micro, 2008 Single precision floating point instructions now support subnormal numbers by default in hardware, as well as all four IEEE rounding modes (nearest, zero, positive infinity, and negative infinity). NVIDIA's next generation CUDA compute architecture: Fermi Whitepaper, 2009 All compute devices follow the IEEE standard for binary floating-point arithmetic with the following deviations: [ 2-page long bullet list ] NVIDIA CUDA C Programming Guide,

29 Some partial, recent GPU history Microsoft DirectX 7.x a 9.0b 9.0c NVIDIA NV10 NV20 NV30 NV40 G70 G80-G90 GT200 GF100 FP 16 Programmable FP 32 shaders Dynamic control flow SIMT? CUDA ATI/AMD FP 24 CTM FP 64 CAL R100 R200 R300 R400 R500 R600 R700 Evergreen NI GPGPU traction

30 Arithmetic features 2006 (ATI R500, NVIDIA G70): Cray-1-like FP Truncated multipliers, adders with 2 guard bits and no sticky 41 / 41 1 Same GPU, different units: different behavior 2007 (ATI R600, NVIDIA G80) Correct IEEE-754 rounding to the nearest for +, Integer arithmetic and logical ops 2008 (AMD R670, NVIDIA GT200) Binary (AMD Evergreen, NVIDIA GF100) 4 mandatory IEEE rounding modes FMA for both Binary32 and Binary64 Subnormals at full-speed 30

31 Hardware elementary functions We therefore conclude that (1) the entire function library should be included in the hardware if and only if COMPLEX data types and their corresponding arithmetic are formally introduced; (2) the following error/accuracy criterion should be adopted and met by the implementation: [Correct rounding]. If either of these conditions is not met, then none of the elementary functions should be included in the hardware. G. Paul, M.W. Wilson, Should the elementary function library be incorporated into computer instruction sets?, TOMS, years later: still no complex datatypes nor correct rounding of elementary functions But we have hardware elementary functions on GPUs 1/x, 1/ x, log 2, 2 x, sin, cos Accuracy: 22 to 23 bits Applications: graphics, physics, finance 31

32 Graphics is bandwidth-starved too Lower-precision format: Binary16 11-bit significand, 5-bit exponent In IEEE-754:2008 Back to the 50's: Block Floating-Point formats One shared exponent, multiple significands More compact storage for correlated FP data 1, , , , m 1 m 2 m 3 m 4 e f 1 =m 1 x2 e f 2 =m 2 x2 e f 3 =m 3 x2 e f 4 =m 4 x2 e Lossy compression of textures in memory Hardware-based on-the-fly decompression Lossless compression of frame buffer, depth buffer 32

33 Fused multiply-add (FMA) (a b+c) with only one rounding error Higher accuracy a b Error-free transformations Different behavior than a b+c Loss of commutativity (+ in dot product ) + c a b a b+c R(a b+c) On all recent GPU Reasons GPUs do lots of linear algebra Easier to implement 4-operand instructions than on CPUs Used for IEEE-compliant division 33

34 Full-speed subnormals On CPUs, no hardware support for subnormals Micro-coded exception handler (x86 ) Software exception handler (Alpha ) On GPUs, subnormals in hardware at full-speed NV GT200 in binary64 NV Fermi AMD Evergreen Reasons Producing output denormals is free with an FMA (Cell SPU) GPU pipelines are latency-tolerant (GF100: 16-cycle FMA) Exception handling breaks regularity inefficient in SIMT 34

35 Static rounding attributes On CPUs Rounding mode as a mode for each thread Get/set with e.g. fegetround() and fesetround() On NVIDIA GPUs Rounding direction: flag in the instruction word Benefit: zero-overhead mode switch Reason: no legacy numeric code on GPUs Applications Interval arithmetic Interval CUDA SDK sample 100 speedup for the same development effort Stochastic arithmetic [JL10] 35

36 AMD Cypress floating-point unit 36

37 AMD Cypress floating-point unit Example Operation 4 FMA Rx Ax Bx + Cx Ry Ay By + Cy 4 UMA Rx (Ax Bx) + Cx Ry (Ay By) + Cy UDOT4 Rx (Ax Bx) + (Ay By) + (Az Bz) + (Aw Bw) 2 UDOT2 Rx (Ax Bx) + (Ay By) Rz (Az Bz) + (Aw Bw) 2 UMFMA Rx (Bx Az) Bz + Cx Ry (By Aw) Bw + Cy FMA Binary64 Rxy Axy Bxy + Cxy 37

38 Next arithmetic features? Arithmetic exceptions, traps, flags Easier to implement on a GPU than on a CPU Bullet feature for IEEE-754-compliance Market demand? Speed? Fused operations (FDOT2, FADD2 ) Fused dot product a b+c d: natural evolution of AMD's FPU Not in IEEE-754: not a drop-in replacement for unfused ops Parallel long accumulators 1 accum per thread: cannot fit on-chip with ~25000 threads Shared long accumulator In software using atomic instructions? Hardware assist for high-radix carry-save, conflict resolution? 38

39 Conclusion Software side Fine-grained thread parallelism Regularity Specialized in FP arithmetic From 8-bit fixed point to IEEE-754:2008 in 10 years Now better FP support than on most CPUs Specialized in graphics Exotic arithmetic units Can HPC learn from computer graphics? Fixed-function units, memory compression? Driven mostly by economic forces 39

40 FP OXO Format FMA UMA Rounding Subnormals Inf, NaN Flags Exceptions Intel X86 80 bit 4 Dynamic Microcode IBM PowerPC 64 bit 4 Dyn. Microcode Intel IA bit 4 Dyn. + Stat. Microcode 32 bit RZ IBM Cell SPU 64 bit 4 Dyn. Output NVIDIA 32 bit 2 Static GT bit 4 Static 32 bit RN AMD RV bit RN 32 bit 4 Static NVIDIA GF100 AMD Evergreen 64 bit 4 Static 32 bit 4 Dyn. 64 bit 4 Dyn. OpenCL 1.1 Direct3D bit Opt N/A RN Opt 64 bit 4 StaticRN 32 bit RN 64 bit RN 40

41 References [MengSka09] Jiayuan Meng, Kevin Skadron. Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. ICCD [Glew09] Andy Glew. Coherent vector lane threading. Berkeley ParLab Seminar, [CDT09] Sylvain Collange, David Defour, Arnaud Tisserand. Power consumption of GPUs from a software perspective. ICCS [Volk10] Vasily Volkov. Better Performance at Lower Occupancy. GTC [CDZ09] Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009 [JL10] Fabienne Jezequel, Jean-Luc Lamotte. Numerical validation of Slater integrals computation on GPU. SCAN

Around GPGPU: architecture, programming, and arithmetic

Around GPGPU: architecture, programming, and arithmetic Around GPGPU: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon David Defour, DALI, ELIAUS, Université de Perpignan November 10, 2010 Key challenges for parallel architectures

More information

Improving the efficiency of parallel architectures with regularity

Improving the efficiency of parallel architectures with regularity Improving the efficiency of parallel architectures with regularity Sylvain Collange Arénaire, LIP, ENS de Lyon Dipartimento di Ingegneria dell'informazione Università di Siena February 2, 2011 Where I

More information

GPU & Computer Arithmetics

GPU & Computer Arithmetics GPU & Computer Arithmetics David Defour University of Perpignan Key multicore challenges Performance challenge How to scale from 1 to 1000 cores The number of cores is the new MegaHertz Power efficiency

More information

Introduction to GPU architecture

Introduction to GPU architecture Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics

More information

Introduction to GPU architecture

Introduction to GPU architecture Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline of the course Feb 29: Introduction to GPU architecture Let's pretend

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: Code optimization part 2 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Memory optimization Memory access patterns Global memory optimization Shared

More information

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2018 Graphics

More information

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 What

More information

Parallel programming: Introduction to GPU architecture

Parallel programming: Introduction to GPU architecture Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 Outline of the course March

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Interval arithmetic on graphics processing units

Interval arithmetic on graphics processing units Interval arithmetic on graphics processing units Sylvain Collange*, Jorge Flórez** and David Defour* RNC'8 July 7 9, 2008 * ELIAUS, Université de Perpignan Via Domitia ** GILab, Universitat de Girona How

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Simty: Generalized SIMT execution on RISC-V

Simty: Generalized SIMT execution on RISC-V Simty: Generalized SIMT execution on RISC-V CARRV 2017 Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12. Outline L9: Project Discussion and Floating Point Issues Discussion of semester projects Floating point Mostly single precision until recent architectures Accuracy What s fast and what s not Reading: Ch

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

General-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA

General-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA General-purpose SIMT processors Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Today

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

GPU Architecture. Michael Doggett Department of Computer Science Lund university

GPU Architecture. Michael Doggett Department of Computer Science Lund university GPU Architecture Michael Doggett Department of Computer Science Lund university GPUs from my time at ATI R200 Xbox360 GPU R630 R610 R770 Let s start at the beginning... Graphics Hardware before GPUs 1970s

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * modern GPU architectures deeply pipelined for efficient resource sharing several buffering

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

High Performance Graphics 2010

High Performance Graphics 2010 High Performance Graphics 2010 1 Agenda Radeon 5xxx Product Family Highlights Radeon 5870 vs. 4870 Radeon 5870 Top-Level Radeon 5870 Shader Core References / Links / Screenshots Questions? 2 ATI Radeon

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

NVIDIA Fermi Architecture

NVIDIA Fermi Architecture Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)

More information

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU. Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8

More information