Improving the efficiency of parallel architectures with regularity

Size: px
Start display at page:

Download "Improving the efficiency of parallel architectures with regularity"

Transcription

1 Improving the efficiency of parallel architectures with regularity Sylvain Collange Arénaire, LIP, ENS de Lyon Dipartimento di Ingegneria dell'informazione Università di Siena February 2, 2011

2 Where I come from Arénaire, LIP, École normale supérieure de Lyon 11 faculty members, 17 PhD/postdoc/staff Computer arithmetic, at all levels Validated algorithms Function approximation in software Hardware operators on FPGAs DALI, LIRMM, Université de Perpignan 7 faculty members, 6 PhD/postdoc Computer arithmetic Compensation, certification Computer architecture Simulation, micro-architecture, GPUs 2

3 Context Accelerate compute-intensive applications HPC: computational fluid dynamics, seismic imaging, DNA folding, phylogenetics Multimedia: 3D rendering, video, image processing Current constraints Power consumption Cost of moving and retaining data Focus on GPU Video game: mass market Economy of scale, amortized R&D High-performance parallel processor For much more than graphics Now in #1 supercomputer (Tianhe-1A) 3

4 Why should we care about GPUs? Yesterday ( ) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Today ( ) Heterogeneous multi-core Intel Sandy Bridge shipping AMD Fusion sampling NVIDIA Denver/Maxwell project announced Throughputoptimized cores Latency(Configurable?) optimized hardware cores accelerators Heterogeneous multi-core chip 4

5 Why should we care about GPUs? Yesterday ( ) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Today ( ) Heterogeneous multi-core Intel Sandy Bridge shipping AMD Fusion sampling NVIDIA Denver/Maxwell project announced Throughputoptimized cores Latency(Configurable?) optimized hardware cores accelerators Heterogeneous multi-core chip 5

6 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 6

7 Software view Programming model Threads SPMD: Single program, multiple data One kernel code, many threads Unspecified execution order between explicit synchronization barriers Languages Graphics shaders : HLSL, Cg, GLSL Barrier GPGPU : C for CUDA, OpenCL Example: in-place scalar-vector product X a X For n threads: X[tid] a * X[tid] Kernel Let's build a computer to run these programs 7

8 st 1 step: sequential processor Sequential execution move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop add i 18 Fetch store X[17] Decode mul Execute Memory L/S Unit Machine code Can exploit implicit instruction-level parallelism to increase throughput Obstacles: not scalable David Patterson (UCBerkeley): Power Wall + Memory Wall + ILP Wall = Brick Wall 8

9 2 nd step: homogeneous multi-core Replication of the complete execution engine Software: coarse-grained threading Threads: T0 add i 18 IF IF store X[17] ID add i 50 IF IF store X[49] ID mul mul EX LSU EX Memory move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code LSU T1 Improves throughput thanks to explicit parallelism 9

10 rd 3 step: interleaved multi-threading Time-multiplexing of processing units Software: coarse-grained or fine-grained threading move i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop Machine code (fine-grained threading) Threads: T0 T1 T2 mul Fetch mul Decode add i 73 add i 50 Execute load X[89] store X[72] load X[17] store X[49] L/S Unit Memory T3 Hides latency thanks to explicit parallelism 10

11 th 4 step: clustered multi-core For each individual unit, select between T0 T1 T2 Cluster 1 T3 Cluster 2 Replication Time-multiplexing Examples br mul Sun UltraSparc T2 AMD Bulldozer add i 73 Fetch store Decode mul add i 50 load X[89] store X[72] load X[17] store X[49] EX Memory L/S Unit Area-efficient tradeoff 11

12 Regularity Similarity in behavior between threads Irregular Regular Instruction regularity 4 mul add Control regularity i=17 i=17 i=17 Memory regularity load X[8] i=17 Time 1 mul add Thread 2 3 mul mul add add 1 mul load 2 add mul 3 4 store load sub add i=21 i=4 i=17 i=2 load X[8] load X[0] load X[11] load X[3] switch(i) { case 2:... case 17:... case 21:... } X load X[9] Memory load load X[10] X[11] 12

13 Final step: SIMT Cooperative sharing of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers (0-3) load IF T1 (0-3) store ID T2 T3 (0) mul (0) (1) mul (2) mul (1) (2) (3) mul EX (3) Memory T0 LSU In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 13

14 So how does it differ from SIMD? SIMD SIMT Instruction regularity Control regularity Vectorization at compile-time Software-managed Bit-masking, predication Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Memory regularity Compiler selects: Hardware-managed vector load-store or Gather-scatter with gather-scatter hardware coalescing SIMD is static, SIMT is dynamic Similar opposition as VLIW vs. superscalar 14

15 Example GPU: NVIDIA GeForce GTX 580 SIMT: warps of 32 threads 16 SMs / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 2 Warp 3 Warp 4 Core 32 Time Core 18 Warp 47 Core 17 Core 16 Core 2 Core 1 Warp 48 SM1 SM Gflop/s Up to threads in flight 15

16 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 16

17 Knowing our baseline Design and run micro-benchmarks Target NVIDIA Tesla architecture Go far beyond published specifications Understand design decisions Run power studies [CDA09] Energy measurements on micro-benchmarks Understand power constraints 17

18 Barra Functional GPU simulator [CDDP10] Modeled after NVIDIA Tesla GPUs Executes native CUDA binaries Reproduces SIMT execution Built within the Unisim framework Fast Uses SSE instructions and multi-threading Accurate Bit-accurate floating-point Produces low-level statistics Allows experimenting with architecture changes 18

19 Primary constraint: power Power measurements on NVIDIA GT200 [CDT10] Energy/op (nj) Instruction control Multiply-add on 32-element vector Load 128B from DRAM Total power (W) With the same amount of energy Load 1 word from DRAM Compute 44 flops Need to keep memory traffic low Standard solution: caches 19

20 On-chip memory Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide: Actual data GPU NVIDIA GF110 AMD Cayman Register files + caches 3.9 MB 7.7 MB At this rate, will catch up with CPUs by

21 The cost of SIMT: register wastage SIMD mov loop: vload vmul vstore add branch SIMT i 0 mov i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop T X[i] T a T X[i] T i i+16 i<n? loop Instructions vload vmul vstore add branch scalar Thread SIMD load mul store add branch Registers T a 17 i 0 n 51 scalar vector t a i n

22 SIMD vs. SIMT SIMD SIMT Instruction regularity Control regularity Vectorization at compile-time Software-managed Bit-masking, predication Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Memory regularity Compiler selects: Hardware-managed vector load-store or Gather-scatter with gather-scatter hardware coalescing Scalar registers, Duplicated registers, scalar instructions duplicated ops Data regularity 22

23 Uniform and affine vectors warp (granularity) Uniform vector Value does not depend on lane ID thread In a warp, v[i] = c c=5 Affine vector In a warp, v[i] = b + i s Base b, stride s Affine relation between value and lane ID c= s=1 b=8 s=2 b= Generic vector : anything else 23

24 How many uniform / affine vectors? Analysis by simulation with Barra Applications from the CUDA SDK Granularity 16 threads Generic Outputs Affine Uniform Inputs 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 49% of all reads from the register file are affine 32% of all instructions compute on affine vectors This is what we have in the wild How to capture it? 24

25 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 25

26 Tagging registers Instructions Associate a tag to each vector register Uniform, Affine, unknown 2 lanes are enough to encode uniform and affine vectors Trace Propagate tags across arithmetic instructions mov i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop loop: load t X[i] mul t a t... Tag K U A U Tags A A K U[A] K U K U[A] K A A+U A<U? K U[A] K U K Thread t a 17 X X X X i 0 1 X X X n 51 X X X X X X X 26

27 Dynamic data deduplication: results Detects 79% of affine input operands 75% of affine computations Unknown Outputs detected Affine Uniform Outputs total Inputs detected Inputs total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 27

28 New pipeline New deduplication stage In parallel with predication control stage Split RF banks into scalar part and vector part Fine-grained clock-gating on vector RF and SIMD datapaths Cooperative sharing of execution units Tags Reg ID Fetch Scalar Vector RF RF Reg ID + tag Deduplication Decode Branch / Mask Read operands Execute... 28

29 Power savings Inactive for 38% of operands Tags Reg ID Fetch Scalar Vector RF RF Reg ID + tag Deduplication Decode Branch / Mask Inactive for 24% of instructions Read operands Execute... S. Collange, D. Defour, Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09,

30 SIMD vs. SIMT SIMD SIMT Instruction regularity Control regularity Vectorization at compile-time Software-managed Bit-masking, predication Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Memory regularity Compiler selects: Hardware-managed vector load-store or Gather-scatter with gather-scatter hardware coalescing Scalar registers, Data deduplication scalar instructions at runtime Data regularity 30

31 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 31

32 A scalarizing compiler? Scalar-only Sc a co lariz mp i n ile g r Actual CPU Scalar+SIMD SPMD compiler Architectures Model CPU Scalar SPMD (CUDA, OpenCL) ing riz cto iler Ve omp c Sequential programming Traditional compiler Programming models Vector-only GPU SIMT 32

33 Still uniform and affine vectors Scalar registers c=a+b a b Uniform vectors Affine vectors with known stride Scalar operations Uniform inputs, uniform outputs Uniform branches c if(c) { } else { } c x=t[i]; i x Uniform conditions Vector load/store (vs. gather/scatter) Affine aligned addresses Memory t[8] t[15] 33

34 From SPMD to SIMD Forward dataflow analysis Statically propagate tags in dataflow graph, C(v), U, A(s), K Propagate value v of constants, stride s of affine vectors, and alignment SPMD mov i0 tid phi i1 φ(i0,i2) load t X[i1] mul t a t store X[i1] t add i2 i1+tcnt branch i2<n? loop A(1) A(1) A(1) φ(a(1), ) K U[A(1)] K U K U[A(1)] K A(1) A(1)+C(tcnt) A(1)<C(n)? 34

35 From SPMD to SIMD Forward dataflow analysis Statically propagate tags in dataflow graph, C(v), U, A(s), K Propagate value v of constants, stride s of affine vectors, and alignment SPMD mov i0 tid phi i1 φ(i0,i2) load t X[i1] mul t a t store X[i1] t add i2 i1+tcnt branch i2<n? loop SIMD A(1) A(1) mov i0 0 A(1) φ(a(1),a(1)) K U[A(1)] K U K U[A(1)] K A(1) A(1)+C(tcnt) A(1)<C(n)? phi i1 φ(i0,i2) vload t X[i1] vmul t a t vstore X[i1] t add i2 i1+tcnt branch i2<n? loop 35

36 Results: instructions, registers Benchmarks: CUDA SDK, SGEMM, Rodinia, Parboil Static operands Unknown Inputs identified Affine Uniform Inputs total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Split register allocation Vector Registers 0% Scalar 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 36

37 Results: memory, control Other Loads identified Unit-strided Uniform Loads total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Divergent Branches identified Uniform Branches total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% SIMD: identify at compile-time situations that SIMT detects at runtime Uniform branches Uniform, unit-strided loads & stores Scalar instructions, registers 37

38 Static vs. Dynamic data deduplication Static deduplication Dynamic deduplication Allows simpler hardware Preserves binary compatibility Governs register allocation Captures dynamic and instruction scheduling behavior Enables constant propagation Unaffected by (future) software complexity Applies to memory (call stack ) 38

39 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 39

40 My register file is full of holes! Static partitioning of RF between warps is space-inefficient warp 1 warp 2 Memory phase Compute phase warp 0 Registers alive, generic alive, affine or uniform dead Time We need more dynamic, flexible register allocation AMD: more flexible, but static Need to take into account affine registers Solution: play Tetris? Or use caches 40

41 Ongoing work: affine caches As level-1 cache Extension to the RF: 1 cache line = 1 spilled vector register Space-efficient storage of call stacks Lookup Tags for generic Store data Load data Tags for affine 1 generic vector or 16 affine vectors Unified array Fill WB Inspired from zero-content augmented caches [DPS09] 41

42 Affine cache as L0: RF replacement Affine ALU Instruction Decode Instruction Fetch Arch reg IDs Affine Array Tags Broadcast Instructions ALU/FPU ALU/FPU ALU/FPU ALU/FPU L0 Array L0 Array L0 Array L0 Array µarch reg IDs Replacement policy SIMT execution: only 1 tag lookup / operand Same translation for all lanes Affine ALU handles most control flow and addresses Vector ALUs/FPUs do the heavy lifting Coordinate warp scheduling and cache replacement? 42

43 Bottom line Introduced regularity: instruction, control, memory, data Current GPGPU applications exhibit significant data regularity Both static and dynamic techniques can exploit it Enables power savings Less data storage and movement 43

44 Future challenges SIMT bridges the gap between superscalar and SIMD Smooth, dynamic tradeoff between regularity and efficiency Future microarchitectural improvements? Efficiency (flops/w) SIMT SIMD Dynamic deduplication, Decoupled scalar-simt, Affine caches? Superscalar Dynamic warp formation [Fung07], Dynamic warp subdivision [Meng10], NIMT [Glew09] Transaction processing Computer graphics Regularity of application Dense linear algebra 44

45 References [Glew09] Andy Glew. Coherent vector lane threading. Berkeley ParLab Seminar, [CDT09] Sylvain Collange, David Defour, Arnaud Tisserand. Power consumption of GPUs from a software perspective. ICCS [CDDP10] Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: a parallel functional simulator for GPGPU. MASCOTS [CDZ09] Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009 [FSYA07] Wilson W. L. Fung, Ivan Sham, George Yuan, Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. MICRO [MTS10] Jiayuan Meng, David Tarjan, Kevin Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. ISCA [DPS09] Julien Dusser, Thomas Piquet, André Seznec. Zero-content Augmented Caches. ICS

Around GPGPU: architecture, programming, and arithmetic

Around GPGPU: architecture, programming, and arithmetic Around GPGPU: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon sylvain.collange@ens-lyon.fr RAIM'11, Perpignan February 9, 2011 Key challenges for parallel architectures

More information

Architecture and micro-architecture of GPUs

Architecture and micro-architecture of GPUs Architecture and micro-architecture of GPUs Sylvain Collange Arénaire, LIP, ENS de Lyon sylvain.collange@ens-lyon.fr Departamento de Ciência da Computação, UFMG - ICEx May 27, 2011 Where I come from Arénaire,

More information

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline of the course Feb 29: Introduction to GPU architecture Let's pretend

More information

Introduction to GPU architecture

Introduction to GPU architecture Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics

More information

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 What

More information

Parallel programming: Introduction to GPU architecture

Parallel programming: Introduction to GPU architecture Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 Outline of the course March

More information

Introduction to GPU architecture

Introduction to GPU architecture Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics

More information

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2018 Graphics

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

General-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA

General-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA General-purpose SIMT processors Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Today

More information

Work factorization for efficient throughput architectures

Work factorization for efficient throughput architectures Work factorization for efficient throughput architectures Sylvain Collange Departamento de Ciência da Computação, ICEx Universidade Federal de Minas Gerais sylvain.collange@dcc.ufmg.br February 01, 2012

More information

Escaping the SIMD vs. MIMD mindset A new class of hybrid microarchitectures between GPUs and CPUs

Escaping the SIMD vs. MIMD mindset A new class of hybrid microarchitectures between GPUs and CPUs Escaping the SIMD vs. MIMD mindset A new class of hybrid microarchitectures between GPUs and CPUs Sylvain Collange Università degli Studi di Siena xsylvain.collange@gmail.comx Séminaire DALI December 15,

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Around GPGPU: architecture, programming, and arithmetic

Around GPGPU: architecture, programming, and arithmetic Around GPGPU: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon David Defour, DALI, ELIAUS, Université de Perpignan November 10, 2010 Key challenges for parallel architectures

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Simty: Generalized SIMT execution on RISC-V

Simty: Generalized SIMT execution on RISC-V Simty: Generalized SIMT execution on RISC-V CARRV 2017 Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Occupancy-based compilation

Occupancy-based compilation Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)

More information

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism Lecture 9 Thread Divergence Scheduling Instruction Level Parallelism Announcements Tuesday s lecture on 2/11 will be moved to room 4140 from 6.30 PM to 7.50 PM Office hours cancelled on Thursday; make

More information

Vector Processors and Graphics Processing Units (GPUs)

Vector Processors and Graphics Processing Units (GPUs) Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Interval arithmetic on graphics processing units

Interval arithmetic on graphics processing units Interval arithmetic on graphics processing units Sylvain Collange*, Jorge Flórez** and David Defour* RNC'8 July 7 9, 2008 * ELIAUS, Université de Perpignan Via Domitia ** GILab, Universitat de Girona How

More information

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) GPU Architecture Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Nicolas Brunie Sylvain Collange Gregory Diamos by Ravi Godavarthi Outline Introduc)on ISCA'39 (Interna'onal Society for Computers

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

Dynamic detection of uniform and affine vectors in GPGPU computations

Dynamic detection of uniform and affine vectors in GPGPU computations Dynamic detection of uniform and affine vectors in GPGPU computations Sylvain Collange, David Defour, Yao Zhang To cite this version: Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform

More information

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 19: SIMD

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Stack-less SIMT reconvergence at low cost

Stack-less SIMT reconvergence at low cost Stack-less SIMT reconvergence at low cost Sylvain Collange To cite this version: Sylvain Collange. Stack-less SIMT reconvergence at low cost. 2011. HAL Id: hal-00622654 https://hal.archives-ouvertes.fr/hal-00622654

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) 1 DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) Chapter 4 Appendix A (Computer Organization and Design Book) OUTLINE SIMD Instruction Set Extensions for Multimedia (4.3) Graphical

More information

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar Vector Processors Kavitha Chandrasekar Sreesudhan Ramkumar Agenda Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information