Antonio R. Miele Marco D. Santambrogio

Size: px
Start display at page:

Download "Antonio R. Miele Marco D. Santambrogio"

Transcription

1 Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

2 2 Introduction First GPU released in 1999 Used for the purpose of graphics processing GPU architecture rapidly evolved providing higher computational power by means of parallelization GPU architecture evolved also to support programmability of their components ( )

3 3 Introduction In 2006, NVIDIA introduced GeForce 8800 GPU supporting a new programming language: CUDA Compute Unified Device Architecture Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Host CPU issues data-parallel kernels to GP-GPU for execution

4 4 Introduction CPU and GPU performance trends FLOPS FLoating-point OPerations per Second

5 5 Graphics pipeline At the beginning there was the graphics pipeline

6 Graphics pipeline 6

7 7 Vertex generation The host interface is the communication bridge between the CPU and the GPU It receives commands from the CPU and also pulls geometry information from system memory It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc)

8 8 Vertex processing The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space Transformations are based on matrix multiplications No new vertices are created in this stage, and no vertices are discarded (input/output has 1:1 mapping)

9 9 Vertex processing 1. Model to world coordinates 2. World to eye coordinates 3. Eye to clip coordinates Textures may be also used for advanced transformations (they provide height maps for displacement mapping)

10 10 Primitive generation The primitive assembler groups vertices forming one primitive (i.e. a triangle)

11 11 Primitive processing Various elaborations are performed Perspective division and viewpoint transformation Clipping

12 12 Fragment generation Geometry information is transformed in raster information (pixels in output Determine what pixels a primitive overlaps Aliasing and other issues

13 13 Fragment processing Assign colors to pixels Shades the fragment by simulating the interaction of light and material

14 14 Fragment processing Effects of tessellation Texture mapping Lightning and texture

15 15 Pixel operations Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests Finally pixels are copied in the framebuffer (the memory space connected to the screen controller)

16 16 Graphics pipeline In each stage elaborations can be performed in parallel on each chunk of data (vertex, fragment, pixel, ) PARALLELISM!

17 17 Evolution of the graphics pipeline Pre GPU Fixed function GPU Programmable GPU Unified shader processors

18 Early 90s pre GPU 18

19 19 Exploit parallelism Goals of GPUs? Pipeline parallel Data-parallel CPU and GPU executing in parallel Specific hardware accelerators Texture filtering, rasterization, MAD, sqrt,...

20 20 Fixed function rasterization, texture mapping, depth testing, etc. 3dfx voodoo (1996) Required separate VGA card for 2D

21 21 NVIDIA GeForce 256 (1999) All stages implemented in hardware Fixed function rasterization, texture mapping, depth testing, etc.

22 22 NVIDIA GeForce 3 (2001) Optionally bypass fixedfunction with a programmable vertex shader Shader: a miniprogram defining the logic of a pipeline stage A specific shading language has to be used (e.g. OpenGL) Programmable

23 23 NVIDIA GeForce 6 (2004) Improved programmability in fragment shader Vertex shader can read textures Dynamic branches Programmable

24 24 Pipelined architecture NVIDIA GeForce 6 (2004) Multiple cores for each stage Programmable stages The introduction of programmable stages requires fetch and decode units

25 25 NVIDIA GeForce 7800 (2005) Vertex Fixed stages Programmable stages Fragment The introduction of programmable stages requires fetch and decode units Composite

26 26 NVIDIA GeForce 8 (2006) Ground-up architecture redesign New geometry shader after the vertex shader Introduction of the unified shader processor Geometry shader Introduction of CUDA Employment of GPU for general purpose computing: GP-GPU Programmable

27 27 NVIDIA GeForce 8800 (2006) Introduction of Issue Units for managing threads generation and scheduling Fixed stages

28 28 Why a single shader processor? Non-unified shader processors Vertex shader bottleneck Pixel shader Heavy pixel workload Vertex shader Pixel shader Problems in balancing workload in pipeline stages Heavy geometry workload

29 29 Why a single shader processor? Non-unified shader processors Unified shader Heavy pixel workload Unified shader Heavy geometry workload Optimal usage of processing resources

30 30 Unified shader processor How the unified shared processor works Three key ideas: Instantiate many shader processors Replicate ALU inside the shader processor to enable SIMD processing Interleave the execution of many groups of SIMD threads

31 31 Example: a diffuse reflectance shader Shader programming model: fragments (or more in general work items) are processed independently The function has to be executed for each fragment

32 32 Example: a diffuse reflectance shader Shader programming model: fragments (or more in general work items) are processed independently The function has to be executed for each fragment One instruction stream per fragment

33 Basic architecture of a modern CPU 33

34 34 Basic architecture of a modern GPU Remove components that help a single instruction stream run faster

35 35 Replicate cores Replicate cores to run several threads in parallel 2 cores process 2 instruction streams in parallel

36 36 Replicate cores Replicate cores to run several threads in parallel 4 cores process 4 instruction streams in parallel

37 37 Replicate cores Replicate cores to run several threads in parallel 16 cores process 16 instruction streams in parallel

38 38 Replicate cores Replicate cores to run several threads in parallel 16 cores process 16 instruction streams in parallel PROBLEM: many cores should share the same instruction stream Since each unit has its own fetch and decode unit, we rather prefer to run different instruction streams

39 39 Replicate ALUs within the core SIMD processing

40 40 Replicate ALUs within the core SIMD processing Original compiled shader: Processing one item using scalar operations on scalar registers

41 41 Replicate ALUs within the core SIMD processing New compiled shader: Processing 8 items using vector operations on vector registers

42 42 Replicate ALUs within the core SIMD processing

43 43 Replicate ALUs within the core SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions Cray, Intel/AMD x86 SSE, IBM Altivec (explicit vector length) Option 2: Scalar instructions with implicit HW vectorization HW determines instruction stream sharing across ALUs NVIDIA GeForce ( SIMT warps), ATI Radeon architectures SIMT: single instruction multiple threads Split identical independent work items over multiple threads executed in lockstep An instruction stream of scalar instructions is shared among the various threads

44 44 Merging two-level replications Result: multicore architecture where each core is a SIMD architecture 16 cores, each one having 8 ALUs = 128 simultaneous threads

45 45 Branches Branches have to be accurately handled

46 46 Branches Branches have to be accurately handled

47 47 Branches Branches have to be accurately handled

48 48 Stalls The execution of an instruction may have a data dependency with a previous one (still running) -> stall! Access to the texture memory (100x slower than ALU instructions)

49 49 Stalls Stalls due to data dependencies have to be managed as well Memory accesses cause many stalls due to the considerably higher execution time with respect to ALU instructions (x100/x1000) Fancy caches and logic avoiding stalls in CPUs have been removed However On GPU we can run concurrently MANY independent instructions streams

50 50 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

51 51 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

52 52 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

53 53 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations

54 54 Stalls Interleaving between contexts can be managed by HW or SW or both NVIDIA/AMD Radeon GPUs approach HW schedules and manages all contexts Special on-chip storage holds work item state

55 55 How to dimension the context Maximal latency hiding ability Low latency hiding ability

56 56 Basic architecture of a modern GPU Summary: Use many slimmed down cores to run in parallel Pack cores full of ALUs (by sharing instruction streams across group of work items) Option 1: explicit SIMD vector instructions Options 2: implicit sharing managed by HW Avoid latency stalls by interleaving execution of many groups of work items/threads/ When a group stalls, work on another group

57 57 16 streaming multiprocessors NVIDIA Fermi (2009) Each multiprocessor has 32 streaming cores SIMT, single instruction multi threads 6 memory ports 1 global scheduler

58 58 Streaming Multiprocessor (SM) 32 streaming cores 32 bit pipelined integer arithmetic unit (with support for 64 bit operations 1 cycle) IEEE single/doubleprecision floating point unit providing multiply-add instructions (1 cycles) 16 load/store units Concurrent access to data in each address of the cache or DRAM (1 cycle) 4 special function units (SFUs) For transcendent functions (sine, cosine, square root, ) Slower than other units (4 cycles) Decoupled from the dispatching units to improve performance

59 59 Streaming Multiprocessor (SM) Threads are grouped in 32 threads sharing an instruction stream, called warp The SM has 2 scheduling and dispatching units Two warps are selected each clock cycle (fetch, decode and execute two warps in parallel) The register file may host up to 48 interleaved warps 1536 threads per SM! Globally threads!

60 60 Streaming Multiprocessor (SM) Each scheduler may execute an instruction on 16 ALU cores, 16 load/store units, or 4 SFUs Each double precision FPU instruction requires 2 ALU cores Each clock cycle the scheduler selects a warp that is ready to be executed Warp are independent -> no dependency check is required

61 61 Other features of Fermi 2-level distributed scheduler At chip level a global workload distribution engine dispatches thread blocks to various SMs At SM level each warp scheduler distributes warps Support to fast context switch (around 25us)

62 62 Other features of Fermi Support to concurrent kernel execution

63 63 NVIDIA Kepler (2012) Same architecture of Fermi with performance and power efficiency improvements Increased to 192 streaming core per SM 32 special floating point units Improved warp scheduling (4 schedulers per SM) Other improvements Maxwell architecture (2014) presents further improvements

64 64 Cache organization CPU cores run efficiently when data is resident in cache Caches reduce latency and proved high bandwidth

65 65 Cache organization Initially GPU core was not provided with caches GPU core required a high-bandwidth connection to memory

66 66 Limited bandwidth A high-end GPU (e.g. Radeon HD 6970) has... Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU No large cache hierarchy to absorb memory requests GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory requests to maximize use of memory bus Still, this is only 5 times the bandwidth available to CPU

67 67 Limited bandwidth If processors request data at too high a rate, the memory system cannot keep up Overcoming bandwidth limits are a common challenge for GPUcompute application developers Request data less often (instead, do more math) arithmetic intensity Might be quicker to calculate something from scratch on device instead of copying from host Fetch data from memory less often (share/reuse data across fragments on-chip communication or storage Graphics elaborations fit well with these issues More ALU operations that memory accesses

68 68 Modern GPU memory hierarchy Modern GPUs are provided with local memories (not synched with main memory) texture caches (read-only) Moreover L1-L2 caches have been added In NVIDIA architectures only L2 is coherent!

69 69 Transmission cost Another relevant aspect is the CPU/GPU transmission bandwidth PCIe bandwidth: 8GB/s on each direction Attempt to pipeline/multi-buffer uploads and downloads

70 70 NVIDIA GeForce 8 (2006) Each SM is provided with 16K shared memory 64K constant cache 8K texture cache Each process can access all memory locations at 86Gb/s with different latencies: Shared: 2 cycles Device: 300 cycles

71 71 NVIDIA Fermi (2009) Each SM is provided with 64K local shared memory used by thread blocks to cooperate Reduction of off-chip traffic Shared memory can be configured by the programmer to obtained also a L1 cache Introduced a chip-level L2 cache

72 72 NVIDIA Fermi (2009) Texture cache has been removed from L1 since not efficient for general purpose computing Fast atomic memory operations Read-modify-write, compare-and-swap Efficient sorting and building of data structures

73 73 NVIDIA Kepler (2012) Doubled Fermi cache size: 128K L1, 1536KB L2 Introduced a Read-only cache (similar to a texture cache) Added shuffle instructions

74 74 CPU/GPU interaction The CPU and GPU inside the PC work in parallel with each other There are two threads going on, one for the CPU and one for the GPU, which communicate through a command buffer: GPU reads commands from here Pending GPU commands CPU writes commands here

75 75 CPU/GPU interaction Communications between CPU and GPU are nonblocking (or asynchronous) In the CPU program below, the object is not drawn after statement A and before statement B: Statement A API call to draw object Statement B Instead, all the API call does is to add the command to draw the object to the GPU command buffer

76 76 CPU/GPU interaction This leads to a number of synchronization considerations In the figure below, the CPU must not overwrite the data in the yellow block until the GPU is done with the black command, which references that data: GPU reads commands from here CPU writes commands here data

77 77 CPU/GPU interaction Modern graphics APIs implement semaphore style operations to keep this from causing problems If the CPU attempts to modify a piece of data that is being referenced by a pending GPU command, it will have to spin around waiting, until the GPU is finished with that command While this ensures correct operation it is not good for performance since there are a million other things we would rather do with the CPU instead of spinning The GPU will also drain a big part of the command buffer thereby reducing its ability to run in parallel with the CPU

78 Inlining data One way to avoid these problems is to inline all data to the command buffer and avoid references to separate data: GPU reads commands from here data CPU writes commands here However, this is also bad for performance, since we may need to copy several Mbytes of data instead of merely passing around a pointer

79 GPU readbacks The output of a GPU is a rendered image on the screen, what will happen if the CPU tries to read it? Pending GPU commands CPU writes commands here GPU reads commands from here The GPU must be synchronized with the CPU, i.e. it must drain its entire command buffer, and the CPU must wait while this happens When the GPU is used for general purpose computing, the programmer has to explicitly manage memory transfers and synchronization

80 81 Other vendors We have analyzed NVIDIA GPUs so far There are many other GPU vendors E.g.: AMD, ARM, The overall GPU architecture is quite similar to the NVIDIA one

81 82 AMD Radeon HD 6970 (Cayman) 2010 SIMD function unit, control shared across 16 units (Up to 4 MUL-ADDs per clock) VLIW processing! Groups of 64 [fragments/vertices/etc.] share instruction stream Four clocks to execute an instruction for all fragments in a group

82 83 AMD Radeon HD 6970 (Cayman) 2010 There are 24 of these cores on the 6970: that s about 32,000 fragments!

83 84 ARM Mali 628 (2014) Targeted for embedded computing

84 85 CPU and GPU within the same chip The trend in the last years has been to integrate the GPU within the same chip of the CPU Opportunities: Reduce offload cost Reduce memory copy/transfers Power management Steps: Remove the external communication link Define a unified memory architecture

85 86 Targeted for mobile and desktop computing Subsequent solutions integrated in Microsoft Xbox and Sony PlayStation 4 Architecture: AMD Llano (2011) CPU: AMD K10 quad-core GPU: AMD Radeon HD 6000

86 87 Intel Sandy Bridge (2011) First Intel generation (after Intel Westmere) with integrated GPU

87 88 Solutions from mobile, desktop and server market Architecture: Intel Skylane (2015) CPU: Intel multi-core (from m3 dual-core to i7 octacore to Xeon E3 octa-core) GPU: Intel HD Graphics (up to 24 execution units) or Iris Graphics (up to 72 execution units)

88 89 Samsung Exynos 5422 (2014) Targeted for mobile computing Architecture: CPU: ARM big.little octa-core GPU: ARM Mali-T628 MP6

89 90 NVIDIA Tegra X1 (2015) Targeted for mobile computing Architecture: CPU: ARM big.little octa-core GPU: NVIDIA Maxwell with 256 CUDA cores

90 91 Final notes Generic many-core GPU Less space devoted to control logic and caches Large register files to support multiple thread contexts Low latency hardware managed thread switching Large number of ALU per core with small usermanaged cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously Simple ALUs Cache High Bandwidth bus to ALUs On Board System Memory Support for general purpose computing!

91 92 Final notes GPUs are massively parallel devices originally used for implementing the graphics pipeline GPUs can be also used for accelerating general purpose computations (GP-GPU!) Some languages have been developed (CUDA, OpenCL, C++AMP)

92 93 Final notes An efficient GPU workload Has thousands of independent pieces of work Uses many ALUs on many cores Supports massive interleaving for latency hiding Is amenable to instruction stream sharing Maps to SIMD execution well Is compute-heavy: the ratio of math operations to memory access is high Not limited by bandwidth

93 94 References Material taken from other university course on computer architectures, computer graphics and parallel computing GPUs.pdf f11/www/ NVIDIA website:

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Advanced Topics on Heterogeneous System Architectures GPU! Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Introduction!

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) GPU Architecture Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Architectures Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Overview of today s lecture The idea is to cover some of the existing graphics

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU. Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

GPU A rchitectures Architectures Patrick Neill May

GPU A rchitectures Architectures Patrick Neill May GPU Architectures Patrick Neill May 30, 2014 Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know

More information

Graphics Processing Unit Architecture (GPU Arch)

Graphics Processing Unit Architecture (GPU Arch) Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Architecture and Function. Michael Foster and Ian Frasch GPU Architecture and Function Michael Foster and Ian Frasch Overview What is a GPU? How is a GPU different from a CPU? The graphics pipeline History of the GPU GPU architecture Optimizations GPU performance

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett Spring 2010 Prof. Hyesoon Kim AMD presentations from Richard Huddy and Michael Doggett Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backens

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147 GRAPHICS HARDWARE Niels Joubert, 4th August 2010, CS147 Rendering Latest GPGPU Today Enabling Real Time Graphics Pipeline History Architecture Programming RENDERING PIPELINE Real-Time Graphics Vertices

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

GPU Architecture. Michael Doggett Department of Computer Science Lund university

GPU Architecture. Michael Doggett Department of Computer Science Lund university GPU Architecture Michael Doggett Department of Computer Science Lund university GPUs from my time at ATI R200 Xbox360 GPU R630 R610 R770 Let s start at the beginning... Graphics Hardware before GPUs 1970s

More information

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský Real - Time Rendering Graphics pipeline Michal Červeňanský Juraj Starinský Overview History of Graphics HW Rendering pipeline Shaders Debugging 2 History of Graphics HW First generation Second generation

More information

Efficient and Scalable Shading for Many Lights

Efficient and Scalable Shading for Many Lights Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL

More information

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university Graphics Architectures and OpenCL Michael Doggett Department of Computer Science Lund university Overview Parallelism Radeon 5870 Tiled Graphics Architectures Important when Memory and Bandwidth limited

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

NVIDIA Fermi Architecture

NVIDIA Fermi Architecture Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

ASYNCHRONOUS SHADERS WHITE PAPER 0

ASYNCHRONOUS SHADERS WHITE PAPER 0 ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

The NVIDIA GeForce 8800 GPU

The NVIDIA GeForce 8800 GPU The NVIDIA GeForce 8800 GPU August 2007 Erik Lindholm / Stuart Oberman Outline GeForce 8800 Architecture Overview Streaming Processor Array Streaming Multiprocessor Texture ROP: Raster Operation Pipeline

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

Introduction to Modern GPU Hardware

Introduction to Modern GPU Hardware The following content are extracted from the material in the references on last page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This

More information

Scheduling the Graphics Pipeline on a GPU

Scheduling the Graphics Pipeline on a GPU Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

Bifrost - The GPU architecture for next five billion

Bifrost - The GPU architecture for next five billion Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016 Vulkan 2 ARM 2016 What is Vulkan? A 3D graphics API for the next twenty years Logical successor

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Graphics Hardware. Instructor Stephen J. Guy

Graphics Hardware. Instructor Stephen J. Guy Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Graphics and Imaging Architectures

Graphics and Imaging Architectures Graphics and Imaging Architectures Kayvon Fatahalian http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/ About Kayvon New faculty, just arrived from Stanford Dissertation: Evolving real-time graphics

More information

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. 1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. Optical Discs 1 Structure of a Graphics Adapter Video Memory Graphics

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

GPGPU introduction and network applications. PacketShaders, SSLShader

GPGPU introduction and network applications. PacketShaders, SSLShader GPGPU introduction and network applications PacketShaders, SSLShader Agenda GPGPU Introduction Computer graphics background GPGPUs past, present and future PacketShader A GPU-Accelerated Software Router

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

CS195V Week 9. GPU Architecture and Other Shading Languages

CS195V Week 9. GPU Architecture and Other Shading Languages CS195V Week 9 GPU Architecture and Other Shading Languages GPU Architecture We will do a short overview of GPU hardware and architecture Relatively short journey into hardware, for more in depth information,

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Technical Report on IEIIT-CNR

Technical Report on IEIIT-CNR Technical Report on Architectural Evolution of NVIDIA GPUs for High-Performance Computing (IEIIT-CNR-150212) Angelo Corana (Decision Support Methods and Models Group) IEIIT-CNR Istituto di Elettronica

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

AMD Radeon HD 2900 Highlights

AMD Radeon HD 2900 Highlights C O N F I D E N T I A L 2007 Hot Chips 19 AMD s Radeon HD 2900 2 nd Generation Unified Shader Architecture Mike Mantor Fellow AMD Graphics Products Group michael.mantor@amd.com AMD Radeon HD 2900 Highlights

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Anatomy of AMD s TeraScale Graphics Engine

Anatomy of AMD s TeraScale Graphics Engine Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements

More information

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer Real-Time Rendering (Echtzeitgraphik) Michael Wimmer wimmer@cg.tuwien.ac.at Walking down the graphics pipeline Application Geometry Rasterizer What for? Understanding the rendering pipeline is the key

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Vector Processors and Graphics Processing Units (GPUs)

Vector Processors and Graphics Processing Units (GPUs) Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your

More information