Antonio R. Miele Marco D. Santambrogio
|
|
- Robert Willis
- 5 years ago
- Views:
Transcription
1 Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
2 2 Introduction First GPU released in 1999 Used for the purpose of graphics processing GPU architecture rapidly evolved providing higher computational power by means of parallelization GPU architecture evolved also to support programmability of their components ( )
3 3 Introduction In 2006, NVIDIA introduced GeForce 8800 GPU supporting a new programming language: CUDA Compute Unified Device Architecture Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Host CPU issues data-parallel kernels to GP-GPU for execution
4 4 Introduction CPU and GPU performance trends FLOPS FLoating-point OPerations per Second
5 5 Graphics pipeline At the beginning there was the graphics pipeline
6 Graphics pipeline 6
7 7 Vertex generation The host interface is the communication bridge between the CPU and the GPU It receives commands from the CPU and also pulls geometry information from system memory It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc)
8 8 Vertex processing The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space Transformations are based on matrix multiplications No new vertices are created in this stage, and no vertices are discarded (input/output has 1:1 mapping)
9 9 Vertex processing 1. Model to world coordinates 2. World to eye coordinates 3. Eye to clip coordinates Textures may be also used for advanced transformations (they provide height maps for displacement mapping)
10 10 Primitive generation The primitive assembler groups vertices forming one primitive (i.e. a triangle)
11 11 Primitive processing Various elaborations are performed Perspective division and viewpoint transformation Clipping
12 12 Fragment generation Geometry information is transformed in raster information (pixels in output Determine what pixels a primitive overlaps Aliasing and other issues
13 13 Fragment processing Assign colors to pixels Shades the fragment by simulating the interaction of light and material
14 14 Fragment processing Effects of tessellation Texture mapping Lightning and texture
15 15 Pixel operations Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests Finally pixels are copied in the framebuffer (the memory space connected to the screen controller)
16 16 Graphics pipeline In each stage elaborations can be performed in parallel on each chunk of data (vertex, fragment, pixel, ) PARALLELISM!
17 17 Evolution of the graphics pipeline Pre GPU Fixed function GPU Programmable GPU Unified shader processors
18 Early 90s pre GPU 18
19 19 Exploit parallelism Goals of GPUs? Pipeline parallel Data-parallel CPU and GPU executing in parallel Specific hardware accelerators Texture filtering, rasterization, MAD, sqrt,...
20 20 Fixed function rasterization, texture mapping, depth testing, etc. 3dfx voodoo (1996) Required separate VGA card for 2D
21 21 NVIDIA GeForce 256 (1999) All stages implemented in hardware Fixed function rasterization, texture mapping, depth testing, etc.
22 22 NVIDIA GeForce 3 (2001) Optionally bypass fixedfunction with a programmable vertex shader Shader: a miniprogram defining the logic of a pipeline stage A specific shading language has to be used (e.g. OpenGL) Programmable
23 23 NVIDIA GeForce 6 (2004) Improved programmability in fragment shader Vertex shader can read textures Dynamic branches Programmable
24 24 Pipelined architecture NVIDIA GeForce 6 (2004) Multiple cores for each stage Programmable stages The introduction of programmable stages requires fetch and decode units
25 25 NVIDIA GeForce 7800 (2005) Vertex Fixed stages Programmable stages Fragment The introduction of programmable stages requires fetch and decode units Composite
26 26 NVIDIA GeForce 8 (2006) Ground-up architecture redesign New geometry shader after the vertex shader Introduction of the unified shader processor Geometry shader Introduction of CUDA Employment of GPU for general purpose computing: GP-GPU Programmable
27 27 NVIDIA GeForce 8800 (2006) Introduction of Issue Units for managing threads generation and scheduling Fixed stages
28 28 Why a single shader processor? Non-unified shader processors Vertex shader bottleneck Pixel shader Heavy pixel workload Vertex shader Pixel shader Problems in balancing workload in pipeline stages Heavy geometry workload
29 29 Why a single shader processor? Non-unified shader processors Unified shader Heavy pixel workload Unified shader Heavy geometry workload Optimal usage of processing resources
30 30 Unified shader processor How the unified shared processor works Three key ideas: Instantiate many shader processors Replicate ALU inside the shader processor to enable SIMD processing Interleave the execution of many groups of SIMD threads
31 31 Example: a diffuse reflectance shader Shader programming model: fragments (or more in general work items) are processed independently The function has to be executed for each fragment
32 32 Example: a diffuse reflectance shader Shader programming model: fragments (or more in general work items) are processed independently The function has to be executed for each fragment One instruction stream per fragment
33 Basic architecture of a modern CPU 33
34 34 Basic architecture of a modern GPU Remove components that help a single instruction stream run faster
35 35 Replicate cores Replicate cores to run several threads in parallel 2 cores process 2 instruction streams in parallel
36 36 Replicate cores Replicate cores to run several threads in parallel 4 cores process 4 instruction streams in parallel
37 37 Replicate cores Replicate cores to run several threads in parallel 16 cores process 16 instruction streams in parallel
38 38 Replicate cores Replicate cores to run several threads in parallel 16 cores process 16 instruction streams in parallel PROBLEM: many cores should share the same instruction stream Since each unit has its own fetch and decode unit, we rather prefer to run different instruction streams
39 39 Replicate ALUs within the core SIMD processing
40 40 Replicate ALUs within the core SIMD processing Original compiled shader: Processing one item using scalar operations on scalar registers
41 41 Replicate ALUs within the core SIMD processing New compiled shader: Processing 8 items using vector operations on vector registers
42 42 Replicate ALUs within the core SIMD processing
43 43 Replicate ALUs within the core SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions Cray, Intel/AMD x86 SSE, IBM Altivec (explicit vector length) Option 2: Scalar instructions with implicit HW vectorization HW determines instruction stream sharing across ALUs NVIDIA GeForce ( SIMT warps), ATI Radeon architectures SIMT: single instruction multiple threads Split identical independent work items over multiple threads executed in lockstep An instruction stream of scalar instructions is shared among the various threads
44 44 Merging two-level replications Result: multicore architecture where each core is a SIMD architecture 16 cores, each one having 8 ALUs = 128 simultaneous threads
45 45 Branches Branches have to be accurately handled
46 46 Branches Branches have to be accurately handled
47 47 Branches Branches have to be accurately handled
48 48 Stalls The execution of an instruction may have a data dependency with a previous one (still running) -> stall! Access to the texture memory (100x slower than ALU instructions)
49 49 Stalls Stalls due to data dependencies have to be managed as well Memory accesses cause many stalls due to the considerably higher execution time with respect to ALU instructions (x100/x1000) Fancy caches and logic avoiding stalls in CPUs have been removed However On GPU we can run concurrently MANY independent instructions streams
50 50 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations
51 51 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations
52 52 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations
53 53 Stalls Interleave processing of many streams on a single core to hide stalls caused by latency operations
54 54 Stalls Interleaving between contexts can be managed by HW or SW or both NVIDIA/AMD Radeon GPUs approach HW schedules and manages all contexts Special on-chip storage holds work item state
55 55 How to dimension the context Maximal latency hiding ability Low latency hiding ability
56 56 Basic architecture of a modern GPU Summary: Use many slimmed down cores to run in parallel Pack cores full of ALUs (by sharing instruction streams across group of work items) Option 1: explicit SIMD vector instructions Options 2: implicit sharing managed by HW Avoid latency stalls by interleaving execution of many groups of work items/threads/ When a group stalls, work on another group
57 57 16 streaming multiprocessors NVIDIA Fermi (2009) Each multiprocessor has 32 streaming cores SIMT, single instruction multi threads 6 memory ports 1 global scheduler
58 58 Streaming Multiprocessor (SM) 32 streaming cores 32 bit pipelined integer arithmetic unit (with support for 64 bit operations 1 cycle) IEEE single/doubleprecision floating point unit providing multiply-add instructions (1 cycles) 16 load/store units Concurrent access to data in each address of the cache or DRAM (1 cycle) 4 special function units (SFUs) For transcendent functions (sine, cosine, square root, ) Slower than other units (4 cycles) Decoupled from the dispatching units to improve performance
59 59 Streaming Multiprocessor (SM) Threads are grouped in 32 threads sharing an instruction stream, called warp The SM has 2 scheduling and dispatching units Two warps are selected each clock cycle (fetch, decode and execute two warps in parallel) The register file may host up to 48 interleaved warps 1536 threads per SM! Globally threads!
60 60 Streaming Multiprocessor (SM) Each scheduler may execute an instruction on 16 ALU cores, 16 load/store units, or 4 SFUs Each double precision FPU instruction requires 2 ALU cores Each clock cycle the scheduler selects a warp that is ready to be executed Warp are independent -> no dependency check is required
61 61 Other features of Fermi 2-level distributed scheduler At chip level a global workload distribution engine dispatches thread blocks to various SMs At SM level each warp scheduler distributes warps Support to fast context switch (around 25us)
62 62 Other features of Fermi Support to concurrent kernel execution
63 63 NVIDIA Kepler (2012) Same architecture of Fermi with performance and power efficiency improvements Increased to 192 streaming core per SM 32 special floating point units Improved warp scheduling (4 schedulers per SM) Other improvements Maxwell architecture (2014) presents further improvements
64 64 Cache organization CPU cores run efficiently when data is resident in cache Caches reduce latency and proved high bandwidth
65 65 Cache organization Initially GPU core was not provided with caches GPU core required a high-bandwidth connection to memory
66 66 Limited bandwidth A high-end GPU (e.g. Radeon HD 6970) has... Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU No large cache hierarchy to absorb memory requests GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory requests to maximize use of memory bus Still, this is only 5 times the bandwidth available to CPU
67 67 Limited bandwidth If processors request data at too high a rate, the memory system cannot keep up Overcoming bandwidth limits are a common challenge for GPUcompute application developers Request data less often (instead, do more math) arithmetic intensity Might be quicker to calculate something from scratch on device instead of copying from host Fetch data from memory less often (share/reuse data across fragments on-chip communication or storage Graphics elaborations fit well with these issues More ALU operations that memory accesses
68 68 Modern GPU memory hierarchy Modern GPUs are provided with local memories (not synched with main memory) texture caches (read-only) Moreover L1-L2 caches have been added In NVIDIA architectures only L2 is coherent!
69 69 Transmission cost Another relevant aspect is the CPU/GPU transmission bandwidth PCIe bandwidth: 8GB/s on each direction Attempt to pipeline/multi-buffer uploads and downloads
70 70 NVIDIA GeForce 8 (2006) Each SM is provided with 16K shared memory 64K constant cache 8K texture cache Each process can access all memory locations at 86Gb/s with different latencies: Shared: 2 cycles Device: 300 cycles
71 71 NVIDIA Fermi (2009) Each SM is provided with 64K local shared memory used by thread blocks to cooperate Reduction of off-chip traffic Shared memory can be configured by the programmer to obtained also a L1 cache Introduced a chip-level L2 cache
72 72 NVIDIA Fermi (2009) Texture cache has been removed from L1 since not efficient for general purpose computing Fast atomic memory operations Read-modify-write, compare-and-swap Efficient sorting and building of data structures
73 73 NVIDIA Kepler (2012) Doubled Fermi cache size: 128K L1, 1536KB L2 Introduced a Read-only cache (similar to a texture cache) Added shuffle instructions
74 74 CPU/GPU interaction The CPU and GPU inside the PC work in parallel with each other There are two threads going on, one for the CPU and one for the GPU, which communicate through a command buffer: GPU reads commands from here Pending GPU commands CPU writes commands here
75 75 CPU/GPU interaction Communications between CPU and GPU are nonblocking (or asynchronous) In the CPU program below, the object is not drawn after statement A and before statement B: Statement A API call to draw object Statement B Instead, all the API call does is to add the command to draw the object to the GPU command buffer
76 76 CPU/GPU interaction This leads to a number of synchronization considerations In the figure below, the CPU must not overwrite the data in the yellow block until the GPU is done with the black command, which references that data: GPU reads commands from here CPU writes commands here data
77 77 CPU/GPU interaction Modern graphics APIs implement semaphore style operations to keep this from causing problems If the CPU attempts to modify a piece of data that is being referenced by a pending GPU command, it will have to spin around waiting, until the GPU is finished with that command While this ensures correct operation it is not good for performance since there are a million other things we would rather do with the CPU instead of spinning The GPU will also drain a big part of the command buffer thereby reducing its ability to run in parallel with the CPU
78 Inlining data One way to avoid these problems is to inline all data to the command buffer and avoid references to separate data: GPU reads commands from here data CPU writes commands here However, this is also bad for performance, since we may need to copy several Mbytes of data instead of merely passing around a pointer
79 GPU readbacks The output of a GPU is a rendered image on the screen, what will happen if the CPU tries to read it? Pending GPU commands CPU writes commands here GPU reads commands from here The GPU must be synchronized with the CPU, i.e. it must drain its entire command buffer, and the CPU must wait while this happens When the GPU is used for general purpose computing, the programmer has to explicitly manage memory transfers and synchronization
80 81 Other vendors We have analyzed NVIDIA GPUs so far There are many other GPU vendors E.g.: AMD, ARM, The overall GPU architecture is quite similar to the NVIDIA one
81 82 AMD Radeon HD 6970 (Cayman) 2010 SIMD function unit, control shared across 16 units (Up to 4 MUL-ADDs per clock) VLIW processing! Groups of 64 [fragments/vertices/etc.] share instruction stream Four clocks to execute an instruction for all fragments in a group
82 83 AMD Radeon HD 6970 (Cayman) 2010 There are 24 of these cores on the 6970: that s about 32,000 fragments!
83 84 ARM Mali 628 (2014) Targeted for embedded computing
84 85 CPU and GPU within the same chip The trend in the last years has been to integrate the GPU within the same chip of the CPU Opportunities: Reduce offload cost Reduce memory copy/transfers Power management Steps: Remove the external communication link Define a unified memory architecture
85 86 Targeted for mobile and desktop computing Subsequent solutions integrated in Microsoft Xbox and Sony PlayStation 4 Architecture: AMD Llano (2011) CPU: AMD K10 quad-core GPU: AMD Radeon HD 6000
86 87 Intel Sandy Bridge (2011) First Intel generation (after Intel Westmere) with integrated GPU
87 88 Solutions from mobile, desktop and server market Architecture: Intel Skylane (2015) CPU: Intel multi-core (from m3 dual-core to i7 octacore to Xeon E3 octa-core) GPU: Intel HD Graphics (up to 24 execution units) or Iris Graphics (up to 72 execution units)
88 89 Samsung Exynos 5422 (2014) Targeted for mobile computing Architecture: CPU: ARM big.little octa-core GPU: ARM Mali-T628 MP6
89 90 NVIDIA Tegra X1 (2015) Targeted for mobile computing Architecture: CPU: ARM big.little octa-core GPU: NVIDIA Maxwell with 256 CUDA cores
90 91 Final notes Generic many-core GPU Less space devoted to control logic and caches Large register files to support multiple thread contexts Low latency hardware managed thread switching Large number of ALU per core with small usermanaged cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously Simple ALUs Cache High Bandwidth bus to ALUs On Board System Memory Support for general purpose computing!
91 92 Final notes GPUs are massively parallel devices originally used for implementing the graphics pipeline GPUs can be also used for accelerating general purpose computations (GP-GPU!) Some languages have been developed (CUDA, OpenCL, C++AMP)
92 93 Final notes An efficient GPU workload Has thousands of independent pieces of work Uses many ALUs on many cores Supports massive interleaving for latency hiding Is amenable to instruction stream sharing Maps to SIMD execution well Is compute-heavy: the ratio of math operations to memory access is high Not limited by bandwidth
93 94 References Material taken from other university course on computer architectures, computer graphics and parallel computing GPUs.pdf f11/www/ NVIDIA website:
GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!
Advanced Topics on Heterogeneous System Architectures GPU! Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Introduction!
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationLecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationGraphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics
Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationSpring 2009 Prof. Hyesoon Kim
Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationGPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
GPU Architecture Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationArchitectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1
Architectures Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Overview of today s lecture The idea is to cover some of the existing graphics
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationGPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.
Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationGPU A rchitectures Architectures Patrick Neill May
GPU Architectures Patrick Neill May 30, 2014 Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know
More informationGraphics Processing Unit Architecture (GPU Arch)
Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationCSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs
More informationGPU Architecture and Function. Michael Foster and Ian Frasch
GPU Architecture and Function Michael Foster and Ian Frasch Overview What is a GPU? How is a GPU different from a CPU? The graphics pipeline History of the GPU GPU architecture Optimizations GPU performance
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationSpring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett
Spring 2010 Prof. Hyesoon Kim AMD presentations from Richard Huddy and Michael Doggett Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backens
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationGRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147
GRAPHICS HARDWARE Niels Joubert, 4th August 2010, CS147 Rendering Latest GPGPU Today Enabling Real Time Graphics Pipeline History Architecture Programming RENDERING PIPELINE Real-Time Graphics Vertices
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationGPU Architecture. Michael Doggett Department of Computer Science Lund university
GPU Architecture Michael Doggett Department of Computer Science Lund university GPUs from my time at ATI R200 Xbox360 GPU R630 R610 R770 Let s start at the beginning... Graphics Hardware before GPUs 1970s
More informationReal - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský
Real - Time Rendering Graphics pipeline Michal Červeňanský Juraj Starinský Overview History of Graphics HW Rendering pipeline Shaders Debugging 2 History of Graphics HW First generation Second generation
More informationEfficient and Scalable Shading for Many Lights
Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL
More informationGraphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university
Graphics Architectures and OpenCL Michael Doggett Department of Computer Science Lund university Overview Parallelism Radeon 5870 Tiled Graphics Architectures Important when Memory and Bandwidth limited
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex
More informationMattan Erez. The University of Texas at Austin
EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationNVIDIA Fermi Architecture
Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationASYNCHRONOUS SHADERS WHITE PAPER 0
ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationThe NVIDIA GeForce 8800 GPU
The NVIDIA GeForce 8800 GPU August 2007 Erik Lindholm / Stuart Oberman Outline GeForce 8800 Architecture Overview Streaming Processor Array Streaming Multiprocessor Texture ROP: Raster Operation Pipeline
More informationCurrent Trends in Computer Graphics Hardware
Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator
More informationIntroduction to Modern GPU Hardware
The following content are extracted from the material in the references on last page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This
More informationScheduling the Graphics Pipeline on a GPU
Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationBifrost - The GPU architecture for next five billion
Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016 Vulkan 2 ARM 2016 What is Vulkan? A 3D graphics API for the next twenty years Logical successor
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationGraphics Hardware. Instructor Stephen J. Guy
Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationGraphics and Imaging Architectures
Graphics and Imaging Architectures Kayvon Fatahalian http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/ About Kayvon New faculty, just arrived from Stanford Dissertation: Evolving real-time graphics
More information1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.
1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. Optical Discs 1 Structure of a Graphics Adapter Video Memory Graphics
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationGPGPU introduction and network applications. PacketShaders, SSLShader
GPGPU introduction and network applications PacketShaders, SSLShader Agenda GPGPU Introduction Computer graphics background GPGPUs past, present and future PacketShader A GPU-Accelerated Software Router
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationCS195V Week 9. GPU Architecture and Other Shading Languages
CS195V Week 9 GPU Architecture and Other Shading Languages GPU Architecture We will do a short overview of GPU hardware and architecture Relatively short journey into hardware, for more in depth information,
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationTechnical Report on IEIIT-CNR
Technical Report on Architectural Evolution of NVIDIA GPUs for High-Performance Computing (IEIIT-CNR-150212) Angelo Corana (Decision Support Methods and Models Group) IEIIT-CNR Istituto di Elettronica
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationAMD Radeon HD 2900 Highlights
C O N F I D E N T I A L 2007 Hot Chips 19 AMD s Radeon HD 2900 2 nd Generation Unified Shader Architecture Mike Mantor Fellow AMD Graphics Products Group michael.mantor@amd.com AMD Radeon HD 2900 Highlights
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationAnatomy of AMD s TeraScale Graphics Engine
Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationIntroduction to CUDA (1 of n*)
Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements
More informationReal-Time Rendering (Echtzeitgraphik) Michael Wimmer
Real-Time Rendering (Echtzeitgraphik) Michael Wimmer wimmer@cg.tuwien.ac.at Walking down the graphics pipeline Application Geometry Rasterizer What for? Understanding the rendering pipeline is the key
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationVector Processors and Graphics Processing Units (GPUs)
Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your
More information