Portland State University ECE 588/688. Graphics Processors

Similar documents
Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Threading Hardware in G80

Mattan Erez. The University of Texas at Austin

Portland State University ECE 588/688. Cray-1 and Cray T3E

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Introduction to CUDA (1 of n*)

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GRAPHICS PROCESSING UNITS

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)


NVIDIA TESLA: AUNIFIED GRAPHICS AND

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GPU Fundamentals Jeff Larkin November 14, 2016

Spring Prof. Hyesoon Kim

CS427 Multicore Architecture and Parallel Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

The NVIDIA GeForce 8800 GPU

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Tesla Architecture, CUDA and Optimization Strategies

Computer Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computing: Parallel Architectures Jin, Hai

Numerical Simulation on the GPU

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ME964 High Performance Computing for Engineering Applications

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

B. Tech. Project Second Stage Report on

Parallel Processing SIMD, Vector and GPU s cont.

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPUs and GPGPUs. Greg Blanton John T. Lubia

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Architecture & Programming Model

332 Advanced Computer Architecture Chapter 7

Computer Architecture

Multi-Processors and GPU

GeForce4. John Montrym Henry Moreton

CUDA Programming Model

Programming in CUDA. Malik M Khan

Antonio R. Miele Marco D. Santambrogio

Parallel Programming for Graphics

Master Informatics Eng.

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Introduction to GPU programming with CUDA

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to Parallel Computing

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

TUNING CUDA APPLICATIONS FOR MAXWELL

Fundamental CUDA Optimization. NVIDIA Corporation

Handout 3. HSAIL and A SIMT GPU Simulator

Lecture 2: CUDA Programming

NVIDIA Fermi Architecture

High Performance Computing on GPUs using NVIDIA CUDA

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Using GPUs to compute the multilevel summation of electrostatic forces

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

TUNING CUDA APPLICATIONS FOR MAXWELL

CS 179 Lecture 4. GPU Compute Architecture

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Architecture. Samuli Laine NVIDIA Research

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

GPU ARCHITECTURE Chris Schultz, June 2017

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Advanced CUDA Programming. Dr. Timo Stich

ME964 High Performance Computing for Engineering Applications

Real-Time Rendering Architectures

ECE 574 Cluster Computing Lecture 16

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Graphics and Computing GPUs

A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Technical Report on IEIIT-CNR

Graphics Processing Unit Architecture (GPU Arch)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

GPUs have enormous power that is enormously difficult to use

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

The Bifrost GPU architecture and the ARM Mali-G71 GPU

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Analyzing CUDA Workloads Using a Detailed GPU Simulator

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Portland State University ECE 588/688. Cray-1 and Cray T3E

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

GPU A rchitectures Architectures Patrick Neill May

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CSE 160 Lecture 24. Graphical Processing Units

Transcription:

Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018

Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly parallel operations on lots of pixels Graphics algorithms have common kernels for texture, edge detection, rasterization, etc. Rasterization: Converting a graph description (e.g., points, lines etc.) into dots/pixels on graph Graphics-intensive applications (e.g., games) need a lot of compute power Some graphics processors are expanding to general purpose applications Called GPGPUs: General Purpose GPUs Portland State University ECE 588/688 Winter 2018 2

Two Graphics Processor Types Vertex Processors Operate on vertices of primitives (e.g, points, lines, triangles) Operations: transforming coordinates to screen space, setting up lighting and texture parameters Designed for low latency, high precision math operations Pixel Fragment Processors Operate on rasterizer output Fills interior of primitives and interpolated parameters Designed for high latency, low precision texture filtering Typically need many more pixel fragment processors than vertex processors (pixels more numerous than vertices) Two types have been converging due to need of programming generality Portland State University ECE 588/688 Winter 2018 3

Nvidia Tesla Design Goals Unified processor architecture to execute vertex programs & pixel fragment shader programs Enables dynamic load balancing of varying workloads Allows sharing of expensive hardware (e.g., texture units) Facilitated parallel computing capability Downside: Efficient load balancing between different shader types is more difficult Architectural scalability High performance Low power Area efficiency Portland State University ECE 588/688 Winter 2018 4

Tesla Architecture Based on a scalable Streaming Processor Array (SPA) GeForce 8800 GPU (paper figure 1) Eight independent processing units: Texture/Processor Clusters (TPC) Each unit has 16 streaming processors (SPs) organized as 2 8-processor streaming multiprocessors (SMs) SPA performs all GPU s calculations Memory system: External DRAM control Fixed-function raster operation processors (ROPs) Perform color and depth frame buffer operations on memory Level-2 cache Interconnection network Portland State University ECE 588/688 Winter 2018 5

GPU Host Interface Unit Tesla Unit Functions Communicates with host CPU Responds to CPU commands Fetches data from system memory Checks command consistency Performs Context Switching Input Assembler Collects geometric primitives (points, lines, triangles, strips) and fetches associated vertex input attribute data Work Distribution Units Includes vertex, pixel, and compute work distribution Forward input assembler s output to processors Portland State University ECE 588/688 Winter 2018 6

Texture/Processor Cluster (TPC) Number scales from 1 to 8 or more depending on required performance TPC Components (paper figure 2) Geometry Controller Maps logical graphics vertex pipeline into physical SMs Manages dedicated on-chip input & output vertex attribute storage, forwards content as needed SM Controller (SMC) Two Streaming Multiprocessors (SMs) Texture Unit Processes one group of four threads per cycle Texture instructions: sources are texture coordinates, outputs are filtered samples One texture unit for each two SMs Portland State University ECE 588/688 Winter 2018 7

Streaming Multiprocessor (SM) Unified graphics and computing multiprocessor Executes vertex, geometry, and pixel-fragment shader programs, as well as parallel computing programs SM components (paper figure 3) Eight streaming processors (SPs) Each SP contains scalar multiply-add (MAD) unit Two special-function units (SFUs) Each SFU contains four FP multipliers Multithreaded inst fetch and issue unit (MT issue) Instruction cache Read-only constant cache (ROM) 16KB read/write shared memory Holds graphics input buffers or shared data for parallel computing Portland State University ECE 588/688 Winter 2018 8

SM Multithreading SMs are hardware multi-threaded Each SM manages and executes up to 768 (24x32) concurrent threads in hardware Graphics threads (e.g., vertex or pixel shader), which define how to process a vertex or a pixel Parallel computing threads use CUDA (Computer Unified Device Architecture) kernels in C to specify how one thread computes results Each SM thread has its own thread execution state, and can execute an independent code path Support for fine-grain parallelism Lightweight thread creation and no overhead thread scheduling Concurrent threads can synchronize at barriers quickly with a single SM instruction Portland State University ECE 588/688 Winter 2018 9

Single-Instruction, Multiple-Thread (SIMT) Tesla SM creates, manages, schedules, executes threads in warps, each containing 32 threads Each SM manages a pool of 24 warps (768 total threads) All threads in a a warp are of the same type (pixel, geometry, vertex, or compute) Warp threads start together at the same program address, but can branch and execute independently SIMT multithreaded instruction unit selects a warp that is ready to execute every cycle, and issues the next instruction (paper figure 4) Instruction is broadcast to all warp s active threads Individual threads could be inactive due to independent branching or predication SM maps warp threads to SP cores: each thread has its own instruction addresses and register state Very efficient execution if all warp threads have same execution path otherwise, branch divergence causes serialization Portland State University ECE 588/688 Winter 2018 10

SIMT vs. SIMD SIMD: Applies one instruction to multiple data lanes (e.g., vector instructions) SIMT: Applies one instruction to multiple independent threads in parallel SIMT enables programmers to write thread-level parallel code for independent threads, and dataparallel code for coordinated code Data-parallel code has much better performance Conversely, SIMD requires software to manually coalesce data into vectors, and manually manage divergence Portland State University ECE 588/688 Winter 2018 11

Memory Access Three different memory spaces Local memory: per-thread, private data, implemented in external DRAM Shared memory: low-latency access to shared data in the same SM Global memory: data shared by all threads, implemented in external DRAM Memory instructions: Loads: load-global, load-shared, load-local Stores: store-global, store-shared, store-local Fast barrier synchronization instruction within the SM Atomic operations To improve memory bandwidth, local and global loads/stores coalesce individual parallel thread accesses from the same warp into fewer memory block accesses Portland State University ECE 588/688 Winter 2018 12

Throughput Applications Characteristics of throughput applications Extensive data parallelism: Many computes on independent data Modest task parallelism: Groups of threads execute the same program, and different groups run different programs Intensive FP arithmetic Latency tolerance: Throughput, not latency, is the performance metric Streaming data flow: Little reuse, high bandwidth Modest inter-thread synchronization & communication Similar behavior in graphics and parallel computing applications Discuss: GPUs vs CPUs Portland State University ECE 588/688 Winter 2018 13

Parallel Computing Architecture Data parallel problem decomposition Need to partition problems to small problems to be done in parallel (paper figure 5) Need load balancing Parallel granularity based on thread, CTA, or grid (paper figure 6) Sharing: Local (private per thread), shared (CTA), global (grid) Cooperative thread array (CTA) or thread block Array of concurrent threads that execute the same program, cooperate to compose a result CTA consists of 1 to 512 concurrent threads, each with a unique ID Could implement 1D, 2D, or 3D shapes and dimensions in threads a thread ID could have 1,2, or 3 dimension indices Each SM executes up to 8 CTAs concurrently CUDA programming model (paper figure 8) Portland State University ECE 588/688 Winter 2018 14

Reading Assignment John Mellor-Crummey and Michael Scott, Synchronization Without Contention, ACM Transactions on Computer Systems,1991 (Review) Thomas Anderson, The Performance of Spin-Lock Alternatives, IEEE Transactions on Parallel and Distributed Systems, 1990 (Skim) Ravi Rajwar and James Goodman, Speculative Lock Elision: Enabling Highly-concurrent Multithreaded Execution, Micro 2001 (Skim) Project Progress Reports due Monday Portland State University ECE 588/688 Winter 2018 15