! Readings! ! Room-level, on-chip! vs.!

Similar documents
Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

Lecture 25: Board Notes: Threads and GPUs

Suggested Readings! Lecture 24" Parallel Processing on Multi-Core Chips! Technology Drive to Multi-core! ! Readings! ! H&P: Chapter 7! vs.! CSE 30321!

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Processing SIMD, Vector and GPU s cont.

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

THREAD LEVEL PARALLELISM

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Introduction II. Overview

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Parallel Systems I The GPU architecture. Jan Lemeire

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Online Course Evaluation. What we will do in the last week?

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Computer Systems Architecture

Computer Systems Architecture

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Handout 3 Multiprocessor and thread level parallelism

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Computing: Parallel Architectures Jin, Hai

Multithreaded Processors. Department of Electrical Engineering Stanford University

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer Science 146. Computer Architecture

Introduction to GPU programming with CUDA

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

Introduction to GPU computing

CS427 Multicore Architecture and Parallel Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

CUDA GPGPU Workshop 2012

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

PowerVR Hardware. Architecture Overview for Developers

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics


CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Scheduling the Graphics Pipeline on a GPU

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multi-Processors and GPU

Processor Architecture and Interconnect

Processor Architectures

WHY PARALLEL PROCESSING? (CE-401)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Mattan Erez. The University of Texas at Austin

Parallel Programming on Larrabee. Tim Foley Intel Corp

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Architecture Spring 2016

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Chapter 4 Data-Level Parallelism

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

GPUfs: Integrating a file system with GPUs

CS425 Computer Systems Architecture

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Multi-core Programming Evolution

Computer Architecture

Antonio R. Miele Marco D. Santambrogio

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Architectures

CMSC 611: Advanced. Parallel Systems

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Modern Processor Architectures. L25: Modern Compiler Design

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Module 5 Introduction to Parallel Processing Systems

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Hyperthreading Technology

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Tesla Architecture, CUDA and Optimization Strategies

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Hardware-Based Speculation

CUDA Programming Model

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallelism and Concurrency. COS 326 David Walker Princeton University

Kaisen Lin and Michael Conley

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Threading Hardware in G80

UNIT I (Two Marks Questions & Answers)

Transcription:

1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads Programming! Lecture 25" Threads and GPUs!! https://computing.llnl.gov/tutorials/pthreads/!! How GPUs Work!! www.cs.virginia.edu/~gfx/papers/pdfs/59_howthingswork.pdf! 3! Processor components! Multicore processors and programming! Recap: L23 Intro to Parallel Processing! Processor comparison!! Types of multi-processors!! Room-level, on-chip! vs.!! Types of machines!! SISD (uniprocessor, pipelined, superscalar)!!! Explain & articulate why modern microprocessors now have more than one core and how SW must adapt. " Use knowledge about underlying HW to write more efficient software"! SIMD (more today)! CSE 30321!! MISD (not really used)!! MIMD (centralized & distributed shared memory)!! More detail about MIMD!! How parallelization impacts performance!! What can impede performance of parallel machines?! Writing more! efficient code!! Coherency, latency, contention, reliability, languages, algorithms! The right HW for the right application! HLL code translation! 4!

5! 6! Recap: L24 Parallel Processing on MC! Today: L25 Threads and GPUs!! Simple quantitative examples on how (i) reliability, (ii) communication overhead, and (iii) load balancing impact performance of parallel systems!! All issues covered in L23 & L24 apply to on-chip computing systems as well as room level computing systems!! Technology drive to multi-core computing! Why the clock flattening? POWER!!!!!! That said, 2 models that lend themselves well to on-chip parallelism deserve special discussion:!! Threads!! GPU-based computing! 7! 8! Threads First! Processes vs. Threads!! Outline of Threads discussion:!! Process!! Thread!! What#s a thread?!! How many people have heard of / used threads before?!! Coupling to architecture!! Created by OS!! Much overhead!! Process ID!! Can exist within process!! Shares process resources!! Example: scheduling threads!! Assume different architectural models!! Programming models!! Why intimate knowledge about HW is important!! Process group ID!! User ID!! Working directory!! Program instructions!! Registers!! Stack space!! Heap!! File descriptors!! Shared libraries!! Shared memory!! Duplicate bare essentials to execute code on chip!! Program counter!! Stack pointer!! Registers!! Scheduling priority!! Set of pending, blocked signals!! Thread specific data!! Semaphores, pipes, etc.!

Processes vs. Threads! 9! Parts A,B!! Idea:! Multi-threading!! Performing multiple threads of execution in parallel!! Replicate registers, PC, etc.!! Fast switching between threads!! Flavors:!! Fine-grain multithreading!! Switch threads after each cycle!! Interleave instruction execution!! If one thread stalls, others are executed!! Coarse-grain multithreading!! Only switch on long stall (e.g., L2-cache miss)!! Simplifies hardware, but doesn#t hide short stalls!! (e.g., data hazards)!! SMT (Simultaneous Multi-Threading)!! Especially relevant for superscalar! Parts C F! Coarse MT vs. Fine MT vs. SMT! Mixed Models:!! Threaded systems and multi-threaded programs are not specific to multi-core chips.!! In other words, could imagine a multi-threaded uniprocessor too!! However, could have an N-core chip where:!! N threads of a single process are run on N cores!! N processes run on N cores and each core splits time between M threads! 12!

Or can do both! Real life examples! 16! Writing threaded programs for supporting HW! Part G!

Impact of modern processing principles" (Lots of state )!! User:!! state used for application execution!! Supervisor:!! state used to manage user state!! Machine:!! state that configures the system!! Transient:!! state used during instruction execution!! Access-Enhancing:!! state used to simplify translation of other state names!! Latency-Enhancing:!! state used to reduce latency to other state values! Impact of modern processing principles" (Total State vs. Time)! Estimated State (k bits) 100,000.00 10,000.00 1,000.00 100.00 10.00 1.00 0.10 0.01 1970 1980 1990 2000 Total State Machine Supervisor User Transient Latency Enhancing Access Enhancing 19! 20! GPU discussion points! How is a frame rendered?!! Motivation for GPUs:!! Necessary processing!! Example problem:!! Generic CPU pipeline!! GPU-based vs. Uni-processor Z-buffer problem!! What does a GPU architecture look like?!! Explain in context of SIMD!! Applicability to other computing problems!! Helpful to consider how the 2 standard graphics APIs OpenGL and Direct 3D work.!! These APIs define a logical graphics pipeline that is mapped onto GPU hardware and processors along with programming models and languages for the programmable stages!! In other words, API takes primitives like points, lines and polygons, and converts them into pixels!! How does the graphics pipeline do this?!! First, important to note that pipeline does not mean the 5 stage pipeline we talked about earlier!! Pipeline describes sequence of steps to prepare image/ scene for rendering! Part H!

21! 22! How is a frame rendered?" (Direct3D pipeline)! Example: Textures! http://en.wikipedia.org/wiki/file:texturedm1a2.png Parts I J! 23! Example: Z-buffer! GPUs!! GPU = Graphics Processing Unit!! Efficient at manipulating computer graphics!! Graphics accelerator uses custom HW that makes mathematical operations for graphics operations fast/ efficient!! Why SIMD? Do same thing to each pixel!! API language compilers target industry standard intermediate languages instead of machine instructions!! GPU driver software generates optimized GPU-specific machine instructions! http://blog.yoz.sk/examples/pixelbenderdisplacement/zbuffer1map.jpg

25! 26! GPUs! Modern GPU!! Also, SW support for GPU programming:!! Often part of a heterogeneous system!! NVIDIA has graphics cards that support API extension to C CUDA ( Computer Unified Device Architecture )!! Allows specialized functions from a normal C program to run on GPU#s stream processors!! Allows C programs that can benefit from integrated GPU(s) to use where appropriate, but also leverage conventional CPU!! GPUs don#t do all things CPU does!! Good at some specific things!! i.e. matrix-vector operations!! GPU HW:!! No multi-level caches!! Hide memory latency with threads!! To process all pixel data!! GPU main memory oriented toward bandwidth! 27! GPUs for other problems! GPU Example!! More recently:!! GPUs found to be more efficient than general purpose CPUs for many complex algorithms!! Often things with massive amount of vector ops!! Example:!! ATI, NVIDIA team with Stanford to do GPU-based computation for protein folding!! Found to offer up to 40 X improvement over more conventional approach!