Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Similar documents
CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPUs and GPGPUs. Greg Blanton John T. Lubia

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Portland State University ECE 588/688. Graphics Processors

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS427 Multicore Architecture and Parallel Computing

For General-Purpose Parallel Computing: It is PRAM or never. Uzi Vishkin

ECE 571 Advanced Microprocessor-Based Design Lecture 18

Windowing System on a 3D Pipeline. February 2005

What s New with GPGPU?

Working with Metal Overview

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Antonio R. Miele Marco D. Santambrogio

Bifrost - The GPU architecture for next five billion

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

FPGA-Based Prototype of a

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

ECE 571 Advanced Microprocessor-Based Design Lecture 20

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Graphics Processing Unit Architecture (GPU Arch)

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Parallel Computing Ideas

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

GPGPU introduction and network applications. PacketShaders, SSLShader

ECE 574 Cluster Computing Lecture 16

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

From Brook to CUDA. GPU Technology Conference

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Toolchain For Programming, Simulating and Studying the XMT Many-Core Architecture

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Threading Hardware in G80

Lecture 2. Shaders, GLSL and GPGPU

GPU Memory Model Overview

Computer Architecture

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

Approaches to Parallel Computing

CS 179: GPU Programming

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Graphics Hardware. Instructor Stephen J. Guy

Comparing Reyes and OpenGL on a Stream Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Spring 2011 Prof. Hyesoon Kim

Accelerating CFD with Graphics Hardware

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

High Performance Computing on GPUs using NVIDIA CUDA

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Current Trends in Computer Graphics Hardware

A Data-Parallel Genealogy: The GPU Family Tree

GRAPHICS PROCESSING UNITS

GPGPU on Mobile Devices

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

NVIDIA Fermi Architecture

Parallel Programming on Larrabee. Tim Foley Intel Corp

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GeForce3 OpenGL Performance. John Spitzer

An Introduction to Graphical Processing Unit

Introduction to Modern GPU Hardware

Spring 2009 Prof. Hyesoon Kim

Accelerating image registration on GPUs

Introduction to Parallel Programming Models

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

PowerVR Series5. Architecture Guide for Developers

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

GPU Architecture and Function. Michael Foster and Ian Frasch

Xbox 360 high-level architecture

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Massively Parallel Architectures

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

ECE 574 Cluster Computing Lecture 15

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Modern Processor Architectures. L25: Modern Compiler Design

Lecture 23: Domain-Specific Parallel Programming

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Programming Model

! Readings! ! Room-level, on-chip! vs.!

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Comparison of High-Speed Ray Casting on GPU

Hardware/Software Co-Design

Transcription:

XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadmaps predict many-cores Unclear how to build or program these many-cores GPUs more general Gaining momentum for more general apps

Is Unification Possible? Can a single general purpose, many-core processor replace a CPU + GPU? It must be Standalone no coprocessor needed Easy and flexible to program Competitive on anything with CPUs Competitive on graphics with GPUs We choose a unification candidate (XMT) in part because it satisfies the first 3 During the Q&A session we welcome your thoughts on what else could be used

Main Experiment and Results Can XMT satisfy the 4 th? Simulate surface shading (a common graphics app) on GP and GPU representatives Mixed results XMT slightly faster on some GPU tasks GPUs significantly faster on others Unification unlikely, but momentum may shift towards GP

CPU History Overview Serial random access machine programming model Great success story, dominant model for decades Popular for both theory and practice Relies on faster serial hardware for performance gains - no longer sufficient i Multi-cores available (2-4 cores/chip) Many-cores on horizon (100 s or 1000 s); but how will they looks?

What Will Future CPUs Look Like? Future from major vendors unclear Proposals try to look like serial RAM to programmers Long term software spiral broken PRAM (parallel random access machine) Model preferred by programming community natural extension of serial scalable included in major algorithm textbooks Discounted because of difficulty building one Recently building one has become feasible

XMT: explicit Multi-Threading PRAM-on on-chip vision under development at the University of Maryland since 1997 Targeting ~1000 cores on chip PRAM-like programmability On chip shared L1 cache provides memory bandwidth necessary for PRAM Previous work has established XMT s performance on a variety of applications Simulator and FPGA implementations available www.umiacs.umd.edu/users/vishkin/xmt

Programming XMT XMTC: Single-program g multiple-data (SPMD) extension of standard C which resembles CRCW PRAM Spawning creates lightweight, asynchronous threads, serial execution resumes once all threads complete

XMT FPGA In Use 64 Processor, 75MHz prototype Used in undergrad theory class (also: non-major Freshmen and 35 high-school students) 6 significant projects No architecture discussion, minimal XMTC discussion

GPU History Overview Stream programming g model Streams and kernels: simple and easily exploits locality for some tasks Handles irregular, fine-grained, and serial code poorly Oi Originally i very inflexible ibl Programming meant setting bits for Muxes Modern GPUs are much more flexible C like languages GPGPU Still tied to stream model

Very High Level GPU Pipeline Old pipelined architecture New virtual pipeline pp architecture Vertex Processing Fragment Processing Pixel Processing Computation units capable of vertex, fragment, pixel, and other processing Flow Control

More Detailed Modern GPU From NVIDIA GeForce 8800 GPU Architecture Overview Virtual pipelined

GeForce and XMT Similarities

GeForce and XMT Similarities Clusters of processors, functional units

GeForce and XMT Similarities On chip memory and access network

GeForce and XMT Similarities Control Logic

Does XMT Meet Unification Requirements? Unification requirements Ability to stand alone Easy and flexible to program Must perform GP tasks well Competitive e with modern GPUs Graphics performance only unknown GPUs are a unique competitor Successful, commodity, parallel hardware

Our Experiment Simulated XMT system vs. several real GPUs on fragment shading Compute shading memory light, general Texture shading memory heavy, specialized Only fragment shading stage compared

Simulating Fragment Shading Simulated XMT used in place of the fragment shading step in software Application Display Mesa OpenGL Vertex Fragment Other Rasterization Processing Shading Fragment Ops XMT Simulator XMT Fragment Shading Program

The Competitors Simulated XMT* ATI x700 NVidia GeForce 7900 NVidia GeForce 8800 Released mid 2004 Released mid 2006 Released late 2006 *Scaled for the same FLOPS level as GeForce 8800

XMT Variants We used 3 variants of XMT Version 1 - unmodified Version 2 - with graphics ISA Floor, Fraction, Linear Interpolation Version 3 graphics ISA and 4-way vector computations on 8-bit arithmetic

Compute Shader Results x700 GeForce 7900 GeForce 8800 XMT version 1 XMT version 2 354 FPS 1917 FPS 2760 FPS 1423 FPS 3222 FPS

Texture Shader Results x700 GeForce 7900 GeForce 8800 XMT version 1 XMT version 2 XMT version 3 1846 FPS 8632 FPS 3179 FPS 197 FPS 275 FPS 487 FPS

Analysis XMT compute shades faster XMT texture shades much slower Acceptable for some apps, not all The GeForce GPUs follow the same trend Sacrificing speed on most used apps for greater flexibility on others

Summary Divide between CPU/GPU is blurring A unified system gives Ease of programming Good GP performance Good graphics performance Combination systems needed for truly high performance apps An XMT + GPU system could provide the best of both worlds How to partition between them? Can they cooperate? www.umiacs.umd.edu/users/vishkin/xmt