Graphics Hardware 2008

Similar documents
AMD Smarter Choice. Futures Some thoughts. Raja Koduri

Anatomy of AMD s TeraScale Graphics Engine

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

FUSION PROCESSORS AND HPC

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

viewdle! - machine vision experts

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

Heterogeneous Computing

SIMULATOR AMD RESEARCH JUNE 14, 2015

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Automatic Intra-Application Load Balancing for Heterogeneous Systems

HyperTransport Technology

Understanding GPGPU Vector Register File Usage

High Performance Graphics 2010

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011


AMD Radeon HD 2900 Highlights

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

Fusion Enabled Image Processing

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

Sequential Consistency for Heterogeneous-Race-Free

Generic System Calls for GPUs

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

RegMutex: Inter-Warp GPU Register Time-Sharing

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

Modern Processor Architectures. L25: Modern Compiler Design

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

! Readings! ! Room-level, on-chip! vs.!

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

AMD Radeon ProRender plug-in for Universal Scene Description. Installation Guide

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

ROCm: An open platform for GPU computing exploration

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v3.0

ASYNCHRONOUS SHADERS WHITE PAPER 0

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

Driver Options in AMD Radeon Pro Settings. User Guide

Release Notes Compute Abstraction Layer (CAL) Stream Computing SDK New Features. 2 Resolved Issues. 3 Known Issues. 3.

Solid State Graphics (SSG) SDK Setup and Raw Video Player Guide

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

WHY PARALLEL PROCESSING? (CE-401)

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

Vulkan (including Vulkan Fast Paths)

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

1 Presentation Title Month ##, 2012

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v2.1

HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

45-year CPU Evolution: 1 Law -2 Equations

GPU Fundamentals Jeff Larkin November 14, 2016

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Higher Level Programming Abstractions for FPGAs using OpenCL

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

NUMA Topology for AMD EPYC Naples Family Processors

DR. LISA SU

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

AMD EPYC CORPORATE BRAND GUIDELINES

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Transcription:

AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different?

Many Cores Disruptive Change Moore s Law is enabling chaotic change Double transistor density every 18 months Creating need for new software models Power/Thermal Constrained World Energy/Cooling Cost Exceed IT equipment Cost Frequency alone is not our friend The Complexity Wall/Frequency Wall Single threaded performance maxed Customers expect added Visible Value 2

Models for the Future of Computers Flame, the robot, walks the way humans walk. (Credit: Image courtesy of Delft University of Technology) Human brain image licensed under Creative Commons 3.0 Unreported 2006 Stanford Racing Team Web design by ToTheWeb LLC F-16 Fighting Falcon image is a U.S. government work and is in the public license. See http://en.wikipedia.org/wiki/image:brain_090407.jpg http://cs.stanford.edu/group/roadrunner/images/graphics/junior-3.jpg domain. See http://en.wikipedia.org/wiki/image:f-16_fighting_falcon.jpg Human Brain Autonomous Vehicle Fighter Jet Heterogeneous Systems Parallel signal processing of sensor data Parallel Classification and Recognition Vision, Sound, Contact, Patterns, Object Serial Cognitive and Reasoning P aralle l S earch, S can, S ort 3

Multi-Core CPUs Advantages Optimized for execution of sequential program Complex Pipelines to achieve max frequency Out Of Order, Super Scalar to achieve max ILP Branch Predictors, Speculative execution Register Renaming Large Caches optimized for low latency access Multi-Core enables parallel multi-tasks/threads Improved user response Background task such as OS chores, virus scan, etc 4

Multi-Core CPUs Disadvantage Fine grain sharing of work between cores/caches OS Overhead spawn, communication, fork Finding large number of task to Multi-task is hard Embarrassing gparallel apps Limited Speed up (ALU & Bandwidth) Limited Power/Flop advantage 5

GPUs Advantages Optimized for structured parallel execution Extensive ALU counts & Memory Bandwidth Cooperative multi-threading hides latency Shared Instruction Resources Fixed function units for parallel workloads dispatch Extensive exploitation of Locality 6

GPUs Disadvantages Ineffective for Single threaded execution Divergence Parallelism lost opportunity Loss of efficiency Upload/download of data cost Requires large workload to fully utilize Small Caches are optimized for locality/throughput 7

Amdahl s Law Speed-up = 1 S W : %S Serial lwork S W + (1 S W ) / N N: Number of processors 9 9 8 8 7 7 Speed-up p 6 5 4 3 2 1 0 Speed-u up 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 Number of CPU Cores Number of CPU Cores 0% Serial 100% Serial 0% Serial 10% Serial 35% Serial 100% Serial 8

Amdahl s Law zoom out a bit Everyone knows Amdahl s Law, but quickly forgets Dr. Tom Puzak, IBM Research, 2007 140 120 100 Speed d-up 80 60 40 20 0% Serial 10% Serial 35% Serial 100% Serial 0 1 2 4 8 16 32 64 128 Number of CPU Cores Amdahl s Law seriously inhibits unstructured parallelism 9

Performance/(Cost & Watt) Illustration* Example: Assume 200 mm², 80W @ constant freq Relative Performance per core 0.25 0.5 1.0 1.4 mm2/core 1.56 mm2 6.25 mm2 25 mm2 50 mm2 Power/Core 0.6 W 2.5 W 10 W 20 W #of Cores (200mm² System) 128 32 8 4 System Performance for parallel Apps 32 16 8 5.6 Relative System Performance/mm² 16% 8% 4% 2.8% Relative System Performance/Watt 40% 20% 10% 7% 10 Structured Parallelism enables solutions for more flops less watts *Illustration only. Not based on actual performance.

Building Blocks - Best of Both Heterogeneous Cores (Mixed CPU/GPU) Lowest Power & Highest Performance Solution Serial Single threaded Leverage Fast OOCs Task Parallel Leverage multi-cpu or GPU cores Data Parallel - Leverage erage GPU or Application Specific Cores 11

Building Blocks - Best of Both (Con d) Unified Memory Architectures Remove large data copies of workload and results Reduce power consumption per computation Enable small job offload Remove OS Overhead and latency of communication Fast Synchronization Primitives Increasing capable memory systems & BW 12

AMD Stream SDK Software Development Stack Applications Compilers Libraries 3 rd Party Tools Brook+ & Other App Specific OpenCl ACM L Rapidmind Graphics API DirectX OpenGL AMD Runtime Multi-Core AMD CPUs Compute Abstraction Layer (CAL) AMD Stream Processors 13

Software Solution (Real Challenge) Development of Software often exceeds that of hardware Determine and implement long term solutions Enable highly coherent machines with heterogeneous accelerators Enable control of memory hierarchies for system scaling Communication/Messaging g Producer/Consumer relationships Simplifying Programming Model Build on emerging multi-core models Enable the Masses to Programming with parallel abilities Enable Application Specific Libraries Enable open standards d 14

Processing Efficiency* AMD FireStream 9250 Stream Processor ATI Radeon X1800 XT ATI Radeon HD 3850 ATI Rd Radeon HD 2900 XT ATI Radeon X1900 XTX ATI Radeon X1950 PRO Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 2005 2006 2007 2008 *Source: internal AMD test results 15

Disclaimer and Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2008 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Opteron, ATI, the ATI logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners. 16