Spring 2009 Prof. Hyesoon Kim

Similar documents
Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

ECE 486/586. Computer Architecture. Lecture # 3

Hardware-driven Visibility Culling Jeong Hyun Kim

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PowerVR Hardware. Architecture Overview for Developers

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CS427 Multicore Architecture and Parallel Computing

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Windowing System on a 3D Pipeline. February 2005

Course web site: teaching/courses/car. Piazza discussion forum:

Lecture 25: Board Notes: Threads and GPUs

Threading Hardware in G80

Portland State University ECE 588/688. Graphics Processors

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Mattan Erez. The University of Texas at Austin

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Performance of computer systems

Designing for Performance. Patrick Happ Raul Feitosa

Graphics Hardware. Instructor Stephen J. Guy

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Graphics Processing Unit Architecture (GPU Arch)

Hardware-driven visibility culling

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

CPE300: Digital System Architecture and Design

Performance OpenGL Programming (for whatever reason)

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

PowerVR Series5. Architecture Guide for Developers

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS451Real-time Rendering Pipeline

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GeForce4. John Montrym Henry Moreton

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

1.2.3 The Graphics Hardware Pipeline

Antonio R. Miele Marco D. Santambrogio

A MATLAB Interface to the GPU

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Mattan Erez. The University of Texas at Austin

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Optimisation. CS7GV3 Real-time Rendering

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

The Role of Performance

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Lecture 2. Shaders, GLSL and GPGPU

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

EE 4702 GPU Programming

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

The Application Stage. The Game Loop, Resource Management and Renderer Design

Comparing Reyes and OpenGL on a Stream Architecture

UberFlow: A GPU-Based Particle Engine

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Rendering Objects. Need to transform all geometry then

EECS 487: Interactive Computer Graphics

GPGPU introduction and network applications. PacketShaders, SSLShader

Computer Architecture CS372 Exam 3

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Working with Metal Overview

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Programming on Larrabee. Tim Foley Intel Corp

Computer Architecture

CS425 Computer Systems Architecture

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

GPUs and GPGPUs. Greg Blanton John T. Lubia

2.11 Particle Systems

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Introduction to Parallel Programming Models

1.3 Data processing; data storage; data movement; and control.

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

CSE 141 Summer 2016 Homework 2

Write only as much as necessary. Be brief!

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Mattan Erez. The University of Texas at Austin

The NVIDIA GeForce 8800 GPU

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Real-World Applications of Computer Arithmetic

Performance, Power, Die Yield. CS301 Prof Szajda

GeForce3 OpenGL Performance. John Spitzer

CS 428: Fall Introduction to. OpenGL primer. Andrew Nealen, Rutgers, /13/2010 1

CS 426 Parallel Computing. Parallel Computing Platforms

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

CS 341l Fall 2008 Test #2

Transcription:

Spring 2009 Prof. Hyesoon Kim

Benchmarking is critical to make a design decision and measuring performance Performance evaluations: Design decisions Earlier time : analytical based evaluations From 90 s: heavy rely on simulations. Processor evaluations Workload characterizations: better understand the workloads

Benchmarks Real applications and application suites E.g., SPEC CPU2000, SPEC2006, TPC-C, TPC-H, EEMBC, MediaBench, PARSEC, SYSmark Kernels Representative parts of real applications Easier and quicker to set up and run Often not really representative of the entire app Toy programs, synthetic benchmarks, etc. Not very useful for reporting Sometimes used to test/stress specific functions/features

Representative applications keeps growing with time!

Test, train and ref Test: simple checkup Train: profile input, feedback compilation Ref: real measurement. Design to run long enough to use for real system -> Simulation? Reduced input set Statistical simulation Sampling

Measure transaction-processing throughput Benchmarks for different scenarios TPC-C: warehouses and sales transactions TPC-H: ad-hoc decision support TPC-W: web-based business transactions Difficult to set up and run on a simulator Requires full OS support, a working DBMS Long simulations to get stable results

SPLASH: Scientific computing kernels Who used parallel computers? PARSEC: More desktop oriented benchmarks NPB: NASA parallel computing benchmarks Not many

GFLOPS, TFLOPS MIPS (Million instructions per second)

Speedup of arithmeitc means!= arithmetic mean of speedup Use geometric mean: n n i=1 Normalized execution time on i Neat property of the geometric mean: Consistent whatever the reference machine Do not use the arithmetic mean for normalized execution times

Often when making comparisons in comparch studies: Program (or set of) is the same for two CPUs The clock speed is the same for two CPUs So we can just directly compare CPI s and often we use IPC s

Average CPI = (CPI 1 + CPI 2 + + CPI n )/n A.M. of IPC = (IPC 1 + IPC 2 + + IPC n )/n Not Equal to A.M. of CPI!!! Must use Harmonic Mean to remain to runtime

H.M.(x 1,x 2,x 3,,x n ) = n 1 + 1 + 1 + + 1 x 1 x 2 x 3 x n What in the world is this? Average of inverse relationships

Average IPC = 1 = 1 A.M.(CPI) CPI 1 + CPI 2 + CPI 3 + + CPI n n n n n = n CPI 1 + CPI 2 + CPI 3 + + CPI n = n 1 + 1 + 1 + + 1 = H.M.(IPC) IPC 1 IPC 2 IPC 3 IPC n

Stanford graphics benchmarks Simple graphics workload. Academic Mostly game applications 3DMark: http://www.futuremark.com/benchmarks/3dmar kvantage Tom s hardware

Still graphics is the major performance bottlenecks Previous research: emphasis on graphics

Several genres of video games First Person Shooter Fast-paced, graphically enhanced Focus of this presentation Role-Playing Games Lower graphics and slower play Board Games Just plain boring

Physics Particle Event Collision Detection AI Rendering Display Computing

Current game design principles: higher frame rates imply the better game quality Recent study on frame rates [Claypool et al. MMCN 2006] very high frame rates are not necessary, very low frame rates impact the game quality severely

Snapshots of animation [Davis et al. Eurographics 2003] time

Game workload Computational workload Rendering workload Other workload Rasterization workload

Case study Workload characterization of 3D games, Roca, et al. IISWC 2006 [WOR] Use ATTILA

Average primitives per frame Average vertex shader instructions Vertex cache hit ratio System bus bandwidths Percentage of clipped, culled, and traversed triangles Average trianglesizes

GPU execution driven simulator https://attilaac.upc.edu/wiki/index.php/architecture Can simulate OpenGL at this moments

Index Buffer Vertex cache Streamer Vertex Request Buffer Primitive Assembly Clipping Triangle Setup Fragment Generation Hierarchical Z Z Cache Z test Z Cache Z test Hierarchical Z buffer HZ Cache Register File Shader Shader Shader Shader Shader Texture Filter Texture Cache Texture Address Attila architecture Unit Size Element width Streamer 48 16x4x32 bits Primitive Assembly 8 3x16x4x32 bits Clipping 4 3x4x32 bits Triangle Setup 12 3x4x32 bits Fragment Generation 16 3x4x32 bits Hierarchical Z 64 (2x16+4x32)x4 bits Z Tests 64 (2x16+4x32)x4 bits Interpolator --- --- Color Write 64 (2x16+4x32)x4 bits Unified Shader (vertex) 12+4 16x4x32 bits Unified Shader (fragment) 240+16 10x4x32 bits Table 2. Queue sizes and number of threads in the ATTILA reference architecture Interpolator Color cache Blend Color cache Blend MC0 MC1 MC2 MC3

Execution driven: Correctness, long development time, Execute binary Trace driven Easy to develop Simulation time could be shorten Large trace file size

No simulation is required To provide insights Statistical Methods CPU First-order GPU Warp level parallelism

CWP=4 Case1: 1 1 3 5 7 2 2 4 6 8 3 4 5 6 Memory 7 Waiting period 8 MWP=2 2 Computation + 4 Memory (a) Case2: 1 1 3 1 3 2 2 4 2 4 3 4 1 2 3 4 2 Computation + 4 Memory (b) Idle cycles Case3 : CWP=4 1 1 2 3 2 4 3 5 MWP > 4 4 5 6 6 7 7 8 8 Case4 : 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 8 Computation + 1 Memory 8 Computation + 1 Memory (a) (b)

Case6: (1 warp) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 Computation + 8 Memory (a) Case7: 1 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 (2 warps) 5 Computation + 4 Memory (b)

Hardware performance counters Built in counters (instruction count, cache misses, branch mispredicitons) Profiler Architecture simulator Characterized items Cache miss, branch misprediciton, row-buffer hit ratio

Top design (instruction, data flow from memory to CPU and GPU), Data/control signals CPU design Pipeline stages, SMT support, Fetch address calculation, branch misprediction, cache miss handling path Memory address calculation stage, vector processing units At least 5 MUXes, register, ALU, latches, Memory system: Load/store buffers, queues GPU Design Show at least 10 ALUs

One of the following items Detailed CPU pipeline design (more muxes and more adders) Detailed survey (more information from other sources) Detailed GPU pipeline design (more muxes and more adders) Detailed memory system (more queues) Detailed memory controller

I/O just a box Cache just one box or (tag + data) Report: explanations are required. ECC: just a box Design review: 30 min Feedback for final report