Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Similar documents
Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Introduction to Pipelined Datapath

Quantifying Performance EEC 170 Fall 2005 Chapter 4

CpE 442 Introduction to Computer Architecture. The Role of Performance

Computer System. Performance

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

1.6 Computer Performance

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Instructor Information

The Von Neumann Computer Model

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance

UCB CS61C : Machine Structures

Lecture 3: Evaluating Computer Architectures. How to design something:

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Course web site: teaching/courses/car. Piazza discussion forum:

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Computer Science 246. Computer Architecture

CS61C : Machine Structures

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

CS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

UCB CS61C : Machine Structures

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Evaluating Computers: Bigger, better, faster, more?

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

The Role of Performance

ECE 486/586. Computer Architecture. Lecture # 3

CO Computer Architecture and Programming Languages CAPL. Lecture 15

APPENDIX Summary of Benchmarks

Computer Architecture s Changing Definition

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Computer Architecture

CS61C : Machine Structures

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

CPE300: Digital System Architecture and Design

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

CSE 141 Computer Architecture Summer Session Lecture 2 Performance, ALU. Pramod V. Argade

Performance of computer systems

Review: latency vs. throughput

The Von Neumann Computer Model

Impact of Cache Coherence Protocols on the Processing of Network Traffic

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Designing for Performance. Patrick Happ Raul Feitosa

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

CMSC 611: Advanced Computer Architecture

Chapter-5 Memory Hierarchy Design

Computer Science 146. Computer Architecture

ADVANCED ELECTRONIC SOLUTIONS AVIATION SERVICES COMMUNICATIONS AND CONNECTIVITY MISSION SYSTEMS

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Review: Salient features of MIPS I. CS152 Computer Architecture and Engineering Lecture 3

CS430 Computer Architecture

TDT4255 Computer Design. Lecture 1. Magnus Jahre

COSC3330 Computer Architecture Lecture 7. Datapath and Performance

The Computer Revolution. Classes of Computers. Chapter 1

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Chapter 14 Performance and Processor Design

CSE 141 Computer Architecture Summer Session I, Lecture 3 Performance and Single Cycle CPU Part 1. Pramod V. Argade

Engineering 9859 CoE Fundamentals Computer Architecture

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Reporting Performance Results

GRE Architecture Session

CS/ECE 752: Advanced Computer Architecture 1. Lecture 1: What is Computer Architecture?

ECE 252 / CPS 220 Advanced Computer Architecture I. Administrivia. Instructors and Course Website. Where to Get Answers

Outline Marquette University

Response Time and Throughput

CSE 141 Summer 2016 Homework 2

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Quiz for Chapter 1 Computer Abstractions and Technology

Performance, Power, Die Yield. CS301 Prof Szajda

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Computer Architecture. What is it?

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Computer Systems Performance Analysis and Benchmarking (37-235)

Chapter 1. and Technology

1.3 Data processing; data storage; data movement; and control.

PIPELINING AND PROCESSOR PERFORMANCE

Design of Experiments - Terminology

Cache Optimization by Fully-Replacement Policy

Computer Architecture. Introduction and Performance Measures

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Transcription:

Performance, Cost and Amdahl s s Law Arquitectura de Computadoras Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Arquitectura de Computadoras Performance- 1

Performance Purchasing perspective given a collection of machines, which has the» best performance?» least cost?» best performance / cost? Design perspective faced with design options, which has the» best performance improvement?» least cost?» best performance / cost? Both require basis for comparison metric for evaluation Our goal is to understand cost & performance implications of architectural choices Arquitectura de Computadoras Performance- 2

Two notions of performance Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 Concorde 3 hours 1350 mph 132 178,200 Which has higher performance? Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns... (Performance) throughput, bandwidth Response time and throughput often are in opposition Arquitectura de Computadoras Performance- 3

What is Performance? KEY: A measure of Speed (Rate) Car: miles driven per hour Car wash: cars washed per day Auto plant: cars built per year Two metrics: Latency (response or execution time)» time to start to finish of a task Throughput (bandwidth)» rate of task completion = rate of task initiation = 1 / (time between task completions) Deterministic vs. average Arquitectura de Computadoras Performance- 4

Definitions Performance is in units of things-per-second bigger is better If we are primarily concerned with response time performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------- Performance(Y) Arquitectura de Computadoras Performance- 5

Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747? Concord is 178,200 pmph / 286,700 pmph Boeing is 286,700 pmph / 178,200 pmph = 0.62 times faster = 1.60 times faster Boeing is 1.6 times ( 60% ) faster in terms of throughput Concord is 2.2 times ( 120% ) faster in terms of flying time We will focus primarily on execution time for a single job Lots of instructions in a program => Instruction throughput important! Arquitectura de Computadoras Performance- 6

Relative Performance Definition: X is n % faster than Y if execution rate execution rate Example: X = 1 minute, Y = 2 minutes 2 minute 1 minute = 1+ 100 100 Thus, X is 100 % faster than Y Example: Car wash that starts one car per minute and holds four cars. Latency = four minutes per car Throughput = one car per minute Throughput > 1/Latency due to overlap Key idea: pipelining X Y execution time Y n = = 1+ execution time 100 Arquitectura de Computadoras Performance- 7 X

Basis of Evaluation Pros representative portable widely used improvements useful in reality Actual Target Workload Full Application Benchmarks Cons very specific non-portable difficult to run, or measure hard to identify cause less representative easy to run, early in design cycle Small Kernel Benchmarks easy to fool identify peak capability and potential bottlenecks Microbenchmarks peak may be a long way from application performance Arquitectura de Computadoras Performance- 8

Metrics of performance Application Answers per month Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s Datapath Control Function Units Transistors Wires Pins Megabytes per second Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused Arquitectura de Computadoras Performance- 9

Aspects of CPU Performance CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Program instr count CPI clock rate Compiler Instr. Set Organization Technology Arquitectura de Computadoras Performance- 10

Aspects of CPU Performance CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Program instr count CPI clock rate X Compiler X X Instr. Set X X X Organization X X Technology X Arquitectura de Computadoras Performance- 11

CPI: Average cycles per per instruction CPI = Instruction Count / (CPU Time * Clock Rate) = Instruction Count / Cycles CPU Time = Cycle Time * n CPI i= 1 i * I i CPU Time = n CPI i= 1 i * F i where F i Ii = Instruction Count Invest resources where time is spent! Arquitectura de Computadoras Performance- 12

Controversial Example CPU time Instruction Cycles = Program Instruction Seconds Cycle Some have argued: CISC CPU Time = P x 8 x T = 8PT RISC CPU Time = 2P x 2 x T = 4PT RISC CPU Time = (CISC CPU Time)/2 DISCLAIMER: The truth is much, much more complex Arquitectura de Computadoras Performance- 13

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S Arquitectura de Computadoras Performance- 14

Amdahl s Law Let Speedup new rate = = old rate old latency new latency Consider an enhancement x that speedups fraction f x of a task by S x Speedup overall = old latency new latency = [( 1- fx ) + fx ] old latency ( 1 f ) old latency + (f / S ) old latency x x x Amdahl s Law gives: Speedup overall = 1 ( 1 f ) + (f / S ) x x x Arquitectura de Computadoras Performance- 15

Amdahl s Law, cont. Example: f x = 95 % and S x = 1.10 Speedup overall = 1 ( 1 095. ) + ( 095. / 110. ) = 1094. Example: f x = 5% and S x = 10 Speedup overall = 1 ( 1 005.. ) + ( 005. / 10) = 1047. Example: f x = 5% and S x Speedup overall = 1 1 005 + ( 005. = 1052. (. ) / ) Arquitectura de Computadoras Performance- 16

Amdahl s Law Corollary Since S x implies Speedup overall For real speedups: 1 Speedup overall < ( 1 ) 1 ( 1 ) + (f / ) f x f x x Example: f x 1 ( 1 f x ) 1 % 1.01 2 % 1.02 5 % 1.05 10 % 1.11 20 % 1.25 50 % 2.00 Arquitectura de Computadoras Performance- 17

Standard Example: : Load/Store Machine Operation Frequency Cycle Count ALU Ops 43 % 1 Loads 21 % 1 Stores 12 % 2 Branches 24 % 2 Suppose we could make stores execute in 1 cycle, by slowing down the cycle time by 15 % Should we make this optimization? Old CPI = 0.43 + 0.21 + (0.12 + 0.24)x2 = 1.36 New CPI = 0.43 + 0.21 + 0.12 + 0.24x2 = 1.24 New CPU time Old CPU time Conclusion: Don t make the change = P New CPI 115. T P Old CPI T = 105. Arquitectura de Computadoras Performance- 18

Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% 1.5 23% Load 20% 5 1.0 45% Store 10% 3.3 14% Branch 20% 2.4 18% Typical Mix 2.2 How much faster would the machine be if a better data cache reduce the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? Arquitectura de Computadoras Performance- 19

Evaluating Instruction Sets? Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI How many clocks are required per instruction? How "lean" a clock is practical? Best Metric: Time to execute the program! Inst. Count Cycle Time NOTE: this depends on instructions set, processor organization, and compilation techniques. Arquitectura de Computadoras Performance- 20

Corollary: Make The Common Case Fast All instructions require an instruction fetch, only a fraction require a data fetch/store. Optimize instructions access over data access Programs exhibit locality spatial locality temporal locality Arquitectura de Computadoras Performance- 21

Corollary: Make The Common Case Fast Access to small memories is faster provide a storage hierarchy such that the most frequent accesses are the smallest (closest) memories Regs. Cache Memory Disk/Tape Arquitectura de Computadoras Performance- 22

Marketing Metrics Clock Frequency 3 Ghz better than 2 Ghz? Only relevant for comparing processors from the same family The same architecture The same ISA Machine with different instruction sets? Intel Pentium vs PowerPC Program with different instruction mixes? Dynamic frequency of instructions Uncorrelated to performance Arquitectura de Computadoras Performance- 23

Marketing Metrics MIPS= instruction Count /Time * 10 6 = Clock Rate / CPI * 10 6 machine with different instruction sets? program with different instruction mixes? dynamic frequency of instruction uncorrelated to performance MFLOPS = FP Operations / Time * 10 6 machine dependent often not where time is spent Normalized: add, sub, compare, mult 1 divide, sqrt 4 exp, sin,... 8 Arquitectura de Computadoras Performance- 24

Normalized MFLOPS Not all machines implement the same FP operations Cray-1 does not implement Divide Motorola 68882 does SQRT, SIN, and COS Not all FP operations are the same ADD is much faster than Divide Normalized MFLOPS Assign a canonical number of FP operations to a program Normalized MFLOPS = Canonical FP operations time 10 6 Arquitectura de Computadoras Performance- 25

Metrics of performance Application Answers per month Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s Datapath Control Function Units Transistors Wires Pins Megabytes per second Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused Arquitectura de Computadoras Performance- 26

Benchmarks Real Programs Representative of real workload The only accurate way to characterize performance e.g., gcc, spice,... Kernels Representative program fragments Time critical excerpts of real programs. e.g., Livermore loops Toy Benchmarks 10-100 lines e. g. Sieve, Puzzle, Towers Synthetic Benchmarks attempt to match average frequencies of real workloads e.g. Whetstone, dhrystone Arquitectura de Computadoras Performance- 27

Benchmarking Reproducible results must control outside factors Important factors Program input Version of program Version of compiler Optimization level Version of operating system Amount of memory Number and type of disks Version of CPU Cache configuration Arquitectura de Computadoras Performance- 28

Benchmarking: SPEC Limitations of de facto Benchmarks Dhrystone Synthetic integer benchmark Heavy string emphasis Optimization compilers cause MAJOR problems Whetsone Synthetic floating-point benchmark Designed to thwart optimization Linpack Floating-point kernel DAXPY() = A(I) = B(I) + C * D(I) Arquitectura de Computadoras Performance- 29

SPEC95 Standard Performance Evaluation Corporation Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags eliminate special undocumented incantations that may not even generate working code for real programs Arquitectura de Computadoras Performance- 30

Benchmarking: SPEC200 Integer Floating Point Gzip Vpr Gcc Mcf Crafty Parser Eon Perlbmk Gap Vortex Bzip2 Compression FPGA circuit placement and routing C programming language compiler Combinatorial optimization Game playing: chess Word processing Computer visualization Perl programming language Group theory Object oriented database Compression Wupwise Swim Mgrid Applu Mesa Galgel Art Equaqke Facerec Ammp Lucas Fma3d Sixtrack Apsi Physics: quantum chromadinamics Shallow water modelling Multigrid solver: 3D potential field Partial differential equations 3D Graphics library Computational fluid dynamics Image recognition neural networks Seismic wave propagation simulation Image processing: face recognition Computational chemistry Number theory/primality testing Finite-element crash simulation Nuclear physics accelerator design Meteorology: pollutant distribution Arquitectura de Computadoras Performance- 31

Summarizing Results: : A Counter- Example A car goes 30 MPH for the first then miles and 90 MPH for the second ten miles. What the car s average speed over the twenty miles? Wrong answer: Avg Speed = 30 MPH + 90 MPH 2 = 60 MPH Correct answer: Avg Speed = = total distance total time 10 miles + 10 miles ( 10 miles / 30 MPH) + ( 10 miles / 90 MPH) = 20 miles ( 1/ 3) hour + ( 1/ 9) hour = 45 MPH Arquitectura de Computadoras Performance- 32

Summarizing Results: Averages Use the ARITHMETIC mean for times (cycles per instruction): Use the HARMONIC mean for rates (MIPS, MFLOPS): 1 n time i i= 1 Use the GEOMETRIC mean for ratios (normalized numbers): n 1 n 1 n i= 1rate i 1 n 1 n i= 1rate i 1 1/ n Better yet: don t average normalized numbers Arquitectura de Computadoras Performance- 33

Summarizing Results: A Measure of Time Property 1: A single-number performance measure for a set of benchmarks expressed in units of time should be directly proportional to the total (weighted) time consumed by the benchmarks. Property 2: A single-number performance measure for a set of benchmarks expressed as a rate should be indirectly proportional to the total (weighted) time consumed by the benchmarks. Arquitectura de Computadoras Performance- 34

Summarizing Results: Which Which Mean? T i = Execution time for Benchmark i F i = FP Operations for Benchmark i R i = F i / T i = Rate of Benchmark i Average Time: Average Rate: 1 n A mean = n i= 1 n A mean = n i= T i 1 R i 1 Violates Property 2: Not proportional to inverse of time. Use Harmonic mean: n 1 n 1 Fi H mean = 1 1 n i R = 1 n = 1 i= 1T i i Arquitectura de Computadoras Performance- 35

Homework 2 Choose a program to evaluate performance of a PC It can be for Linux or Windows Choose performance metrics for: Speed of CPU Speed of Main memory Speed of graphics applications Speed of hard disk Run performance program in two different computers Your assigned PC at the lab Your home computer Compare results for two computers and stand if one is faster than the other according each metrics Three pages long report Describe the performance program (one page) Describe performance tests and metrics (one page) Describe characteristics of both computers, compare results and make conclusions (one page) Arquitectura de Computadoras Performance- 36

Homework 2 Computer A Computer B Comparison CPU speed Memory speed Graphics speed HD speed Due date: September 19th, 2008. Arquitectura de Computadoras Performance- 37