Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Similar documents
CS654 Advanced Computer Architecture. Lec 2 - Introduction

CpE 442 Introduction to Computer Architecture. The Role of Performance

Instructor Information

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

CMSC 611: Advanced Computer Architecture

B649 Graduate Computer Architecture. Lec 1 - Introduction

Course web site: teaching/courses/car. Piazza discussion forum:

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Final Lecture. A few minutes to wrap up and add some perspective

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

MOTIVATION. B649 Parallel Architectures and Programming

Pipelining. CS701 High Performance Computing

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Lecture 3: Evaluating Computer Architectures. How to design something:

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Quiz for Chapter 1 Computer Abstractions and Technology

Engineering 9859 CoE Fundamentals Computer Architecture

Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Bell s Law new class per decade

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Performance of computer systems

CPE300: Digital System Architecture and Design

Computer Architecture. R. Poss

Dynamic Control Hazard Avoidance

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Performance, Power, Die Yield. CS301 Prof Szajda

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Designing for Performance. Patrick Happ Raul Feitosa

Computer Architecture

Introduction to Pipelined Datapath

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Technology. Giorgio Richelli

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EECS 322 Computer Architecture Superpipline and the Cache

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

TDT 4260 lecture 7 spring semester 2015

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Transistors and Wires

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Fundamentals of Quantitative Design and Analysis

Advanced Computer Architecture

Introduction to Computer Architecture II

COSC4201. Prof. Mokhtar Aboelaze York University

Computer Systems Performance Analysis and Benchmarking (37-235)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

The Role of Performance

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Topics. Digital Systems Architecture EECE EECE Need More Cache?

ECE 486/586. Computer Architecture. Lecture # 3

Vector and Parallel Processors. Amdahl's Law

History of the Intel 80x86

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Review: latency vs. throughput

CSE 141 Summer 2016 Homework 2

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture 1: Introduction

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

EECS4201 Computer Architecture

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Execution/Effective address

Copyright 2012, Elsevier Inc. All rights reserved.

CS430 Computer Architecture

DEPARTMENT OF ECE IV YEAR ECE EC6009 ADVANCED COMPUTER ARCHITECTURE LECTURE NOTES

Computer Systems Architecture Spring 2016

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Advanced Computer Architecture

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

COSC3330 Computer Architecture Lecture 7. Datapath and Performance

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Modern Computer Architecture

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

COSC 6385 Computer Architecture. Instruction Level Parallelism

Advanced Computer Architecture

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

EITF20: Computer Architecture Part2.2.1: Pipeline-1

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Chapter-5 Memory Hierarchy Design

ECE 154A Introduction to. Fall 2012

Transcription:

Lecture - 4 Measurement Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Acknowledgements David Patterson Dr. Roger Kieckhafer 9/29/2009 2

Computer Architecture is Design and Analysis Design Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems Analysis Creativity Cost / Performance Analysis Bad Ideas Good Ideas Mediocre Ideas 9/29/2009 3

What Computer Architecture brings to Table Other fields often borrow ideas from architecture Quantitative Principles of Design 1. Take Advantage of Parallelism 2. Principle of Locality 3. Focus on the Common Case 4. Amdahl s Law 5. The Processor Performance quation Careful, quantitative comparisons Define, quantity, and summarize relative performance Define and quantity relative cost Define and quantity dependability Define and quantity power Culture of anticipating and exploiting advances in technology Culture of well-defined interfaces that are carefully implemented and thoroughly checked 9/29/2009 4

1) Taking Advantage of Parallelism Increasing throughput of server computer via multiple processors or multiple disks Detailed HW design Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand Multiple memory banks searched in parallel in set-associative caches Pipelining: overlap instruction execution to reduce the total time to complete an instruction sequence. Not every instruction depends on immediate predecessor executing instructions completely/partially in parallel possible Classic 5-stage pipeline: 1) Instruction Fetch (Ifetch), 2) Register Read (Reg), 3) xecute (ALU), 4) Data Memory Access (Dmem), 5) Register Write (Reg) 9/29/2009 5

Pipelined Instruction xecution Time (clock cycles) I n s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg O r d e r Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg 9/29/2009 6

Limits to pipelining Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: attempt to use the same hardware to do two different things at once Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). Time (clock cycles) I n s t r. O r d e r Ifetch Reg Ifetch ALU Reg Ifetch DMem ALU Reg Ifetch Reg DMem ALU Reg Reg DMem ALU Reg DMem Reg 9/29/2009 7

2) The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access) Last 30 years, HW relied on locality for memory perf. P $ MM 9/29/2009 8

3) Focus on the Common Case Common sense guides computer design Since its engineering, common sense is valuable In making a design trade-off, favor the frequent case over the infrequent case.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it 1st.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st Frequent case is often simpler and can be done faster than the infrequent case.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow May slow down overflow, but overall performance improved by optimizing for the normal case What is frequent case and how much performance improved by making case faster > Amdahl s Law 9/29/2009 9

4) Amdahl s Law xtimenew xtimeold 1 Fractionenhanced ( Fraction ) + enhanced Speedup enhanced Speedup overall xtime xtime old new ( 1 Fraction ) enhanced 1 + Fraction Speedup enhanced enhanced Best you could ever hope to do: Speedup maximum 1 1 - Fraction ( ) enhanced 9/29/2009 10

Amdahl s Law example New CPU 10X faster I/O bound server, so 60% time waiting for I/O Speedup overall ( 1 Fraction ) enhanced 1 + Fraction Speedup enhanced enhanced 1 ( 1 0.4 ) + 0.4 10 1 0.64 1.56 Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster 9/29/2009 11

5) Processor performance equation CPU CPU time time Seconds Seconds Instructions Instructions x x Cycles Cycles x x Seconds Seconds Program Program Program Program Instruction Instruction Cycle Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X 9/29/2009 12

Definition: Performance Performance is in units of things per sec bigger is better If we are primarily concerned with response time performance(x) 1 execution_time(x) " X is n times faster than Y" means Performance(X) n Performance(Y) xecution_time(y) xecution_time(x) 9/29/2009 13

Performance: What to measure Usually rely on benchmarks vs. real workloads To increase predictability, collections of benchmark applications, called benchmark suites, are popular SPCCPU: popular desktop benchmark suite CPU only, split between integer and floating point programs SPCint2000 has 12 integer, SPCfp2000 has 14 integer pgms SPCSFS (NFS file server) and SPCWeb (WebServer) added as server benchmarks Transaction Processing Council measures server performance and costperformance for databases TPC-C Complex query for Online Transaction Processing TPC-H models ad hoc decision support TPC-W a transactional web benchmark TPC-App application server and web services benchmark 9/29/2009 14

Relative Performance Metrics Given two design options (X and Y) xecution Time: T X xecution time of a workload run on option X T Y xecution time of a workload run on option Y Performance: Perf x 1/ T X Perf Y 1/ T Y Speedup of X over Y (S x/y ): Perf Perf X S X / Y Y T T Y X 9/29/2009 15

Relative Performance Metrics Percent Improvement in Performance X is n% faster than Y means: n 100 Perf X Perf Perf Y Y 100 Perf Perf X Y Perf Perf Y Y n 100 [ S 1] X / Y xample: Y takes 15 seconds to complete a task, X takes 10 seconds to complete the same task. 9/29/2009 16

Relative Performance Metrics Speedup: S x/y Percent faster n 9/29/2009 17

9/29/2009 18 Revisiting Amdahl s Law Amdahl s Law States: Plugging into the definition of Speedup yields: Note: If F 1.0, Then: ( ) ( ) + + S F F T S F T F T T 1 1 0 0 0 ( ) + S F F T T S 1 1 0 0 / S S S T T 0 / 0

Revisiting Amdahl s Law xample: An enhancement () improves the speed of Floating Point (FLP) instructions by a factor of 2. S 2 S / 0 1 ( 1 F ) + F 2 Where: F the fraction of FLP instructions in Program e.g. General Pgm: If F 0.1, Then S /0 e.g. Scientific Pgm: If F 0.9, Then S /0 9/29/2009 19

xecution Time CPU time Seconds Instructions x Cycles x Seconds Program Program Instruction Cycle xecution Time for a CPU T(N) N CPI τ 1 / f (clock period 1 / frequency) T(N) 9/29/2009 20

Cycles Per Instruction Calculating CPI: F k Fraction of instructions of type k, k {1 m} CPI k CPI for instruction type k CPI Mean CPI for entire program m CPI F k CPI k 1 k Conclusion: Invest Resources where time is Spent! Focus on instruction types for which (F k x CPI k ) is largest 9/29/2009 21

xample: Calculating CPI Typical Instruction Mix Type (k) F k CPI k F k *CPI k (% Time) ALU 50% 1.5 (33%) Load 20% 2.4 (27%) Store 10% 2.2 (13%) Branch 20% 2.4 (27%) CPI 1.5 9/29/2009 22

xample Suppose we have the following measurements: Frequency of FP operations: 25 % Average CPI of FP operations: 4.0 Average CPI of other instructions: 1.33 Frequency of FPSRQ: 2 % CPI of FPSQR: 20 Assume two design alternatives: Decrease the CPI of FPSQR to 2, or decrease the average CPI of all FP operations to 2.5. Use processor performance equation to calculate. Observe that the clock speed and instruction count remain identical. Find original CPI first: CPI 4 x 0.25 + 1.33 x 0.75 2.0 CPI new sqrt CPi original 0.02 x (CPI old FPSQRT CPI of new FPSQRT ) 2.0 0.02 * (20-2) 1.64 CPI new FP (0.75 x 1.33) + (0.25 x 2.5) 1.62 Speed-up CPI original / CPI new FP 2.0 / 1.62 1.23 1.23 9/29/2009 23

Throughput xecution Time is relative Depends on number of operations executed Interested how much work was done in that time W(N) Mean operations executed per unit time W ( N ) T N ( N ) Typical Units MIPS Millions of Instructions per Secon MFLOPS Millions of Floating Point Ops per Second Both MIPS & MFLOPS need time measured in μsec 9/29/2009 24

Throughput Marketing Hype vs. Actual Performance: MIPS: Machines with different instruction sets? Programs with different instruction mixes? Dynamic frequency of instructions Uncorrelated with performance MFLOPs: Machine dependent Often not where time is spent 9/29/2009 25

Programs to valuate Processor Performance Toy Benchmarks 10-100 line program e.g.: sieve, puzzle, quicksort Synthetic Benchmarks Attempt to match average frequencies of real workloads e.g., Whetstone, dhrystone Kernels Time critical excerpts from Real programs e.g., gcc, spice 9/29/2009 26

Benchmarking Games Differing configurations used on two systems Compiler wired to optimize the workload Test written to be biased towards one machine Workload arbitrarily picked Very small benchmarks used Benchmarks manually translated to optimize performance 9/29/2009 27

Common Benchmarking Mistakes Only average behavior represented in test workload Skewness of device demands ignored Loading level controlled inappropriately Caching effects ignored Buffer sizes not appropriate Inaccuracies due to sampling ignored 9/29/2009 28

Common Benchmarking Mistakes Ignoring monitoring overhead Not validating measurements Not ensuring same initial conditions Not measuring transient (cold start) performance Using device utilizations for performance comparisons Collecting too much data but doing too little analysis 9/29/2009 29

Revisiting SPC: System Perf. valuation Cooperative First Round 1989 10 programs yielding a single number Second Round 1992 SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs) Compiler Flags unlimited. Third Round 1995 Single flag setting for all programs; new set of programs benchmarks useful for 3 years 9/29/2009 30

SPC First Round One program: 99% of time in single line of code New front-end compiler could improve dramatically 800 700 600 500 400 300 200 100 0 gcc epresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark 9/29/2009 31

How to Summarize Performance Arithmetic mean (weighted arithmetic mean) tracks execution time: Harmonic mean (weighted harmonic mean) of rates Normalized execution time Relative to some baseline system Handy for scaling performance. 9/29/2009 32

Summary Relative Performance Metrics Speedup: Perf Perf X S X / Y Y T T Y X Percent Improvement: n X Y [ S 1] 100 / Amdahl s Law Limited impact of enhancements S / 0 T T 0 1 T T 0 F ( F ) + 1 ( 1 F ) + S S F 9/29/2009 33

Summary Absolute Performance Metrics xec time: CPI: T ( N ) N CPI m CPI F k CPI k 1 k τ Throughput W ( N ) T N ( N ) Could be any granularity If measured in Instructions: 1 W ( N) CPI τ f CPI 9/29/2009 34

9/29/2009 35