Performance Evaluation

Similar documents
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Lecture 3: Computer Arithmetic: Multiplication and Division

The Role of Performance

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Course web site: teaching/courses/car. Piazza discussion forum:

Lec 25: Parallel Processors. Announcements

Review of Basic Computer Architecture

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Multiple Issue ILP Processors. Summary of discussions

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

ECE 486/586. Computer Architecture. Lecture # 3

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Review of Basic. Computer Architecture. Theory Goals Specification

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Performance of computer systems

Performance, Power, Die Yield. CS301 Prof Szajda

Instructor Information

Performance Evaluation of Information Retrieval Systems

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Parallel matrix-vector multiplication

CpE 442 Introduction to Computer Architecture. The Role of Performance

Designing for Performance. Patrick Happ Raul Feitosa

CPE300: Digital System Architecture and Design

Processor (IV) - advanced ILP. Hwansoo Han

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Introduction to Programming. Lecture 13: Container data structures. Container data structures. Topics for this lecture. A basic issue with containers

Conditional Speculative Decimal Addition*

Assembler. Building a Modern Computer From First Principles.

Lecture 5: Multilayer Perceptrons

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Smoothing Spline ANOVA for variable screening

CS3350B Computer Architecture CPU Performance and Profiling

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

The Processor: Instruction-Level Parallelism

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Quiz for Chapter 1 Computer Abstractions and Technology

Brave New World Pseudocode Reference

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Efficient Distributed File System (EDFS)

Lecture 3: Evaluating Computer Architectures. How to design something:

Mathematics 256 a course in differential equations for engineering students

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

THE IMPACT OF SMT/SMP DESIGNS ON MULTIMEDIA SOFTWARE ENGINEERING - A WORKLOAD ANALYSIS STUDY

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

ELEC 377 Operating Systems. Week 6 Class 3

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Real-Time Guarantees. Traffic Characteristics. Flow Control

CSE 326: Data Structures Quicksort Comparison Sorting Bound

The Von Neumann Computer Model

Alufix Expert D Design Software #85344

The Codesign Challenge

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

LLVM passes and Intro to Loop Transformation Frameworks

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Problem Set 3 Solutions

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

GRE Architecture Session

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Multicore and Parallel Processing

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Loop Transformations, Dependences, and Parallelization

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

S1 Note. Basis functions.

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

COMPUTER ORGANIZATION AND DESI

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

Response Time and Throughput

CS 534: Computer Vision Model Fitting

Sorting: The Big Picture. The steps of QuickSort. QuickSort Example. QuickSort Example. QuickSort Example. Recursive Quicksort

Transcription:

Performance Evaluaton [Ch. ] What s performance? of a car? of a car wash? of a TV? How should we measure the performance of a computer? The response tme (or wall-clock tme) t takes to complete a task? Why sn t ths a good measure? The CPU tme t takes to complete a task? H&P use Amdahl s Law The user CPU tme t takes? How s ths dfferent from CPU tme? system performance to refer to elapsed tme on an unloaded system, and CPU performance to refer to user CPU tme on an unloaded system. The mprovement n performance ( speedup ) s lmted by the part you cannot mprove. In the last class, we looked at mprovng performance by ppelnng operatons. What was the part we could not mprove? 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn

Speedup or Performance of task wth gzmo Performance of task wthout gzmo Speedup Executon tme of task wthout gzmo Executon tme of task wth gzmo However, usually we can t speed up (or enhance ) the whole task. So we have to speed up only a part. Speedup enhanced Best-case speedup from gzmo alone Fracton enhanced Fracton of task that gzmo can enhance Speedup overall Executon tme of task wthout gzmo Executon tme of task wth gzmo Executon tme old Executon tme old x ( Fracton enhanced ) + Executon tme old x Fracton enhanced Speedup enhanced ( Fracton enhanced ) + Fracton enhanced Speedup enhanced ( f ) + f s Amdahl s Law example You do smulaton of jet plane wngs. One run takes one week on your fastest computer You get ths ad n your malbox: The Acme Hyperbole s the largest supercomputer ever bult, t has 00,000 processors (great!) Lecture 2 Advanced Mcroprocessor Desgn 2

It costs $ bazllon (not so great) ow, week s 600,000 sec, so You could run a smulaton n 6 seconds, rght? Well, not all of a program can be done at the same tme Data dependences: x ( ), followed by ( ) x * y Control dependences: f xxx then yyy else zzz Say 80% of your program s parallelzable (pretty good). How fast would your program fnsh? Speedup enhanced Fracton enhanced Speedup overall ( Fracton enhanced ) + Fracton enhanced Speedup enhanced 0.8 ( 0.8) + 00000 0.2 5. So the program runs approxmately 5 tmes faster, fnshng n not qute as great as one would hope. Worth a bazllon dollars? Let s take another look at Amdahl s law, from the perspectve that not all work s parallelzable. Recall Speedup s lmted by the part you cannot mprove. The common case matters most. 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 3 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn

Case : Suppose f 0.95 and s.. What knd of speedup can we get? s overall ( 0.95) + 0.95.0 Case 2: Suppose f 0.05 and s 0. s overall ( 0.05) + 0.05 0 Case 3: Suppose f 0.05, s. Workload selecton s overall ( 0.05) + ε What workloads do we use to evaluate performance? Observatons Solutons A database search does dfferent thngs from an FFT Hardware good for searchng databases sn t good for an FFT. Ask the users Guess Standards: Benchmark sutes SPEC89, SPEC92, SPEC95, SPEC2K (for workstatons) Includes code and nputs Transacton Process Councl TPC-A, B, C, D (web/database servers) o code, just a specfcaton Lecture 2 Advanced Mcroprocessor Desgn 4

Perfect Club (for supercomputers) Code, but you can rewrte t bas long as results are same Graveyard of faled metrcs MIPS (mllon nstructons per second) MIPS nstructons program program tme 0 6 Instructon sets are not the same across dfferent vendors machnes MIPS can be nversely proportonal to performance! (consder FP hw vs. software emulaton) MFLOPS (Mllon floatng-pont operatons per second) The set of FP nstructons s not consstent across machnes (A Pentum has a dvde, Cray C90 supercomputer does not) Integer-only code (e.g., a compler) has a zero MFLOPS ratng Peak performance (maxmum performance for a synthetc strng of nstructons) Example: The DEC Alpha mcroprocessor has a peak performance of.2 BIPS When compared usng benchmarks, the actual rate s closer to 360 to 750 (DEC Alpha) MIPS So, what metrc should we use? Run tme Run tme the only unmpeachable measure of performance for processors. But, there are many ways to nterpret run tme. 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 5 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn

Wall clock tme user sees System Program A (benchmark) Program B (somethng else) I/O Compute I/O Compute I/O Compute What we care about: CPU benchmarkng cares about these two t Sngle program compute tme Fracton of system tme due to sngle program Compute tme CPU tme CPU tme clock cycle count cycle tme clock cycle count clock rate (MHz) Cycles per nstructon CPI So, clock cycle count And, CPU tme clock cycle count nstructon count We can mprove CPU tme n three ways. Decrease nstructon count (IC) Good compler Better software algorthms Decrease CPI (ncrease IPC, a.k.a. nst. level parallelsm, or ILP) Fancy hardware (e.g., caches, branch predcton, ppelnng, superscalar) Lecture 2 Advanced Mcroprocessor Desgn 6

Good compler Decrease CT Deeper ppelnng & really good crcut desgn Technology scalng Smple ISA; Less aggressve ILP (smple mcroarchtecture) The real story on RISC vs. CISC RISC: Smple nstructons Takes a lot of them to do anythng: Increases Easer to buld hardware: Easer to parallelze: et effect: eed a lot of memory to hold program, but Runs faster f Inc(IC) < Dec(CT) and Dec(CPI). IC CISC ADD ( C ), ( A ), ( B ) RISC CT CPI CISC: Bg honkng complex nstructons RISC Takes very few to do anythng: Decreases IC Easer to program by hand Harder to buld fast hardware: Incr. CT Harder to parallelze: Increases CPI et effect: Retrospectve LD r, ( A ) LD r2, ( B ) ADD r3, r, r2 ST r3, ( C ) Memory effcent Runs faster f Dec(IC) > Inc(CT) and Inc(CPI) 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 7 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn

If memory s expensve, people hand code machnes, and complers are terrble Use If memory s nexpensve, no one hand codes, and complers are terrfc Use Whch computer s faster? Computer A Computer B Computer C Program P (sec) 0 20 Program P2 (sec) 000 00 20 Total tme (sec) 00 0 40 A s 0x faster than B for P B s 0x faster than A for P2 A s 20x faster than C for P C s 50x faster than A for P2 etc. Total executon tme gves the clearest pcture: B s 00/0 9.x faster than A for both programs C s 25x faster than A for both programs C s 2.75x faster than B for both programs Whch would you buy? (Answer: C s fastest, overall) The arthmetc mean of tmes s a good measure too Example of means A B Prog 4 2 Prog 2 4 7 Harmonc mean 4 3. Arthmetc mean 4 4.5 (Rates gven n nstructons per second) Whch s faster, A or B? Lecture 2 Advanced Mcroprocessor Desgn 8

Consder runnng an average nstructon from Prog. followed by one from Prog. 2: for A: (/4 + /4) /2 for B: (/2 + /7) 9/4 A runs the two nstructons faster (/2 < 9/4), thus A s better. ow look at the harmonc mean (Hmean) vs. the arthmetc mean (Amean). Hmean says A has a hgher rate than B (4 vs. 3.) so Amean says B has a hgher rate than A (4.5 vs. 4) so B s better, but that s wrong! If you used the wrong method to combne the numbers, you would buy the slower machne! ote also that the defnton of harmonc mean s just the average of the rates converted to tmes, then converted back to rates. Rules Use arthmetc mean to combne run tmes. x weght tme tme Use harmonc mean to combne rates (e.g., IPC), because t actually combnes them as tmes then converts back to a rate H weght / rate rate 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 9 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn