MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Similar documents
ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Computer Architecture. Chapter 1 Part 2 Performance Measures

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Course web site: teaching/courses/car. Piazza discussion forum:

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Instructor Information

CpE 442 Introduction to Computer Architecture. The Role of Performance

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Reporting Performance Results

Quiz for Chapter 1 Computer Abstractions and Technology

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Engineering 9859 CoE Fundamentals Computer Architecture

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

The Role of Performance

CS3350B Computer Architecture CPU Performance and Profiling

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Response Time and Throughput

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Pipelining. CS701 High Performance Computing

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Performance, Power, Die Yield. CS301 Prof Szajda

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Computer Architecture s Changing Definition

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Designing for Performance. Patrick Happ Raul Feitosa

ENGG 1203 Tutorial. Computer Arithmetic (1) Computer Arithmetic (3) Computer Arithmetic (2) Convert the following decimal values to binary:

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

CMSC 611: Advanced Computer Architecture

Introduction to Pipelined Datapath

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Performance Analysis

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

CS61C : Machine Structures

Computer Architecture

UCB CS61C : Machine Structures

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

CPE300: Digital System Architecture and Design

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

CSE 141 Summer 2016 Homework 2

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Vector and Parallel Processors. Amdahl's Law

CS654 Advanced Computer Architecture. Lec 2 - Introduction

ECE 486/586. Computer Architecture. Lecture # 3

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

پردازش لوله ای و برداری

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

UCB CS61C : Machine Structures

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

04S1 COMP3211/9211 Computer Architecture Tutorial 1 (Weeks 02 & 03) Solutions

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Advanced Computer Architecture

Computer Architecture. What is it?

Performance of computer systems

Lecture 3: Evaluating Computer Architectures. How to design something:

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS/ECE 752: Advanced Computer Architecture 1. Lecture 1: What is Computer Architecture?

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

Solutions for Chapter 4 Exercises

Evaluating Computers: Bigger, better, faster, more?

COSC3330 Computer Architecture Lecture 7. Datapath and Performance

Copyright 2012, Elsevier Inc. All rights reserved.

ECE331: Hardware Organization and Design

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Review: latency vs. throughput

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

GRE Architecture Session

Performance. February 12, Howard Huang 1

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Fundamentals of Computer Design

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

EECS4201 Computer Architecture

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 14 Performance and Processor Design

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

The Von Neumann Computer Model

Instruction Pipelining

Instruction Pipelining

Lecture 1: Introduction

Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Bell s Law new class per decade

UNIT I FUNDAMENTALS OF COMPUTER DESIGN

Transcription:

Necessity of evaluation computer performance MEASURING COMPUTER PERFORMANCE For comparing different computer performances User: Interested in reducing the execution time (response time) of a task. Computer centre administrator: throughput The computer user may say a computer is faster when a program runs in less time, while The computer centre manager may say a computer is faster when it completes more jobs in an hour COMPUTER ARCHITECTURE 2 TIME Time is the measure of computer performance The computer that performs the same amount of work in the least time is the fastest. Program execution time is measured in seconds per program. But, since with multiprogramming the CPU works on another program while waiting for I/O CPU time - the time CPU is computing (not including extra-time; e.g. time waiting for I/O or running other programs) A computer faster than another? when a program runs in less time (will say a computer user) the computer user is interested in reducing response time also referred to as execution time or latency. when computer completes more jobs in an hour (will say computer centre manager) the computer centre manager is interested in increasing throughput - the total amount of work done in a given time - sometimes called bandwidth COMPUTER ARCHITECTURE 3 COMPUTER ARCHITECTURE 4

USE accurate terms: Response time, execution time and throughput utilized for evaluate an entire computing work Response time (execution time, latency): how long the computer responds to user input or finishes a program; the sooner the better A user sees the result in 5 second after inputting the keyword to a library database system A simulation program finishes in one hour Throughput: Amount of work done in a given time. Throughput = how much work a computer can finish for a given time; the higher the better A web server serves up to 5 million requests per second Why throughput: a system runs multiple jobs simultaneously Bandwidth and latency for memory performance "A is faster than B" A is n times faster than B will mean COMPUTER ARCHITECTURE 5 COMPUTER ARCHITECTURE 6 "A is faster than B" "A is n% faster than B" Example 1 If machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, which of the following statements is true? a. A is 50% faster than B b. A is 33% faster than B COMPUTER ARCHITECTURE 7 COMPUTER ARCHITECTURE 8

Example 2 Machine A is n% faster than machine B can be expressed as: A is therefore 50% faster than B COMPUTER ARCHITECTURE 9 Using Cycles to Compute CPU Time Clock Cycle? Logic signal used to determine when state should be updated Ex: When does Register latch output of the adder? It takes a certain amount of time for the values (11, 22) to propagate through the adder and the result appear at the Output. The register cannot latch (store) the output of the adder before the value has been computed COMPUTER ARCHITECTURE 10 Using Cycles to Compute CPU Time CPU time is the time a processor spends executing a piece of software CPU execution time can be computed as: where: NrCk P = CPU clock cycles for a program T CLK = Clock period (clock cycle time) f CK = Clock frequency (clock rate) CPI = average clock cycles per instruction NrI = Number of instructions per program Clock Cycles Per Instruction (CPI) CPI = average clock cycles per instruction CPI is an average of all the instructions executed in a program CPI is useful in comparing two different implementations of the same architecture COMPUTER ARCHITECTURE 11 COMPUTER ARCHITECTURE 12

Example 3 Example 4 CPU clock frequency is 1 Megahertz (clock cycle time is 10-6 sec.) Program takes 4.5 million cycles to execute What s the CPU time? 4,500,000 * 10-6 = 4.5 seconds CPU clock frequency is 500 MHz Program takes 45 million cycles to execute What s the CPU time? 45,000,000 * (1/500,000,000) = 0.09 seconds COMPUTER ARCHITECTURE 13 COMPUTER ARCHITECTURE 14 Influences on CPU Performance Influences on CPU Performance Program Inst Count (NrI) CPI Compiler ( ) Instr. Set Clock Rate (f CLK ) Organisation Technology NrI CPI T CLK Program type Compiler quality ISA (low no. of complex instructions) ISA (high no. of simple instructions) Compiler quality organisation (parallel execution, hardwired control, memory bandwidth) hardware technology organisation COMPUTER ARCHITECTURE 15 COMPUTER ARCHITECTURE 16

Cycles Per Instruction CPI = Average Clock Cycles per Instruction Cycles Per Instruction This form can be used to express CPU time as but we can write where Ii represents number of times instruction i is executed in a program and CPI i represents the number of clock cycles for instruction i. Instruction Frequency" - what is the percentage of type i instructions from the total number of instructions per program? COMPUTER ARCHITECTURE 17 COMPUTER ARCHITECTURE 18 Overall CPI The later form of the CPI calculation multiplies each individual CPI i by the fraction of occurrence in a program Invest Resources where time is Spent! Example 5 A benchmark has 100 instructions: 25 instructions are loads/stores (each takes 2 cycles) 50 instructions are adds (each takes 1 cycle) 25 instructions are square root (each takes 100 cycles) CPI = ( (25 * 2) + (50 * 1) + (25 * 100) ) / 100 = 2600 / 100 = 26.0 COMPUTER ARCHITECTURE 19 COMPUTER ARCHITECTURE 20

Example 6: Calculating CPI MIPS OP Freq CPI i CPI i * F i (%time) ALU 50% 1 0.5 (33%) Load 20% 2 0.4 (27%) Store 10% 2 0.2 (13%) Branch 20% 2 0.4 (27%) 100% 1.5 Good news The rightmost form is very convenient since clock rate is fixed for a machine and CPI is usually a small number MIPS is easy to understand, especially by a customer MIPS matches intuition (Faster machines means bigger MIPS) Bad news MIPS is dependent of the instruction set, making it difficult to compare MIPS of computers with different instruction sets. MIPS varies between programs on the same computer; and most importantly, MIPS can vary inversely to performance COMPUTER ARCHITECTURE 21 COMPUTER ARCHITECTURE 22 Example: MIPS as a meaningless indicator of performance Machine A has a special instruction for performing square root calculations It takes 100 cycles to execute Machine B doesn t have the special instruction -- must perform square root calculations in software using simple instructions (.e.g, Add, Mult, Shift) that each take 1 cycle to execute Only for square root instructions, if f CLK = 1 MHz: Machine A: 1/100 MIPS = 0.01 MIPS Machine B: 1 MIPS COMPUTER ARCHITECTURE 23 MFLOPS Since MFLOPS were intended to measure floating-point performance, they are not applicable outside the range. Based on operations rather than instructions, MFLOPS is intended to be a fair comparison between different machines COMPUTER ARCHITECTURE 24

MFLOPS criticism MFLOPS is not dependable because the set of floating point operations is not consistent across machines MFLOPS rating is dependent on the machine (FP instructions implemented) Some machines have no divide instruction, or trigonometric functions. MFLOPS rating is dependent on the program For example, a program with 100% floating-points adds will have a higher rating than a program with 100% floatingpoint divides. BENCHMARKS Programs to Evaluate Performance Benchmarks can focus on specific aspects of a system Floating point & integer ALU Memory system I/O Operating System COMPUTER ARCHITECTURE 25 COMPUTER ARCHITECTURE 26 BENCHMARKS Patterson & Hennessy describe five types of benchmark programs, listed below in decreasing order of accuracy of prediction: Real applications Modified (or scripted) applications Kernels Toy benchmarks Synthetic benchmarks BENCHMARKS - Real applications Examples are compilers for C, text-processing software like Word, and other applications like Photoshop Real applications have input, output, and options that a user can select when running the program Real applications often encounter portability problems arising from dependences on the operating system or compiler Enhancing portability often means modifying the source and sometimes eliminating some important activity, such as interactive graphics, which tends to be more system dependent. COMPUTER ARCHITECTURE 27 COMPUTER ARCHITECTURE 28

BENCHMARKS-Modified applications Real applications are used as the building blocks for a benchmark, either with modifications to the application or with a script that acts as stimulus to the application Applications are modified for one of two primary reasons: to enhance portability or to focus on one particular aspect of system performance For example, to create a CPU-oriented benchmark, I/O may be removed or restructured to minimize its impact on execution time BENCHMARKS- Kernels Kernels are constructed by extracting small, key pieces from real programs Unlike real programs, no user would run kernel programs. They exist solely to evaluate performance Kernels are best used to isolate performance of individual features of a machine to explain the reasons for differences in performance of real programs. COMPUTER ARCHITECTURE 29 COMPUTER ARCHITECTURE 30 BENCHMARKS - Toy benchmarks - Toy benchmarks are typically between 10 and 100 lines of code and produce a result the user already knows before running the toy program Examples: Sieve of Eratosthenes, Puzzle, Quicksort BENCHMARKS- Synthetic benchmarks Similar in philosophy to kernels, synthetic benchmarks try to match the average frequency of operations and operands of a large set of programs Synthetic benchmarks are even further removed from reality than kernels because kernel code is extracted from real programs, while synthetic code is created artificially to match an average execution profile The most popular synthetic benchmarks: Whetstone (floating point operations) Dhrystone (fixed point operations) COMPUTER ARCHITECTURE 31 COMPUTER ARCHITECTURE 32

SPEC -the Standard Performance Evaluation Corporation SPEC is a non-profit corporation formed to "establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers" http://www.spec.org SPEC develops suites of benchmarks intended to measure computer performance These suites are packaged with source code and tools and are extensively tested for portability before release They are available to the public for a fee covering development and administrative costs By license agreement, SPEC members and customers agree to run and report results as specified in each benchmark suite's documentation. SPEC - Benchmark Suite Main categories of benchmark suites: Desktop benchmarks: CPU, memory, and graphics performance Sever benchmarks: throughput-oriented, I/O and OS intensive Embedded benchmarks: measuring the ability to meet deadline and save power COMPUTER ARCHITECTURE 33 COMPUTER ARCHITECTURE 34 Amdahl s Law The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahl's Law Amdahl's Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used Amdahl's Law defines the speedup that can be gained by using a particular feature Amdahl s Law Suppose that we can make an enhancement to a machine that will improve performance when it is used Speedup is the ratio: Performance for entire task u sin g the enhancement when possible Speedup = Performance for entire task without u sin g the enhancement Performancenew = Performanceold Entire task Performance u sin g the enhancement when possible Speedup = Entire task performance without u sin g the enhancement Performancenew ExecutionTime = = old Performanceold ExecutionTimenew COMPUTER ARCHITECTURE 35 COMPUTER ARCHITECTURE 36

Amdahl s Law Amdahl s Law t 1 (20s) t 2 (80s) 10x speedup for this component t 11 (2s) t 2 (80s) COMPUTER ARCHITECTURE 37 COMPUTER ARCHITECTURE 38 Amdahl s Law Example 1 Program runs for 100 seconds on a uniprocessor 10% of the program can be parallelized on a multiprocessor (F=0.1) Assume a multiprocessor with 10 processors (n=10) Amdahl s Law Example 2 Floating point instructions improved to run 2X; but only 10% of actual instructions are FP 0.1 ExTime new = ExTimeold 0.9 + = 0. 95 ExTimeold 2 Speedup = 1 ( 1 0.1) + 0.1 10 1 = = 0.01+ 0.9 1 0.91 = 1.1 Speedup overall = 1 = 1.053 0.95 COMPUTER ARCHITECTURE 39 COMPUTER ARCHITECTURE 40

Amdahl s Law Example 3 Suppose we could improve the speed of the CPU in our machine by a factor of five (without affecting I/O performance) for five times the cost. Also assume that the CPU is used 50% of the time, and the rest of the time the CPU is waiting for I/O. If the CPU is one-third of the total cost of the computer, is increasing the CPU speed by a factor of five a good investment from the cost / performance viewpoint? Answer: The speedup obtained is: 1 speedup = = 0.5 0.5 + 5 The new machine will cost 2,33 times the original machine: 1 0.6 = 1.67 2 1 *1+ *5 = 2.33 3 3 Since the cost increase is larger than the performance improvement, this change does not improve cost / performance COMPUTER ARCHITECTURE 41