Computer System. Performance

Similar documents
1.6 Computer Performance

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Introduction To Computer Architecture

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

CS61C : Machine Structures

Instructor Information

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Computer Science 246. Computer Architecture

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Instructor Information

Quantifying Performance EEC 170 Fall 2005 Chapter 4

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Microarchitecture Overview. Performance

Computer Performance Evaluation: Cycles Per Instruction (CPI)

APPENDIX Summary of Benchmarks

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

Microarchitecture Overview. Performance

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

UCB CS61C : Machine Structures

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

UCB CS61C : Machine Structures

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

Cache Optimization by Fully-Replacement Policy

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

CS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

ADVANCED ELECTRONIC SOLUTIONS AVIATION SERVICES COMMUNICATIONS AND CONNECTIVITY MISSION SYSTEMS

CS 110 Computer Architecture

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Workloads, Scalability and QoS Considerations in CMP Platforms

Lecture 3: Evaluating Computer Architectures. How to design something:

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Performance, Power, Die Yield. CS301 Prof Szajda

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CS61C : Machine Structures

ECE369: Fundamentals of Computer Architecture

Chapter 1. and Technology

The Computer Revolution. Classes of Computers. Chapter 1

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Computer Engineering Fall Semester, 2011

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Performance Prediction using Program Similarity

ECE 486/586. Computer Architecture. Lecture # 3

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

CMSC 611: Advanced Computer Architecture

Instruction Based Memory Distance Analysis and its Application to Optimization

Low-Complexity Reorder Buffer Architecture*

CSE2021 Computer Organization. Computer Abstractions and Technology

SimPoint 3.0: Faster and More Flexible Program Analysis

CS3350B Computer Architecture CPU Performance and Profiling

CS 61C: Great Ideas in Computer Architecture Performance and Floating Point Arithmetic

Chapter 1. The Computer Revolution

Performance of computer systems

Solutions for Chapter 4 Exercises

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Computer Organization & Assembly Language Programming (CSE 2312)

Chapter-5 Memory Hierarchy Design

ECE 252 / CPS 220 Advanced Computer Architecture I. Administrivia. Instructors and Course Website. Where to Get Answers

Data Hiding in Compiled Program Binaries for Enhancing Computer System Performance

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Exploiting Streams in Instruction and Data Address Trace Compression

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Decoupled Zero-Compressed Memory

The Von Neumann Computer Model

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Rechnerstrukturen

Response Time and Throughput

Outline. What is Performance? Restating Performance Equation Time = Seconds. CPU Performance Factors

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

Performance of a Computer (Chapter 4)

Chapter 1. Computer Abstractions and Technology

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Three major components of Embedded Systems

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 2: Figures of Merit and Evaluation Methodologies

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

TDT 4260 lecture 2 spring semester 2015

The Von Neumann Computer Model

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005

Computer Architecture. Chapter 1 Part 2 Performance Measures

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Quiz for Chapter 1 Computer Abstractions and Technology

Using Hardware Vulnerability Factors to Enhance AVF Analysis

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

Transcription:

Computer System Performance Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in CP-226:Computer Architecture Lecture 4 (01 Feb 2013)

Performance and Cost Which of the following airplanes has the best performance? Airplane Passengers Range (mi) Speed (mph) Boeing 737-100 101 630 598 Boeing 747 470 4150 610 BAC/Sud Concorde 132 4000 1350 Douglas DC-8-50 146 8720 544 How much faster is the Concorde vs. the 747 How much bigger is the 747 vs. DC-8? 2

Performance and Cost Which computer is fastest? Not so simple Scientific simulation FP performance Program development Integer performance Database workload Memory, I/O 3

Performance of Computers Want to buy the fastest computer for what you want to do? Workload is all-important Correct measurement and analysis Want to design the fastest computer for what the customer wants to pay? Cost is an important criterion 4

Defining Performance What is important to whom? Computer system user Minimize elapsed time for program = time_end time_start Called response time Computer center manager Maximize completion rate = #jobs/second Called throughput 5

Response Time vs. Throughput Is throughput = 1/av. response time? Only if NO overlap Otherwise, throughput > 1/av. response time E.g. a lunch buffet assume 5 entrees Each person takes 2 minutes/entrée Throughput is 1 person every 2 minutes BUT time to fill up tray is 10 minutes Why and what would the throughput be otherwise? 5 people simultaneously filling tray (overlap) Without overlap, throughput = 1/10 6

What is Performance for us? For computer architects CPU time = time spent running a program Intuitively, bigger should be faster, so: Performance = 1/X time, where X is response, CPU execution, etc. Elapsed time = CPU time + I/O wait We will concentrate on CPU time 7

Improve Performance Improve (a) response time or (b) throughput? Faster CPU Helps both (a) and (b) Add more CPUs Helps (b) and perhaps (a) due to less queueing 8

Performance Comparison Machine A is n times faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = n Machine A is x% faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = 1 + x/100 E.g. time(a) = 10s, time(b) = 15s 15/10 = 1.5 => A is 1.5 times faster than B 15/10 = 1.5 => A is 50% faster than B 9

Breaking Down Performance A program is broken into instructions H/W is aware of instructions, not programs At lower level, H/W breaks instructions into cycles Lower level state machines change state every cycle For example: 1GHz Snapdragon runs 1000M cycles/sec, 1 cycle = 1ns 2.5GHz Core i7 runs 2.5G cycles/sec, 1 cycle = 0.25ns 10

Iron Law Time Processor Performance = --------------- Program Instructions Cycles = X X Program Instruction Time Cycle (code size) (CPI) (cycle time) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer 11

Iron Law Instructions/Program Instructions executed, not static code size Determined by algorithm, compiler, ISA Cycles/Instruction Determined by ISA and CPU organization Overlap among instructions reduces this term Time/cycle Determined by technology, organization, clever circuit design 12

Our Goal Minimize time which is the product, NOT isolated terms Common error to miss terms while devising optimizations e.g. ISA change to decrease instruction count BUT leads to CPU organization which makes clock slower Bottom line: terms are inter-related 13

Other Metrics MIPS and MFLOPS MIPS = instruction count/(execution time x 10 6 ) = clock rate/(cpi x 10 6 ) But MIPS has serious shortcomings 14

Problems with MIPS E.g. without FP hardware, an FP op may take 50 single-cycle instructions With FP hardware, only one 2-cycle instruction Thus, adding FP hardware: CPI increases (why?) Instructions/program decreases (why?) Total execution time decreases BUT, MIPS gets worse! 50/50 => 2/1 50 => 1 50 => 2 50 MIPS => 2 MIPS 15

Problems with MIPS Ignores program Usually used to quote peak performance Ideal conditions => guaranteed not to exceed! When is MIPS ok? Same compiler, same ISA E.g. same binary running on AMD Phenom, Intel Core i7 Why? Instr/program is constant and can be ignored 16

Other Metrics MFLOPS = FP ops in program/(execution time x 10 6 ) Assuming FP ops independent of compiler and ISA Often safe for numeric codes: matrix size determines # of FP ops/program However, not always safe: Missing instructions (e.g. FP divide) Optimizing compilers Relative MIPS and normalized MFLOPS Adds to confusion 17

Use ONLY Time Rules Beware when reading, especially if details are omitted Beware of Peak Guaranteed not to exceed 18

Iron Law Example Machine A: clock 1ns, CPI 2.0, for program x Machine B: clock 2ns, CPI 1.2, for program x Which is faster and how much? Time/Program = instr/program x cycles/instr x sec/cycle Time(A) = N x 2.0 x 1 = 2N Time(B) = N x 1.2 x 2 = 2.4N Compare: Time(B)/Time(A) = 2.4N/2N = 1.2 So, Machine A is 20% faster than Machine B for this program 19

Iron Law Example Keep clock(a) @ 1ns and clock(b) @2ns For equal performance, if CPI(B)=1.2, what is CPI(A)? Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A)) CPI(A) = 2.4 20

Iron Law Example Keep CPI(A)=2.0 and CPI(B)=1.2 For equal performance, if clock(b)=2ns, what is clock(a)? Time(B)/Time(A) = 1 = (N x 2.0 x clock(a))/(n x 1.2 x 2) clock(a) = 1.2ns 21

Which Programs Execution time of what program? Best case your always run the same set of programs Port them and time the whole workload In reality, use benchmarks Programs chosen to measure performance Predict performance of actual workload Saves effort and money Representative? Honest? Benchmarketing 22

How to Average Machine A Machine B Program 1 1 10 Program 2 1000 100 Total 1001 110 One answer: for total execution time, how much faster is B? 9.1x 23

How to Average Another: arithmetic mean (same result) Arithmetic mean of times: AM(A) = 1001/2 = 500.5 AM(B) = 110/2 = 55 500.5/55 = 9.1x n time( i) i= 1 Valid only if programs run equally often, so use weighted arithmetic mean: 1 n n i= 1 ( weight( i) time( i) ) 1 n 24

Other Averages E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed? Average speed = (30+90)/2 WRONG Average speed = total distance / total time = (20 / (10/30 + 10/90)) = 45 mph 25

Harmonic Mean Harmonic mean of rates = Use HM if forced to start and end with rates (e.g. reporting MIPS or MFLOPS) Why? n i= 1 rate( n Rate has time in denominator Mean should be proportional to inverse of sums of time (not sum of inverses) See: J.E. Smith, Characterizing computer performance with a single number, CACM Volume 31, Issue 10 (October 1988), pp. 1202-1206. n 1 ) 26

Dealing with Ratios Machine A Machine B Program 1 1 10 Program 2 1000 100 Total 1001 110 If we take ratios with respect to machine A Machine A Machine B Program 1 1 10 Program 2 1 0.1 27

Dealing with Ratios Average for machine A is 1, average for machine B is 5.05 If we take ratios with respect to machine B Machine A Machine B Program 1 0.1 1 Program 2 10 1 Average 5.05 1 Can t both be true!!! Don t use arithmetic mean on ratios! 28

Geometric Mean Use geometric mean for ratios Geometric mean of ratios = Independent of reference machine In the example, GM for machine a is 1, for machine B is also 1 Normalized with respect to either machine n n ratio( i) i=1 29

But GM of ratios is not proportional to total time AM in example says machine B is 9.1 times faster GM says they are equal If we took total execution time, A and B are equal only if Program 1 is run 100 times more often than program 2 Generally, GM will mispredict for three or more machines 30

Summary Use AM for times Use HM if forced to use rates Use GM if forced to use ratios Best of all, use unnormalized numbers to compute time 31

Benchmarks: SPEC2000 System Performance Evaluation Cooperative Formed in 80s to combat benchmarketing SPEC89, SPEC92, SPEC95, SPEC2000 12 integer and 14 floating-point programs Sun Ultra-5 300MHz reference machine has score of 100 Report GM of ratios to reference machine 32

Benchmarks: SPEC CINT2000 Benchmark 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Description Compression FPGA place and route C compiler Combinatorial optimization Chess Word processing, grammatical analysis Visualization (ray tracing) PERL script execution Group theory interpreter Object-oriented database Compression Place and route simulator 33

Benchmarks: SPEC CFP2000 Benchmark 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi Description Physics/Quantum Chromodynamics Shallow water modeling Multi-grid solver: 3D potential field Parabolic/elliptic PDE 3-D graphics library Computational Fluid Dynamics Image Recognition/Neural Networks Seismic Wave Propagation Simulation Image processing: face recognition Computational chemistry Number theory/primality testing Finite-element Crash Simulation High energy nuclear physics accelerator design Meteorology: Pollutant distribution 34

Thank You 35