CMSC 611: Advanced Computer Architecture

Similar documents
CMSC 611: Advanced Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Performance, Power, Die Yield. CS301 Prof Szajda

Performance of computer systems

Course web site: teaching/courses/car. Piazza discussion forum:

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Computer Systems Performance Analysis and Benchmarking (37-235)

Response Time and Throughput

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

The Role of Performance

CpE 442 Introduction to Computer Architecture. The Role of Performance

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Engineering 9859 CoE Fundamentals Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

Designing for Performance. Patrick Happ Raul Feitosa

Reporting Performance Results

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Instructor Information

Lecture 3: Evaluating Computer Architectures. How to design something:

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

1.6 Computer Performance

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Alpha AXP Workstation Family Performance Brief - OpenVMS

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Quiz for Chapter 1 Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

The Von Neumann Computer Model

1.3 Data processing; data storage; data movement; and control.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Solutions for Chapter 4 Exercises

PERFORMANCE MEASUREMENT

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

CSE4201 Instruction Level Parallelism. Branch Prediction

ECE 486/586. Computer Architecture. Lecture # 3

Instruction Level Parallelism

Chapter-5 Memory Hierarchy Design

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Pipelining. CS701 High Performance Computing

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

ELE 455/555 Computer System Engineering. Section 1 Review and Foundations Class 5 Computer System Performance

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

04S1 COMP3211/9211 Computer Architecture Tutorial 1 (Weeks 02 & 03) Solutions

Computer Architecture

Quantifying Performance EEC 170 Fall 2005 Chapter 4

Hardware Speculation Support

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

CMSC 611: Advanced Computer Architecture

CS 110 Computer Architecture

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Outline Marquette University

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Performance. February 12, Howard Huang 1

CS3350B Computer Architecture CPU Performance and Profiling

Lecture 2: Performance

CS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from David Culler, UC Berkeley CS252, Spr 2002

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

CSE 141 Computer Architecture Summer Session Lecture 2 Performance, ALU. Pramod V. Argade

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

IBM Power Systems Performance Report. POWER9, POWER8 and POWER7 Results

History of the Intel 80x86

The Computer Revolution. Classes of Computers. Chapter 1

Transistors and Wires

Vector and Parallel Processors. Amdahl's Law

Assessing and Understanding Performance

GRE Architecture Session

Computer Architecture. What is it?

Fundamentals of Quantitative Design and Analysis

EE108B Lecture 6. Performance and Compilers. Christos Kozyrakis Stanford University

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Advanced Computer Architecture (CS620)

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

UCB CS61C : Machine Structures

Figure 1-1. A multilevel machine.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Transcription:

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

Important Equations (so far) Performance = 1 Execution time Speedup = Performance (B) Performance (A) = Time (A) Time (B) CPU time = Instructions Program " Cycles Instruction " Seconds Cycle CPU clock cycles = n " CPI i #Instructions i i=1

Amdahl s s Law The performance enhancement possible with a given improvement is limited by the amount that the improved feature is used Execution time after improvement = Execution time affected by the improvement Amount of improvement + Execution time unaffected A common theme in Hardware design is to make the common case fast Increasing the clock rate would not affect memory access time Using a floating point processing unit does not speed integer ALU operations Example: Floating point instructions improved to run 2X; but only 10% of actual instructions are floating point Exec-Time new = Exec-Time old x (0.9 +.1/2) = 0.95 x Exec-Time old Speedup overall = Exec-Time new / Exec-Time old = 1/0.95 = 1.053

Ahmdal s Law for Speedup Time old = Time old * ( Fraction unchanged + Fraction enhanced ) " Time new = Time old * Fraction unchanged + Fraction enhanced $ # Speedup enhanced % ' & Speedup overall = Time old Time new = = Speedup overall = Time old " Time old * Fraction unchanged + Fraction enhanced $ # 1 Fraction unchanged + Fraction enhanced Speedup enhanced 1 1" Fraction enhanced ( ) + Fraction enhanced Speedup enhanced Speedup enhanced % ' &

Can Hardware-Independent Metrics Predict Performance? KDF9 B5500 ICL 1907 1.1 µs ATLAS Time Instructions executed Code size in instructions Code size in bits 12 11 10 9 8 7 6 5 4 3 2 CDC 6600 NU 1108 The Burroughs B5500 machine is designed specifically for Algol 60 programs Although CDC 6600 s programs are over 3 times as big as those of B5500, yet the CDC machine runs them almost 6 times faster Code size cannot be used as an indication for performance 1

Comparing & Summarizing Performance Computer A Computer B Program 1 (seconds) 1 10 Program 2 (seconds) 1000 100 Total time (seconds) 1001 110 Wrong summary can present a confusing picture A is 10 times faster than B for program 1 B is 10 times faster than A for program 2 Total execution time is a consistent summary measure Relative execution times for the same workload Assuming that programs 1 and 2 are executing for the same number of times on computers A and B CPU Performance (B) CPU Performance (A) = Total execution time (A) Total execution time (B) = 1001 110 = 9.1 Execution time is the only valid and unimpeachable measure of performance

Performance Summary (Cont.) 1 n Arithmetic Mean (AM) =! Execution_ Time i n 1 i = Weighted Arithmetic Mean (WAM) = n! i= 1 w i " Execution_ Time i Where: n is the number of programs executed w i is a weighting factor that indicates the frequency of executing program i n with! w = and 0 w! 1 i = 1 i 1! i Weighted arithmetic means summarize performance while tracking exec. time Never use AM for normalizing time relative to a reference machine Time on A Time on B Norm. to A Norm. to B A B A B Program 1 1 10 1 10 0.1 1 Program 2 1000 100 1 0.1 10 1 AM of normalized time 1 5.05 5.05 1 AM of time 500.5 55 1 0.11 9.1 1

Performance Summary (Cont.) Geometric Mean (GM) = n n! i= 1 Execution_Time_ratio i Where: n is the number of programs executed With Geometric Mean ( X i ) Geometric Mean ( Y ) i = Geometric Mean & $ % X Y i i #! "! Geometric mean is suitable for reporting average normalized execution time Time on A Time on B Norm. to A Norm. to B A B A B Program 1 1 10 1 10 0.1 1 Program 2 1000 100 1 0.1 10 1 GM of time or normalized time 31.62 31.62 1 1 1 1

Performance Benchmarks Many widely-used benchmarks are small programs that have significant locality of instruction and data reference Universal benchmarks can be misleading since hardware and compiler vendors do optimize their design for these programs The best types of benchmarks are real applications since they reflect the end-user interest Architectures might perform well for some applications and poorly for others Compilation can boost performance by taking advantage of architecture-specific features Application-specific compiler optimization are becoming more popular

800 Effect of Compilation 700 600 500 400 300 200 100 0 gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark Compiler Enhanced compiler App. and arch. specific optimization can dramatically impact performance

The SPEC Benchmarks SPEC stands for System Performance Evaluation Cooperative suite of benchmarks Created by a set of companies to improve the measurement and reporting of CPU performance SPEC2000 is the latest suite that consists of 12 integer (written in C) and 14 floating-point (in Fortran 77) programs Customized SPEC suites have been recently introduced to assess performance of graphics and transaction systems. Since SPEC requires running applications on real hardware, the memory system has a significant effect on performance

Hardware Model number Powerstation 550 CPU 41.67-MHz POWER 4164 FPU (floating point) Integrated Number of CPU 1 Cache size per CPU 64K data/8k instruction Memory 64 MB Disk subsystem Network interface N/A Software 2 400-MB SCSI OS type and revision AIX Ver. 3.1.5 Compiler revision AIX XL C/6000 Ver. 1.1.5 AIX XL Fortran Ver. 2.2 Other software File system type Firmware level Tuning parameters Background load System state Performance Reports None AIX N/A System None None Multi-user (single-user login) Guiding principle is reproducibility (report environment & experiments setup)

The SPEC Benchmarks SPEC ratio = Execution time on SUN SPARCstation 10/40 Execution time on the measure machine Bigger numeric values of the SPEC ratio indicate faster machine

10 9 8 7 6 5 4 3 2 1 SPEC95 for Pentium and Pentium Pro 10 9 8 7 6 5 4 3 2 1 0 50 100 Clock rate (MHz) 150 200 250 Pentium 0 50 100 150 Clock rate (MHz) 200 250 Pentium Pentium Pro Pentium Pro The performance measured may be different on other Pentium-based hardware with different memory system and using different compilers At the same clock rate, the SPECint95 measure shows that Pentium Pro is 1.4-1.5 times faster while the SPECfp95 shows that it is 1.7-1.8 times faster When the clock rate is increased by a certain factor, the processor performance increases by a lower factor

SPECbase CINT2000 Price-Performance Metric Prices reflects those of July 2001 SPEC CINT2000 per $1000 in price Different results are obtained for other benchmarks, e.g. SPEC CFP2000 With the exception of the Sunblade price-performance metrics were consistent with performance

Historic Perspective In early computers most instructions of a machine took the same execution time The measure of performance for old machines was the time required performing an individual operation (e.g. addition) New computers have diverse set of instructions with different execution times The relative frequency of instructions across many programs was calculated The average instruction execution time was measured by multiplying the time of each instruction by its frequency The average instruction execution time was a small step to MIPS that grew in popularity

Using MIPS MIPS = Million of Instructions Per Second one of the simplest metrics valid only in a limited context Instruction count MIPS (native MIPS) = 6 Execution time! 10 There are three problems with MIPS: MIPS specifies the instruction execution rate but not the capabilities of the instructions MIPS varies between programs on the same computer MIPS can vary inversely with performance (see next example) The use of MIPS is simple and intuitive, faster machines have bigger MIPS

Consider the machine with the following three instruction classes and CPI: Now suppose we measure the code for the same program from two different compilers and obtain the following data: Assume that the machine s clock rate is 500 MHz. Which code sequence will execute faster according to MIPS? According to execution time? Answer: Using the formula: Sequence 1: Sequence 2: Example Instruction class CPI for this instruction class A 1 B 2 C 3 Instruction count in (billions) for each Code from instruction class A B C Compiler 1 5 1 1 Compiler 2 10 1 1 CPU clock cycles = "CPI! C CPU clock cycles = (5!1 + 1!2 + 1!3)! 10 9 = 10!10 9 cycles CPU clock cycles = (10!1 + 1!2 + 1!3)! 10 9 = 15!10 9 cycles n i =1 i i

Using the formula: Sequence 1: Sequence 2: Example (Cont.) Exection time = CPU clock cycles Clock rate Execution time = (10!10 9 )/(500!10 6 ) = 20 seconds Execution time = (15!10 9 )/(500!10 6 ) = 30 seconds Therefore compiler 1 generates a faster program Using the formula: MIPS = Instruction count Execution time! 10 6 (5 + 1+ 1)! 10 Sequence 1: MIPS = = 350 6 20! 10 (10 + 1+ 1)! 10 Sequence 2: MIPS = = 400 6 30! 10 Although compiler 2 has a higher MIPS rating, the code from generated by compiler 1 runs faster 9 9