Designing for Performance. Patrick Happ Raul Feitosa

Similar documents
Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

1.3 Data processing; data storage; data movement; and control.

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Performance, Power, Die Yield. CS301 Prof Szajda

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

The Role of Performance

CSE 141 Summer 2016 Homework 2

CS430 Computer Architecture

Response Time and Throughput

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Course web site: teaching/courses/car. Piazza discussion forum:

Performance of computer systems

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Vector and Parallel Processors. Amdahl's Law

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

04S1 COMP3211/9211 Computer Architecture Tutorial 1 (Weeks 02 & 03) Solutions

Computer Architecture. Chapter 1 Part 2 Performance Measures

Computer Organization. 8 th Edition. Chapter 2 p Computer Evolution and Performance

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

Practice Assignment 1

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Computer Architecture

CPE300: Digital System Architecture and Design

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CMSC 611: Advanced Computer Architecture

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

CS3350B Computer Architecture CPU Performance and Profiling

GRE Architecture Session

ECE 341. Lecture # 15

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Review: latency vs. throughput

Lec 25: Parallel Processors. Announcements

The Computer Revolution. Classes of Computers. Chapter 1

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CMSC411 Fall 2013 Midterm 1

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

The Memory Hierarchy & Cache

Engineering 9859 CoE Fundamentals Computer Architecture

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Computer Architecture Homework Set # 1 COVER SHEET Please turn in with your own solution

Mainstream Computer System Components

Outline Marquette University

Cache Memory and Performance

Performance Analysis

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Adapted from David Patterson s slides on graduate computer architecture

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Pipelining and Vector Processing

High Performance Computing

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Computer Organization and Architecture William Stallings 8th Edition. Chapter 2 Computer Evolution and Performance

CPSC614: Computer Architecture

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

EECS 322 Computer Architecture Superpipline and the Cache

Basics of Performance Engineering

Chapter 18 - Multicore Computers

Fundamentals of Quantitative Design and Analysis

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture and Organization (CS-507)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 2: Performance

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CPU Performance Pipelined CPU

Chapter 1. The Computer Revolution

CpE 442 Introduction to Computer Architecture. The Role of Performance

Lecture 1: Introduction

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CS 110 Computer Architecture

LECTURE 1. Introduction

Transcription:

Designing for Performance Patrick Happ Raul Feitosa

Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance 2

Which one would you choose? Name INTEL CORE I7 4770K Number of cores 4 Number of threads 8 Frequency 3.5 GHz Turbo Frequency 3.9 GHz Data width 64-bit TDP 84 W Release June, 2013 Name AMD FX 9590 Number of cores 8 Number of threads 8 Frequency 4.7 GHz Turbo Frequency 5 GHz Data width 64-bit TDP 220 W Release July, 2013 Designing for Performance 3

Outline Performance Assessment Amdahl s Law Designing for Performance 4

Designing new systems Cost Size Reliability Security Power Consumption Performance Designing for Performance 5

CPU operations Seek and decode instructions Load and Store data Logic and Arithmetic Operations Clock pulse Designing for Performance 6

Performance factors Clock speed or clock rate ( f ) Expressed in multiples of Hz. Clock cycle or clock tick one increment, or pulse, of the clock. Clock time ( τ ) time between consecutive pulses. 1 f Designing for Performance 7

Performance factors Clock speed Usually multiple clock cycles are required per instruction. The amount of work implied by one instruction varies considerably. Pipelining gives simultaneous execution of instructions. So, clock speed is not the whole story! Designing for Performance 8

Performance factors CPI - average number of cycles per instructions I i - number of machine instructions of type i executed by a program. CPI i - number of cycles per instruction of type i. I c - number of machine instructions executed by a program n I c I i i1 CPI n i 1 CPI I c i I i Designing for Performance 9

Performance factors T processor time needed to execute a program. T I c CPI a refinement yields T where I p ( mk) c p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time. Designing for Performance 10

Performance factors Instruction Execution Rate Expressed in Millions of instructions (MIPS) or floating point operations (MFLOPS) per second. Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy. Ic MIPS T 6. 10 CPI f.10 6 Designing for Performance 11

Performance factors System attributes affecting the performance factors Instruction set architecture I c p m k τ Compiler technology VLSI technology Processor implementation Cache and memory hierarchy Designing for Performance 12

Performance factors System attributes affecting the performance factors I c p m k τ Instruction set architecture! Compiler technology VLSI technology Processor implementation Cache and memory hierarchy! Designing for Performance 13

Exercise 1 A program involves the execution of 2 million instructions on a 400 MHz processor. CPI and proportion of four instruction types are given below. Compute the average CPI: instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10% average CPI is CPI = 0.6+ (2 0.18) + (4 0.12) + (8 0.1) = 2.24 Designing for Performance 14

Exercise 2 Consider two hardware implementations M 1 and M 2 of the same instruction set. There are three instruction classes: F, I and N. The M 1 clock rate is 600 Mhz. The clock cycle of M 2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M 1 CPI of M 2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic a) Compute the peak performance for M 1 and M 2 in MIPS. b) If 50% of the instruction executed in a given program belong to class N and the other are equally distributed between F and I, which is the fastest machine and by which factor? Designing for Performance 15

Exercise 2 c) A designer of M 1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the options before. e) Characterize application programs that can be executed faster in M 1 than in M 2, i. e., discuss the instruction composition of such applications. Hint: Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively. Designing for Performance 16

Exercise 3 Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2 a) Compute the execution time for both codes assuming a clock rate = 1 GHz. b) Which compiler produce the most efficient code and by which factor? c) Which code execute at the highest MIPS? Designing for Performance 17

Benchmarks: motivation A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on CISC Compiled code on RISC add mem(b),mem(c),mem(a) load mem(b),reg(1); load mem(c),reg(2); add reg(1),reg(2),reg(3); store reg(3),mem(a); Designing for Performance 18

Benchmarks: definition Programs designed to test performance Written in high level language portable Represents style of task (systems, numerical, commercial) Easily measured and widely distributed E.g. System Performance Evaluation Corporation (SPEC) CPU2006 for computation bound 17 floating point programs in C, C++, Fortran 12 integer programs in C, C++ 3 million lines of code Graphics, High Performance, Web, Servers, Designing for Performance 19

Averaging Results By running m different benchmark one obtains a reliable comparison. The overall instruction execution rate may be expressed by the m m 1 R H RA R m i 1 m i1 R arithmetic or harmonic mean, where R i is the instruction execution rate of the i-th benchmark i1 i Designing for Performance 20

SPEC speed metric Spec benchmarks do not concern with instruction execution rates Base runtime defined for each benchmark using reference machine Speed metric is ratio of reference time to system run time Tref i execution time for benchmark i on reference machine Tsut i execution time of benchmark i on test system Designing for Performance 21

Averaging SPEC metrics Overall performance calculated by averaging ratios for all 12 integer benchmarks Use geometric mean Appropriate for normalized numbers such as ratios Designing for Performance 22

SPEC Rate Metric Measures throughput or rate of a machine carrying out a number of tasks Multiple copies of benchmarks run simultaneously Typically, same as number of processors Ratio is calculated as follows: Tref i reference execution time for benchmark i N number of copies running simultaneously Tsut i elapsed time from start of execution of all N programs until completion of all copies of program Again, a geometric mean is calculated Designing for Performance 23

Exercise 4 The table below shows the execution times, in seconds, for 3 different processors. benchmark processor X Y Z a) Compute the arithmetic mean value for each system using X as the reference machine and then using Y as the reference machine. b) Compute the geometric mean value for each system using X as the reference machine and then using Y as the reference machine. Which is the most realistic result? 1 20 10 40 2 40 80 20 Designing for Performance 24

Which one would you choose? Name INTEL CORE I7 4770K Number of cores 4 Number of threads 8 Frequency 3.5 GHz Turbo Frequency 3.9 GHz Data width 64-bit TDP 84 W Release June, 2013 Name AMD FX 9590 Number of cores 8 Number of threads 8 Frequency 4.7 GHz Turbo Frequency 5 GHz Data width 64-bit TDP 220 W Release July, 2013 Designing for Performance 25

Ref: CPUBoss Link Designing for Performance 26

Outline Performance Assessment Amdahl s Law Designing for Performance 27

Amdahl s Law Estimate the potential speed up of program using multiple processors Fraction p of code parallelizable with no scheduling overhead Fraction (1 - p) of code inherently serial T is total execution time for program on single processor N is number of processors that fully exploit parallel portions of code Gene Amdahl Speedup time to execute program on a single processor time to execute program on N parallel processors T(1 p) Tp Tp T(1 p) N (1 1 p) p N Designing for Performance 28

Amdahl s Law Conclusions Code needs to be parallelizable/parallelized! p small, parallel processors has little effect. N, speedup bound by 1/(1 p). Speedup is bound, giving diminishing returns for more processors. Speedup time to execute program on a single processor time to execute program on N parallel processors T(1 p) Tp Tp T(1 p) N (1 1 p) p N Designing for Performance 29

Amdahl s Law Exercise 5 A program spends 60% of its execution time with floating point operations. 90% of them are executed in parallelizable loops. When the code is parallelized coordination and synchronization between parts make the part not involving floating-point operations 10% longer. a) Find the improvement in terms of execution time achieved by doubling the speed of the floating-point unit. b) Find the improvement in terms of execution time achieved by using two processors having the same speed and structure as the original one c) What would be the improvement if both changes are implemented. Designing for Performance 30

Amdahl s Law Generalization for any design improvement Speedup Execution time before enhancemen t Execution time after enhancemen t Suppose that the enhancement affects the execution p of the total runtime before enhancement, and that the speed up brought by this enhancement is SU p. Thus Speedup 1 p 1 f SU p Designing for Performance 31.

Amdahl s Law Generalized Amdahl s Law example Suppose that a task consumes 40% of the time with floating-point operations. A new FPU has speedup K. Then the overall speedup is Speedup So, the maximum speedup is 1.67. 1 1 0.4 0.4 K Designing for Performance 32

Homeworks Exercise 6 A processor is used for an application where 30 %, 25% and 10% of the processing time is spent with floating-point addition, multiplication and division, respectively. For a new processor version, 3 alternatives are being considered, all of them involving nearly the same design and implementation cost. Which one should be selected? a) Redesign the adder making it twice as fast as the older one. b) Redesign the multiplier making it three times as fast as the older one c) Redesign the divider making it ten times as fast as the older one. Designing for Performance 33

Homeworks Exercise 7: T is the average processing time of a computer operating at frequency f. Instructions are grouped in 3 types, as shown below. Instruction type CPI Floating point arithmetic 10 Integer arithmetic 5 Non- arithmetic 2 Typically a program executes the same proportion of instructions from all three groups/types. Compute the MIPS and the new execution time, if the FPU becomes twice as fast. Designing for Performance 34

Homeworks Exercise 8: Let f 1 and f 2 be the operation frequency of processors P 1 and P 2 respectively. Assume that two compilers generate different executable codes for the same source program which may be executed byp 1 as well as byp 2. The codes have the characteristics given below: Instruction type CPI Proportion compiler 1 Proportion compiler 2 Floating point arithmetic 10 20 % 30 % Integer arithmetic 5 30 % 10 % Non- arithmetic 2 50 % 60 % Compute the ratio f 1 /f 2 for which the processing time in P 1 executing code 1 equals the processing time of P 2 executing code 2. Designing for Performance 35

Homeworks Exercise 9: The code of an application can be separated in a sequential part (S) and in a parallelizable part (P). The number of executed instructions of type P is twice as many as of type S, when the application runs in a single processor. When the application runs in multiple processors the number of instructions of type S increases in 10%. Consider the following two configurations: A) Single processor machine operating with frequency 2f. B) Four processors machine operating with frequency f. a) Determine the limit ratio r between the CPI of instructions of type P and type S (r=cpi P /CPI S ), for which the configuration A) is faster than configuration B). b) Compute the upper limit for the speed up that can be achieved using multiple processors without changing the operation frequency. Designing for Performance 36

Homeworks Exercise 10: The following table shows the execution times, in seconds, for five different benchmark programs on three machines. Benchmark Processor R M Z E 417 244 134 F 83 70 70 H 66 153 135 I 39449 35527 66000 K 772 368 369 a) Compute the speed metric for each processor for each benchmark, normalized to machine R using equation given in slide 21. Then compute the arithmetic mean value b) Repeat a) using M as reference machine. Which machine is the slowest based on each of the preceding two calculations? c) Repeat the calculations of parts(a) and (b) using the geometric mean, defined in slide 22. Which machine is the slowest based on the two calculations?. Designing for Performance 37

Text Book References The topics are covered in Stallings - sections 2.2, 2.3 and 2.5 Tanenbaum - section 8.4 Parhami - chapter 4 Designing for Performance 38

Designing for Performance END 15-17, 24,28,31-25 Designing for Performance 39