Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Similar documents
Advanced Computer Architecture (CS620)

Lecture: Pipelining Basics

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

ECE 486/586. Computer Architecture. Lecture # 3

Lecture 2: Performance

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Performance of computer systems

Performance, Power, Die Yield. CS301 Prof Szajda

CS3350B Computer Architecture CPU Performance and Profiling

Solutions for Chapter 4 Exercises

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Spring 2009 Prof. Hyesoon Kim

Transistors and Wires

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Course web site: teaching/courses/car. Piazza discussion forum:

HW1 Solutions. Type Old Mix New Mix Cost CPI

CMSC 611: Advanced Computer Architecture

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Lecture 5: Pipelining Basics

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology

Response Time and Throughput

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Computer Architecture

Lecture: Pipeline Wrap-Up and Static ILP

CS654 Advanced Computer Architecture. Lec 2 - Introduction

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

CpE 442 Introduction to Computer Architecture. The Role of Performance

Reporting Performance Results

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

GRE Architecture Session

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Outline Marquette University

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Cache Organizations for Multi-cores

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Performance Analysis

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

What is Pipelining? RISC remainder (our assumptions)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Lecture 12: Instruction Execution and Pipelining. William Gropp

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

ECE331: Hardware Organization and Design

EE382A Lecture 16A: Multi-Core Design Tradeoffs (cont.)

CSC630/CSC730 Parallel & Distributed Computing

SOFT 437. Software Performance Analysis. Ch 7&8:Software Measurement and Instrumentation

Practice Assignment 1

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Computer Architecture. What is it?

Designing for Performance. Patrick Happ Raul Feitosa

Instructor Information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

COSC 6385 Computer Architecture - Pipelining

CS152 Computer Architecture and Engineering. Lecture 9 Performance Dave Patterson. John Lazzaro. www-inst.eecs.berkeley.

The Role of Performance

Review: latency vs. throughput

Performance Measurement (as seen by the customer)

EECS4201 Computer Architecture

Multiple Issue ILP Processors. Summary of discussions

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

CSE A215 Assembly Language Programming for Engineers

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Lecture 18: Core Design, Parallel Algos

CS 5803 Introduction to High Performance Computer Architecture: RISC vs. CISC. A.R. Hurson 323 Computer Science Building, Missouri S&T

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Computer Architecture. R. Poss

Introduction to Pipelined Datapath

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

CS61C : Machine Structures

ECE 154A Introduction to. Fall 2012

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Transcription:

Lecture: Benchmarks, Pipelining Intro Topics: Performance equations wrap-up, Intro to pipelining 1

Measuring Performance Two primary metrics: wall clock time (response time for a program) and throughput (jobs performed in unit time) To optimize throughput, must ensure that there is minimal waste of resources 2

Benchmark Suites Performance is measured with benchmark suites: a collection of programs that are likely relevant to the user SPEC CPU 2006: cpu-oriented programs (for desktops) SPECweb, TPC: throughput-oriented (for servers) EEMBC: for embedded processors/workloads 3

Summarizing Performance Consider 25 programs from a benchmark set how do we capture the behavior of all 25 programs with a single number? P1 P2 P3 Sys-A 10 8 25 Sys-B 12 9 20 Sys-C 8 8 30 Sum of execution times (AM) Sum of weighted execution times (AM) Geometric mean of execution times (GM) 4

Sum of Weighted Exec Times Example We fixed a reference machine X and ran 4 programs A, B, C, D on it such that each program ran for 1 second The exact same workload (the four programs execute the same number of instructions that they did on machine X) is run on a new machine Y and the execution times for each program are 0.8, 1.1, 0.5, 2 With AM of normalized execution times, we can conclude that Y is 1.1 times slower than X perhaps, not for all workloads, but definitely for one specific workload (where all programs run on the ref-machine for an equal #cycles) 5

GM Example Computer-A Computer-B Computer-C P1 1 sec 10 secs 20 secs P2 1000 secs 100 secs 20 secs Conclusion with GMs: (i) A=B (ii) C is ~1.6 times faster For (i) to be true, P1 must occur 100 times for every occurrence of P2 With the above assumption, (ii) is no longer true Hence, GM can lead to inconsistencies 6

Problem 4 Consider 3 programs from a benchmark set. Assume that system-a is the reference machine. How does the performance of system-c compare against that of system-b (for all 3 metrics)? P1 P2 P3 Sys-A 5 10 20 Sys-B 6 8 18 Sys-C 7 9 14 Sum of execution times (AM) Sum of weighted execution times (AM) Geometric mean of execution times (GM) 7

Problem 4 Consider 3 programs from a benchmark set. Assume that system-a is the reference machine. How does the performance of system-c compare against that of system-b (for all 3 metrics)? P1 P2 P3 S.E.T S.W.E.T GM Sys-A 5 10 20 35 3 10 Sys-B 6 8 18 32 2.9 9.5 Sys-C 7 9 14 30 3 9.6 Relative to B, C provides a speedup of 0.97 (S.W.E.T) or 0.99 (GM) or 1.07 (S.E.T) Relative to B, C improves execution time by -3.3% (S.W.E.T) or -1% (GM) or 6.7% (S.E.T) 8

Summarizing Performance GM: does not require a reference machine, but does not predict performance very well So we multiplied execution times and determined that sys-a is 1.2x faster but on what workload? AM: does predict performance for a specific workload, but that workload was determined by executing programs on a reference machine Every year or so, the reference machine will have to be updated 9

Speedup Vs. Percentage Speedup is a ratio = old exec time / new exec time Improvement, Increase, Decrease usually refer to percentage relative to the baseline = (new perf old perf) / old perf A program ran in 100 seconds on my old laptop and in 70 seconds on my new laptop What is the speedup? (1/70) / (1/100) = 1.42 What is the percentage increase in performance? ( 1/70 1/100 ) / (1/100) = 42% What is the reduction in execution time? 30% 10

CPU Performance Equation Clock cycle time = 1 / clock speed CPU time = clock cycle time x cycles per instruction x number of instructions Influencing factors for each: clock cycle time: technology and pipeline CPI: architecture and instruction set design instruction count: instruction set design and compiler CPI (cycles per instruction) or IPC (instructions per cycle) can not be accurately estimated analytically 11

Problem 5 My new laptop has an IPC that is 20% worse than my old laptop. It has a clock speed that is 30% higher than the old laptop. I m running the same binaries on both machines. What speedup is my new laptop providing? 12

Problem 5 My new laptop has an IPC that is 20% worse than my old laptop. It has a clock speed that is 30% higher than the old laptop. I m running the same binaries on both machines. What speedup is my new laptop providing? Exec time = cycle time * CPI * instrs Perf = clock speed * IPC / instrs Speedup = new perf / old perf = new clock speed * new IPC / old clock speed * old IPC = 1.3 * 0.8 = 1.04 13

An Alternative Perspective - I Each program is assumed to run for an equal number of cycles, so we re fair to each program The number of instructions executed per cycle is a measure of how well a program is doing on a system The appropriate summary measure is sum of IPCs or AM of IPCs = 1.2 instr + 1.8 instr + 0.5 instr cyc cyc cyc This measure implicitly assumes that 1 instr in prog-a has the same importance as 1 instr in prog-b 14

An Alternative Perspective - II Each program is assumed to run for an equal number of instructions, so we re fair to each program The number of cycles required per instruction is a measure of how well a program is doing on a system The appropriate summary measure is sum of CPIs or AM of CPIs = 0.8 cyc + 0.6 cyc + 2.0 cyc instr instr instr This measure implicitly assumes that 1 instr in prog-a has the same importance as 1 instr in prog-b 15

AM and HM Note that AM of IPCs = 1 / HM of CPIs and AM of CPIs = 1 / HM of IPCs So if the programs in a benchmark suite are weighted such that each runs for an equal number of cycles, then AM of IPCs or HM of CPIs are both appropriate measures If the programs in a benchmark suite are weighted such that each runs for an equal number of instructions, then AM of CPIs or HM of IPCs are both appropriate measures 16

AM vs. GM GM of IPCs = 1 / GM of CPIs AM of IPCs represents thruput for a workload where each program runs sequentially for 1 cycle each; but high-ipc programs contribute more to the AM GM of IPCs does not represent run-time for any real workload (what does it mean to multiply instructions?); but every program s IPC contributes equally to the final measure 17

Problem 6 My new laptop has a clock speed that is 30% higher than the old laptop. I m running the same binaries on both machines. Their IPCs are listed below. I run the binaries such that each binary gets an equal share of CPU time. What speedup is my new laptop providing? P1 P2 P3 Old-IPC 1.2 1.6 2.0 New-IPC 1.6 1.6 1.6 18

Problem 6 My new laptop has a clock speed that is 30% higher than the old laptop. I m running the same binaries on both machines. Their IPCs are listed below. I run the binaries such that each binary gets an equal share of CPU time. What speedup is my new laptop providing? P1 P2 P3 AM GM Old-IPC 1.2 1.6 2.0 1.6 1.57 New-IPC 1.6 1.6 1.6 1.6 1.6 AM of IPCs is the right measure. Could have also used GM. Speedup with AM would be 1.3. 19

Building a Car Unpipelined Start and finish a job before moving to the next Jobs Time 20

The Assembly Line Pipelined Break the job into smaller stages A B C A B C Jobs A B C A B C Time 21

Clocks and Latches Stage 1 Stage 2 22

Clocks and Latches Stage 1 L Stage 2 L Clk 23

Title Bullet 24