ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Similar documents
CpE 442 Introduction to Computer Architecture. The Role of Performance

Computer Architecture s Changing Definition

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Lecture 3: Evaluating Computer Architectures. How to design something:

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Introduction to Pipelined Datapath

Course web site: teaching/courses/car. Piazza discussion forum:

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

ECE 486/586. Computer Architecture. Lecture # 3

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Instructor Information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance

CS61C : Machine Structures

UCB CS61C : Machine Structures

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Evaluating Computers: Bigger, better, faster, more?

CS/ECE 752: Advanced Computer Architecture 1. Lecture 1: What is Computer Architecture?

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

The Role of Performance

Reporting Performance Results

Chapter 14 Performance and Processor Design

Computer Architecture. What is it?

UCB CS61C : Machine Structures

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Performance, Power, Die Yield. CS301 Prof Szajda

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Performance of computer systems

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

CPE300: Digital System Architecture and Design

COSC3330 Computer Architecture Lecture 7. Datapath and Performance

The Computer Revolution. Classes of Computers. Chapter 1

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Quantifying Performance EEC 170 Fall 2005 Chapter 4

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

CS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Computer Architecture

Pipelining. CS701 High Performance Computing

CS 4200/5200 Computer Architecture I

Review: Salient features of MIPS I. CS152 Computer Architecture and Engineering Lecture 3

Response Time and Throughput

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Review: latency vs. throughput

The Von Neumann Computer Model

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Computer Architecture. Chapter 1 Part 2 Performance Measures

CS61C : Machine Structures

Quiz for Chapter 1 Computer Abstractions and Technology

ECE369: Fundamentals of Computer Architecture

Announcements. 1 week extension on project. 1 week extension on Lab 3 for 141L. Measuring performance Return quiz #1

CS3350B Computer Architecture CPU Performance and Profiling

Engineering 9859 CoE Fundamentals Computer Architecture

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

1.3 Data processing; data storage; data movement; and control.

Designing for Performance. Patrick Happ Raul Feitosa

Copyright 2012, Elsevier Inc. All rights reserved.

PERFORMANCE MEASUREMENT

CMSC 611: Advanced Computer Architecture

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

Spring 2009 Prof. Hyesoon Kim

Fundamentals of Quantitative Design and Analysis

Types of Workloads. Raj Jain Washington University in Saint Louis Saint Louis, MO These slides are available on-line at:

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

CSE 141 Summer 2016 Homework 2

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Write only as much as necessary. Be brief!

Transistors and Wires

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

CS430 Computer Architecture

Outline Marquette University

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Lecture 1: Introduction

DEPARTMENT OF ECE IV YEAR ECE EC6009 ADVANCED COMPUTER ARCHITECTURE LECTURE NOTES

Performance. February 12, Howard Huang 1

Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Bell s Law new class per decade

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Advanced Computer Architecture (CS620)

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Transcription:

ECE C61 Computer Architecture Lecture 2 performance Prof Alok N Choudhary choudhar@ecenorthwesternedu 2-1

Today s s Lecture Performance Concepts Response Time Throughput Performance Evaluation Benchmarks Announcements Processor Design Metrics Cycle Time Cycles per Instruction Amdahl s Law Speedup what is important Critical Path 2-2

Performance Concepts 2-3

Performance Perspectives Purchasing perspective Given a collection of machines, which has the - Best performance? - Least cost? - Best performance / cost? Design perspective Faced with design options, which has the Both require - Best performance improvement? - Least cost? - Best performance / cost? basis for comparison metric for evaluation Our goal: understand cost & performance implications of architectural choices 2-4

Two Notions of Performance Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 65 hours 610 mph 470 286,700 Concorde 3 hours 1350 mph 132 178,200 Which has higher performance? Execution time (response time, latency, ) Time to do a task Throughput (bandwidth, ) Tasks per unit of time Response time and throughput often are in opposition 2-5

Definitions Performance is typically in units-per-second bigger is better If we are primarily concerned with response time performance = 1 execution_time " X is n times faster than Y" means ExecutionTime ExecutionTime y x = Performance Performance x y = n 2-6

Example Time of Concorde vs Boeing 747? Concord is 1350 mph / 610 mph = 22 times faster = 65 hours / 3 hours Throughput of Concorde vs Boeing 747? Concord is 178,200 pmph / 286,700 pmph Boeing is 286,700 pmph / 178,200 pmph = 062 times faster = 160 times faster Boeing is 16 times ( 60% ) faster in terms of throughput Concord is 22 times ( 120% ) faster in terms of flying time We will focus primarily on execution time for a single job Lots of instructions in a program => Instruction thruput important! 2-7

Benchmarks 2-8

Evaluation Tools Benchmarks, traces and mixes Macrobenchmarks and suites Microbenchmarks Traces Workloads Simulation at many levels ISA, microarchitecture, RTL, gate circuit Trade fidelity for simulation rate (Levels of abstraction) Other metrics Area, clock frequency, power, cost, Analysis Queuing theory, back-of-the-envelope Rules of thumb, basic laws and principles 2-9

Benchmarks Microbenchmarks Measure one performance dimension - Cache bandwidth - Memory bandwidth - Procedure call overhead - FP performance Insight into the underlying performance factors Not a good predictor of application performance Macrobenchmarks Application execution time - Measures overall performance, but on just one application - Need application suite 2-10

Why Do Benchmarks? How we evaluate differences Different systems Changes to a single system Provide a target Benchmarks should represent large class of important programs Improving benchmark performance should help many programs For better or worse, benchmarks shape a field Good ones accelerate progress good target for development Bad benchmarks hurt progress help real programs v sell machines/papers? Inventions that help real programs don t help benchmark 2-11

Popular Benchmark Suites Desktop SPEC CPU2000 - CPU intensive, integer & floating-point applications SPECviewperf, SPECapc - Graphics benchmarks SysMark, Winstone, Winbench Embedded EEMBC - Collection of kernels from 6 application areas Dhrystone - Old synthetic benchmark Servers SPECweb, SPECfs TPC-C - Transaction processing system TPC-H, TPC-R - Decision support system TPC-W - Transactional web benchmark Parallel Computers SPLASH - Scientific applications & kernels Most markets have specific benchmarks for design and marketing 2-12

SPEC CINT2000 2-13

tpc 2-14

Basis of Evaluation Pros representative portable widely used improvements useful in reality Actual Target Workload Full Application Benchmarks Cons very specific non-portable difficult to run, or measure hard to identify cause less representative easy to run, early in design cycle Small Kernel Benchmarks easy to fool identify peak capability and potential bottlenecks Microbenchmarks peak may be a long way from application performance 2-15

Programs to Evaluate Processor Performance (Toy) Benchmarks 10-100 line eg,: sieve, puzzle, quicksort Synthetic Benchmarks attempt to match average frequencies of real workloads eg, Whetstone, dhrystone Kernels Time critical excerpts 2-16

Announcements Website http://wwwecenorthwesternedu/~kcoloma/ece361 Next lecture Instruction Set Architecture 2-17

Processor Design Metrics 2-18

Metrics of Performance Application Seconds per program Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins Useful Operations per second (millions) of Instructions per second MIPS (millions) of (FP) operations per second MFLOP/s Megabytes per second Cycles per second (clock rate) 2-19

Organizational Trade-offs Application Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins Instruction Mix CPI Cycle Time CPI is a useful design measure relating the Instruction Set Architecture with the Implementation of that architecture, and the program measured 2-20

Processor Cycles Clock Combination Logic Cycle Most contemporary computers have fixed, repeating clock cycles 2-21

CPU Performance 2-22

Cycles Per Instruction (Throughput) Cycles per Instruction CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count CPU time = Cycle Time! " CPI j! I n j=1 j Instruction Frequency CPI = n! CPIj " Fj where Fj = j=1 I Instruction j Count 2-23

Principal Design Metrics: CPI and Cycle Time Performance = 1 ExecutionTime Performance = CPI 1! CycleTime Performance = 1 Cycles! Instruction Seconds Cycle = Instructions Seconds 2-24

Example Typical Mix Op Freq Cycles CPI ALU 50% 1 5 Load 20% 5 10 Store 10% 3 3 Branch 20% 2 4 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Load 20% x 2 cycles = 4 Total CPI 22 16 Relative performance is 22 / 16 = 138 How does this compare with reducing the branch instruction to 1 cycle? Branch 20% x 1 cycle = 2 Total CPI 22 20 22 Relative performance is 22 / 20 = 11 2-25

Summary: Evaluating Instruction Sets and Implementation Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? How many clocks are required per instruction? How "lean" a clock is practical? Best Metric: Time to execute the program! CPI NOTE: Depends on instructions set, processor organization, and compilation techniques Inst Count Cycle Time 2-26

Amdahl's Law :: Make the Common Case Fast Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, Performance improvement ExTime(with E) = ((1-F) + F/S) X ExTime(without E) is limited by how much the improved feature is used Invest resources where Speedup(with E) = ExTime(without E) time is spent ((1-F) + F/S) X ExTime(without E) 2-27

Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 machines with different instruction sets? programs with different instruction mixes? dynamic frequency of instructions uncorrelated with performance MFLOP/s= FP Operations / Time * 10^6 machine dependent often not where time is spent 2-28

Summary CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Time is the measure of computer performance! Good products created when have: Good benchmarks Good ways to summarize performance If not good benchmarks and summary, then choice between improving product for real programs vs improving product to get more sales sales almost always wins Remember Amdahl s Law: Speedup is limited by unimproved part of program 2-29

Critical Path 2-30

Range of Design Styles Custom Design Standard Cell Gate Array/FPGA/CPLD Custom Control Logic Custom ALU Custom Register File Gates Standard ALU Standard Registers Gates Routing Channel Gates Routing Channel Gates Performance Design Complexity (Design Time) Compact Longer wires 2-31

Implementation as Combinational Logic + Latch Clock Combinational Logic Latch Mealey Machine Moore Machine 2-32

Clocking Methodology Clock Combination Logic All storage elements are clocked by the same clock edge (but there may be clock skews) The combination logic block s: Inputs are updated at each clock tick All outputs MUST be stable before the next clock tick 2-33

Critical Path & Cycle Time Clock Critical path: the slowest path between any two storage devices Cycle time is a function of the critical path 2-34

Tricks to Reduce Cycle Time Reduce the number of gate levels A B C Pay attention to loading D A B C D One gate driving many gates is a bad idea Avoid using a small gate to drive a long wire Use multiple stages to drive large load INV4x Revise design Clarge INV4x 2-35

Summary Performance Concepts Response Time Throughput Performance Evaluation Benchmarks Processor Design Metrics Cycle Time Cycles per Instruction Amdahl s Law Speedup what is important Critical Path 2-36