UCB CS61C : Machine Structures

Similar documents
UCB CS61C : Machine Structures

Last time. Lecture #29 Performance & Parallel Intro

CS61C : Machine Structures

CS61C : Machine Structures

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS61C : Machine Structures

Architecture of Parallel Computer Systems - Performance Benchmarking -

A Fast Instruction Set Simulator for RISC-V

PIPELINING AND PROCESSOR PERFORMANCE

CS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Computer Architecture. Introduction

Microarchitecture Overview. Performance

CS61C : Machine Structures

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Lightweight Memory Tracing

! CS61C : Machine Structures. Lecture 27 Performance II & Inter-machine Parallelism. !!Instructor Paul Pearce!

1.6 Computer Performance

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Information System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University

Footprint-based Locality Analysis

Lecture 3: Evaluating Computer Architectures. How to design something:

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Energy Models for DVFS Processors

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Pipelining. CS701 High Performance Computing

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

ECE 486/586. Computer Architecture. Lecture # 3

Near-Threshold Computing: How Close Should We Get?

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Microarchitecture Overview. Performance

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Computer System. Performance

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Sandbox Based Optimal Offset Estimation [DPC2]

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Detection of Weak Spots in Benchmarks Memory Space by using PCA and CA

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

CS3350B Computer Architecture CPU Performance and Profiling

Linux Performance on IBM zenterprise 196

Bias Scheduling in Heterogeneous Multi-core Architectures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Open Access Research on the Establishment of MSR Model in Cloud Computing based on Standard Performance Evaluation

UCB CS61C : Machine Structures

Evaluating Computers: Bigger, better, faster, more?

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Data Prefetching by Exploiting Global and Local Access Patterns

Chapter 1. and Technology

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

CS 110 Computer Architecture

A Dynamic Program Analysis to find Floating-Point Accuracy Problems

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

CpE 442 Introduction to Computer Architecture. The Role of Performance

Introduction to Pipelined Datapath

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

Energy-centric DVFS Controlling Method for Multi-core Platforms

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Perceptron Learning for Reuse Prediction

Quantifying Performance EEC 170 Fall 2005 Chapter 4

UC Berkeley CS61C : Machine Structures

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

CSE2021 Computer Organization. Computer Abstractions and Technology

UC Berkeley CS61C : Machine Structures

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

UC Berkeley CS61C : Machine Structures

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Potential for hardware-based techniques for reuse distance analysis

CS61C : Machine Structures

Outline. Lecture 40 Hardware Parallel Computing Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. Accessing data in a direct mapped cache

The Role of Performance

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Computer Architecture s Changing Definition

CS61C : Machine Structures

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

ISA-Aging. (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

The Computer Revolution. Classes of Computers. Chapter 1

Performance analysis of Intel Core 2 Duo processor

Transcription:

inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in the world face off. IBM s Blue Gene won again, with 65,536 processors. They use the LINPACK benchmark (A x = B). www.research.ibm.com/bluegene/

Why Performance? Faster is better! Purchasing Perspective: given a collection of machines (or upgrade options), which has the best performance? least cost? best performance / cost? Computer Designer Perspective: faced with design options, which has the best performance improvement? least cost? best performance / cost? All require basis for comparison and metric for evaluation! Solid metrics lead to solid progress! CS61C L38 Performance (2)

Two Notions of Performance Plane DC to Paris Top Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concorde Which has higher performance? Interested in time to deliver 100 passengers? Interested in delivering as many passengers per day as possible? In a computer, time for one task called Response Time or Execution Time In a computer, tasks per unit time called Throughput or Bandwidth CS61C L38 Performance (3) 1350 3 hours 132 178,200 mph

Definitions Performance is in units of things per sec bigger is better If mostly concerned with response time performance(x) = 1 execution_time(x) F(ast) is n times faster than S(low) means: performance(f) n = = performance(s) execution_time(s) execution_time(f) CS61C L38 Performance (4)

Example of Response Time v. Throughput Time of Concorde vs. Boeing 747? Concord is 6.5 hours / 3 hours = 2.2 times faster Concord is 2.2 times ( 120% ) faster in terms of flying time (response time) Throughput of Boeing vs. Concorde? Boeing 747: 286,700 pmph / 178,200 pmph = 1.6 times faster Boeing is 1.6 times ( 60% ) faster in terms of throughput We will focus primarily on response time. CS61C L38 Performance (5)

Words, Words, Words Will (try to) stick to n times faster ; its less confusing than m % faster As faster means both decreased execution time and increased performance, to reduce confusion we will (and you should) use improve execution time or improve performance CS61C L38 Performance (6)

What is Time? Straightforward definition of time: Total time to complete a task, including disk accesses, memory accesses, I/O activities, operating system overhead,... real time, response time or elapsed time Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time) CPU execution time or CPU time Often divided into system CPU time (in OS) and user CPU time (in user program) CS61C L38 Performance (7)

How to Measure Time? Real Time Actual time elapsed CPU Time: Computers constructed using a clock that runs at a constant rate and determines when events take place in the hardware These discrete time intervals called clock cycles (or informally clocks or cycles) Length of clock period: clock cycle time (e.g., ½ nanoseconds or ½ ns) and clock rate (e.g., 2 gigahertz, or 2 GHz), which is the inverse of the clock period; use these! CS61C L38 Performance (8)

Measuring Time using Clock Cycles (1/2) CPU execution time for a program Units of [seconds / program] or [s/p] = Clock Cycles for a program x Clock Period Units of [s/p] = [cycles / p] x [s / cycle] = [c/p] x [s/c] Or = Clock Cycles for a program [c / p] Clock Rate [c / s] CS61C L38 Performance (9)

Measuring Time using Clock Cycles (2/2) One way to define clock cycles: Clock Cycles for program [c/p] = Instructions for a program [i/p] (called Instruction Count ) x Average Clock cycles Per Instruction [c/i] (abbreviated CPI ) CPI one way to compare two machines with same instruction set, since Instruction Count would be the same CS61C L38 Performance (10)

Performance Calculation (1/2) CPU execution time for program [s/p] = Clock Cycles for program [c/p] x Clock Cycle Time [s/c] Substituting for clock cycles: CPU execution time for program [s/p] = ( Instruction Count [i/p] x CPI [c/i] ) x Clock Cycle Time [s/c] = Instruction Count x CPI x Clock Cycle Time CS61C L38 Performance (11)

Performance Calculation (2/2) CPU time = Instructions x Cycles x Seconds Program Instruction Cycle CPU time = Instructions x Cycles x Seconds Program Instruction Cycle CPU time = Instructions x Cycles x Seconds CPU time = Program Instruction Cycle Seconds Program Product of all 3 terms: if missing a term, can t predict time, the real measure of performance CS61C L38 Performance (12)

How Calculate the 3 Components? Clock Cycle Time: in specification of computer (Clock Rate in advertisements) Instruction Count: Count instructions in loop of small program Use simulator to count instructions Hardware counter in spec. register CPI: (Pentium II,III,4) Calculate: Execution Time / Clock cycle time Instruction Count Hardware counter in special register (PII,III,4) CS61C L38 Performance (13)

Calculating CPI Another Way First calculate CPI for each individual instruction (add, sub, and, etc.) Next calculate frequency of each individual instruction Finally multiply these two for each instruction and add them up to get final CPI (the weighted sum) CS61C L38 Performance (14)

Example (RISC processor) Op Freq i CPI i Prod (% Time) ALU 50% 1.5 (23%) Load 20% 5 1.0 (45%) Store 10% 3.3 (14%) Branch 20% 2.4 (18%) Instruction Mix 2.2 (Where time spent) What if Branch instructions twice as fast? CS61C L38 Performance (15)

What Programs Measure for Comparison? Ideally run typical programs with typical input before purchase, or before even build machine Called a workload ; For example: Engineer uses compiler, spreadsheet Author uses word processor, drawing program, compression software In some situations its hard to do Don t have access to machine to benchmark before purchase Don t know workload in future CS61C L38 Performance (16)

Benchmarks Obviously, apparent speed of processor depends on code used to test it Need industry standards so that different processors can be fairly compared Companies exist that create these benchmarks: typical code used to evaluate systems Need to be changed every ~5 years since designers could (and do!) target for these standard benchmarks CS61C L38 Performance (17)

Example Standardized Benchmarks (1/2) Standard Performance Evaluation Corporation (SPEC) SPEC CPU2006 CINT2006 12 integer (perl, bzip, gcc, go,...) CFP2006 17 floating-point (povray, bwaves,...) All relative to base machine (which gets 100) Sun Ultra Enterprise 2 w/296 MHz UltraSPARC II They measure System speed (SPECint2006) System throughput (SPECint_rate2006) www.spec.org/osg/cpu2006/ CS61C L38 Performance (18)

Example Standardized Benchmarks (2/2) SPEC Benchmarks distributed in source code Members of consortium select workload 30+ companies, 40+ universities, research labs Compiler, machine designers target benchmarks, so try to change every 5 years SPEC CPU2006: CINT2006 perlbench C Perl Programming language bzip2 C Compression gcc C C Programming Language Compiler mcf C Combinatorial Optimization gobmk C Artificial Intelligence : Go hmmer C Search Gene Sequence sjeng C Artificial Intelligence : Chess libquantum C Simulates quantum computer h264ref C H.264 Video compression omnetpp C++ Discrete Event Simulation astar C++ Path-finding Algorithms xalancbmk C++ XML Processing CS61C L38 Performance (19) CFP2006 bwaves Fortran Fluid Dynamics gamess Fortran Quantum Chemistry milc C Physics / Quantum Chromodynamics zeusmp Fortran Physics / CFD gromacs C,Fortran Biochemistry / Molecular Dynamics cactusadm C,Fortran Physics / General Relativity leslie3d Fortran Fluid Dynamics namd C++ Biology / Molecular Dynamics dealll C++ Finite Element Analysis soplex C++ Linear Programming, Optimization povray C++ Image Ray-tracing calculix C,Fortran Structural Mechanics GemsFDTD Fortran Computational Electromegnetics tonto Fortran Quantum Chemistry lbm C Fluid Dynamics wrf C,Fortran Weather sphinx3 C Speech recognition

Another Benchmark PCs: Ziff-Davis Benchmark Suite Business Winstone is a system-level, applicationbased benchmark that measures a PC's overall performance when running today's top-selling Windows-based 32-bit applications it doesn't mimic what these packages do; it runs real applications through a series of scripted activities and uses the time a PC takes to complete those activities to produce its performance scores. Also tests for CDs, Content-creation, Audio, 3D graphics, battery life www.etestinglabs.com/benchmarks/ CS61C L38 Performance (20)

Performance Evaluation: An Aside Demo If we re talking about performance, let s discuss the ways shady salespeople have fooled consumers (so you don t get taken!) 5. Never let the user touch it 4. Only run the demo through a script 3. Run it on a stock machine in which no expense was spared 2. Preprocess all available data 1. Play a movie CS61C L38 Performance (21)

Megahertz Myth Marketing Movie CS61C L38 Performance (22)

Peer Instruction A. Rarely does a company selling a product give unbiased performance data. B. The Sieve of Eratosthenes and Quicksort were early effective benchmarks. C. A program runs in 100 sec. on a machine, mult accounts for 80 sec. of that. If we want to make the program run 6 times faster, we need to up the speed of mults by AT LEAST 6. CS61C L38 Performance (23) ABC 0: FFF 1: FFT 2: FTF 3: FTT 4: TFF 5: TFT 6: TTF 7: TTT

Peer Instruction Answers T R U E F A L S E F A L S E A. Rarely does a company selling a product give unbiased performance data. B. The Sieve of Eratosthenes, Puzzle and Quicksort were early effective benchmarks. C. A program runs in 100 sec. on a machine, mult accounts for 80 sec. of that. If we want to make the program run 6 times faster, we need to up the speed of mults by AT LEAST 6. A. TRUE. It is rare to find a company that gives Metrics that do not favor its product. B. Early benchmarks? Yes. Effective? No. Too simple! C. 6 times faster = 16 sec. mults must take -4 sec! I.e., impossible! CS61C L38 Performance (24) ABC 0: FFF 1: FFT 2: FTF 3: FTT 4: TFF 5: TFT 6: TTF 7: TTT

And in conclusion CPU time = Instructions x Cycles x Seconds Latency v. Throughput Performance doesn t depend on any single factor: need Instruction Count, Clocks Per Instruction (CPI) and Clock Rate to get valid estimations User Time: time user waits for program to execute: depends heavily on how OS switches between tasks CPU Time: time spent executing a single program: depends solely on design of processor (datapath, pipelining effectiveness, caches, etc.) Benchmarks Attempt to predict perf, Updated every few years Measure everything from simulation of desktop graphics programs to battery life Megahertz Myth MHz performance, it s just one factor CS61C L38 Performance (25) Program Instruction Cycle