This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Similar documents
Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 2: Figures of Merit and Evaluation Methodologies

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance of computer systems

Designing for Performance. Patrick Happ Raul Feitosa

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Course web site: teaching/courses/car. Piazza discussion forum:

UCB CS61C : Machine Structures

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

EECS 470 Lecture 1. Computer Architecture. Winter 2019 Prof. Ron Dreslinski h6p://

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

The Role of Performance

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

1.3 Data processing; data storage; data movement; and control.

Lecture 3: Evaluating Computer Architectures. How to design something:

Quiz for Chapter 1 Computer Abstractions and Technology

Instructor Information

Computer Performance Evaluation: Cycles Per Instruction (CPI)

UCB CS61C : Machine Structures

Computer Architecture

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

ECE 486/586. Computer Architecture. Lecture # 8

Multiple Issue ILP Processors. Summary of discussions

Lecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lec 25: Parallel Processors. Announcements

04S1 COMP3211/9211 Computer Architecture Tutorial 1 (Weeks 02 & 03) Solutions

Performance, Power, Die Yield. CS301 Prof Szajda

The Processor: Instruction-Level Parallelism

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Walking Four Machines by the Shore

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Write only as much as necessary. Be brief!

Fundamentals of Quantitative Design and Analysis

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

CMSC 611: Advanced Computer Architecture

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

The Von Neumann Computer Model

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

LECTURE 1. Introduction

TDT 4260 lecture 7 spring semester 2015

Microarchitecture Overview. Performance

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

EECS 322 Computer Architecture Superpipline and the Cache

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

CO Computer Architecture and Programming Languages CAPL. Lecture 15

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Response Time and Throughput

EECS 470 Lecture 2. Performance, Power & ISA. Fall Jon Beaumont

Multicore and Parallel Processing

Computer Architecture. What is it?

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Outline Marquette University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

CS3350B Computer Architecture CPU Performance and Profiling

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Multiple Instruction Issue. Superscalars

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Computer System. Performance

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Vector and Parallel Processors. Amdahl's Law

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

EECS4201 Computer Architecture

CS 110 Computer Architecture

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Microarchitecture Overview. Performance

Constructing Virtual Architectures on Tiled Processors. David Wentzlaff Anant Agarwal MIT

GRE Architecture Session

Lecture 1: Introduction

COMPUTER ORGANIZATION AND DESI

Transistors and Wires

Tutorial 11. Final Exam Review

CS430 Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture Spring 2016

Advanced Computer Architecture

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

CIS 371 Spring 2010 Thu. 4 March 2010

Copyright 2012, Elsevier Inc. All rights reserved.

CpE 442 Introduction to Computer Architecture. The Role of Performance

Parallelism, Multicore, and Synchronization

Transcription:

This Unit CIS 501 Computer Architecture Metrics Latency and throughput Reporting performance Benchmarking and averaging Unit 2: Performance Performance analysis & pitfalls Slides developed by Milo Martin & Amir Roth at the University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. CIS 501 (Martin): Performance 1 CIS 501 (Martin): Performance 2 Readings Chapter 1.2-1.4 of MA:FSPTCM As You Get Settled You drive two miles 30 miles per hour for the first mile 90 miles per hour for the second mile Question: what was your average speed? Hint: the answer is not 60 miles per hour Why? Would the answer be different if each segment was equal time (versus equal distance)? CIS 501 (Martin): Performance 3 CIS 501 (Martin): Performance 4

Answer You drive two miles 30 miles per hour for the first mile 90 miles per hour for the second mile Question: what was your average speed? Hint: the answer is not 60 miles per hour 0.03333 hours per mile for 1 mile 0.01111 hours per mile for 1 mile 0.02222 hours per mile on average = 45 miles per hour Reasoning About Performance CIS 501 (Martin): Performance 5 CIS 501 (Martin): Performance 6 Performance: Latency vs. Throughput Latency (execution time): time to finish a fixed task Throughput (bandwidth): number of tasks in fixed time Different: exploit parallelism for throughput, not latency (e.g., bread) Often contradictory (latency vs. throughput) Will see many examples of this Choose definition of performance that matches your goals Scientific program? latency. web server? throughput. Example: move people 10 miles Car: capacity = 5, speed = 60 miles/hour Bus: capacity = 60, speed = 20 miles/hour Latency: car = 10 min, bus = 30 min Throughput: car = 15 PPH (count return trip), bus = 60 PPH Fastest way to send 10TB of data? (1+ gbits/second) CIS 501 (Martin): Performance 7 Comparing Performance A is X times faster than B if Latency(A) = Latency(B) / X Throughput(A) = Throughput(B) * X A is X% faster than B if Latency(A) = Latency(B) / (1+X/100) Throughput(A) = Throughput(B) * (1+X/100) Car/bus example Latency? Car is 3 times (and 200%) faster than bus Throughput? Bus is 4 times (and 300%) faster than car CIS 501 (Martin): Performance 8

Processor Performance and Workloads Q: what does performance of a chip mean? A: nothing, there must be some associated workload Workload: set of tasks someone (you) cares about Benchmarks: standard workloads Used to compare performance across machines Either are or highly representative of actual programs people run Benchmarking Micro-benchmarks: non-standard non-workloads Tiny programs used to isolate certain aspects of performance Not representative of complex behaviors of real applications Examples: binary tree search, towers-of-hanoi, 8-queens, etc. CIS 501 (Martin): Performance 9 CIS 501 (Martin): Performance 10 SPEC Benchmarks SPEC (Standard Performance Evaluation Corporation) http://www.spec.org/ Consortium that collects, standardizes, and distributes benchmarks Post SPECmark results for different processors 1 number that represents performance for entire suite Benchmark suites for CPU, Java, I/O, Web, Mail, etc. Updated every few years: so companies don t target benchmarks SPEC CPU 2006 12 integer : bzip2, gcc, perl, hmmer (genomics), h264, etc. 17 floating point : wrf (weather), povray, sphynx3 (speech), etc. Written in C/C++ and Fortran CIS 501 (Martin): Performance 11 SPECmark 2006 Reference machine: Sun UltraSPARC II (@ 296 MHz) Latency SPECmark For each benchmark Take odd number of samples Choose median Take latency ratio (reference machine / your machine) Take average (Geometric mean) of ratios over all benchmarks Throughput SPECmark Run multiple benchmarks in parallel on multiple-processor system Recent (latency) leaders SPECint: Intel 3.3 GHz Xeon W5590 (34.2) SPECfp: Intel 3.2 GHz Xeon W3570 (39.3) (First time I ve look at this where same chip was top of both) CIS 501 (Martin): Performance 12

Other Benchmarks Parallel benchmarks SPLASH2: Stanford Parallel Applications for Shared Memory NAS: another parallel benchmark suite SPECopenMP: parallelized versions of SPECfp 2000) SPECjbb: Java multithreaded database-like workload Transaction Processing Council (TPC) TPC-C: On-line transaction processing (OLTP) TPC-H/R: Decision support systems (DSS) TPC-W: E-commerce database backend workload Have parallelism (intra-query and inter-query) Heavy I/O and memory components CIS 501 (Martin): Performance 13 Mean (Average) Performance Numbers Arithmetic: (1/N) * P=1..N Latency(P) For units that are proportional to time (e.g., latency) You can add latencies, but not throughputs Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) Throughput(P1+P2,A)!= Throughput(P1,A) + Throughput(P2,A) 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour Average is not 60 miles/hour Harmonic: N / P=1..N 1/Throughput(P) For units that are inversely proportional to time (e.g., throughput) Geometric: N P=1..N Speedup(P) For unitless quantities (e.g., speedup ratios) CIS 501 (Martin): Performance 14 CPU Performance Recall: CPU Performance Equation Multiple aspects to performance: helps to isolate them Latency = seconds / program = (insns / program) * (cycles / insn) * (seconds / cycle) Insns / program: dynamic insn count Impacted by program, compiler, ISA Cycles / insn: CPI Impacted by program, compiler, ISA, micro-arch Seconds / cycle: clock period Impacted by micro-arch, technology For low latency (better performance) minimize all three Difficult: often pull against one another Example we have seen: RISC vs. CISC ISAs ± RISC: low CPI/clock period, high insn count ± CISC: low insn count, high CPI/clock period CIS 501 (Martin): Performance 15 CIS 501 (Martin): Performance 16

Improving Clock Frequency Faster transistors Next topic Micro-architectural techniques Multi-cycle processors Break each instruction into small bits Less logic delay -> improved clock frequency Different instructions take different number of cycles But increases CPI! (CPI is now > 1) Pipelined processors As above, but overlap parts of instruction (parallelism!) Faster clock, but CPI can still be around 1 But pipeline hazards increase CPI Topic for following week Cycles per Instruction (CPI) CPI: Cycle/instruction for on average IPC = 1/CPI Used more frequently than CPI Favored because bigger is better, but harder to compute with Different instructions have different cycle costs E.g., add typically takes 1 cycle, divide takes >10 cycles Depends on relative instruction frequencies CPI example A program executes equal: integer, floating point (FP), memory ops Cycles per instruction type: integer = 1, memory = 2, FP = 3 What is the CPI? (33% * 1) + (33% * 2) + (33% * 3) = 2 Caveat: this sort of calculation ignores many effects Back-of-the-envelope arguments only CIS 501 (Martin): Performance 17 CIS 501 (Martin): Performance 18 CPI Example Assume a processor with instruction frequencies and costs Integer ALU: 50%, 1 cycle Load: 20%, 5 cycle Store: 10%, 1 cycle Branch: 20%, 2 cycle Which change would improve performance more? A. Branch prediction to reduce branch cost to 1 cycle? B. Faster data memory to reduce load cost to 3 cycles? Compute CPI Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) B is the winner CIS 501 (Martin): Performance 19 Measuring CPI How are CPI and execution-time actually measured? Execution time? stopwatch timer (Unix time command) CPI = (CPU time * clock frequency) / dynamic insn count How is dynamic instruction count measured? More useful is CPI breakdown (CPI CPU, CPI MEM, etc.) So we know what performance problems are and what to fix Hardware event counters Available in most processors today One way to measure dynamic instruction count Calculate CPI using counter frequencies / known event costs Cycle-level micro-architecture simulation + Measure exactly what you want and impact of potential fixes! Method of choice for many micro-architects CIS 501 (Martin): Performance 20

Simulator Performance Breakdown Improving CPI: Caching and Parallelism CIS501 is more about improving CPI than clock frequency Techniques we will look at Caching, speculation, multiple issue, out-of-order issue Vectors, multiprocessing, more Moore s Law can help CPI more transistors Best examples are caches (to improve memory component of CPI) Parallelism: IPC > 1 implies instructions in parallel And now multi-processors (multi-cores) But also speculation, wide issue, out-of-order issue, vectors All roads lead to multi-core Why multi-core over still bigger caches, yet wider issue? Diminishing returns, limits to instruction-level parallelism (ILP) Multi-core can provide linear performance with transistor count CIS 501 (Martin): Performance From Romer et al, ASPLOS 1996 21 CIS 501 (Martin): Performance 22 MIPS (performance metric, not the ISA) (Micro) architects often ignore dynamic instruction count Typically work in one ISA/one compiler treat it as fixed CPU performance equation becomes Latency: seconds / insn = (cycles / insn) * (seconds / cycle) Throughput: insn / second = (insn / cycle) * (cycles / second) Pitfalls of Partial Performance Metrics MIPS (millions of instructions per second) Cycles / second: clock frequency (in MHz) Example: CPI = 2, clock = 500 MHz 0.5 * 500 MHz = 250 MIPS Pitfall: may vary inversely with actual performance Compiler removes insns, program gets faster, MIPS goes down Work per instruction varies (e.g., multiply vs. add, FP vs. integer) CIS 501 (Martin): Performance 23 CIS 501 (Martin): Performance 24

Mhz (MegaHertz) and Ghz (GigaHertz) 1 Hertz = 1 cycle per second 1 Ghz is 1 cycle per nanosecond, 1 Ghz = 1000 Mhz (Micro-)architects often ignore dynamic instruction count but general public (mostly) also ignores CPI Equates clock frequency with performance! Which processor would you buy? Processor A: CPI = 2, clock = 5 GHz Processor B: CPI = 1, clock = 3 GHz Probably A, but B is faster (assuming same ISA/compiler) Classic example 800 MHz PentiumIII faster than 1 GHz Pentium4! More recent example: Core i7 faster clock-per-clock than Core 2 Same ISA and compiler! Meta-point: danger of partial performance metrics! CIS 501 (Martin): Performance 25 CPI and Clock Frequency Clock frequency implies CPU clock Other system components have their own clocks (or not) E.g., increasing processor clock doesn t accelerate memory latency Example: a 1 Ghz processor with (1ns clock period) 80% non-memory instructions @ 1 cycle (1ns) 20% memory instructions @ 6 cycles (6ns) (80%*1) + (20%*6) = 2ns per instruction (also 500 MIPS) Impact of double the core clock frequency? Without speeding up the memory Non-memory instructions latency is now 0.5ns (but 1 cycle) Memory instructions keep 6ns latency (now 12 cycles) (80% * 0.5) + (20% * 6) = 1.6ns per instruction (also 625 MIPS) Speedup = 2/1.6 = 1.25, which is << 2 What about an infinite clock frequency? (non-memory free) Only a factor of 1.66 speedup (example of Amdahl s Law) CIS 501 (Martin): Performance 26 Performance Rules of Thumb Design for actual performance, not peak performance Peak performance: Performance you are guaranteed not to exceed Greater than actual or average or sustained performance Why? Caches misses, branch mispredictions, limited ILP, etc. For actual performance X, machine capability must be > X Easier to buy bandwidth than latency Which is easier: designing a train that: (1) hold twice as much cargo or (2) goes twice as fast? Use bandwidth to reduce latency Build a balanced system Don t over-optimize 1% to the detriment of other 99% System performance often determined by slowest component CIS 501 (Martin): Performance 27 Amdahl s Law Restatement of the law of diminishing returns Total speedup limited by non-accelerated piece Analogy: drive to work & park car, walk to building Consider a task with a parallel and serial portion What is the speedup with N cores? Speedup(n, p, s) = (s+p) / (s + (p/n)) p is parallel percentage, s is serial percentage What about infinite cores? Speedup(p, s) = (s+p) / s = 1 / s Example: can optimize 50% of program A Even magic optimization that makes this 50% disappear only yields a 2X speedup CIS 501 (Martin): Performance 28

Amdahl s Law Graph Summary Latency = seconds / program = (instructions / program) * (cycles / instruction) * (seconds / cycle) Instructions / program: dynamic instruction count Function of program, compiler, instruction set architecture (ISA) Cycles / instruction: CPI Function of program, compiler, ISA, micro-architecture Seconds / cycle: clock period Function of micro-architecture, technology parameters Source: Wikipedia CIS 501 (Martin): Performance 29 Optimize each component CIS501 focuses mostly on CPI (caches, parallelism) but some on dynamic instruction count (compiler, ISA) and some on clock frequency (pipelining, technology) CIS 501 (Martin): Performance 30