Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Similar documents
UCB CS61C : Machine Structures

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

UCB CS61C : Machine Structures

Response Time and Throughput

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Chapter 1. and Technology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

PIPELINING AND PROCESSOR PERFORMANCE

1.6 Computer Performance

CSE2021 Computer Organization. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

The Computer Revolution. Classes of Computers. Chapter 1

Rechnerstrukturen

A Fast Instruction Set Simulator for RISC-V

Architecture of Parallel Computer Systems - Performance Benchmarking -

Chapter 1. Computer Abstractions and Technology

Computer Architecture. Introduction

Performance, Power, Die Yield. CS301 Prof Szajda

ECE369: Fundamentals of Computer Architecture

Microarchitecture Overview. Performance

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Last time. Lecture #29 Performance & Parallel Intro

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Chapter 1. The Computer Revolution

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

EIE/ENE 334 Microprocessors

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Lightweight Memory Tracing

Chapter 1. Computer Abstractions and Technology. Jiang Jiang

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

EECS2021E EECS2021E. The Computer Revolution. Morgan Kaufmann Publishers September 12, Chapter 1 Computer Abstractions and Technology 1

Information System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University

Computer Organization & Assembly Language Programming (CSE 2312)

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Energy Models for DVFS Processors

EE108B Lecture 6. Performance and Compilers. Christos Kozyrakis Stanford University

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Outline Marquette University

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

Microarchitecture Overview. Performance

Computer System. Performance

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

Detection of Weak Spots in Benchmarks Memory Space by using PCA and CA

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Footprint-based Locality Analysis

Mestrado em Informática

Pipelining. CS701 High Performance Computing

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CPU Performance. Christos Kozyrakis. Stanford University C. Kozyrakis EE108b Lecture 5

CMSC 611: Advanced Computer Architecture

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

CS3350B Computer Architecture CPU Performance and Profiling

COMPUTER ORGANIZATION AND DESIGN

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

CS 110 Computer Architecture

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Bias Scheduling in Heterogeneous Multi-core Architectures

ECE331: Hardware Organization and Design

Transistors and Wires

Three major components of Embedded Systems

The Role of Performance

Sandbox Based Optimal Offset Estimation [DPC2]

Near-Threshold Computing: How Close Should We Get?

An Introduction to Parallel Architectures

Linux Performance on IBM zenterprise 196

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Lecture 3: Evaluating Computer Architectures. How to design something:

Performance of computer systems

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 1. Computer Abstractions and Technology

Open Access Research on the Establishment of MSR Model in Cloud Computing based on Standard Performance Evaluation

Course web site: teaching/courses/car. Piazza discussion forum:

Energy-centric DVFS Controlling Method for Multi-core Platforms

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Perceptron Learning for Reuse Prediction

Outline. What is Performance? Restating Performance Equation Time = Seconds. CPU Performance Factors

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

CpE 442 Introduction to Computer Architecture. The Role of Performance

CSCI 402: Computer Architectures. Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI

Transcription:

Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 200 400 600 Passenger Capacity 0 5000 10000 Cruising Range (miles) Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 500 1000 1500 Cruising Speed (mph) 0 200000 400000 Passengers x mph 2

Defining Performance (2) Performance issues Measure, analyze, report, and summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Questions Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine s instruction set affect performance? 3

Computer Performance (1) Response time ( execution time, latency) The time between the start and completion of a task How long does it take for my job to run? How long must I wait for the database query? Throughput ( bandwidth) The total amount of work done in a given time How much work is getting done per unit time? What is the average execution rate? What if We replace the processor with a faster version? We add more processors? 4

Computer Performance (2) Relative performance Define Performance = 1 Execution Time X is n times faster than Y Performance X Performance Y = Execution time Y Execution time X = n Example: time taken to run a program 10s on machine A, 15s on machine B Execution Time B / Execution Time A = 15s / 10s = 1.5 Machine A is 1.5 times faster than machine B 5

Measuring Execution Time Elapsed time Total response time, including all aspects Processing, I/O, OS overhead, idle time Determines system performance CPU time Time spent processing a given job Discounts I/O time, other jobs shares Comprises user CPU time and system CPU time Different programs are affected differently by CPU and system performance Our focus: User CPU time 6

CPU Clocking Clock Operation of digital hardware governed by a constant-rate clock Clock ticks indicate when to start activities Clock period: duration of a clock cycle Clock frequency (rate): cycles per second Clock (cycles) Data transfer and computation Update state Clock period 7

CPU Time (1) CPU time CPU Time = CPU Clock Cycles Clock Cycle Time = CPU Clock Cycles Clock Rate Performance improved by Reducing the number of clock cycles Increasing clock rate (or decreasing the clock cycle time) Hardware designer must often trade off clock rate against cycle count 8

CPU Time (2) Example: Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2x clock cycles How fast must Computer B clock be? Clock Rate B = Clock Cycles B CPU Time B = 1.2 Clock Cycles A 6s Clock Cycles A = CPU Time A Clock Rate A = 10s 2GHz = 20 10 9 Clock Rate B = 1.2 20 109 6s = 24 109 6s = 4GHz 9

CPI (1) Instruction count and CPI Clock Cycles = Instruction Count Cycles per Instruction CPU Time = Instruction Count CPI Clock Cycle Time = Instruction Count CPI Clock Rate Instruction count for a program Determined by program, ISA, and compiler Average cycles per instruction (CPI) Determined by CPU hardware If different instructions have different CPI The average CPI affected by instruction mix 10

CPI (2) CPI example Computer A: Cycle time = 250ps, CPI = 2.0 Computer B: Cycle time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? CPU Time A = Instruction Count CPI A Cycle Time A = I 2.0 250ps = I 500ps CPU Time B = Instruction Count CPI B Cycle Time B = I 1.2 500ps = I 600ps CPU Time B CPU Time A = I 600ps I 500ps = 1.2 11

CPI (3) CPI in more detail If different instruction classes take different numbers of cycles: n Clock Cycles = i=1( CPI i Instruction Count i ) Weighted average CPI CPI = Clock Cycles = ( Instruction Count i=1 n CPI i Instruction Count i Instruction Count ) Relative frequency 12

CPI (4) Example: Alternative compiled code sequences using instructions in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 Sequence 1: IC = 5 Clock cycles = 2x1+1x2+2x3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 Clock cycles = 4x1+1x2+1x3 = 9 Avg. CPI = 9/6 = 1.5 13

MIPS MIPS: Millions of Instructions Per Second MIPS as a performance metric? Doesn t account for Differences in ISAs between computers Differences in complexity between instructions MIPS = Instruction count Execution time 10 6 = Instruction count Instruction count CPI Clock rate 10 6 = Clock rate CPI 10 6 CPI varies between programs on a given CPU 14

C Sort Example (1) Bubble sort in C void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } void sort (int v[], int n) { int i, j; for (i = 0; i < n; i += 1) { for (j = i 1; j >= 0 && v[j] > v[j + 1]; j -= 1) { swap(v, j); } } } 15

C Sort Example (2) Effect of compiler optimization 3 2.5 2 1.5 1 0.5 0 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 Relative Performance none O1 O2 O3 Clock Cycles none O1 O2 O3 140000 120000 100000 80000 60000 40000 20000 1.5 0.5 Compiled with gcc for Pentium 4 under Linux 2 1 0 0 Instruction count none O1 O2 O3 CPI none O1 O2 O3 16

C Sort Example (3) Effect of language and algorithm 3 2.5 2 1.5 1 0.5 Bubblesort Relative Performance 0 2.5 C/none C/O1 C/O2 C/O3 Java/int Java/JIT Quicksort Relative Performance 2 1.5 1 0.5 0 3000 C/none C/O1 C/O2 C/O3 Java/int Java/JIT Quicksort vs. Bubblesort Speedup 2500 2000 1500 1000 500 0 C/none C/O1 C/O2 C/O3 Java/int Java/JIT 17

C Sort Example (4) Lessons Instruction count and CPI are not good performance indicators in isolation Compiler optimizations are sensitive to the algorithm Java/JIT compiled code is significantly faster than JVM interpreted Comparable to optimized C in some cases Nothing can fix a dumb algorithm! 18

Performance Summary CPU Time = Instructions Program Clock cycles Instruction Seconds Clock cycle Instruction Count CPI Clock Cycle Algorithm Programming language Compiler ISA Microarchitecture Technology 19

Benchmarks How to measure the performance? Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications Small benchmarks Nice for architects and designers Easy to standardize Can be abused 20

SPEC CPU Benchmark (1) SPEC (Standard Performance Evaluation Corp.) Develops benchmarks for CPU, I/O, Web, http://www.spec.org SPEC CPU benchmark An industry-standardized, CPU-intensive benchmark suite, stressing a system's processor, memory subsystem and compiler. Companies have agreed on a set of real program and inputs Valuable indicator of performance (and compiler technology) CPU89 CPU92 CPU95 CPU2000 CPU2006 Can still be abused 21

SPEC CPU Benchmark (2) Benchmark games An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error was a sad commentary on a common industry practice of cheating on standardized performance tests The error was pointed out to Intel two days ago by a competitor, Motorola came in a test known as SPECint92 Intel acknowledged that it had optimized its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing At the heart of Intel s problem is the practice of tuning compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code Saturday, January 6, 1996 New York Times 22

SPEC CPU Benchmark (3) SPEC CPU2006 Elapsed time to execute a selection of programs Negligible I/O, so focuses on CPU performance Normalize relative to reference machine Sun s historical Ultra Enterprise 2 introduced in 1997 296MHz UltraSPARC II processor Summarize as geometric mean of performance ratios CINT2006: 12 integer programs written in C and C++ CFP2006: 17 FP programs written in Fortran and C/C++ n n Execution time ratio i i=1 23

SPEC CPU Benchmark (4) SPEC CPU2006 (cont d) Integer Benchmarks (CINT2006) Floating Point Benchmarks (CFP2006) perlbench C Perl programming language bwaves Fortran Fluid dynamics bzip2 C Compression gamess Fortran Quantum chemistry gcc C C compiler milc C Physics: Quantum chromodynamics mcf C Combinatorial optimization zeusmp Fortran Physics / CFD gobmk C Artificial intelligence: Go gromacs C/Fortran Biochemistry / Molecular dynamics hmmer C Search gene sequence cactusadm C/Fortran Physics / General relativity sjeng C Artificial intelligence: Chess leslie3d Fortran Fluid dynamics libquantum C Physics: Quantum computing namd C++ Biology / Molecular dynamics h264ref C Video compression dealii C++ Finite element analysis omnetpp C++ Discrete event simulation soplex C++ Linear programming, optimization astar C++ Path-finding algorithms povray C++ Image ray-tracing xalancbmk C++ XML processing calculix C/Fortran Structural mechanics GemsFDTD Fortran Computational electromagnetics tonto Fortran Quantum chemistry lbm C Fluid dynamics wrf C/Fortran Weather prediction sphinx3 C Speech recognition 24

SPEC CPU Benchmark (5) CINT2006 for Opteron X4 2356 Name Description IC 10 9 CPI Tc (ns) Exec time Ref time SPECratio perl Interpreted string processing 2,118 0.75 0.40 637 9,770 15.3 bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8 gcc GNU C Compiler 1,050 1.72 0.40 724 8,050 11.1 mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8 go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6 hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5 sjeng Chess game (AI) 2,176 0.96 0.40 837 12,100 14.5 libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8 h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3 omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1 astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1 xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0 Geometric mean 11.7 25

SPEC Power Benchmark (1) SPECpower_ssj2008 The first industry-standard SPEC benchmark for evaluating the power and performance characteristics of server class computers Initially targets the performance of server-side Java Power consumption of server at different workload levels (0% ~ 100%) Performance: ssj_ops/sec Power: Watts (Joules/sec) 10 10 Overall ssj_ops per Watt = ssj_ops i power i i=0 i=0 26

SPEC Power Benchmark (2) SPECpower_ssj2008 for X4 2356 Target Load Performance Power Performance Actual Load ssj_ops Avg. Active Power (W) to Power Ratio 100% 99.3% 240,914 299 806 90% 90.7% 219,979 291 756 80% 80.1% 194,276 282 690 70% 70.5% 170,927 271 630 60% 59.9% 145,299 258 562 50% 49.5% 120,062 245 490 40% 40.2% 97,534 232 420 30% 30.2% 73,199 219 334 20% 19.9% 48,386 207 233 10% 9.8% 23,819 197 121 Active Idle 0 178 0 ssj_ops / power = 498 27

SPEC Power Benchmark (3) Low power at idle? Look back at X4 power benchmark At 100% load: 299W At 50% load: 245W (82%) At 10% load: 197W (66%) Google data center Mostly operates at 10% 50% load At 100% load less than 1% of the time Designing processors to make power proportional to load? 28

Other Benchmarks EEMBC Applications on embedded systems such as communication devices, automobiles, etc. Mediabench Set of multimedia applications (codecs, graphics, ) NAS Parallel benchmarks from NASA SPLASH, PARSEC Multithreaded benchmarks for multiprocessors 29

Amdahl s Law (1) Execution time after improvement T original T improved = T affected Improvement factor + T unaffected 1- f f Improved by S = T original ( 1 f + f S) T improved Example: multiply accounts for 80s/100s How much improvement in multiply performance to run a program 4 times faster? How about making it 5 times faster? 30

Amdahl s Law (2) Speedup and Amdahl s law Speedup = T original T improved = 1 ( 1 f + f S) Principles Make the common case fast As f 1, speedup S Speedup is limited by the fraction of code that can be optimized As S, speedup 1 / (1 f) Uncommon case can become the common one after improvement 31

Summary Performance is specific to a particular program(s) Total execution time is a consistent summary of the performance For a given architecture, performance increases come from Increases in clock rate (without adverse CPI affects) Improvements in processor organization that lower CPI Compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Pitfall: Expecting improvement in one aspect of a machine s performance to affect the total performance 32