Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Similar documents
Performance Evaluation

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Programming in Fortran 90 : 2017/2018

Parallel matrix-vector multiplication

Support Vector Machines

CMPS 10 Introduction to Computer Science Lecture Notes

An Entropy-Based Approach to Integrated Information Needs Assessment

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Wishing you all a Total Quality New Year!

The Codesign Challenge

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Biostatistics 615/815

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Optimizing Document Scoring for Query Retrieval

K-means and Hierarchical Clustering

Assembler. Building a Modern Computer From First Principles.

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

A Genetic Algorithm Based Dynamic Load Balancing Scheme for Heterogeneous Distributed Systems

Design and Analysis of Algorithms

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Chapter 1. Introduction

CS 534: Computer Vision Model Fitting

X- Chart Using ANOM Approach

Smoothing Spline ANOVA for variable screening

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Mathematics 256 a course in differential equations for engineering students

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Quantifying Performance Models

Improving The Test Quality for Scan-based BIST Using A General Test Application Scheme

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Verification by testing

Burst Round Robin as a Proportional-Share Scheduling Algorithm

Application of Improved Fish Swarm Algorithm in Cloud Computing Resource Scheduling

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

S1 Note. Basis functions.

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Reducing Frame Rate for Object Tracking

ETAtouch RESTful Webservices

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Performance Evaluation of Information Retrieval Systems

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Solving two-person zero-sum game by Matlab

LESSON 15: BODE PLOTS OF TRANSFER FUNCTIONS

GSLM Operations Research II Fall 13/14

Simulation Based Analysis of FAST TCP using OMNET++

Hermite Splines in Lie Groups as Products of Geodesics

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Dijkstra s Single Source Algorithm. All-Pairs Shortest Paths. Dynamic Programming Solution. Performance. Decision Sequence.

Introduction to Programming. Lecture 13: Container data structures. Container data structures. Topics for this lecture. A basic issue with containers

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Cache Sharing Management for Performance Fairness in Chip Multiprocessors

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Evaluation of Parallel Processing Systems through Queuing Model

CS1100 Introduction to Programming

Algorithm To Convert A Decimal To A Fraction

Instructors: Randy H. Katz David A. PaHerson hhp://inst.eecs.berkeley.edu/~cs61c/fa10. Fall Lecture #9. Agenda

Nachos Project 3. Speaker: Sheng-Wei Cheng 2010/12/16

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Network Coding as a Dynamical System

Loop Transformations, Dependences, and Parallelization

Load Balancing for Hex-Cell Interconnection Network

AC : TEACHING SPREADSHEET-BASED NUMERICAL ANAL- YSIS WITH VISUAL BASIC FOR APPLICATIONS AND VIRTUAL IN- STRUMENTS

Dynamic Camera Assignment and Handoff

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals

Multi-stable Perception. Necker Cube

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Cluster Analysis of Electrical Behavior

Real-time Scheduling

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Brave New World Pseudocode Reference

Dijkstra s Single Source Algorithm. All-Pairs Shortest Paths. Dynamic Programming Solution. Performance

Machine Learning. Topic 6: Clustering

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Analysis of Collaborative Distributed Admission Control in x Networks

Unsupervised Learning and Clustering

Dynamic Processor Allocation for Multiple RHC Systems in Multi-Core Computing Environments

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Parallel Computation of the Functions Constructed with

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

Efficient Distributed File System (EDFS)

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Minimum Cost Optimization of Multicast Wireless Networks with Network Coding

Video Proxy System for a Large-scale VOD System (DINA)

Analytic Evaluation of Quality of Service for On-Demand Data Delivery

Recognizing Faces. Outline

Transcription:

4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/ http://www.ntomoble.com/2010/11/02/opera-celand-clean/ 4.3 4.4 Performance Depends on Vew Pont?! What's faster: A 747 Jumbo Arlner An F-22 fghter jet If you are an ndvdual nterested n gettng from pont A to pont B, then the F-22 Ths s known as Tme from the of an operaton untl t If you are tryng to number of people, the Ths s known as Throughput vs. Latency If Latency s the Tmet takes for a Jobto complete & Throughput Jobs/ Tme Is Throughput 1 / Latency? Latency s from the perspectve of a Throughput s from the perspectve of s the great frend of throughput! We wll see many tmes n ths course (ppelnng, memory org., etc.) that there s often not much we can do about but there are lots of ways to mprove Hopefully wthout latency too much, f at all

4.5 4.6 What are the metrcs? Metrcs Executon Tme Key Pont: When comparng dfferent systems, s the ultmate crteron (metrc) Usng a as a metrc can often be msleadng metrcs Often not comparng apples to apples Often not normalzed 4.7 4.8 What's Wrong wth Rates Wall Clock Tme vs. CPU Tme Two trans take two dfferent routes from Cty A to Cty B and leave at the same tme. Tran 1 travels at 60 MPH, whle tran 2 travels at 75 MPH. Whch one arrves frst? 1 (MIPS): You may hear that Computer 1 executes 500 MIPS whle Computer 2 executes 750 MIPS. Whch one executes a gven program faster? Tran speed MIPS & Routes Program (how many nstructons) MIPS s only useful for the same 2 (Clock Rate): You may hear that CPU1 runs at 2 GHz and CPU2 runs at 3 GHz, whch one executes a program faster (assume same nstructon set) CPU1 may have whle CPU2 has CPU1 Tme < CPU2 Tme Even executon tme can be hard to measure accurately because the OS may allocate a percentage of compute cycles to other programs (also, part of a programs executon s spent n OS calls or watng for I/O, etc.) Wall Clock Tme: Real tme t took from when the user submtted the job untl t was completed CPU Tme (User Tme + System Tme): Actual tme the program used the CPU ether n the applcaton code (User Tme) or n the OS (System Tme) Doesn't nclude I/O tme Lnux/Unx: % real 0m16.019s user 0m12.840s sys 0m0.180s

4.9 4.10 Performance Performance Equaton Performance s defned as the nverse of executon tme Performanc e 1 Executon Tme Executon tme can be modeled usng three components IC Dynamc Instructon Count not Statc Instructon Count : Average number of clock cycles to execute each nstructon Often want to compare relatve performance or speedup (how many tmes faster s a new system than an old one) Performance Performance New Speedup Old Executon Executon Old New Exec.Tme 4.11 4.12 Dynamc vs. Statc Instructon Count What Affects Performance Statc nstructon count s the number of wrtten nstructons Dynamc nstructon count (or trace count) s how many nstructon were executed at run tme Would you prefer ether: SmallStatc IC & LargeDynamc IC or LargeStatc IC & Small Dynamc IC Statc IC LP: BNE LP THN: ELS: Dynamc IC Component SW/HW Affects Descrpton Algorthm SW Instruc.Count & Programmng Language SW Instruc.Count & Compler SW Instruc.Count & Instructon Set HW Instruc.Count,, Clock Cycle Determnes how many nstructons & whch knd are executed Determnes constructs that need to be translated and the knd of nstructons Effcency of translatonaffects how many and whch nstructons are used Determnes what nstructons are avalable and whatwork each nstructon performs Mcroarchtecture HW, Clock Cycle Determnes how each nstructon s executed (, clock perod) Source: H&P, Computer Organzaton & Desgn, 3 rd Ed.

4.13 4.14 & Mcroarchtecture Sngle Bus R0 R1 Rn Y Reg. ALU Two-Bus R0 R1 Rn Y Reg. ALU Three Bus R0 R1 Rn Y Reg. ALU Processor A runs at 200 MHz and executes a 40 mllon nstructon program at a sustaned 50 MIPS Processor B runs at 400 MHz and executes the same program (w/ a dfferent compler) whch yelds a count of 60 mllon nstructons and a of 6 What s the of the program on Proc. A? Whch processor executes the program faster and by what factor? What s the MIPS rate of Proc. B? Z Reg. Z Reg. Z Reg. Clock 1: Y Rsrc1 Clock 2: Z Rsrc2 + Y Clock 3: Rdst Z Clock 1: Z Rsrc1 + Rsrc2 Clock 2: Rdst Z Clock 1: Rdst Rsrc1 + Rsrc2 General Implcatons: Less Resources > More Clock Cycles (Tme) 4.15 4.16 Calculatng can be found by takng the expected value (weghted average) of each nstructon type s [.e. for each type * frequency (probablty) of that type of nstructon] * P( InstructonType ) Type _ In practce, s often hard too fnd analytcally because n modern processors nstructon executon s dependent on earler nstructons Instead we run benchmark applcatons on smulators to measure average. If CLK1 MHz what s PEAK Inst./Sec. Average Average P1 A 1 B 2 C 3 P1 A 1 B 2 C 3

4.17 4.18 Calculate of ths snppet of code usng the followng s for each nstructon type add $s0,$zero,$zero add $t1,$zero,4 loop: lw $t2,0($t0) add $t2,$t2,$t1 add $t0,$t0,4 add $t1,$t1,-1 bne $t1,$zero,loop sw $t2,0($t2) Dynamc Instructon Count * P( InstructonType ) Type _ add lw/ sw bne add / add 1 lw/ sw 4 bne 2 Dynamc Count Other Performance Measures OPS/FLOPS (Floatng-Pont) Operatons/Sec. Maxmum number of arthmetc operatons per second the processor can acheve : 4 FP ALU s on a processor runnng @ 2 GHz > 8 GFLOPS Memory Bandwdth (Bytes/Sec.) Maxmum bytes of memory per second that can be read/wrtten Programs are ether memory bound or computatonally bound 4.19 4.20 Energy Proportonal Computng Desred Power vs. Utlzaton Relatonshp What should I optmze? AMDAHL'S LAW The Case for Energy-Proportonal Computng, Luz André Barroso, Urs Hölzle, IEEE Computer, vol. 40 (2007).

Amdahl s Law 4.21 Amdahl s Law 4.22 Where should we put our effort when tryng to enhance performance of a program Amdahl s Law How much performance gan do we get by mprovng only a part of the whole ExecTmeNe w ExecTmeUnaffected + Speedup ExecTmeOld ExecTmeNew ExecTmeAffected ImprovementFactor Holds for both HW and SW HW: Whch nstructons should we make fast? The most used (executed) ones SW: Whch portons of our program should we work to optmze Holds for parallelzaton of algorthms (convertng code to run multple processors) Orgnal Sequental Program Parallelzed Program Parallelzaton 4.23 4.24 A programmer parallelzes a functon n hs program to be run on 8 cores. The functon accounted for 40% of the runtme of the overall program. What s the speedup of the enhancement? What f we mprove only class B nstrucs. P1 Freq. A 1 10% B 2 > 1 40% C 3 50% Speedup Speedup Percent Unaffected + 1 Percent Affected ImprovementFactor 1 1 Speedup 6 / 5 1.2??? 2 / 3+ (1/ 3/ 2) 5/ 6

Proflng 4.25 gprof Output 4.26 How do you know where tme s beng spent? From a software (programmng for performance) perspectve, proflers are handy tools Instrument your code to take statstcs as t runs and then can show you what percentage of tme each functon or even lne of code was responsble for Common proflers 'gprof' (usually standard wth Unx / Lnux nstalls) and 'g++' Intel VTune MS Vsual Studo Proflng Tools From a hardware perspectve, smulators can help SmpleScalar Smcs Your own smulaton model developed n Verlog/SystemC/etc. % cumulatve self self total tme seconds seconds calls s/call s/call name 42.96 4.48 4.48 56091649 0.00 0.00 Board::operator<(Board const&) const 6.43 5.15 0.67 2209524 0.00 0.00 std::_rb_tree<...>::_m_lower_bound(...) 5.08 5.68 0.53 108211500 0.00 0.00 gnu_cxx:: normal_terator<...>::operator+(...) 4.51 6.15 0.47 4419052 0.00 0.00 Board::Board(Board const&) 4.32 6.60 0.45 1500793 0.00 0.00 vod std:: adjust_heap<...>(...) 3.84 7.00 0.40 28553646 0.00 0.00 PuzzleMove::operator>(PuzzleMove const&) const Credts 4.27 These sldes were derved from Gandh Puvvada s EE 457 Class Notes