Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Similar documents
CMSC 611: Advanced Computer Architecture

Performance Evaluation

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

CMSC 611: Advanced Computer Architecture

The Codesign Challenge

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Mathematics 256 a course in differential equations for engineering students

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assembler. Building a Modern Computer From First Principles.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Analysis of Continuous Beams in General

Review of Basic Computer Architecture

CMPS 10 Introduction to Computer Science Lecture Notes

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Optimizing Document Scoring for Query Retrieval

Random Kernel Perceptron on ATTiny2313 Microcontroller

Machine Learning: Algorithms and Applications

Conditional Speculative Decimal Addition*

ELEC 377 Operating Systems. Week 6 Class 3

Smoothing Spline ANOVA for variable screening

Wishing you all a Total Quality New Year!

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

Lecture 3: Computer Arithmetic: Multiplication and Division

Parallel matrix-vector multiplication

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Performance Evaluation of Information Retrieval Systems

Review of Basic. Computer Architecture. Theory Goals Specification

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Efficient Distributed File System (EDFS)

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Brave New World Pseudocode Reference

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Computer Architecture ELEC3441

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

S1 Note. Basis functions.

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Topology Design using LS-TaSC Version 2 and LS-DYNA

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Rules for Using Multi-Attribute Utility Theory for Estimating a User s Interests

Programming in Fortran 90 : 2017/2018

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

A Clustering Algorithm Solution to the Collaborative Filtering

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Alternating Direction Method of Multipliers Implementation Using Apache Spark

Nachos Project 3. Speaker: Sheng-Wei Cheng 2010/12/16

Meta-heuristics for Multidimensional Knapsack Problems

Hermite Splines in Lie Groups as Products of Geodesics

UB at GeoCLEF Department of Geography Abstract

Maintaining temporal validity of real-time data on non-continuously executing resources

LS-TaSC Version 2.1. Willem Roux Livermore Software Technology Corporation, Livermore, CA, USA. Abstract

SAO: A Stream Index for Answering Linear Optimization Queries

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

SOLUTION APPROACHES FOR THE CLUSTER TOOL SCHEDULING PROBLEM IN SEMICONDUCTOR MANUFACTURING

and NSF Engineering Research Center Abstract Generalized speedup is dened as parallel speed over sequential speed. In this paper

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

High-Boost Mesh Filtering for 3-D Shape Enhancement

Chapter 1. Introduction

Simulation Based Analysis of FAST TCP using OMNET++

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Secure and Fast Fingerprint Authentication on Smart Card

Giving credit where credit is due

3D vector computer graphics

Optimized caching in systems with heterogeneous client populations

Quantifying Performance Models

Article RGCA: a Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

Module Management Tool in Software Development Organizations

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Unsupervised Learning

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Cracking of the Merkle Hellman Cryptosystem Using Genetic Algorithm

Memory and I/O Organization

Optimized Resource Scheduling Using Classification and Regression Tree and Modified Bacterial Foraging Optimization Algorithm

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Alufix Expert D Design Software #85344

y and the total sum of

A Parallelization Design of JavaScript Execution Engine

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Energy-Efficient Workload Placement in Enterprise Datacenters

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Multiblock method for database generation in finite element programs

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings

Burst Round Robin as a Proportional-Share Scheduling Algorithm

Design and Analysis of Algorithms

Lecture 5: Multilayer Perceptrons

Verification by testing

Newton-Raphson division module via truncated multipliers

FPGA-based implementation of circular interpolation

THE IMPACT OF SMT/SMP DESIGNS ON MULTIMEDIA SOFTWARE ENGINEERING - A WORKLOAD ANALYSIS STUDY

Reducing Frame Rate for Object Tracking

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Transcription:

Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence

Performance = 1 Executon tme Speedup = Performance (B) Performance (A) = Tme (A) Tme (B) CPU tme = Instructons Program Cycles Instructon Seconds Cycle CPU clock cycles = n =1 CPI Instructons

The performance enhancement possble wth a gven mprovement s lmted by the amount that the mproved feature s used Executon tme after mprovement = Executon tme affected by the mprovement Amount of mprovement + Executon tme unaffected A common theme n Hardware desgn s to make the common case fast Increasng the clock rate would not affect memory access tme Usng a floatng pont processng unt does not speed nteger ALU operatons Example: Floatng pont nstructons mproved to run 2X; but only 10% of actual nstructons are floatng pont Exec-Tme new = Exec-Tme old x (0.9 +.1/2) = 0.95 x Exec-Tme old Speedup overall = Exec-Tme new / Exec-Tme old = 1/0.95 = 1.053

Tme old = Tme old * ( Fracton unchanged + Fracton enhanced) Tme new = Tme old * Fracton unchanged + Fracton enhanced Speedup enhanced Speedup overall = Tme old Tme new = = Speedup overall = Tme old Tme old * Fracton unchanged + Fracton enhanced 1 Fracton unchanged + Fracton enhanced Speedup enhanced 1 1 Fracton enhanced ( )+ Fracton enhanced Speedup enhanced Speedup enhanced

Tme KDF9 B5500 Instructons executed Code sze n nstructons Code sze n bts 12 11 10 9 8 7 6 5 ICL 1907 1.1 μs ATLAS 4 3 2 CDC 6600 NU 1108 1 The Burroughs B5500 machne s desgned specfcally for Algol 60 programs Although CDC 6600 s programs are over 3 tmes as bg as those of B5500, yet the CDC machne runs them almost 6 tmes faster Code sze cannot be used as an ndcaton for performance

Computer A Computer B Program 1 (seconds) 1 10 Program 2 (seconds) 1000 100 Total tme (seconds) 1001 110 Wrong summary can present a confusng pcture A s 10 tmes faster than B for program 1 B s 10 tmes faster than A for program 2 Total executon tme s a consstent summary measure Relatve executon tmes for the same workload Assumng that programs 1 and 2 are executng for the same number of tmes on computers A and B CPU Performance (B) CPU Performance (A) = Total executon tme (A) Total executon tme (B) = 1001 110 = 9.1 Executon tme s the only vald and unmpeachable measure of performance

Arthmetc Mean (AM) = 1 n Executon_ Tme n 1 = Weghted Arthmetc Mean (WAM) = n = 1 w Executon_ Tme Where: n s the number of programs executed w s a weghtng factor that ndcates the frequency of executng program n w = wth and = 1 1 0 w 1 Weghted arthmetc means summarze performance whle trackng exec. tme Never use AM for normalzng tme relatve to a reference machne Tme on A Tme on B Norm. to A Norm. to B A B A B Program 1 1 10 1 10 0.1 1 Program 2 1000 100 1 0.1 10 1 AM of normalzed tme 1 5.05 5.05 1 AM of tme 500.5 55 1 0.11 9.1 1

Geometrc Mean (GM) = n n = 1 Executon_Tme_rato Where: n s the number of programs executed Wth Geometrc Mean ( X ) Geometrc Mean ( Y ) = Geometrc Mean X Y Geometrc mean s sutable for reportng average normalzed executon tme Tme on A Tme on B Norm. to A Norm. to B A B A B Program 1 1 10 1 10 0.1 1 Program 2 1000 100 1 0.1 10 1 GM of tme or normalzed tme 31.62 31.62 1 1 1 1

Many wdely-used benchmarks are small programs that have sgnfcant localty of nstructon and data reference Unversal benchmarks can be msleadng snce hardware and compler vendors do optmze ther desgn for these programs The best types of benchmarks are real applcatons snce they reflect the end-user nterest Archtectures mght perform well for some applcatons and poorly for others Complaton can boost performance by takng advantage of archtecture-specfc features Applcaton-specfc compler optmzaton are becomng more popular

800 700 600 500 400 300 200 100 0 gcc espresso spce doduc nasa7 l eqntott matrx300 fpppp tomcatv Benchmark Compler Enhanced compler App. and arch. specfc optmzaton can dramatcally mpact performance

SPEC stands for System Performance Evaluaton Cooperatve sute of benchmarks Created by a set of companes to mprove the measurement and reportng of CPU performance SPEC2000 s the latest sute that conssts of 12 nteger (wrtten n C) and 14 floatng-pont (n Fortran 77) programs Customzed SPEC sutes have been recently ntroduced to assess performance of graphcs and transacton systems. Snce SPEC requres runnng applcatons on real hardware, the memory system has a sgnfcant effect on performance

Hardware Model number Powerstaton 550 CPU 41.67-MHz POWER 4164 FPU (floatng pont) Integrated Number of CPU 1 Cache sze per CPU 64K data/8k nstructon Memory 64 MB Dsk subsystem Network nterface N/A Software 2 400-MB SCSI OS type and revson AIX Ver. 3.1.5 Compler revson AIX XL C/6000 Ver. 1.1.5 AIX XL Fortran Ver. 2.2 Other software Fle system type Frmware level Tunng parameters Background load System state None AIX N/A System None None Mult-user (sngle-user logn) Gudng prncple s reproducblty (report envronment & experments setup)

SPEC rato = Executon tme on SUN SPARCstaton10/40 Executon tme on the measure machne Bgger numerc values of the SPEC rato ndcate faster machne

10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 50 100 Clock rate (MHz) 150 200 250 Pentum 0 50 100 150 Clock rate (MHz) 200 250 Pentum Pentum Pro Pentum Pro The performance measured may be dfferent on other Pentum-based hardware wth dfferent memory system and usng dfferent complers At the same clock rate, the SPECnt95 measure shows that Pentum Pro s 1.4-1.5 tmes faster whle the SPECfp95 shows that t s 1.7-1.8 tmes faster When the clock rate s ncreased by a certan factor, the processor performance ncreases by a lower factor

SPECbase CINT2000 Prces reflects those of July 2001 SPEC CINT2000 per $1000 n prce Dfferent results are obtaned for other benchmarks, e.g. SPEC CFP2000 Wth the excepton of the Sunblade prce-performance metrcs were consstent wth performance

In early computers most nstructons of a machne took the same executon tme The measure of performance for old machnes was the tme requred performng an ndvdual operaton (e.g. addton) New computers have dverse set of nstructons wth dfferent executon tmes The relatve frequency of nstructons across many programs was calculated The average nstructon executon tme was measured by multplyng the tme of each nstructon by ts frequency The average nstructon executon tme was a small step to MIPS that grew n popularty

MIPS = Mllon of Instructons Per Second one of the smplest metrcs vald only n a lmted context Instructon count MIPS (natve MIPS) = 6 Executon tme 10 There are three problems wth MIPS: MIPS specfes the nstructon executon rate but not the capabltes of the nstructons MIPS vares between programs on the same computer MIPS can vary nversely wth performance (see next example) The use of MIPS s smple and ntutve, faster machnes have bgger MIPS

Consder the machne wth the followng three nstructon classes and CPI: Now suppose we measure the code for the same program from two dfferent complers and obtan the followng data: Assume that the machne s clock rate s 500 MHz. Whch code sequence wll execute faster accordng to MIPS? Accordng to executon tme? Answer: Usng the formula: Instructon class CPI for ths nstructon class A 1 B 2 C 3 Instructon count n (bllons) for each Code from nstructon class A B C Compler 1 5 1 1 Compler 2 10 1 1 CPU clock cycles = CPI C Sequence 1: CPU clock cycles = (5 1 + 1 2 + 1 3) 10 9 = 1010 9 cycles Sequence 2: CPU clock cycles = (10 1 + 1 2 + 1 3) 10 9 = 1510 9 cycles n =1

Usng the formula: Execton tme = CPU clock cycles Clock rate Sequence 1: Executon tme = (1010 9 )/(50010 6 ) = 20 seconds Sequence 2: Executon tme = (1510 9 )/(50010 6 ) = 30 seconds Therefore compler 1 generates a faster program Usng the formula: MIPS = Instructon count Executon tme 10 6 (5 + 1+ 1) 10 Sequence 1: MIPS = = 350 6 20 10 (10 + 1+ 1) 10 Sequence 2: MIPS = 6 = 400 30 10 Although compler 2 has a hgher MIPS ratng, the code from generated by compler 1 runs faster 9 9