Chapter 14 Performance and Processor Design

Similar documents
Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

The Role of Performance

CPE300: Digital System Architecture and Design

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

CS 426 Parallel Computing. Parallel Computing Platforms

Reporting Performance Results

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

Performance of computer systems

Chapter 13 Reduced Instruction Set Computers

Handout 2 ILP: Part B

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

Advanced Computer Architecture

Lecture 3: Evaluating Computer Architectures. How to design something:

Overview of Computer Organization. Chapter 1 S. Dandamudi

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Computer Architecture. What is it?

Multiple Issue ILP Processors. Summary of discussions

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Lecture 1: Introduction

Overview of Computer Organization. Outline

Modern Processors. RISC Architectures

Computer Performance Evaluation: Cycles Per Instruction (CPI)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Course web site: teaching/courses/car. Piazza discussion forum:

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Instruction Level Parallelism

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

RISC Principles. Introduction

Instructor Information

Advanced issues in pipelining

Lecture 4: Instruction Set Design/Pipelining

Intel released new technology call P6P

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Ten Reasons to Optimize a Processor

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Fundamentals of Quantitative Design and Analysis

Alpha AXP Workstation Family Performance Brief - OpenVMS

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures

Control Hazards. Branch Prediction

The Processor: Instruction-Level Parallelism

CS 5803 Introduction to High Performance Computer Architecture: RISC vs. CISC. A.R. Hurson 323 Computer Science Building, Missouri S&T

Historical Perspective and Further Reading 3.10

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Lecture 4: RISC Computers

Basics of Performance Engineering

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Chapter 1: Fundamentals of Quantitative Design and Analysis

Superscalar Processors

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Processor Organization and Performance

Multi-core Architectures. Dr. Yingwu Zhu

Kaisen Lin and Michael Conley

Lecture 4: ISA Tradeoffs (Continued) and Single-Cycle Microarchitectures

CS/ECE 752: Advanced Computer Architecture 1. Lecture 1: What is Computer Architecture?

Computer Systems Laboratory Sungkyunkwan University

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

Copyright 2012, Elsevier Inc. All rights reserved.

Main Points of the Computer Organization and System Software Module

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 04: Instruction Sets and the Processor organizations. Lesson 20: RISC and converged Architecture

4.16. Historical Perspective and Further Reading. supercomputer: Any machine still on the drawing board.

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

TRIPS: Extending the Range of Programmable Processors

Spring 2014 Midterm Exam Review

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Chapter 4 The Processor (Part 4)

Microprocessor Architecture Dr. Charles Kim Howard University

How What When Why CSC3501 FALL07 CSC3501 FALL07. Louisiana State University 1- Introduction - 1. Louisiana State University 1- Introduction - 2

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 4: RISC Computers

Independent DSP Benchmarks: Methodologies and Results. Outline

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

EC 513 Computer Architecture

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

SOFT 437. Software Performance Analysis. Ch 7&8:Software Measurement and Instrumentation

Outline. What Makes a Good ISA? Programmability. Implementability

Computer System Architecture

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Interaction of JVM with x86, Sparc and MIPS

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

CISC Attributes. E.g. Pentium is considered a modern CISC processor

Chapter 5:: Target Machine Architecture (cont.)

The Von Neumann Computer Model

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Transcription:

Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures 14.5 Performance Evaluation Techniques 14.5.1 Tracing and Profiling 14.5.2 Timings and Microbenchmarks 14.5.3 Application-Specific Evaluation 14.5.4 Analytic Models 14.5.5 Benchmarks 14.5.6 Synthetic Programs 14.5.7 Simulation 14.5.8 Performance Monitoring 14.6 Bottlenecks and Saturation 14.7 Feedback Loops 14.7.1 Negative Feedback 14.7.2 Positive Feedback Chapter 14 Performance and Processor Design Outline (continued) 14.8 Performance Techniques in Processor Design 14.8.1 Complex Instruction Set Computing (CISC) 14.8.2 Reduced Instruction Set Computing (RISC) 14.8.3 Post-RISC Processors 14.8.4 Explicitly Parallel Instruction Computing (EPIC) 1

Objectives After reading this chapter, you should understand: the need for performance measures. common performance metrics. several techniques for measuring relative system performance. the notions of bottlenecks, saturation and feedback. popular architectural design philosophies for processors. processor design techniques that increase performance. 14.1 Introduction Performance evaluation is useful for Consumers Developers Users System performance is affected by Software, particularly the operating system Hardware, particularly processors 2

14.2 Important Trends Affecting Performance Issues Early days of computing: performance evaluation almost exclusively for hardware Now: software must be evaluated too Raw measures influential (e.g., GHz) Industry standardization Evaluation tools becoming more sophisticated 14.3 Why Performance Monitoring and Evaluation are Needed Common purposes of performance evaluation Selection evaluation Performance projection Performance monitoring Performance evaluation useful when: Developing a system Deciding whether to buy or upgrade System tuning 3

14.4 Performance Measures Two types of measures Absolute performance measures (e.g., throughput) Relative performance measures (e.g., ease of use) Common performance measures: Turnaround time = time from when a job is submitted to when a result is returned Response time = time from when a user enters a request until the system displays a response System reaction time = time from when a user enters a request until the first time slice of service is given to that user request Represented by random variables 14.4 Performance Measures Other commonly employed performance measures: Variance in response time = how closely individual response times are to the mean response time Throughput = work per unit time Workload = amount of work submitted to a system; the system is evaluated in relation to a specified workload. Capacity = maximum throughput Utilization = fraction of time a resource is in use 4

14.5 Performance Evaluation Techniques Different techniques for different purposes Some evaluate the system as a whole Some isolate the performance of individual subsystems, components, functions or instructions Thorough evaluation involves using more than one technique 14.5.1 Tracing and Profiling Trace Record of system activity, typically a log of user or application requests to the operating system Characterizes a system s execution environment Manipulate to test for what if scenarios Standard traces Can be used to compare systems that execute in a similar environment Standard traces are difficult to obtain because Traces proprietary to installation where recorded Subtle differences between environments can make an impact on performance, hindering the portability of traces 5

14.5.1 Tracing and Profiling Profile Record of system activity in kernel mode (e.g., process scheduling and I/O management) Indicate which primitives are most heavily used and should be optimized 14.5.2 Timings and Microbenchmarks Timing Raw performance measure (e.g., cycles per second or instructions per second) Quick comparisons between hardware Comparisons between members of the same family of computers (e.g., Pentiums) Microbenchmark Measures the time required to perform a specific operating system operation (e.g., process creation) Also used for system operations (e.g., read/write bandwidth) Only used for measuring small aspects of system performance, not the system s performance as a whole 6

14.5.2 Timings and Microbenchmarks Microbenchmark suites Programs that contain a number of microbenchmarks to test different instructions and operations of a system lmbench Compare system performance between different UNIX platforms Several limitations Timings too coarse (used a software clock) for some tests Statistics reporting was not uniform hbench Analyzes the relationship between operating system primitives and hardware components Corrected some of the limitations of lmbench 14.5.3 Application-Specific Evaluation Vector-based methodology System vector Microbenchmark test for each primitive Vector consists of the results of these tests Application vector Profile the system when running the target application Vector consists of the demand on each primitive Performance of the system obtained by For each element in the system vector, multiply it by the element in the application vector that corresponds to the same primitive Sum the results 7

14.5.3 Application-Specific Evaluation Hybrid methodology Combines the vector-based methodology with a trace Useful for system s whose execution environment depends not only the target application, but on the stream of user requests (e.g., a Web server) Kernel program A simple algorithm (e.g., matrix inversion) or an entire program Executed on paper using manufacturer s timings Useful for consumers who have not yet purchased a system Not commonly used anymore 14.5.4 Analytic Models Analytic models Mathematical representations of computer systems Examples: those of queuing theory and Markov processes Pros A large body of results exist that can be applied to new models Can be relatively fast and accurate Cons Systems often too complex to model exactly Evaluator must be a skilled mathematician Must use other techniques to validate results 8

14.5.5 Benchmarks Production program or industry-standard benchmark Run to completion and timed on target machine Pros Run on the actual machine so human error minimal Capture most aspects of a system and its environment Typically already exist so easy to use Characteristics of a good benchmark Produces nearly the same result each run on the same machine Relevant to the types of applications executed on target system Widely used 14.5.5 Benchmarks Standard Performance Evaluation Corporation (SPEC) Develops industry-standard benchmarks, called SPECmarks Publishes thousands of results per month Criticized because workloads not relevant enough to actual workloads on systems being tested Continually works to develop more relevant benchmarks 9

14.5.5 Benchmarks Other Industry-standard benchmarks Business Application Performance Corporation (BAPCo): SYSmark, MobileMark, WebMark Transaction Processing Performance Council (TPC) benchmarks for database systems Standard Application (SAP) benchmarks evaluating scalability 14.5.6 Synthetic Programs Synthetic programs Programs constructed for a specific purpose (not a real program) To test a specific component To approximate the instruction mix of an application or group of applications Useful for isolating the performance of specific components, but not the system as a whole 10

14.5.6 Synthetic Programs Examples of synthetic programs Whetstone (floating point calculations), no longer used Dhrystone (execution of system programs), no longer used WinBench 99 (graphics, disk and video subsystems in a Windows environment) IOStone (file systems) Hartstone (real-time systems) STREAM (memory subsystem) 14.5.7 Simulation Simulation Computerized model of a system Useful in performance projection Results of a simulation must be validated Two types Event driven simulators controlled by events made to occur according to a probability distribution Script-driven simulators controlled by data carefully manipulated to reflect the system s anticipated environment Common errors Bugs in the simulator Deliberate omissions (due to complexity) Imprecise modeling 11

14.5.8 Performance Monitoring Performance monitoring Can locate inefficiencies in a system that administrators or developers can remove Software monitors Windows Task Manager and Linux proc file system Might distort results because these program require system resources Hardware monitors Use counting registers Record events such as TLB misses, clock ticks and memory operations 14.5.8 Performance Monitoring Figure 14.1 Summary of performance evaluation techniques. 12

14.6 Bottlenecks and Saturation Bottleneck Resource that performs its designated task slowly relative to other resources Degrades system performance Arrival rate > service rate Removing a bottleneck might not increase performance if there are other bottlenecks Saturated resource Processes competing for use of the resource interfere with each other s execution Thrashing occurs when memory is saturated 14.7 Feedback Loops Feedback loop Technique in which information about the current state of the system can affect arriving requests Negative feedback implies resource is saturated Positive feedback implies resource is underutilized 13

14.7.1 Negative Feedback Negative feedback Arrival rate a resource might decrease as a result of negative feedback Examples: Multiple print servers Print servers with long queues cause negative feedback Jobs go to other print servers Contributes to system stability 14.7.2 Positive Feedback Positive feedback Arrival rate a resource might increase as a result of positive feedback Might be misleading E.g., Processor utilization is low might cause the scheduler to admit more processes to that processor s queue Low utilization might be due to thrashing Admitting more processes causes more thrashing and worse performance Designers must be cautious of these types of situations 14

14.8 Performance Techniques in Processor Design Conceptually, a processor can be divided into: Instruction set Hardware implementation Instruction Set Architecture (ISA) Interface that describes the processor Instruction set, number of registers, etc. Like an API exposed by the processor 14.8.1Complex Instruction Set Computing (CISC) Complex instruction set computing (CISC) Incorporate frequently used sections of code into single machine language instructions Makes assembly code writing easier Saves memory Optimizes the execution of complex functions Popular until mid-1980s 15

14.8.1 Complex Instruction Set Computing (CISC) Characteristics of CISC processors Many instructions Instruction decoded by firmware Few general purpose registers Examples: Pentium, Athlon Pipelining Divides a processor s datapath into discrete stages One instruction per stage per clock cycle Increases parallelism and hence processor performance Originally developed for CISC processors First pipeline: IBM 7030 ( Stretch ) 14.8.2 Reduced Instruction Set Computing (RISC) Studies revealed that only a few simple instructions accounted for nearly all instructions executed by a processor IBM System/370 10 most frequently executed instructions accounted for 2/3 of instructions executed IBM Series/1 - programmers tended to generate semantically equivalent instruction sequences Provided compelling evidence for reduced instruction set computing (RISC) 16

14.8.2 Reduced Instruction Set Computing (RISC) Figure 14.2 The ten most frequently executed instructions on IBM s System/370 architecture. (Courtesy of International Business Machines Corporation.) 14.8.2 Reduced Instruction Set Computing (RISC) RISC Few instructions Complexity in the software Instruction decode hardwired All instructions a fixed size (typically, one machine word) All instructions require nearly the same amount execution time Many general purpose registers RISC performance gains vs. CISC Better use of pipelines Delayed branching Common instructions execute fast Fewer memory accesses 17

14.8.2 Reduced Instruction Set Computing (RISC) RISC performance losses vs. CISC Longer context switch time Consume more memory Slower for complex operations such as floating point Examples SPARC MIPS G5 14.8.2 Reduced Instruction Set Computing (RISC) Figure 14.3 RISC and CISC comparison. 18

14.8.3 Post-RISC Processors Modern processors Stray from traditional RISC and CISC designs Include anything that increases performance Common names: post-risc, second generation RISC and fast instruction set computing (FISC) Some do not agree that RISC and CISC designs converged RISC convergence to CISC Superscalar architecture Out of order execution (OOO) Branch prediction On-chip floating point and vector support Additional, infrequently used instructions 14.8.3 Post-RISC Processors CISC convergence to RISC Core RISC instruction set All instructions decoded to RISC instructions (e.g., Pentium) Complex instructions only included to provide backwards compatibility 19

14.8.4 Explicitly Parallel Instruction Computing (EPIC) Motivations Hardware complexity from superscalar architectures do not scale well Exploit instruction-level parallelism (ILP) Characteristics Many execution units Borrows from superscalar and VLIW techniques Compiler decides path of execution; no OOO Multi-op instructions Branch predication Speculative loading 14.8.4 Explicitly Parallel Instruction Computing (EPIC) Figure 14.4 Instruction execution in an (a) EPIC processor and (b) post-risc processor. 20