Computer Science 246. Computer Architecture

Similar documents
Computer Science 146. Computer Architecture

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Computer System. Performance

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

CS61C : Machine Structures

Computer Performance Evaluation: Cycles Per Instruction (CPI)

1.6 Computer Performance

Microarchitecture Overview. Performance

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

Microarchitecture Overview. Performance

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

55:132/22C:160, HPCA Spring 2011

Computer Architecture

Lecture 4: Instruction Set Architecture

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Computer Architecture

Course web site: teaching/courses/car. Piazza discussion forum:

Quantifying Performance EEC 170 Fall 2005 Chapter 4

Outline. What Makes a Good ISA? Programmability. Implementability

Outline. What Makes a Good ISA? Programmability. Implementability. Programmability Easy to express programs efficiently?

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

EC-801 Advanced Computer Architecture

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

The Von Neumann Computer Model

The Role of Performance

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

Performance of computer systems

Lecture 3: Evaluating Computer Architectures. How to design something:

APPENDIX Summary of Benchmarks

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

CpE 442 Introduction to Computer Architecture. The Role of Performance

Instruction Set Principles and Examples. Appendix B

Computer Systems Laboratory Sungkyunkwan University

RISC Principles. Introduction

ECE 252 / CPS 220 Advanced Computer Architecture I. Administrivia. Instructors and Course Website. Where to Get Answers

Lecture 4: Instruction Set Design/Pipelining

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

Lecture 3 Machine Language. Instructions: Instruction Execution cycle. Speaking computer before voice recognition interfaces

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

ADVANCED ELECTRONIC SOLUTIONS AVIATION SERVICES COMMUNICATIONS AND CONNECTIVITY MISSION SYSTEMS

The Von Neumann Computer Model

Instructor Information

CPE300: Digital System Architecture and Design

Instruction Set Architecture. "Speaking with the computer"

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

I ve been getting this a lot lately So, what are you teaching this term? Computer Organization. Do you mean, like keeping your computer in place?

Workloads, Scalability and QoS Considerations in CMP Platforms

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/17/2014

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

RISC Architecture Ch 12

Instruction Set Design

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Instruction Set Architectures

ECE 486/586. Computer Architecture. Lecture # 8

Processor Architecture

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Chapter 2: Instructions How we talk to the computer

Instruction Set Architectures. Part 1

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.

Instruction Set Architectures. CS301 Prof. Szajda

EC 413 Computer Organization

Chapter 1: Fundamentals of Quantitative Design and Analysis

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro

CSCE 5610: Computer Architecture

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

Chapter 13 Reduced Instruction Set Computers

RISC, CISC, and ISA Variations

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

GRE Architecture Session

Lecture 4: RISC Computers

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

CHAPTER 5 A Closer Look at Instruction Set Architectures

EECS4201 Computer Architecture

CISC Attributes. E.g. Pentium is considered a modern CISC processor

Instruction Set Architecture

Computer Architecture. What is it?

Instruction Set Principles. (Appendix B)

CPU Architecture and Instruction Sets Chapter 1

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Hakim Weatherspoon CS 3410 Computer Science Cornell University

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Typical Processor Execution Cycle

Computer Engineering Fall Semester, 2011

Transcription:

Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu

Lecture Outline Performance Metrics Averaging Amdahl s Law Benchmarks The CPU Performance Equation Optimal Pipelining Case Study Modern Processor Analysis Begin ISAs

Performance Metrics Better Execution Time is often what we target Throughput (tasks/sec) vs. latency (sec/task) How do we decide the tasks? Benchmarks What the customer cares about, real applications Representative programs (SPEC, SYSMARK, etc) Kernels: Code fragments from real programs (Linpack) Toy Programs: Quicksort, Sieve of Eratosthenes Synthetic Programs: Just a representative instruction mix (Whetsone, Dhrystone)

Measuring Performance Total Execution Time: This is arithmetic mean 1 n n i=1 Time i This should be used when measuring performance in execution times (CPI)

Measuring Performance Weighted Execution Time: n i=1 Weight i Time i What if P1 and P2 are not run equally?

Measuring Performance Normalized Execution Time Normalize to reference machine Can only use geometric mean (arithmetic mean can vary depending on the reference machine) n n ExecutionTimeRatioi i=1 Problem: Ratio not Execution Time is the result

Harmonic Mean: Motivation 30 mph for the first 10 miles 90 mph for the next 10 miles Average speed? (30+90)/2 = 60mph WRONG! Average speed = total distance / total time 20/(10/30+10/90) = 45mph

Harmonic Mean Each program has O operations n programs executed no operations in Σ T i Execution rate is then no/ Σ T i = n/ Σ (T i /O)=n/ Σ 1/P i where 1/P i is the rate of execution of program i Harmonic mean should be used when measuring performance in execution rates (IPC)

Amdahl s Law (Law of Diminishing Returns) Very Intuitive Make the Common case fast Speedup = Execution Time for task without enhancement Execution Time for task using enhancement Execution time new = Execution time old Overall Speedup Fraction enhanced ) ((1 - Fraction enhanced ) + Speedup enhanced

Amdahl s Law Corollary Speedup Overall = 1 Fraction enhanced ) ((1 - Fraction enhanced ) + Speedup enhanced As Speedup Enhanced >> 0, Speedup Overall = 1 (1 - Fraction enhanced )

Amdahl s Law Example Fraction enhanced =95%, Speedup Enhanced =1.1x Speedup Overall =1/((1-.95)+(.95/1.1))=1.094 Fraction enhanced =5%, Speedup Enhanced =10x Make the common case fast! Speedup Overall =1/((1-.05)+(.05/10))=1.047 Fraction enhanced =5%, Speedup Enhanced =Infinity Speedup Overall =1/(1-.05)=1.052

MIPS MIPS = instruction count/(execution time x 10 6 ) = clock rate/(cpi x 10 6 ) Problems ISAs are not equivalent, e.g. RISC vs. CISC 1 CISC instruction may equal many RISC! Programs use different instruction mixes May be ok when comparing same benchmarks, same ISA, same compiler, same OS

MFLOPS Same as MIPS, just FP ops Not useful either FP-intensive apps needed Traditionally, FP ops were slow, INT can be ignored BUT, now memory ops can be the slowest! Peak MFLOPS is a common marketing fallacy Basically, it just says #FP-pipes X Clock Rate

GHz Is this a metric? Maybe as good as the others One number, no benchmarks, what can be better? Many designs are frequency driven Processor Clock Rate SPEC FP2000 IBM POWER3 450 MHz 434 Intel PIII 1.4 GHz 456 Intel Pentium 4 2.4 GHz 833 Itanium-2 1.0 GHz 1356

Benchmark Suites SPEC CPU2000 (int and float) (Desktop, Server) EEMBC ( embassy ), SPECjvm (Embedded) TPC-C, TPC-H, SPECjbb, ECperf (Server)

SPEC CPU2000: Integer Benchmarks 164.gzip C Compression 175.vpr C FPGA Circuit Placement and Routing 176.gcc C C Programming Language Compiler 181.mcf C Combinatorial Optimization 186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Programming Language 254.gap C Group Theory, Interpreter 255.vortex C Object-oriented Database 256.bzip2 C Compression 300.twolf C Place and Route Simulator

SPEC CPU2000: Floating Point Benchmarks 168.wupwise Fortran 77 Physics / Quantum Chromodynamics 171.swim Fortran 77 Shallow Water Modeling 172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field 173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations 177.mesa C 3-D Graphics Library 178.galgel Fortran 90 Computational Fluid Dynamics 179.art C Image Recognition / Neural Networks 183.equake C Seismic Wave Propagation Simulation 187.facerec Fortran 90 Image Processing: Face Recognition 188.ammp C Computational Chemistry 189.lucas Fortran 90 Number Theory / Primality Testing 191.fma3d Fortran 90 Finite-element Crash Simulation 200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design 301.apsi Fortran 77 Meteorology: Pollutant Distribution

Server Benchmarks TPC-C (Online-Transaction Processing, OLTP) Models a simple order-entry application Thousands of concurrent database accesses System # / Processor tpm $/tpm Fujitsu PrimePower 128, 563MHz SPARC64 455K $28.58 HP SuperDome 64, 875MHz PA8700 423K $15.64 IBM p690 32, 1300MHz POWER4 403K $17.80 TPC-H (Ad-hoc, decision support) Data warehouse, backend analysis tools for data

CPU Performance Equation Execution Time = seconds/program instructions cycles seconds program instruction cycle Program Compiler (Scheduling) Technology Architecture (ISA) Organization (uarch) Physical Design Compiler Microarchitects Circuit Designers

Common Architecture Trick Instructions/Program (Path-length) is constant Same benchmark, same compiler Ok usually, but for some ideas compiler may change Seconds/Cycle (Cycle-time) is constant My tweak won t impact cycle-time Often a bad assumption Current designs are ~12-20FO4 Inverter Delays per cycle Just focus on Cycles/Instruction (CPI or IPC) Many architecture studies do just this!

Instruction Set Architecture Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine. IBM, Introducing the IBM 360 (1964) The ISA defines: Operations that the processor can execute Data Transfer mechanisms + how to access data Control Mechanisms (branch, jump, etc) Contract between programmer/compiler + HW

Classifying ISAs

Stack Architectures with implicit stack Acts as source(s) and/or destination, TOS is implicit Push and Pop operations have 1 explicit operand Example: C = A + B Push A // S[++TOS] = Mem[A] Push B // S[++TOS] = Mem[B] Add // Tem1 = S[TOS--], Tem2 = S[TOS--], S[++TOS] = Tem1 + Tem2 Pop C // Mem[C] = S[TOS--] x86 FP uses stack (complicates pipelining)

Accumulator Architectures with one implicit register Acts as source and/or destination One other source explicit Example: C = A + B Load A // (Acc)umulator <= A Add B // Acc <= Acc + B Store C // C <= Acc Accumulator implicit, bottleneck? x86 uses accumulator concepts for integer

Register Most common approach Fast, temporary storage (small) Explicit operands (register IDs) Example: C = A + B Register-memory load/store Load R1, A Load R1, A Add R3, R1, B Load R2, B Store R3, C Add R3, R1, R2 Store R3, C All RISC ISAs are load/store IBM 360, Intel x86, Moto 68K are register-memory

Common Addressing Modes Base/DisplacementLoad R4, 100(R1) Register Indirect Load R4, (R1) Indexed Load R4, (R1+R2) Direct Load R4, (1001) Memory Indirect Load R4, @(R3) Autoincrement Load R4, (R2)+ Scaled Load R4, 100(R2)[R3]

What leads to a good/bad ISA? Ease of Implementation (Job of Architect/Designer) Does the ISA lead itself to efficient implementations? Ease of Programming (Job of Programmer/Compiler) Can the compiler use the ISA effectively? Future Compatibility ISAs may last 30+yrs Special Features, Address range, etc. need to be thought out

Implementation Concerns Simple Decoding (fixed length) Compactness (variable length) Simple Instructions (no load/update) Things that get microcoded these days Deterministic Latencies are key! Instructions with multiple exceptions are difficult More/Less registers? Slower register files, decoding, better compilers Condition codes/flags (scheduling!)

Programmability 1960s, early 70s Code was mostly hand-coded Late 70s, Early 80s Most code was compiled, but hand-coded was better Mid-80s to Present Most code is compiled and almost as good as assembly Why?

Programmability: 70s, Early 80s Closing the Semantic Gap High-level languages match assembly languages Efforts for computers to execute HLL directly e.g. LISP Machine Hardware Type Checking. Special type bits let the type be checked efficiently at run-time Hardware Garbage Collection Fast Function Calls Efficient Representation of Lists Never worked out Semantic Clash Too many HLLs? C was more popular? Is this coming back with Java? (Sun s picojava)

Programmability: 1980s 2000s In the Compiler We Trust Wulf: Primitives not Solutions Compilers cannot effectively use complex instructions Synthesize programs from primitives Regularity: same behavior in all contexts No odd cases things should be intuitive Orthogonality: Data type independent of addressing mode Addressing mode independent of operation performed

ISA Compatibility In Computer Architecture, no good idea ever goes unpunished. Marty Hopkins, IBM Fellow Never abandon existing code base Extremely difficult to introduce a new ISA Alpha failed, IA64 is struggling, best solution may not win x86 most popular, is the least liked! Hard to think ahead, but ISA tweak may buy 5-10% today 10 years later it may buy nothing, but must be implemented Register windows, delay branches

CISC vs. RISC Debate raged from early 80s through 90s Now it is fairly irrelevant Despite this Intel (x86 => Itanium) and DEC/Compaq (VAX => Alpha) have tried to switch Research in the late 70s/early 80s led to RISC IBM 801 -- John Cocke mid 70s Berkeley RISC-1 (Patterson) Stanford MIPS (Hennessy)

VAX 32-bit ISA, instructions could be huge (up to 321 bytes), 16 GPRs Operated on data types from 8 to 128-bits, decimals, strings Orthogonal, memory-to-memory, all operand modes supported Hundreds of special instructions Simple compiler, hand-coding was common CPI was over 10!

x86 Variable length ISA (1-16 bytes) FP Operand Stack 2 operand instructions (extended accumulator) Register-register and register-memory support Scaled addressing modes Has been extended many times (as AMD did with x86-64) Intel, initially went to IA64, now backtracked (EM64T)

RISC vs. CISC Arguments RISC Simple Implementation Load/store, fixed-format 32-bit instructions, efficient pipelines Lower CPI Compilers do a lot of the hard work CISC MIPS = Microprocessor without Interlocked Pipelined Stages Simple Compilers (assists hand-coding, many addressing modes, many instructions) Code Density

MIPS/VAX Comparison Computer Science 146

After the dust settled Turns out it doesn t matter much Can decode CISC instructions into internal micro- ISA This takes a couple of extra cycles (PLA implementation) and a few hundred thousand transistors In 20 stage pipelines, 55M tx processors this is minimal Pentium 4 caches these micro-ops Actually may have some advantages External ISA for compatibility, internal ISA can be tweaked each generation (Transmeta)