Microarchitecture Overview. Performance

Similar documents
Microarchitecture Overview. Performance

UCB CS61C : Machine Structures

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

UCB CS61C : Machine Structures

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Intel released new technology call P6P

1.6 Computer Performance

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Execution-based Prediction Using Speculative Slices

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Lecture 3: Evaluating Computer Architectures. How to design something:

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Computer Architecture. Introduction. Lynn Choi Korea University

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Chapter 1. and Technology

INSTRUCTION LEVEL PARALLELISM

Speculative Multithreaded Processors

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Multiple Issue ILP Processors. Summary of discussions

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Design of Experiments - Terminology

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

EC 513 Computer Architecture

Superscalar Processors

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Course web site: teaching/courses/car. Piazza discussion forum:

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Pentium IV-XEON. Computer architectures M

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Intel Enterprise Processors Technology

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Processor (IV) - advanced ILP. Hwansoo Han

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

XT Node Architecture

The Processor: Instruction-Level Parallelism

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Computer Science 146. Computer Architecture

45-year CPU Evolution: 1 Law -2 Equations

EECS 322 Computer Architecture Superpipline and the Cache

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

The Pentium II/III Processor Compiler on a Chip

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer System. Performance

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Superscalar Processor

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Lecture 1: Introduction

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Superscalar Machines. Characteristics of superscalar processors

Mestrado em Informática

How to write powerful parallel Applications

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Architecture. Introduction

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Rechnerstrukturen

Chapter-5 Memory Hierarchy Design

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

5008: Computer Architecture

CS61C - Machine Structures. Week 6 - Performance. Oct 3, 2003 John Wawrzynek.

CS 152, Spring 2011 Section 8

CS425 Computer Systems Architecture

PIPELINING AND PROCESSOR PERFORMANCE

Jim Keller. Digital Equipment Corp. Hudson MA

Show Me the $... Performance And Caches

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Superscalar Organization

Instructor Information

Multicore and Parallel Processing

Superscalar Processors Ch 14

Transcription:

Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make a function faster 4 Execute more operations in parallel Pipelining Superscalar execution Multiple processors Scott Rixner Lecture 2 2 1

Pipelining 4 Basic microprocessor pipeline Instruction fetch (IF) Instruction decode (ID) Execute (EX) Memory access (MEM) Writeback (WB) Fetch Decode Execute Memory Write Scott Rixner Lecture 2 3 Superscalar Execution 4 Branch prediction 4 Instruction scheduling Compiler reordering In-order issue/completion Out-of-order issue/completion Register renaming Scott Rixner Lecture 2 4 2

Alpha 21264 Pipeline -IEEE Micro 3/99 Scott Rixner Lecture 2 5 Pentium Pro Pipeline 4 IFU1 Fetch 32 bytes from L1 $ 4 IFU2 Find instructions (branches to BTB) 4 IFU3 Align instructions 4 DEC1 3 decoders generate µops 4 DEC2 Move µops to dispatch queue 4 RAT Rename/allocate 40 registers 4 ROB Allocate reorder buffer (40) 4 DIS Dispatch µops to units 4 EX Execute 4 RET1 Mark ROB for retirement 4 RET2 Retire instructions Scott Rixner Lecture 2 6 3

Pentium 4 Pipeline 4 Stages 1-2 Trace cache next instruction pointer 4 Stages 3-4 Trace cache fetch 4 Stage 5 Drive (wire delay!) 4 Stages 6-8 Allocate and Rename 4 Stages 10-12 Schedule instructions Memory/fast ALU/slow ALU & general FP/simple FP 4 Stages 13-14 Dispatch 4 Stages 15-16 Register access 4 Stage 17 Execute 4 Stage 18 Set flags 4 Stage 19 Check branches 4 Stage 20 Drive (more wire delay!) Scott Rixner Lecture 2 7 UltraSPARC III Pipeline -IEEE Micro 5/99 Scott Rixner Lecture 2 8 4

Memory System Issues 4 Latency (10s to 100s of cycles) 4 Caching Locality Sources of misses (compulsory, capacity, conflict) 4 Load boosting Small basic blocks Critical word first 4 Prefetching 4 Load/store reordering Memory disambiguation Scott Rixner Lecture 2 9 ITRS Predictions Year 2001 2003 2006 2009 2010 2012 2013 Technology 250 180 150 130 100 1999 Transistors 40 76 200 520 1400 Clock Frequency 1400 1600 2000 2500 3000 Technology 150 107 70 45 32 2001 Transistors 276 439 878 2212 4424 Clock Frequency 1684 3088 5631 11511 19384 Technology 107 70 50 45 35 32 2003 Transistors 439 878 1756 2212 3511 4424 Clock Frequency 2976 6783 12369 15079 20065 22980 Technology 78 52 45 36 32 2005 Transistors 553 1106 2212 2212 4424 Clock Frequency 6783 12369 15079 20065 22980 -International Technology Roadmap for Semiconductors (1999, 2001, 2003, 2005) Scott Rixner Lecture 2 10 5

Transistor Counts - Intel Scott Rixner Lecture 2 11 2004: Intel Itanium 2 4 Issues up to 8 ops per cycle 4 0.13 micron process 4 592 million transistors 4 432 mm 2 die 4 128-bit bus 4 16KB data cache 4 16KB instruction cache 4 9MB L3 cache (256KB L2) 4 1.6 GHz Scott Rixner Lecture 2 12 6

2006: Intel Dual Core Itanium 2 4 2 Itanium 2 processors 4 Each core 2-way multithreading Issues up to 8 ops per cycle 16KB inst. & data L1 caches 1MB inst. & 256KB data L2 caches 12MB L3 cache 4 Virtualization technology 4 0.09 micron process 4 1.72 billion transistors Cores: 57M L1/L2 caches: 106.5M L3 caches: 1550M Bus logic and I/O: 6.7M 4 596 mm 2 die 4 128-bit bus 4 1.6 GHz Scott Rixner Lecture 2 13 Future Challenges 4 2000 Interconnect Design 4 2002 Design productivity Power management Multicore organization I/O bandwidth Circuit and process technology 4 2004 Drivers Systems on chip Mixed-signal chips Embedded memory Design Test Scott Rixner Lecture 2 14 7

Interconnect 4 Local Within a module Scales with technology 4 Global Connect major functional areas Does not scale well with technology On-chip interconnection networks 4 Architectural impact? Scott Rixner Lecture 2 15 Power Trends -Computer 1/04 4 Low power devices will consume too much power Scott Rixner Lecture 2 16 8

Power Gap -Computer 1/02 4 Low-power devices will be dominated by memory 4 Why? 4 What is the architectural impact of this? Scott Rixner Lecture 2 17 Design Productivity Gap 4 Communications-centric design 4 Robustness and Scalability 4 Cost optimization and cooptimization 4 What about the architect? -Computer 1/99 Scott Rixner Lecture 2 18 9

Test 4 Revolution in test equipment design? Cost of tester has overwhelmed other test costs Increased integration, open architecture Focus on test instrument, not infrastructure Focus on test capability Improve time to market 4 SoC testing Must test internal cores 4 High-speed signals New testing problems coming into the mainstream 4 More difficult to ensure reliability Scott Rixner Lecture 2 19 Performance Evaluation 4 Microbenchmarks Small snippets of code that directly measure performance of a particular feature 4 Kernels Functions that represents the important parts of applications 4 Applications Actual real world applications that a user may run 4 Whatever is available? Scott Rixner Lecture 2 20 10

Performance Comparison 4 X is n times faster than Y means: ExTime( Y ) ExTime( X ) = Perf ( X ) Perf ( Y ) = n 4 Same definition applies to other metrics, such as throughput Scott Rixner Lecture 2 21 Example Plane DC to Paris Speed Passengers Throughput (PMPH) Boeing 747 6.5 hours 610 mph 470 266,700 Concorde 3 hours 1350 mph 132 178,200 4 Performance metrics Time to run the task (travel time for each passenger) Throughput (person miles per hour = PMPH) 4 Comparisons Speed of Concorde > 747 Throughput of 747 > Concorde Scott Rixner Lecture 2 22 11

Amdahl s Law 4 Performance improvements depend on: How good is enhancement How often is it used 4 Speedup due to enhancement E (fraction p sped up by factor S): ExTime w/out E Perf w/ E Speedup(E) = = ExTime w/ E Perf w/out E ExTimenew = ExTimeold 1 ExTimeold Speedup( E) = = ExTime ( p) + Scott Rixner Lecture 2 23 new 1 p S ( 1 p) + p S SPEC CPU2000 Integer Benchmarks 4 gzip Compression 4 vpr FPGA circuit place and route 4 gcc C compiler 4 mcf Combinatorial optimization 4 crafty Chess player 4 parser Word processing 4 eon Computer visualization 4 perlbmk Perl programming language 4 gap Group theory, interpreter 4 vortex Object-oriented database 4 bzip2 Compression 4 twolf Place and route simulator Scott Rixner Lecture 2 24 12

SPEC CPU2006 Integer Benchmarks 4 perlbench Perl programming language 4 bzip2 Compression 4 gcc C compiler 4 mcf Combinatorial optimization 4 gobmk Go player 4 hmmer Gene sequence search 4 sjeng Chess player 4 libquantum Quantum computer simulator 4 h264ref Video compression 4 omnetpp Discrete event simulator 4 astar A* path finding 4 xalancbmk XML processing Scott Rixner Lecture 2 25 SPEC CPU Benchmarks 4 What are they measuring? 4 How are they selected? 4 Are they ideal performance measures/predictors? 4 Could we do better? Scott Rixner Lecture 2 26 13

SPEC2000 Results 4 SPEC2000 ratios are speedups over a 300MHz Sun Ultra 5 times 100 Processor SPEC2000 Int SPEC2000 FP Power (W) Alpha 21364 904 1,279 155 AMD Opteron 254 1,789 2,132 92 AMD Opteron 280 1,499 1,752 95 IBM Power4+ 1,077 1,598 100 IBM Power5 1,470 2,839 120 Intel Itanium 2 1,490 2,801 130 Intel XeonMP 1,388 1,314 140 Intel Xeon 1,810 1,909 130 Scott Rixner Lecture 2 27 Pentium Performance Scaling Pentium 12% 28% 55% 5000 4500 4000 SPECint2000 Rating 3500 3000 2500 2000 1500 1000 500 0 2000 2001 2002 2003 2004 2005 2006 2007 Year Scott Rixner Lecture 2 28 14

Next Classes 4 Thursday Front End 1 A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History Cooperative Prefetching: Compiler and Hardware Support for Effective Instruction Prefetching in Modern Processors 4 Tuesday Front End 2 A Scalable Front-End Architecture for Fast Instruction Delivery Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching Scott Rixner Lecture 2 29 15