Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Similar documents
Chapter 7. Microarchitecture. Copyright 2013 Elsevier Inc. All rights reserved.

CS425 Computer Systems Architecture

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Exploitation of instruction level parallelism

Advanced Computer Architecture

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 4. The Processor

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

ECE154A Introduction to Computer Architecture. Homework 4 solution

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Thomas Polzer Institut für Technische Informatik

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Chapter 06: Instruction Pipelining and Parallel Processing

Out of Order Processing

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011

Processor (IV) - advanced ILP. Hwansoo Han

Getting CPI under 1: Outline

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Hardware-Based Speculation

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

The Processor: Instruction-Level Parallelism

Instruction-Level Parallelism and Its Exploitation

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction Level Parallelism (ILP)

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

LECTURE 3: THE PROCESSOR

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Hardware-Based Speculation

ECE331: Hardware Organization and Design

Course on Advanced Computer Architectures

ECE 154B Spring Project 4. Dual-Issue Superscalar MIPS Processor. Project Checkoff: Friday, June 1 nd, Report Due: Monday, June 4 th, 2018

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Chapter 4. The Processor

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Organization and Structure

Multiple Instruction Issue. Superscalars

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Pipelining. Maurizio Palesi

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

EIE/ENE 334 Microprocessors

Instruction Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Chapter 4. The Processor

RECAP. B649 Parallel Architectures and Programming

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Final Exam Fall 2007

CSE 490/590 Computer Architecture Homework 2

Outline Marquette University

Dynamic Scheduling. CSE471 Susan Eggers 1

Advanced Instruction-Level Parallelism

CHW 362 : Computer Architecture & Organization

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

COMPUTER ORGANIZATION AND DESIGN

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Chapter 4 The Processor (Part 4)

Full Datapath. Chapter 4 The Processor 2

CS Mid-Term Examination - Fall Solutions. Section A.

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

ECE232: Hardware Organization and Design

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Hyperthreading Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

CS425 Computer Systems Architecture

Preventing Stalls: 1

14:332:331 Pipelined Datapath

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

ECE Exam II - Solutions October 30 th, :35 pm 5:55pm

ECE260: Fundamentals of Computer Engineering

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EECS 452 Lecture 9 TLP Thread-Level Parallelism

CS341l Fall 2009 Test #2

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CSC258: Computer Organization. Microarchitecture

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Keywords and Review Questions

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Chapter 4. The Processor

Transcription:

Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 <1>

Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor (done) Multicycle Processor (done) Pipelined Processor (done) Exceptions (done) Advanced Microarchitecture (now) Chapter 7 <2>

Advanced Microarchitecture Deep Pipelining Branch Prediction Superscalar Processors Out of Order Processors Register Renaming SD Multithreading Multiprocessors Chapter 7 <3>

Deep Pipelining 10-20 stages typical Number of stages limited by: Pipeline hazards Sequencing overhead Power Cost Chapter 7 <4>

Branch Prediction Ideal pipelined processor: CPI = 1 Branch misprediction increases CPI Static branch prediction: Check direction of branch (forward or backward) If backward, predict taken Else, predict not taken Dynamic branch prediction: Keep history of last (several hundred) branches in branch target buffer, record: Branch destination Whether branch was taken Chapter 7 <5>

Branch Prediction Example add $s1, $0, $0 # sum = 0 add $s0, $0, $0 # i = 0 addi, $0, 10 # = 10 for: beq $s0,, done # if i == 10, branch add $s1, $s1, $s0 # sum = sum i addi $s0, $s0, 1 # increment i j for done: Chapter 7 <6>

1-Bit Branch Predictor Remembers whether branch was taken the last time and does the same thing Mispredicts first and last branch of loop Chapter 7 <7>

2-Bit Branch Predictor strongly taken taken predict taken weakly taken weakly not taken taken predict taken predict taken taken taken taken not taken taken predict not taken strongly not taken taken Only mispredicts last branch of loop Chapter 7 <8>

Superscalar Multiple copies of datapath execute multiple instructions at once Dependencies make it tricky to issue multiple instructions at once CLK CLK CLK CLK CLK PC A RD Instruction Memory A1 A2 A3 A4 A5 A6 WD3 WD6 Register File RD1 RD4 RD2 RD5 ALUs A1 RD1 A2 RD2 Data Memory WD1 WD2 Chapter 7 <9>

Superscalar Example Ideal IPC: 2 Actual IPC: 2 1 2 3 4 5 6 7 8 lw, 40($s0) add $t1, $s1, $s2 lw add $s0 40 $s1 $s2 $t1 Time (cycles) sub $t2, $s1, $s3 and $t3, $s3, $s4 $s1 sub $t2 $s3 - $s3 and $t3 $s4 & or $t4, $s1, $s5 sw $s5, 80($s0) or sw $s1 $s5 $s0 80 $s5 $t4 Chapter 7 <10>

Superscalar with Dependencies lw, 40($s0) add $t1,, $s1 sub, $s2, $s3 Ideal IPC: 2 and $t2, $s4, Actual IPC: 6/5 = 1.17 or $t3, $s5, $s6 sw $s7, 80($t3) 1 2 3 4 5 6 7 8 9 Time (cycles) lw, 40($s0) lw $s0 40 add $t1,, $s1 sub, $s2, $s3 add sub $s1 $s2 $s3 $s1 $s2 $s3 - $t1 and $t2, $s4, or $t3, $s5, $s6 Stall and or and or $s4 $s5 $s6 & $t2 $t3 sw $s7, 80($t3) sw $t3 80 $s7 Chapter 7 <11>

Out of Order Processor Looks ahead across multiple instructions Issues as many instructions as possible at once Issues instructions out of order (as long as no dependencies) Dependencies: RAW (read after write): one instruction writes, later instruction reads a register WAR (write after read): one instruction reads, later instruction writes a register WAW (write after write): one instruction writes, later instruction writes a register Chapter 7 <12>

Out of Order Processor Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3) Scoreboard: table that keeps track of: Instructions waiting to issue Available functional units Dependencies Chapter 7 <13>

Out of Order Processor Example lw, 40($s0) add $t1,, $s1 sub, $s2, $s3 Ideal IPC: 2 and $t2, $s4, Actual IPC: 6/4 = 1.5 or $t3, $s5, $s6 sw $s7, 80($t3) lw, 40($s0) or $t3, $s5, $s6 RAW sw $s7, 80($t3) two cycle latency between load and RAW use of add $t1,, $s1 WAR sub, $s2, $s3 RAW and $t2, $s4, 1 2 3 4 5 6 7 8 lw or $s0 40 $s5 $s6 $t3 $t3 sw $s7 80 add sub $s1 $s2 $s3 and - $s4 & $t1 Time (cycles) $t2 Chapter 7 <14>

Register Renaming lw, 40($s0) add $t1,, $s1 sub, $s2, $s3 Ideal IPC: 2 and $t2, $s4, Actual IPC: 6/3 = 2 or $t3, $s5, $s6 sw $s7, 80($t3) 1 2 3 4 5 6 7 Time (cycles) lw, 40($s0) sub $r0, $s2, $s3 lw sub $s0 40 $s2 $s3 - $r0 2-cycle RAW RAW and $t2, $s4, $r0 or $t3, $s5, $s6 and or $s4 $r0 $s5 $s6 & $t2 $t3 RAW add $t1,, $s1 sw $s7, 80($t3) add sw $s1 $t3 80 $s7 $t1 Chapter 7 <15>

SD Single Instruction Multiple Data (SD) Single instruction acts on multiple pieces of data at once Common application: graphics Perform short arithmetic operations (also called packed arithmetic) For example, add four 8-bit elements padd8 $s2, $s0, $s1 32 24 23 16 15 8 7 0 Bit position a 3 a 2 a 1 a 0 $s0 b 3 b 2 b 1 b 0 $s1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 $s2 Chapter 7 <16>

Advanced Architecture Techniques Multithreading Wordprocessor: thread for typing, spell checking, printing Multiprocessors Multiple processors (cores) on a single chip Chapter 7 <17>

Threading: Definitions Process: program running on a computer Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper Thread: part of a program Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing Chapter 7 <18>

Threads in Conventional Processor One thread runs at once When one thread stalls (for example, waiting for memory): Architectural state of that thread stored Architectural state of waiting thread loaded into processor and it runs Called context switching Appears to user like all threads running simultaneously Chapter 7 <19>

Multithreading Multiple copies of architectural state Multiple threads active at once: When one thread stalls, another runs immediately If one thread can t keep all execution units busy, another thread can use them Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput Intel calls this hyperthreading Chapter 7 <20>

Multiprocessors Multiple processors (cores) with a method of communication between them Types: Homogeneous: multiple cores with shared memory Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone) Clusters: each core has own memory system Chapter 7 <21>

Other Resources Patterson & Hennessy s: Computer Architecture: A Quantitative Approach Conferences: www.cs.wisc.edu/~arch/www/ ISCA (International Symposium on Computer Architecture) HPCA (International Symposium on High Performance Computer Architecture) Chapter 7 <22>