ECE 587 Advanced Computer Architecture I

Similar documents
ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II

Advanced Computer Architecture

Portland State University ECE 587/687. Superscalar Issue Logic

Spring 2014 Midterm Exam Review

Superscalar Organization

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Course web site: teaching/courses/car. Piazza discussion forum:

ECE 341. Lecture # 15

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Instruction Pipelining

COSC 6385 Computer Architecture - Pipelining

Computer Systems Architecture Spring 2016

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Processor Architecture

The Processor: Instruction-Level Parallelism

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Instruction Pipelining

Instruction-Level Parallelism and Its Exploitation

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Instr. execution impl. view

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

ECE 505 Computer Architecture

COMPUTER ORGANIZATION AND DESI

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Chapter 06: Instruction Pipelining and Parallel Processing

Advanced Instruction-Level Parallelism

Lecture 15: Pipelining. Spring 2018 Jason Tang

Processor (IV) - advanced ILP. Hwansoo Han

2. [3 marks] Show your work in the computation for the following questions involving CPI and performance.

Computer Architecture

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Organisasi Sistem Komputer

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

SUPERSCALAR AND VLIW PROCESSORS

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

ECE154A Introduction to Computer Architecture. Homework 4 solution

Pipelining. CSC Friday, November 6, 2015

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

When and Where? Course Information. Expected Background ECE 486/586. Computer Architecture. Lecture # 1. Spring Portland State University

Modern Computer Architecture

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Dynamic Scheduling. CSE471 Susan Eggers 1

Multi-cycle Instructions in the Pipeline (Floating Point)

omputer Design Concept adao Nakamura

Processors. Young W. Lim. May 12, 2016

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Portland State University ECE 587/687. Caches and Prefetching

Portland State University ECE 588/688. Cray-1 and Cray T3E

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining, Branch Prediction, Trends

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

What is Pipelining? RISC remainder (our assumptions)

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Tutorial 11. Final Exam Review

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Control Hazards. Prediction

Lecture 1: Introduction

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Architectures for Instruction-Level Parallelism

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Pipelining. Maurizio Palesi

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

Static, multiple-issue (superscaler) pipelines

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Transcription:

ECE 587 Advanced Computer Architecture I Instructor: Alaa Alameldeen alaa@ece.pdx.edu Spring 2015 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2015 1

When and Where? When: Monday and Wednesday 6:40-8:30 PM Where: URBN 304 Office hours: Monday and Wednesday, after class Webpage: http://www.cecs.pdx.edu/~alaa/ece587/ Go to webpage for: Class Slides Papers Simulator information Homework and project assignments Portland State University ECE 587 Spring 2015 2

Course Description State of the art superscalar microprocessor design Emphasis on quantitative analysis Homeworks and project involve working with a high level performance simulator Emphasis on papers readings NOT on a textbook Tutorial papers Original sources and ideas papers Papers covering most recent trends Portland State University ECE 587 Spring 2015 3

Expected Background ECE 486/586 or equivalent Basic microprocessor organization Instruction sets: RISC, CISC Datapath design Pipelining Caches Basic branch prediction Programming experience in C Needed for homeworks and projects Portland State University ECE 587 Spring 2015 4

Grading Policy Class Participation 5% Paper Reviews 10% Homeworks 25% Project 30% Final Exam 30% Grading Scale: A: 92-100% A-: 86-91.5% B+: 80-85.5% B: 76-79.5% B-: 72-75.5% C+: 68-71.5% C: 64-67.5% C-: 60-63.5% D+: 57-59.5% D: 54-56.5% D-: 50-53.5% F: Below 50% Portland State University ECE 587 Spring 2015 5

Why Study Computer Architecture Technology advancements require continuous optimization of cost, performance, and power Moore s law Original version: Transistor scaling exponential Popular version: Processor performance exponentially increasing Innovation needed to satisfy market trends User and software requirements keep on changing Software developers expecting improvements in computing power Portland State University ECE 587 Spring 2015 6

Two important metrics Latency Response time Performance For different hardware structures (e.g, cache access, store buffer lookup) For different instructions/operations Execution time from start to finish Throughput or bandwidth Rate of task completion Rate of data transfer Portland State University ECE 587 Spring 2015 7

Instruction Cycle Five stages (cycles) for instruction processing: Instruction fetch (IF) Instruction decode, read operands (ID) Execute (EX) Memory read/write (MEM) Write back results (WB) IF ID EX MEM WB Most modern processors have many more stages Portland State University ECE 587 Spring 2015 8

Simplified Instruction Cycle For the remainder of this lecture, let s simplify the instruction processing to three stages (cycles): Instruction Fetch (f) Instruction Decode and Read Operands (d) Execute and write results (e) f d e Portland State University ECE 587 Spring 2015 9

Execution Time Execution time (Runtime) for a program is given by: Instructions per program x Cycles per instruction x Time per cycle (Cycle time) Runtime = I x CPI x t c Portland State University ECE 587 Spring 2015 10

Execution Time For a scalar processor (with a 3-cycle instruction processing), CPI = 3 Runtime = I x 3 x t c Portland State University ECE 587 Spring 2015 11

Improving Performance via Basic Pipelining F D E F D E F D E F D E Runtime = I x 1 x t c Portland State University ECE 587 Spring 2015 12

Superscalar Processors Superscalar processors: Multiple pipelines operate in parallel F D E F D E F D E F D E F D E F D E Runtime = I x 0.5 x t c Superscaler techniques have been applied to both CISC and RISC processors Portland State University ECE 587 Spring 2015 13

Superscalar Processors It is not guaranteed that a wide superscalar executes at maximum throughput for any given sequence of instructions Instructions are not independent Can t always find more than one instruction to issue per cycle Branches Don t know what instruction to fetch next The processor execution resources are limited Fetch and execution mechanisms Cache misses Portland State University ECE 587 Spring 2015 14

True Data Dependencies Also called data hazards, read-after-write (RAW) hazards An instruction may use a result produced by the previous instruction Both instructions may not execute simultaneously in multiple pipelines The second instruction must typically be stalled F D E F D S E Portland State University ECE 587 Spring 2015 15

Procedural Dependencies Also called control or branch hazards Instruction fetch implicitly depends on knowing the correct value for the program counter (PC) This is (in a sense) a true dependence on the PC Branches may change the program counter late in their execution, leading to pipeline stalls F D E F D E S S F D E S S F D E Portland State University ECE 587 Spring 2015 16

Procedural Dependencies (Cont.) CISC variable length instructions introduce another procedural dependency: Portions of an instruction must be decoded before the instruction length is known Portland State University ECE 587 Spring 2015 17

Resource Conflicts Also called structural hazards If two instructions try to use the same hardware resource simultaneously, then one must wait Solution 1: Duplicate hardware resources Can be very expensive Solution 2: Pipeline long latency execution units F D E E E F D S S S E E E F D E1 E2 E3 F D S E1 E2 E3 Portland State University ECE 587 Spring 2015 18

Instruction Issue Methods Instruction Issue is the process of initiating instruction execution in functional units Instruction Issue Policy is the mechanism the processor uses to issue instructions (and to find and examine instructions) Portland State University ECE 587 Spring 2015 19

IO Issue and IO Execution In-order (IO) issue and in-order (IO) execution requires instructions to be issued, executed and to complete in the same order they appear in the program Simple strategy to implement BUT More hazards hinder performance Portland State University ECE 587 Spring 2015 20

IO Issue and IO Execution Example: I1 requires 2 cycles to execute, I3 and I4 use same functional unit, same for I5 and I6, I5 has true dependence on I4 Decode Execute Writeback 1 I1 I2 2 I3 I4 I1 I2 3 I3 I4 I1 4 I4 I3 I1 I2 5 I5 I6 I4 I3 6 I6 I5 I4 7 I6 I5 8 I6 Portland State University ECE 587 Spring 2015 21

IO Issue and OO Execution In-order (IO) issue and out-of-order (OO) execution allows instructions to complete in a different order This prevents long operations from overly reducing performance, even for scalar processors (e.g., unrelated instructions can execute while a load from the L2 cache or a floating point divide is in progress) Portland State University ECE 587 Spring 2015 22

IO Issue and OO Execution Same example: I1 requires 2 cycles to execute, I3 and I4 use same functional unit, same for I5 and I6, I5 has true dependence on I4 Decode Execute Writeback 1 I1 I2 2 I3 I4 I1 I2 3 I4 I1 I3 I2 4 I5 I6 I4 I1 I3 5 I6 I5 I4 6 I6 I5 7 I6 Portland State University ECE 587 Spring 2015 23

IO Issue and OO Execution If out-of-order completion is allowed, it is also possible to have an output dependence Two outstanding instructions write to the same location They must complete in the correct order to make sure the correct result is stored This is also called a write-after-write (WAW) hazard DIV R3,R4,R5 ADD R3,R4,R1 ADD R5,R3,R3 ; Which R3? This can be overcome with register renaming Portland State University ECE 587 Spring 2015 24

OO Issue and OO Execution Out-of-order (OO) issue and out-of-order (OO) execution further improves performance by not stalling the processor in the presence of resource conflicts or true and output dependences Instructions that would cause a problem are left in an instruction window to be issued when the problem has cleared The processor thus can look ahead to the size of the window to find instructions to issue Portland State University ECE 587 Spring 2015 25

OO Issue and OO Execution Same example: I1 requires 2 cycles to execute, I3 and I4 use same functional unit, same for I5 and I6, I5 has true dependence on I4 Decode Window Execute Writeback 1 I1 I2 2 I3 I4 I1,I2 I1 I2 3 I5 I6 I3,I4 I1 I3 I2 4 I4,I5,I6 I6 I4 I1 I3 5 I5 I5 I4 I6 6 I5 Portland State University ECE 587 Spring 2015 26

OO Issue and OO Execution This method can cause antidependences An instruction that needs to read a result may have that result overwritten by a following instruction that was issued first We must make sure that the value is not overwritten until it has been read by all users This is also called a write-after-read (WAR) hazard. DIV STORE R3,R4,R5 A,R3 ADD R3,R4,1 ;Can t until after store ADD R5,R3,R3 Portland State University ECE 587 Spring 2015 27

Register Renaming Given enough on-chip storage, the hardware can automatically rename registers (as specified in the program code) to ensure that each refers to a unique location Register renaming can remove storage conflicts DIV STORE ADD ADD R3a,R4a,R5a A,R3a R3b,R4a,1 R5b,R3b,R3b Portland State University ECE 587 Spring 2015 28

RISC I Design Approach New architectures should be designed for HLL Does not matter which part of the system is in hardware and which is in software Architecture tradeoffs to build a cost-effective system: Which language constructs are used frequently? What is the distribution of various instructions? Dedicate available area for the most frequent constructs and operations Remember Amdahl s law Portland State University ECE 587 Spring 2015 29

Amdahl s Law Speedup 1 P 1 P / S P = proportion of computation improved S = improvement speedup Example: Parallel Execution P: Parallel portion, S: Serial portion = 1-P N: Number of Cores Speedup S 1 P / N 1 P 1 P / N Portland State University ECE 587 Spring 2015 30

Reading Assignments J. Smith and G. Sohi, The Microarchitecture of Superscalar Processors, Proc. IEEE, 1995 (Review) Read before Wed class Submit review before the beginning of class: Paper summary Strong points (2-4 points) Weak points (2-4 points) Review due in TA s email inbox before 6:40PM Wed Plain text, no attachments Subject line has to be: ECE587_REVIEW_04_01 HW1 out, due next Monday Apr 6. Portland State University ECE 587 Spring 2015 31