ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Similar documents
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Processor: Instruction-Level Parallelism

Advanced Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

COMPUTER ORGANIZATION AND DESI

Exploitation of instruction level parallelism

Multithreaded Processors. Department of Electrical Engineering Stanford University

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4 The Processor (Part 4)

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Lec 25: Parallel Processors. Announcements

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Multiple Instruction Issue. Superscalars

Thomas Polzer Institut für Technische Informatik

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

TDT 4260 lecture 7 spring semester 2015

CS425 Computer Systems Architecture

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Handout 2 ILP: Part B

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Hardware-Based Speculation

Chapter 4 The Processor 1. Chapter 4D. The Processor

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Multi-cycle Instructions in the Pipeline (Floating Point)

ECE 154A Introduction to. Fall 2012

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Full Datapath. Chapter 4 The Processor 2

CS425 Computer Systems Architecture

Hardware-based Speculation

Multithreading: Exploiting Thread-Level Parallelism within a Processor

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Parallelism, Multicore, and Synchronization

Multicore and Parallel Processing


Advanced Computer Architecture

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Static, multiple-issue (superscaler) pipelines

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

INSTRUCTION LEVEL PARALLELISM

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Chapter 4. The Processor

Pipelining. CSC Friday, November 6, 2015

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Advanced issues in pipelining

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

EECC551 Exam Review 4 questions out of 6 questions

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

EIE/ENE 334 Microprocessors

Dynamic Scheduling. CSE471 Susan Eggers 1

Chapter 4. The Processor

ECE331: Hardware Organization and Design

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EC 513 Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Chapter 4. The Processor

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Chapter 4. The Processor

CMSC 611: Advanced Computer Architecture

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Instruction-Level Parallelism and Its Exploitation

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

LECTURE 3: THE PROCESSOR

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

November 7, 2014 Prediction

Transcription:

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring 2016 [ material from Harris & Patterson & Hennessy] 1

Advanced processor design 1. Floating point arithmetic 2. SIMD (data-level parallelism) 3. Superscalar (instruction-level parallelism) 4. Multi-threading HW (thread-level parallelism) 2

1. Floating points single: 8 bits single: 23 bits S Exponent Fraction S: sign bit (0 non-negative, 1 negative) Normalize significand: 1.0 significand < 2.0 Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit). Significand is Fraction with the 1. restored 3

Example 1. What is the floating point representation of 0.75? 0.75 = ( 1) 1 1.1 2 2 1 S = 1, Fraction = 1000 00 2 Exponent = 1 + Bias = 1 + 127 = 126 = 01111110 2 Answer: 1011111101000 00 2. What is the decimal value of the following IEEE number 1 01111100 11000000000000000000000 The sign bit s is 1. The e field contains 01111100= 124. The mantissa is 0.10100 = 0.625. =-1.625 2-3 = -0.203125 4

Example of floating point addition Consider a 4-digit decimal example 9.999 10 1 + 1.610 10 1 1. Align decimal points Shift number with smaller exponent 9.999 10 1 + 0.016 10 1 2. Add significands 9.999 10 1 + 0.016 10 1 = 10.015 10 1 3. Normalize result & check for over/underflow 1.0015 10 2 4. Round and renormalize if necessary 1.002 10 2 5

Example of floating point addition Now consider a 4-digit binary example 1.000 2 2 1 + 1.110 2 2 2 (0.5 + 0.4375) 1. Align binary points Shift number with smaller exponent 1.000 2 2 1 + 0.111 2 2 1 2. Add significands 1.000 2 2 1 + 0.111 2 2 1 = 0.001 2 2 1 3. Normalize result & check for over/underflow 1.000 2 2 4, with no over/underflow 4. Round and renormalize if necessary 1.000 2 2 4 (no change) = 0.0625 6

Floating point addition HW Step 1 Step 2 Step 3 Step 4 7

Comparison of HW area requirements 32 LEs 840 LEs 32 LEs for integer 32-bit addition 840 LEs for floating point single-precision addition Floating point operating have larger delay and take multiple cycles in EX stage. 8

floating point ISA interface (e.g., ARM) Floating Point Register File Floating Point Instructions FADD<S/D>{cond} Fd, Fn, Fm 9

Improving throughput with SIMD and Superscalar optimization goal for multiple-issue processors Pipeline CPI is less than 1: Data Hazard Stalls Control Stalls Memory cache misses Goal of superscalar is to maximize throughput (i.e., IPC frequency) 10

2. SIMD architectures Single Instruction Multiple Data (SIMD) Single instruction acts on multiple pieces of data at once (aka vector instructions) Common application: scientific computing, graphics Requires vector register file and multiple execution units Examples: ARM NEON and Intel SSE2 extensions. 11

EXAMPLE: ARM NEON SIMD extensions New additional register file can be viewed as: Sixteen 128-bit quadword registers: Q0-Q15 Thirty-two 64-bit doubleword registers: D0-D31 Data packing inside a register: 16 integer bytes (16x8 bits), 8 integer words (8x16), 8 half-precision FP numbers (8x16), 4 integer doublewords (4x32), 4 single-precision FP numbers (4x32), 2 integer quad words (2x64). Examples: VADD.I16 Q0, Q1, Q2 ; 16-bit integer elements stored in 128-bit Q registers VMULL.S16 Q0, D2, D3 ; 4 16-bit lanes 4 32-bit products in a 128-bit vector Structured load and store instructions that can load or store single or multiple values from or to single or multiple lanes in a vector register. 12

3. Superscalar architectures VLIW: Compiler groups instructions to be issued together Very Large Instruction Words (VLIW) Packages them statically into issue slots Compiler detects and avoids hazards Superscalar: CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Can be static in order or dynamic out-of-order (won t discuss) 13

Difference between superscalar and VLIW [from Fisher et al.] 14

Superscalar ARM Pipeline (issue width = 2) CLK CLK CLK CLK CLK PC A RD Instruction Memory 8 A1 A2 A3 A4 A5 A6 WD3 WD6 Register File RD1 RD4 RD2 RD5 ALUs A1 RD1 A2 RD2 Data Memory WD1 WD2 Multiple copies of datapath execute multiple instructions at once. Dependencies make it tricky to issue multiple instructions at once 15

Superscalar example Ideal IPC: 2 Actual IPC: 2 1 2 3 4 5 6 7 8 Time (cycles) LDR R8, [R0, #40] ADD R9, R1, R2 IM LDR ADD RF R0 40 R1 R2 + + DM R8 R9 RF SUB R10, R1, R3 AND R11, R3, R4 IM SUB AND RF R1 R3 R3 R4 - & DM R10 R11 RF ORR R12, R1, R5 STR R5, [R0, #80] IM ORR STR RF R1 R5 R0 80 + DM R5 R12 RF Notice that there are no dependencies. All resolved by compiler VLIW 16

Static superscalar with hazard detection Ideal IPC: 2 Actual IPC: 6/5 = 1.2 1 2 3 4 5 6 7 8 9 Time (cycles) LDR R8, [R0, #40] IM LDR RF R0 40 + DM R8 RF ADD R9, R8, R1 SUB R8, R2, R3 IM ADD SUB RF R8 R1 R2 R3 RF R8 R1 R2 R3 + - DM R9 R8 RF AND R10, R4, R8 ORR R11, R5, R6 Stall AND IM ORR IM AND ORR RF R4 R8 R5 R6 & DM R10 R11 RF STR R7, [R11, #80] IM STR RF R11 80 + R7 DM RF 17

Scheduling example Assume dual-issue ARM MOV R0, #0 Loop: LDR R1, [R10, 0] # R0=array element ADD R1, R1, R12 # add scalar in $s2 STR R1, [R10, 0] # store result ADD R10, R10, 4 # decrement pointer CMP R10, R0 BNE Loop ALU/branch Load/store cycle Loop: nop LDR R1, [R10, 0] 1 nop nop 2 ADD R1, R1, R12 nop 3 ADD R10, R10, 4 STR R1, [R10, 0] 4 CMP R10, R0 nop 5 BNE Loop 6 IPC = 6/6 = 1 (peak dual-issue IPC = 2 and single-issue IPC = 6/7 = 0.85) Can the compiler do a better scheduling to improve ILP? 18

Limits to ILP: data dependencies Data dependencies determine: Order which results should be computed Possibility of hazards Degree of freedom in scheduling instructions => limit to ILP Data/Name dependency hazards: Read After Write (RAW) Write After Read (WAR) Write After Write (WAW) 19

Register hazards from dynamic scheduling Read After Write (RAW) hazard (true dependency) ADD SUB R2, R1, R0 R4, R2, R3 Write After Read (WAR) hazard (anti dependency) ADD.. SUB R4, R2, R0 R2, R1, R3 Write After Write (WAW) hazard (output dependency) ADD. sub R2, R1, R0 R2, R2, R3 A true data dependency because values are transmitted between the instructions Introduced by OOO Just a name dependency no values being transmitted Dependency can be removed by renaming registers (either by compiler or HW) Introduced by OOO Just a name dependency no values being transmitted Dependency can be removed by renaming registers (either by compiler or HW) 20

Re-schedule example Loop: LDR R1, [R10, 0] # R0=array element ADD R1, R1, R12 # add scalar in $s2 STR R1, [R10, 0] # store result ADD R10, R10, 4 # decrement pointer CMP R10, R0 BNE Loop ALU/branch Load/store cycle Loop: nop LDR R1, [R10, 0] 1 ADD R10, R10, 4 nop 2 ADD R1, R1, R12 nop 3 CMP R10, R0 STR R1, [R10, 4] 4 BNE Loop nop 5 IPC = 6/5 = 1.2 (c.f. peak IPC = 2) 21

Exposing ILP using loop unrolling Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called register renaming Avoid loop-carried anti-dependencies Store followed by a load of the same register Aka name dependence Reuse of a register name 22

Loop unrolling and renaming Loop: LDR R1, [R10, 0] ADD R1, R1, R12 STR R1, [R10, 0] LDR R1, [R10, -4] ADD R1, R1, R12 STR R1, [R10, -4] LDR R1, [R10, -8] ADD R1, R1, R12 STR R1, [R10, -8] LDR R1, [R10, -12] ADD R1, R1, R12 STR R1, [R10, -12] ADD R10, R10, 16 CMP R10, R0 BNE Loop Loop unrolling Loop: LDR R1, [R10, 0] ADD R1, R1, R12 STR R1, [R10, 0] LDR R2, [R10, -4] ADD R2, R2, R12 STR R2, [R10, -4] LDR R3, [R10, -8] ADD R3, R1, R12 STR R3, [R10, -8] LDR R4, [R10, -12] ADD R4, R1, R12 STR R4, [R10, -12] ADD R10, R10, 16 CMP R10, R0 BNE Loop Loop unrolling + renaming 23

Loop unrolling scheduling ALU/branch Load/store cycle Loop: ADD R10, R10, 16 LDR R1, [R10, 0] 1 LDR R2, [R10, 12] 2 ADD R1, R1, R12 LDR R3, [R10, 8] 3 ADD R2, R2, R12 LDR R4, [R10, 4] 4 ADD R3, R3, R12 STR R1, [R10, 16] 5 ADD R4, R4, R12 STR R2, [R10, 12] 6 STR R3, [R10, 8] 7 CMP R10, R0 STR R4, [R10, 4] 8 BNE Loop 9 IPC = 15/9 = 1.67 Closer to 2, but at cost of registers and code size 24

Limitations of static superscalar and VLIW VLIW: SW tied to HW. Some dependencies cannot be identified during compilation. Static superscalar: Branch mispredictions and cache misses will have high impact. RAW dependencies will lead to further stalls. 1. Can we rename and schedule instructions dynamically in HW? 2. Can ready instruction bypass stalled instructions and execute out-of-order (OoO)? 25

Gist of dynamic superscalar (A57) Decoder: Breaks more complex instructions into uops Detects dependencies Renames registers to avoid WAR and WAW hazards Dispatches instructions (max 3) to appropriate queues Issue Queue: Instructions wait until ready (i.e. all operands are available and forwarded) Once ready, instructions are issued out-of-order once EX unit is available WB: Results are forwarded to any instructions waiting for operands in queue Results are written in order in the register file 26

Summary of superscalar architectures Pros: Improved single-thread throughput Cons: impacts silicon area impacts power consumption impacts design complexity (especially for dynamic OoO) SW compilation techniques enable more ILP from the same HW simplify HW (e.g., VLIW) at the expense of code portability 27

4. Multi-threaded and multi-core processors Moore s Law + Superscalar limitations Multi-threaded and Multi-core processors. Improves total throughput but not single thread performance. Applications need to be re-coded to use the parallelism. Topics: A. Software multi-threading. B. HW block multi-threading. C. HW interleaved multi-threading. D. Multi-core processors. 28

A. Software multi-threading Used since the 1960 s to enable multiple users to use the same computing system and to hide the latency of I/O operations and Creating illusion of parallelism for multiple processes/threads using context switching: 1. Trap processor -- flush pipeline 2. Save process state (includes register file(s), PC, page table base register, etc) in OS process control block which is in memory. 3. Restore process state of a different ready process 4. Start execution -- fill pipeline On an I/O operation (or shared resource conflict or timer interrupt) 1. Process is preempted and removed from ready list 2. I/O operation is started 3. Another active process is picked from the ready list and run 4. When I/O completes preempted process back in the ready list Very high switching overhead (ok, since wait is very long) 29

Hardware multithreading Vertical waste (e.g., cache misses) and horizontal waste (ILP limits) can limit potential of superscalar processors. Fill the waste slots from instructions of other threads. HW runs multiple threads concurrently OS thinks there is more than one processor, each processing its own thread. HW threads share cache memory enables rapid communication by threads fine-grain communication by threads. Vertical waste and horizontal waste of the issue bandwidth of a four-wide superscalar processor. 30

B. Block multithreading (coarse grain) Core has multiple contexts A hardware context is the program counter, register file, and other data required to enable a software thread to execute on a core. Each running thread executes in turn until a long latency event switch out by flushing and load context of another thread. Similar to software multithreading but with HW support for fast context switch order less than 10 cycles versus thousands of cycles Block multithreading can hide long cache misses. However, it typically cannot hide short interinstruction latencies. 31

Required changes for block multi-threading 1. Fetch: Multiple PCs; selector base on active thread Miss in caches trigger context switching 2. Read: Need to replicate the register file might impact RF latency 3. Ex: Nothing 4. MEM: data memory needs to be non-blocking 5. WB: Need to know the thread id. 32

Example of operation 33

C. Interleaved (fine-grain) multi-threading Dispatch instructions from different threads/processes in each cycle Able to switch between contexts (threads) as every cycle with no switch delay. Takes advantage of small latencies such as instruction execution latencies Does not address horizontal waster but can eliminate vertical waste 34

Interleaved architecture example Two threads: penalty of a taken branch is one clock! Five threads: penalty of a taken branch is zero! 35

Interleaved multi-threading architecture Same architecture as for block multithreading except that: Data forwarding must be thread aware Context id is carried by forwarded values Stage flushing must be thread aware On an miss exception IF, ID, EX and MEM cannot be flushed indiscriminately Same for taken branches and regular software exceptions Thread selection algorithm is different: a different thread is selected in each cycle (round-robin) On a long latency event the selector puts the thread aside and removes it from selection 36

Impact of increasing number of contexts Fetch can be simplified no need for branch prediction and control hazard circuit Forwarding can be eliminated and no need for data hazard hardware Issues: Single-thread performance can suffer Register file needs to increase proportionally. Interference in cache. 37

D. Multi-core processors Multiple independent cores Hardware provides single physical address space for all processors shared memory is used for communication Big problem is cache coherency Interconnection network: enables cores to share items in their caches. core core core 38

Example: Sun/Oracle UltraSparc T1 Targets server market for transactional applications high TLP T1 (introduced in 2005) 8 cores @ 1.4 GHz 72W Each core supports four thread contexts total 32 HW threads OS sees 32 virtual/logical cores Each core is in-order scalar core with simple 6-stage pipeline: fetch, thread select, decode, execute, memory, writeback. L2 cache shared among all cores using a cross bar Static branch prediction using a hint bit 39

Summary 1. SIMD exploits data-level parallelism needs compiler intervention to explicitly use vector instructions. 2. Floating point and SIMD require register file and ISA extensions 3. Superscalar exploits instruction-level parallelism no need for recompilation though compilation help improve performance and mitigate pressure on HW 4. Multi-cores exploits thread-level parallelism needs programmer intervention to re-write the code. 40