EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Similar documents
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Advanced Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

Lec 25: Parallel Processors. Announcements

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Multicore and Parallel Processing

Parallelism, Multicore, and Synchronization

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Chapter 4 The Processor 1. Chapter 4D. The Processor

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Chapter 4 The Processor (Part 4)

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

COMPUTER ORGANIZATION AND DESI

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Thomas Polzer Institut für Technische Informatik

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Multiple Instruction Issue. Superscalars

Static, multiple-issue (superscaler) pipelines

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Multi-cycle Instructions in the Pipeline (Floating Point)

Exploitation of instruction level parallelism

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

EIE/ENE 334 Microprocessors

CS425 Computer Systems Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Chapter 4. The Processor

Chapter 4. The Processor

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Full Datapath. Chapter 4 The Processor 2

Hardware-based Speculation

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Chapter 4. The Processor

Course on Advanced Computer Architectures

CMSC 611: Advanced Computer Architecture

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruction Level Parallelism: Multiple Instruction Issue

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Anne Bracy CS 3410 Computer Science Cornell University

CS425 Computer Systems Architecture

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Copyright 2012, Elsevier Inc. All rights reserved.

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Handout 2 ILP: Part B

Chapter 06: Instruction Pipelining and Parallel Processing

Lecture 9: Multiple Issue (Superscalar and VLIW)

ECE 571 Advanced Microprocessor-Based Design Lecture 4

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

EC 513 Computer Architecture

TDT 4260 lecture 7 spring semester 2015

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Computer Systems Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECC551 Exam Review 4 questions out of 6 questions

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Keywords and Review Questions

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Computer Systems Architecture

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

Four Steps of Speculative Tomasulo cycle 0

UNIT I (Two Marks Questions & Answers)

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Spring 2016

ILP: COMPILER-BASED TECHNIQUES

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Multiple Issue ILP Processors. Summary of discussions

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

RECAP. B649 Parallel Architectures and Programming

Lecture 2: Processor and Pipelining 1

Shared Symmetric Memory Systems

Transcription:

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring 2014 [ material from Patterson & Hennessy, 4 th ed & Inside the Machine by J. Stokes] 1

Introduction to parallel processor design Intel Core i7 AMD A10 IBM Power 8 1. SIMD (data-level parallelism) 2. Superscalar (instruction-level parallelism) 3. Multi-cores (thread-level parallelism) 2

1. SIMD architectures Single Instruction Multiple Data (SIMD) Single instruction acts on multiple pieces of data at once (aka vector instructions) Common application: scientific computing, graphics Requires vector register file and multiple execution units 3

2. Superscalar architectures VLIW: Compiler groups instructions to be issued together Very Large Instruction Words (VLIW) Packages them statically into issue slots Compiler detects and avoids hazards Superscalar: CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Can be static in order or dynamic out-of-order 4

Difference between superscalar and VLIW [from Fisher et al.] 5

MIPS with static dual issue Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 6

Pipeline design for multiple issue 8 7

Scheduling example for dual-issue MIPS Schedule this code for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 nop nop 2 add $t0, $t0, $s2 nop 3 addi $s1, $s1, 4 sw $t0, 0($s1) 4 bne $s1, $zero, Loop nop 5 IPC = 5/5 = 1 (c.f. peak dual-issue IPC = 2 and singleissue IPC = 5/6 = 0.83 for single-issue pipeline) 8

Limits to ILP: data dependencies Data dependencies determine: Order which results should be computed Possibility of hazards Degree of freedom in scheduling instructions => limit to ILP Data/Name dependency hazards: Read After Write (RAW) Write After Read (WAR) Write After Write (WAW) 9

Name dependency (antidependence): WAR lw $s0, 0($t0) add $t0, $s1, $s2 add $s4, $s2, $s0.. sub $s2, $s1, $s3 lw $t2, 0($s2). lw $s2, 4($t0) Just a name dependency no values being transmitted Dependency can be removed by renaming registers (either by compiler or HW) 10

Name dependency (output dependency): WAW lw $s0, 0($t0). add $s0, $s1, $s2 add $s2, $s1, $s0. sub $s2, $t2, $t3 Just a name dependency no values being transmitted Dependency can be removed by renaming registers (either by compiler or HW) 11

Re-schedule example for dual-issue MIPS Re-Schedule this code for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 add $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) 12

Exposing ILP using loop unrolling Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called register renaming Avoid loop-carried anti-dependencies Store followed by a load of the same register Aka name dependence Reuse of a register name 13

Loop unrolling example ALU/branch Load/store cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 add $t0, $t0, $s2 lw $t2, 8($s1) 3 add $t1, $t1, $s2 lw $t3, 4($s1) 4 add $t2, $t2, $s2 sw $t0, 16($s1) 5 add $t3, $t3, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size 14

Overall superscalar organization Dynamic Out-of-order (OoO) execution Multiple instructions are fetched and decoded in parallel. Decoder has to check for dependencies and rename registers to avoid WAW and WAR hazards. Instructions wait in a dispatch buffer until their operands are available (avoids RAW hazards). When ready, instructions are dispatched to the execution units. Re-order buffer puts back instructions in program order for WB 15

Summary of superscalar architectures Pros: Improved single-thread throughput: hide memory latency; avoid or reduces stalls; and ability to fetch and execute multiple instructions per cycle. Cons: à impacts silicon area à impacts power consumption à impacts design complexity SW compilation techniques enable à more ILP from the same HW à simplify HW (e.g., VLIW) at the expense of code portability Conclusion: great single-thread performance but at the expense of energy efficiency. 16

3. Multi-core processors Moore s Law + ILP Wall + Power Wall à Multi-core processors Improves total throughput but not single thread performance Applications needs to be re-coded to use the parallelism 17

Communication using shared memory SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Shared memory is used for communication Memory access time UMA (uniform) vs. NUMA (nonuniform) 18

Cache coherence problem Suppose two CPU cores share a physical address space Time step Write-through caches Event CPU A s cache CPU B s cache Memory 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 19

Defining cache coherency Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes) read returns written value P 1 writes X; P 2 reads X (sufficiently later) read returns written value c.f. CPU B reading X after step 3 in example P 1 writes X, P 2 writes X all processors see writes in the same order End up with the same final value for X 20

Maintaining cache coherence with snooping protocol Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses Owning cache supplies updated value CPU activity Bus activity CPU A s cache CPU B s cache Memory CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X 1 1 1 0 21

Example: Core i7 Introduced 2008. Quad-core processor. Each core has a 4-wide superscalar out-of-order pipeline Supports integer/fp SIMD instructions. 64 KB L1 cache/core (32K I and 32K D-cache), 256 KB L2 cache per core shared 4-12 MB L3 cache. frequency depends on the grading or binning 22

Summary Intel Core i7 AMD A10 IBM Power 8 1. SIMD exploits data-level parallelism à needs compiler intervention to explicitly use vector instructions. 2. Superscalar exploits instruction-level parallelism à no need for recompilation though compilation help improve performance and mitigate pressure on HW 3. Multi-cores exploits thread-level parallelism à needs programmer intervention to re-write the code. 23