Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Similar documents
Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Exploitation of instruction level parallelism

Hardware-Based Speculation

THREAD LEVEL PARALLELISM

Processor Performance and Parallelism Y. K. Malaiya

Hardware-based Speculation

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

WHY PARALLEL PROCESSING? (CE-401)

Handout 2 ILP: Part B

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture-13 (ROB and Multi-threading) CS422-Spring

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multiple Instruction Issue. Superscalars

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Multithreaded Processors. Department of Electrical Engineering Stanford University


TDT 4260 lecture 7 spring semester 2015

CS425 Computer Systems Architecture

CS 654 Computer Architecture Summary. Peter Kemper

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Getting CPI under 1: Outline

LECTURE 10. Pipelining: Advanced ILP

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Advanced issues in pipelining

Intel released new technology call P6P

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Four Steps of Speculative Tomasulo cycle 0

Simultaneous Multithreading Architecture

! Readings! ! Room-level, on-chip! vs.!

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

omputer Design Concept adao Nakamura

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Course on Advanced Computer Architectures

CS425 Computer Systems Architecture

Advanced processor designs

RAID 0 (non-redundant) RAID Types 4/25/2011

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

5008: Computer Architecture

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

" # " $ % & ' ( ) * + $ " % '* + * ' "

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

UNIT I (Two Marks Questions & Answers)

Advanced Computer Architecture

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Processor (IV) - advanced ILP. Hwansoo Han

Multicore Hardware and Parallelism

Spring 2014 Midterm Exam Review

COMPUTER ORGANIZATION AND DESI

INSTRUCTION LEVEL PARALLELISM

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 9. Pipelining Design Techniques

EC 513 Computer Architecture

Multiple Issue ILP Processors. Summary of discussions

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

EECC551 Exam Review 4 questions out of 6 questions

ECE 341. Lecture # 15

Dynamic Scheduling. CSE471 Susan Eggers 1

Parallel Processing SIMD, Vector and GPU s cont.

Keywords and Review Questions

ECSE 425 Lecture 25: Mul1- threading

ILP: Instruction Level Parallelism

EE 4980 Modern Electronic Systems. Processor Advanced

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT 4260 TDT ILP Chap 2, App. C

Multi-cycle Instructions in the Pipeline (Floating Point)

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Basic Computer Architecture

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture

SUPERSCALAR AND VLIW PROCESSORS

Hardware-Based Speculation

Advanced Processor Architecture

Transcription:

Lecture 26: Parallel Processing Spring 2018 Jason Tang 1

Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2

Taxonomy of Parallel Architectures Flynn categories: SISD, MISD, SIMD, and MIMD Levels of Parallelism: Bit level parallelism: 4-bit, 8-bit, 16-bit, 32-bit, now 64-bit processors Instruction level parallelism (ILP): Pipelining, VLIW, superscalar, out of order execution Process/thread level parallelism (TLP): Running different programs, or different parts of same program, on one processor 3

Multiple Issue Pipelining In classic pipelining, processor fetches one instruction per cycle In multiple issue pipelining, processor fetches two (or more) instructions Results in a CPI less than 1 Two ways to implement multiple issue pipelines: Static multiple issue: compiler decides multiple issues, before execution Dynamic multiple issue: hardware decides multiple issue, during execution 4

Issue Slot Position within a processor from which instruction can execute In simple processors, there is exactly one issue slot, which can perform any kind of instruction (integer arithmetic, floating point arithmetic, branching, etc) Modern processors have multiple issue slots, but not all slots are equal Example: Cortex-A53 has 2 issue slots: Slot 0 can execute any instruction Slot 1 can execute integer add, branch, load/store, or floating point https://www.anandtech.com/show/8718/thesamsung-galaxy-note-4-exynos-review/3 5

Static Multiple Issue Pipeline Slot 0 Slot 1 Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Slot 0 Slot 1 Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Slot 0 Slot 1 Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Multiple instructions execute simultaneously, in different slots Issue packet: set of instructions executed together Compiler is responsible for generating contents of issue packet Compiler is responsible for preventing data and control hazards 6

Very Long Instruction Word (VLIW) Compiler generates a single instruction that commands each issue slot If the instruction cannot be parallelized or if a data/control hazard is to be resolved via a bubble, then send a no-op to that issue slot Example: Itanium was Intel s first attempt at a 64-bit processor Not in any way related to x86/64 (more properly AMD64 or Intel 64 ) EPIC: explicitly parallel instruction computing, term invented by Intel for its VLIW implementation Each IA-64 issue packet is 128-bits, containing three instructions within 7

Example IA-64 Flow https://www.pctechguide.com/ cpu-architecture/ia-64-architecture 8

Dynamic Multiple Issue Pipeline Superscalar processor fetches multiple instructions at once Dynamic pipeline schedule: Processor decides which instruction(s) to execute each cycle, to avoid hazards Out of order execution: Processor may execute instructions in a different order than given by compiler, to minimize stalls Requires duplicate and much more complex hardware 9

Dynamic Pipeline Scheduling Instruction fetch and decode unit Reservation station Reservation station Reservation station Reservation station Integer functional unit Integer functional unit Floating point functional unit Load/store functional unit Commit unit Instructions are fetched in-order, and later committed ( retired ) in-order Functional units execute instructions out-of-order 10

Out of Order Execution With standard pipelining, if an instruction stalls, all instructions after it are also stalled With dynamic scheduling, split Decode stage into: Issue: decode instruction, checking for structural hazards Dispatch: wait until all data hazards have been resolved, then fetch operands; hold instruction until a functional unit can execute it Commit unit is responsible for reordering functional units results to achieve in-order results 11

Dynamic Scheduling Example For a modern Intel Skylake processor, 64-bit integer division can take up to 90 clock cycles, while integer addition takes just one cycle With a single pipelined processor, a division followed by 8 additions would take 98 cycles With a single superscalar processor with 4 ALUs and a single FPU, and where there are no data dependencies between instructions, that same sequence would take 92 cycles If that superscalar processor also executes out of order, that same sequence would take 90 cycles http://www.agner.org/optimize/instruction_tables.pdf 12

Cortex-M7 Pipeline https://www.anandtech.com/show/8542/cortexm7- launches-embedded-iot-and-wearables/2 13

Threading A software multithreaded system has a single program, with multiple program counters and stacks Operating system decides which thread to run on which processor, based upon the process scheduler A processor normally executes only one thread at a time A hardware multithreaded system has a processor that can truly execute multiple threads simultaneously, via dynamic scheduling Operating System Concepts, Silberschatz, Galvin, and Gagne 14

Hardware Multithreading Subtypes Processor fetches from 2 (or more) different program counters, where each program counter corresponds to a thread Superthreading: processor executes instruction(s) from one thread on one clock cycle, then instruction(s) from second thread on next clock cycle Number of instructions to execute is based upon available issue slots Instead of operating system choosing the time slice, the processor has a time slice of 1 clock cycle Simultaneous Multithreading (SMT): processor fills issue slots with mix of instructions from all threads, to keep all functional units busy https://arstechnica.com/features/2002/10/hyperthreading/ 15

Hardware Multithreading Example: processor with 4 issue slots, and 2 threads to execute Time 1 2 3 Thread A 1 2 4 Thread B 5 3 4 5 6 6 Threads normally execute on a superscalar processor as per top right diagram 7 8 9 10 11 12 7 8 With a fine-grained superthreading (switching thread every clock cycle) and a SMT processor, execution would instead look like lower left Time Fine-Grained Superthreading 1 2 1 2 3 4 5 3 4 5 6 Simultaneous Multithreading 1 2 3 4 6 3 5 6 7 8 1 2 4 5 7 With course-grained superthreading, processor switches threads only on expensive stalls, instead of every cycle 6 7 8 7 9 10 11 12 8 9 10 11 12 8 16

Comparison of Hardware Multithreading Superthreading Symmetric Multithreading Number of different threads in a pipeline stage in a given cycle Exactly 1 1+ Scheduling algorithm Round-robin, or processor may have priority scheduling Simultaneous Processor usage Same as superscalar Every available issue slot Stall penalty Partially hidden, because other threads will execute during stall Completely hidden, because other threads will fill issue slots Power consumption Same as superscalar Uses more power 17