Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Similar documents
Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Multithreaded Processors. Department of Electrical Engineering Stanford University

Write only as much as necessary. Be brief!

TDT 4260 lecture 7 spring semester 2015

Short Answer: [3] What is the primary difference between Tomasulo s algorithm and Scoreboarding?

Complex Pipelines and Branch Prediction

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Lecture-13 (ROB and Multi-threading) CS422-Spring

Introduction to GPU programming with CUDA

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Control Hazards. Branch Prediction

ECE260: Fundamentals of Computer Engineering

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS433 Homework 2 (Chapter 3)

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

CS433 Homework 2 (Chapter 3)

CS425 Computer Systems Architecture

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Control Hazards. Prediction

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Processor Architecture

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Memory management units

Simultaneous Multithreading Processor

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Processor (IV) - advanced ILP. Hwansoo Han

Full Datapath. Chapter 4 The Processor 2

Pipelining. CSC Friday, November 6, 2015

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Chapter 4. The Processor

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019


Please state clearly any assumptions you make in solving the following problems.

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Exploitation of instruction level parallelism

Lecture 14: Multithreading

Pipelining to Superscalar

Advanced Instruction-Level Parallelism

Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers.

Hardware-Based Speculation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ILP Ends TLP Begins. ILP Limits via an Oracle

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

ECSE 425 Lecture 25: Mul1- threading

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

COSC 6385 Computer Architecture - Pipelining

Alexandria University

Advanced Processor Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Keywords and Review Questions

The Pentium II/III Processor Compiler on a Chip

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

For all questions, please show your step-by-step solution to get full-credits.

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Week 11: Assignment Solutions

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Computer Architecture Sample Test 1

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Parallel Processing SIMD, Vector and GPU s cont.

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

COMPUTER ORGANIZATION AND DESI

University of Toronto Faculty of Applied Science and Engineering

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture 18: Multithreading and Multicores

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CSE 490/590 Computer Architecture Homework 2

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units


Transcription:

1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock 1000 instructions = 1000 cycles @ (10 6 )Hz = 1000/10 6 = 1000us Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline? 1 instruction per cycle. 5MHz clock +4 extra cycles to load the pipeline (1000 + 4 cycles)/(5*10 6 )Hz = 0.8us ~5x faster c) A -stage pipeline? 1 instruction per cycle. MHz clock +19 extra cycles to load the pipeline (1000 + 19 cycles)/(*10 6 )Hz = 50.95us ~x faster d) A 100-stage pipeline? 1 instruction per cycle. 100MHz clock +99 extra cycles to load the pipeline (1000 + 99 cycles)/(100*10 6 )Hz = 10.99us ~100x faster Looking at those results, it seems clear that increasing pipeline should increase the execution speed of a processor. Why do you think that processor designers (see below) have not only stopped increasing pipeline length but, in fact, reduced it? Pentium III Coppermine (1999) 10-stage pipeline Pentium IV NetBurst (00) -stage pipeline Pentium Prescott (04) 31-stage pipeline Core i7 9xx Bloomfield (08) 24-stage pipeline Core i7 2xx/3xx Sandy Bridge (13) 19-stage pipeline Incidence of Data Hazards: The more stages in the pipeline the further away a previous instruction can be while still maintaining a dependency Impact of both Data and Control Hazards: The more stages in the pipeline, the more cycles we need to stall the pipeline when a hazard occurs. Hardware costs: The more stages in the pipeline the more extra buffering is needed and the more extra wiring to communicate stages. Implementing forwarding is specially a concern as the number of forwarding paths needed increases nearly quadratically with the number of stages. Clock limitation: The clock can not be made infinitely fast. Power dissipation increases with clock frequency, rapidly increasing the consumed energy and the temperature of the chip.

2. Consider a simple program with two nested loops as the following: while (true) { for (i=0; i<x; i++) { do_stuff } } With the following assumptions: do_stuff has instructions that can be executed ideally in the pipeline. The overhead for control hazards is 3-cycles, regardless of the branch being static or conditional. The two loops can be translated into a single branch instruction each. Calculate the instructions-per-cycle that can be achieved for different values of x (2, 4, 10, 100): a) Without branch prediction. outer loop: always branches always fails :: 3-cycle penalty per outer loop iteration inner loop: branches (x-1) out of x iterations (x-1) fails :: 3*(x-1) cycles penalty per outer loop iteration Instructions = x*(+1) + 1 Cycles = Instructions + penalties = x*(+1) + 1 + 3*(x-1) + 3 b) With a simple branch prediction policy - do the same as the last time. 1 st loop: always branches always predicts (except first time) ::: no penalty 2 nd loop: first (predicts no branch, but branches) and last (predicts branch but doesn t) iterations always fail, rest always predict (predict branch AND branches) ::: two 3-cycle penalties per outer iteration Instructions = x*(+1) + 1 Cycles = x*(+1) + 1 + 3*(2) c) With perfect branch prediction - always predicts correctly. 1 st loop: always predicts ::: no penalty 2 nd loop: always predicts ::: no penalty Instructions = x*(+1) + 1 Cycles = x*(+1) + 1 x = 2 x = 4 x = 10 x = 100 Instructions (per 1st loop iteration) 43 85 211 2101 a) Cycles - w/o BP 49 97 241 2401 a) IPC - w/o BP 0.878 0.876 0.876 0.875 b) Cycles - simple BP 49 91 217 2107 b) IPC - simple BP 0.878 0.934 0.972 0.997 c) Cycles - perfect BP 43 85 211 2101 c) IPC - perfect ect BP 1.000 1.000 1.000 1.000

3. Consider a 10-stage fully pipelined processor as the one below. IF ID EX MEM WB (IF and MEM: 3 stages each; EX: 2 stages; ID and WB: 1 stage each) a) How many cycle penalties will be incurred by the different kinds of Hazards in such processor? Data Hazards: Up to 5 cycle penalty (1EX+3MEM+1WB) for 2 consecutive instructions with data dependencies if forwarding is not implemented. 1st instruction needs to WB before the 2nd advances to EX (see below) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 add r1,r0,r3 IF IF IF ID EX EX MEM MEM MEM WB sub r1,r0,r3 IF IF IF ID stall stall stall stall stall EX EX MEM MEM MEM WB Control Hazards: Unconditional branches: 3 cycles (2IF + 1ID). We don t know it s a branch until ID stage (3 instructions already fetched) 1 2 3 4 5 6 7 b n IF IF IF ID EX EX MEM inst 1 IF IF IF ID EX EX inst 2 IF IF IF ID EX inst 3 IF IF IF ID n IF IF IF Conditional branches: 5 cycles (2 IF + 1 ID + 2 EX). We don t know whether we need to branch or not until end of EX stage (5 instructions already fetched) 1 2 3 4 5 6 7 Beq n IF IF IF ID EX EX MEM inst 1 IF IF IF ID EX EX inst 2 IF IF IF ID EX inst 3 IF IF IF ID inst 4 IF IF IF inst 5 IF IF n IF

4. Consider the following program which implements R = A 2 + B 2 + C 2 + D 2 LD r1 @A MUL r2 r1 r1 LD r3 @B MUL r4 r3 r3 ADD r11 r2 r4 LD r5 @C MUL r6 r5 r5 LD r7 @D MUL r8 r7 r7 ADD r12 r6 r8 ADD r21 r11 r12 ST r21 @R -- A^2 -- B^2 -- A^2 + B^2 -- C^2 -- D^2 -- C^2 + D^2 -- A^2 + B^2 + C^2 + D^2 a) Simulate its execution in a basic 5-stage pipeline without forwarding. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 31 32 LD r1, A IF ID EX MEM WB MUL r2 r1 r1 IF ID Stall Stall EX MEM WB LD r3, B IF Stall Stall ID EX MEM WB MUL r4 r3 r3 IF ID Stall Stall EX MEM WB ADD r11 r2 r4 IF Stall Stall ID Stall Stall EX MEM WB LD r5, C IF Stall Stall ID EX MEM WB MUL r6 r5 r5 IF ID Stall EX MEM WB LD r7, D IF Stall ID EX MEM WB MUL r8 r7 r7 IF ID Stall Stall EX MEM WB ADD r12 r6 r8 IF Stall Stall ID Stall Stall EX MEM WB ADD r21 r11 r12 IF Stall Stall ID Stall Stall EX MEM WB ST r21, R IF Stall Stall ID Stall Stall EX MEM WB b) Simulate the execution of the original code in a 5-stage pipeline with forwarding. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 LD r1, A IF ID EX MEM WB MUL r2 r1 r1 IF ID Stall EX MEM WB LD r3, B IF Stall ID EX MEM WB MUL r4 r3 r3 IF ID Stall EX MEM WB ADD r11 r2 r4 IF Stall ID EX MEM WB LD r5, C IF ID EX MEM WB MUL r6 r5 r5 IF ID Stall EX MEM WB LD r7, D IF Stall ID EX MEM WB MUL r8 r7 r7 IF ID Stall EX MEM WB ADD r12 r6 r8 IF Stall ID EX MEM WB ADD r21 r11 r12 IF ID EX MEM WB ST r21, R IF ID EX MEM WB Note: Colored stages represents dependencies between instructions (e.g. cyan in MUL r2,r1,r1 EX has a dependency on LDR, r1,a MEM

c) Draw the dependency graph of the application. LD A LD B LD C LD D MUL MUL MUL MUL ADD ADD ADD STR R d) Based on the dependency graph discuss the suitability of the code to be run in a 2-way superscalar Looking at the dependency graph we can see that there are many independent instructions that could be issued in parallel to exploit instruction level parallelism. Only the last two instructions (ADD and STR) would not be able to be issued in parallel. e) Simulate the execution of the original code in a 5-stage 2-way superscalar pipeline with forwarding. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 LD r1, A IF ID EX MEM WB MUL r2 r1 r1 IF ID Stall EX MEM WB LD r3, B IF Stall ID EX MEM WB MUL r4 r3 r3 IF ID Stall EX MEM WB ADD r11 r2 r4 IF Stall ID EX MEM WB LD r5, C IF Stall ID EX MEM WB MUL r6 r5 r5 IF ID Stall EX MEM WB LD r7, D IF Stall ID EX MEM WB MUL r8 r7 r7 IF ID Stall EX MEM WB ADD r12 r6 r8 IF Stall ID EX MEM WB ADD r21 r11 r12 IF ID EX MEM WB ST r21, R IF ID EX MEM WB f) Simulate the execution of the reordered code in a 5-stage 2-way superscalar pipeline 1 2 3 4 5 6 7 8 9 10 11 1 LD r1, A IF ID EX MEM WB 2 LD r3, B IF ID EX MEM WB 3 LD r5, C IF ID EX MEM WB 4 LD r7, D IF ID EX MEM WB 5 MUL r2 r1 r1 IF ID EX MEM WB 6 MUL r4 r3 r3 IF ID EX MEM WB 7 MUL r6 r5 r5 IF ID EX MEM WB 8 MUL r8 r7 r7 IF ID EX MEM WB 9 ADD r11 r2 r4 IF ID EX MEM WB 10 ADD r12 r6 r8 IF ID EX MEM WB 11 ADD r21 r11 r12 IF ID EX MEM WB 12 ST r21, R IF ID EX MEM WB

5. Consider we want to execute 2 programs with 100 instructions each. The first program suffers an i-cache miss at instruction #30, and the second program another at instruction #70. Assume that + There is parallelism enough to execute all instructions independently (no hazards) + Switching threads can be done instantaneously + A cache miss requires cycles to get the instruction to the cache. + The two programs would not interfere with each other s caches lines Calculate the execution time observed by each of the programs (cycles elapsed between the execution of the first and the last instruction of that application) and the total time to execute the workload a) Sequentially (no multithreading), Thread A Thread B 29 1 Total: 240 A: 121 B: 121 71 69 1 31 b) With coarse-grain multithreading, Thread A Thread B 29 1 Cycle wasted 71 69 1 31 Total: 2 A: 171 B: 172 c) With fine-grain multithreading, Thread A Thread B 29 1 22 29 29 21 1 30 Total: 2 A: 1 B: 1 d) And with 2-way simultaneous multithreading. Thread A Thread B 30 1 30 40 31 391 31 Total: 101 A: 101 B: 101

6. Consider a 4-way superscalar processor with 4-way multithreading and the 4 threads, with the instructions issued as per the diagrams below. ICM stands for instruction cache miss and means no instruction can be issued that cycle. Thread A Thread B Thread C Thread D 1 2 1 2 3 1 2 3 1 3 4 5 ICM ICM 4 5 6 6 ICM 2 3 4 7 8 7 4 5 5 6 9 10 11 12 8 6 7 8 ICM ICM 7 9 10 11 12 ICM ICM 8 9 10 ICM 9 10 11 12 13 13 14 14 15 16 15 16 Draw a diagram showing the execution flow (issuing only) of the threads of a multithreaded processor (first 10 cycles only). Assume the processor starts issuing instructions from the different threads in order A >> B >> C >> D >> A: a) Coarse-grain multithreading 1 2 3 4 5 6 7 8 9 10 11 12 ICM 1 2 3 4 5 6 7 Swap thread upon expensive operations (ICM) b) Fine-grain multithreading 1 2 1 2 3 1 2 3 1 3 4 5 4 5 2 3 4 4 5 6 6 Round robin over ready threads c) Simultaneous multithreading 1 2 1 2 3 1 2 3 1 3 4 5 4 5 6 6 2 3 4 7 8 7 4 5 5 6 9 10 11 12 8 6 7 8 7 9 10 11 12 Issue instructions from several threads per cycle

7. Consider a modern 6-core processor with 2-way multithreading and 4-way superscalar pipelines. a) What is this processor s peak IPC? 6 cores, each of them with a 4-way superscalar can issue/commit up to 24 instructions in a single cycle. b) How many concurrent threads are supported by this processor? 6 cores running 2 threads each: 12 in total c) Assume the processor has a MESI cache coherency protocol with copy back. For the following sequence of accesses to variable x, show the bus transactions, the cache states and the actions in main memory. Assume all the corresponding cache lines start in the I state. core0: LDR r0, x Read miss: Cache0 sends a MEM_READ. No other cache has it, so main memory sends the value. Cache0 changes to E core1: LDR r1, x Read miss: Cache1 sends a MEM_READ. Cache0 has the cache line, so sends the value. Cache0 and Cache1 change to S core2: STR r0, x Write miss: Cache2 sends a RWITW. Cache0 and Cache1 change to I, main memory sends the value and Cache2 changes to M core3: LDR r3, x Read miss: Cache3 sends a MEM_READ. Cache2 has the cache line, so sends the value to Cache3 and updates main memory. Cache3 and Cache2 change to S core3: STR r5, x Write hit: Cache3 sends an INVALIDATE. Cache2 changes to I, Cache3 changes to M

8. Consider the following 3 processors: For each of them compute: Proc 1 Proc 2 Proc 3 # of cores 8 6 8 Multithreading No 4-way fine-grain 2-way SMT Superscalar 4-way 4-way 2-way Clock freq. 1.5 GHz 2.5 GHz 3GHz a) single-thread peak performance For single-thread performance we need to multiply the peak IPC (superscalar lanes) by the clock frequency Proc 1: 4*1.5 = 6Ginstr/s Proc 2: 4*2.5 = 10Ginstr/s Proc 3: 2*3 = 6Ginstr/s b) peak IPC of the system (instructions per cycle) For peak IPC, we need to multiply the number of cores times the maximum number of instructions that can be executed per cycle. Proc 1: 8*4 = 32 instr/cycle Proc 2: 6*4 = 24 instr/cycle Proc 3: 8*2 = 16 instr/cycle c) peak computing throughput of the system (instructions per second) For the computing throughput we need to multiply the number of cores, the IPC and the freq. Proc 1: 8*4*1.5 = 48Ginstr/s Proc 2: 6*4*2.5 = 60Ginstr/s Proc 3: 8*2*3 = 48 Ginstr/s d) the total number of hardware threads supported by the system For the number of threads, we multiply the number of cores times the threads per core Proc 1: 8*1 = 8 threads Proc 2: 6*4 = 24 threads Proc 3: 8*2 = 16 threads