Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Size: px
Start display at page:

Download "Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding"

Transcription

1 Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern California

2 GPUs Have Long Memory Latency GPU Core (SM) GPU I-cache/Fetch/Decode Core (SM) GPU I-cache/Fetch/Decode Core Warp (SM) Scheduler I-cache/Fetch/Decode Warp Register Scheduler File Warp ALU Register Scheduler ALU File ALU ALU ALU ALURegister ALUFile ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Scratchpad ALU ALU L1 ALU cache Scratchpad Scratchpad L1 cache L1 Cache On-chip Interconnect Memory Partition L2 cache Memory Controller Off-chip Interface Off-chip DRAM

3 GPUs Have Long Memory Latency GPU Core (SM) GPU I-cache/Fetch/Decode Core (SM) GPU I-cache/Fetch/Decode Core Warp (SM) Scheduler I-cache/Fetch/Decode Warp Register Scheduler File Warp ALU Register Scheduler ALU File ALU ALU ALU ALURegister ALUFile ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Scratchpad ALU ALU L1 ALU cache Scratchpad Scratchpad L1 cache L1 Cache On-chip Interconnect cycles Cycle DRAM Latency Memory Partition L2 cache Memory Controller Off-chip Interface cycles Off-chip DRAM

4 Long Latency Stalls Warp Execution Instruction stream of a Warp 0x80 0x88 0x90 0x98 0xA0 0x80 0x88 0x90 0x98 0xA0 ld.global $r0, [$r1] ld.global $r0, [$r1] In-order Long-latency memory instructions

5 Long Latency Stalls Warp Execution Instruction stream of a Warp Stall Point 0x80 0x88 0x90 0x98 0xA0 0x80 0x88 0x90 0x98 0xA0 ld.global $r0, [$r1] $r4 brlong-latency 0x80 ld.global stall! $r0, [$r1] In-order

6 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units

7 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units

8 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units

9 Fraction of Total Cycles Need of Improving Latency Tolerance 100% 80% 60% 40% 20% Long-latency RAW Stall Load/Store Unit Stall Other Stalls 0% SG LI MC WP KM BF SV FD MU BT LP NW FW CF LK DW PF DC S1 BP S2 MT LB ST AE SF MQ SD Benchmarks Long-latency RAW stalls Warps waiting for the response from memory system Long memory latency Load/store unit stall No more memory requests can be generated Limited memory throughput

10 Fraction of Total Cycles Need of Improving Latency Tolerance 100% 80% Long-latency RAW Stall Load/Store Unit Stall Other Stalls Latency-sensitive Latency-insensitive 60% 40% 20% 0% SG LI MC WP KM BF SV FD MU BT LP NW FW CF LK DW PF DC S1 BP S2 MT LB ST AE SF MQ SD Long memory Benchmarks latency 23% of cycles stalled Long-latency RAW stalls Warps waiting for the response from memory system Long memory latency Load/store unit stall No more memory requests can be generated Limited memory throughput

11 Relative Performance More TLP Improves Performance Better latency tolerance by interleaving more threads 1.2 Performance TB2 Baseline TB8 TB16 Active Thread Blocks / Core

12 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts 1.2 Performance 350 Registers Scratchpad TB2 Baseline TB8 TB16 0 TB2 Baseline TB8 TB16 Active Thread Blocks / Core Active Thread Blocks / Core

13 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases Performance +6.7% +4.0% Registers Scratchpad % x 2x x TB2 Baseline TB8 TB16 0 TB2 Baseline TB8 TB16 Active Thread Blocks / Core Active Thread Blocks / Core

14 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases % Can Performance we improve latency Registers hiding Scratchpad 350 without +6.7% +4.0% adding more 300 threads? 250 2x TB2 Baseline TB8 TB16 1.5x 2x TB2 Baseline TB8 TB16

15 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases % Can Performance we improve latency Registers hiding Scratchpad 350 without +6.7% +4.0% adding more 300 threads? Exploit instruction-level parallelism! x TB2 Baseline TB8 TB16 2x 2x TB2 Baseline TB8 TB16

16 Where is Instruction-level Parallelism? Stall ld.global $r0, [$r1] ld.global $r0, [$r1]

17 Where is Instruction-level Parallelism? Stall Appears past branch ld.global $r0, [$r1] ld.global $r0, [$r1] Long-latency independent

18 Where is Instruction-level Parallelism? Multiple Warps Stall Appears past branch ld.global $r0, [$r1] ld.global add $r0, $r2, [$r1] $r2, #1 ld.global add mul $r0, $r2, $r6, [$r1] $r2, $r0, #1 $r6 add mul set.lt $r2, $r6, $r4, $r2, $r0, $r2, #1 $r6$r5 mul set.lt $r4 br $r6, $r4, 0x80 $r0, $r2, $r6$r5 set.lt $r4 ld.global br $r4, 0x80 $r0, $r2, [$r1] $r5 $r4 ld.global br add 0x80 $r0, $r2, [$r1] $r2, #1 ld.global add mul $r0, $r2, $r6, [$r1] $r2, $r0, #1 $r6 add mul set.lt $r2, $r6, $r4, $r2, $r0, $r2, #1 $r6$r5 mul set.lt $r4 br $r6, $r4, 0x80 $r0, $r2, $r6$r5 set.lt $r4 br $r4, 0x80 $r2, $r5 Long-latency independent

19 Where is Instruction-level Parallelism? Rescheduling required to exploit instruction-level parallelism Hardware schedulers may be too expensive for multiple threads Execution order Stall Stall ld.global $r0, [$r1] ld.global $r0, [$r1] Long-latency independent Producer-consumer correctness between ld and mul ignored for illustration

20 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics

21 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics P-mode (Pre-execution mode) At long-latency stall point, Switch warp to P-mode Skip instructions in long-latency dependence chain Execute long-latency independent instructions

22 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics P-mode (Pre-execution mode) At long-latency stall point, Switch warp to P-mode Skip instructions in long-latency dependence chain Execute long-latency independent instructions N-mode (Normal execution mode) Resume warp at stall point after completion of long-latency operation Execute long-latency dependent instructions Key: Keep in-order Fetch-decode-issue

23 Warp Pre-execution Warp mode: Normal (N) Key modifications: Each warp has mode (Normal and Pre-ex) Scoreboard/SIMT Stacks for each mode P-queue: Records pre-executed instructions in program order N-Scoreboard P-Scoreboard P-queue Head N-Stack P-Stack

24 Warp Pre-execution Warp mode: Normal (N) Long-latency Inst. N-Scoreboard $r0 (Long) P-Scoreboard N-Stack 1111 P-Stack P-queue Head

25 Warp Pre-execution Warp mode: Pre-execution (N P) Long-latency dependent! N-Scoreboard $r0 (Long) Copy P-Scoreboard $r0 (Long) P-queue N-Stack 1111 P-Stack 1111 Long-latency dependent detected Switch from N-mode to P-mode Head

26 Warp Pre-execution Skip Warp mode: Pre-execution (P) In P-mode, instructions dependent to in-flight long-latency operation are skipped N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue Head N-Stack 1111 P-Stack 1111

27 Warp Pre-execution Warp mode: Pre-execution (P) In P-mode, executed instructions are recorded in P-queue N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack 1111

28 Warp Pre-execution Warp mode: Pre-execution (P) Divergent state handled using P-stack P-mode continues beyond branches No branch speculation Stop at dependent branch N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack

29 Warp Pre-execution Prefetch Warp mode: Pre-execution (P) In P-mode, loads are used as accurate prefetch (treated as skipped) N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack

30 Warp Pre-execution Warp mode: Pre-execution (P) N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue N-Stack 1111 P-Stack nd instance of mul is also skipped add set.lt Head

31 Warp Pre-execution Warp mode: Pre-execution (P) 2 nd instance of set.lt pushed to P-queue P-queue can hold multiple instances of an instruction in the presence of loops N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

32 Warp Pre-execution Warp mode: Normal (P N) After 1 st ld.global completion, switch to N-mode and continues from the original next instruction N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

33 Warp Pre-execution Reuse Warp mode: Normal (N) set.lt was pre-executed Popped from P-queue head No re-execution, result reused immediately N-Scoreboard P-Scoreboard P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

34 Warp Pre-execution Prefetch Hit Warp mode: Normal (N P) Prefetches generated for 2 nd ld.global will make cache hits N-Scoreboard $r0 (Cache hit) P-Scoreboard P-queue set.lt add Head N-Stack P-Stack

35 Warp Pre-execution Reuse Warp mode: Normal (N P) The result of 2 nd add also reused through P-queue N-Scoreboard P-Scoreboard P-queue set.lt add Head N-Stack P-Stack

36 Problem: WAR and WAW Hazards N P add $r1, $r1, #4 mul $r3, $r2, $r0 sub $r2, $r4, $r5 sub.$r2 must be written to register file after mul.$r2 is read sub can be pre-executed, but the result cannot be updated until mul.$r2 is read!

37 Solution: Register Renaming N P ld.global $r0, [$r1] add $r1, $r1, #4 mul $r3, $r2, $r0 sub $p0, $r4, $r5 set.lt $r4, $p0, $r5 $r2 renamed to $p0 $r2 renamed to $p0 Idea: Store pre-executed results to rename registers In P-mode, source and destination registers renamed mul.$r2 not overwritten, sub.$r2 renamed to $p0 Subsequent $r2 renamed to $p0 Values propagated via rename registers in P-mode

38 Solution: Register Renaming N P N ld.global $r0, [$r1] add $r1, $r1, #4 mul $r3, $r2, $r0 sub $p0, $r4, $r5 set.lt $r4, $p0, $r5 mul $r3, $r2, $r0 sub $r2, $r4, $r5 $r2 renamed to $p0 $r2 renamed to $p0 Result read from $p0, updated to $r2 Idea: Store pre-executed results to rename registers In P-mode, source and destination registers renamed mul.$r2 not overwritten, sub.$r2 renamed to $p0 Subsequent $r2 renamed to $p0 Values propagated via rename registers in P-mode In N-mode, commit renamed registers in-order (reuse) Pre-computed sub.$r2 is read from rename register $p0

39 Warp Pre-execution Reuse Inst. ID set.lt Warp mode: Normal (N P) Arch. Dst ID Rename Register ID $r4 $p100 More details in paper N-Scoreboard P-Scoreboard P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

40 Register Utilization (%) Do We Need More Registers? 100% 80% 60% 40% 20% Baseline P-mode 0% MU BT BP LB AE LK SF S2 ST SG KM SV MC PF MQ DC S1 CF FW LP DWWP BF FD SD MT LI NW GPUs underutilize register file Thread block granularity register management Unused registers for P-mode rename registers Increased register utilization (73% 86%)

41 Relative Performance Storage/SM (KB) Performance and Storage Up to +38% for latency-sensitive top 50% applications Combined with OptTB (Optimized thread block count) Only 10% storage overhead Performance On-chip Storage P-mode Overhead % +22% +22% +38% x 1.1x 0.8 Base 2xTLP 2xTLP P-modeP-mode +OptTB +OptTB 0 Base 2xTLP 2xTLP P-modeP-mode +OptTB +OptTB Outperforms 2x more threads Less storage overhead

42 LI KM SV SG MC BT CF LP FD WP LB DW LK FW DC S2 MU BP ST S1 PF BF NW AE SF MQ MT SD P-mode Instruction Coverage 100% 80% 60% 40% 20% 0% Pre-execute/Prefetch Skipped Pre-execution covers 38% of dynamic instructions 16% useful instructions (pre-execution/prefetch) 22% skipped Up to 90% coverage, up to 60% useful pre-execution Implication: GPU apps may have further ILP to exploit

43 Conclusion Scaling TLP for higher latency tolerance demands high storage cost Our solution: Warp pre-execution Exploit instruction-level parallelism Retain in-order fetch-decode-scheduling Challenges WAW and WAR hazards Solution: Register renaming Pre-execution shows promising results 38% performance improvement (combined with TLP throttling)

44 Thank you for listening! Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern California

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim * Sangpil Lee * Myung Kuk Yoon * Gunjae Koo Won Woo Ro * Murali Annavaram * School of Electrical and Electronic

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Cache efficiency based dynamic bypassing technique for improving GPU performance

Cache efficiency based dynamic bypassing technique for improving GPU performance 94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections ) Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Course Introduction Purpose: This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Objectives: Learn about error detection and address errors

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Readings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152

Readings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152 Readings H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152 Recent Research Paper The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays, Hrishikesh et

More information

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Pipeline Wrap-Up and Static ILP Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP? What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,

More information

TDT 4260 TDT ILP Chap 2, App. C

TDT 4260 TDT ILP Chap 2, App. C TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1> Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

CSE 240A Midterm Exam

CSE 240A Midterm Exam Student ID Page 1 of 7 2011 Fall Professor Steven Swanson CSE 240A Midterm Exam Please write your name at the top of each page This is a close book, closed notes exam. No outside material may be used.

More information

Out of Order Processing

Out of Order Processing Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative

More information

Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism (ILP) Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2018 Static Instruction Scheduling 1 Techniques to reduce stalls CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcement Lab1 is released Start early we only have limited computing

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP) Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

OOO Execution and 21264

OOO Execution and 21264 OOO Execution and 21264 1 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? 2 Parallelism ET = IC * CPI

More information

Processor Architecture

Processor Architecture Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading Content Tomasulo

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information