Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding
|
|
- Kristopher Carr
- 5 years ago
- Views:
Transcription
1 Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern California
2 GPUs Have Long Memory Latency GPU Core (SM) GPU I-cache/Fetch/Decode Core (SM) GPU I-cache/Fetch/Decode Core Warp (SM) Scheduler I-cache/Fetch/Decode Warp Register Scheduler File Warp ALU Register Scheduler ALU File ALU ALU ALU ALURegister ALUFile ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Scratchpad ALU ALU L1 ALU cache Scratchpad Scratchpad L1 cache L1 Cache On-chip Interconnect Memory Partition L2 cache Memory Controller Off-chip Interface Off-chip DRAM
3 GPUs Have Long Memory Latency GPU Core (SM) GPU I-cache/Fetch/Decode Core (SM) GPU I-cache/Fetch/Decode Core Warp (SM) Scheduler I-cache/Fetch/Decode Warp Register Scheduler File Warp ALU Register Scheduler ALU File ALU ALU ALU ALURegister ALUFile ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Scratchpad ALU ALU L1 ALU cache Scratchpad Scratchpad L1 cache L1 Cache On-chip Interconnect cycles Cycle DRAM Latency Memory Partition L2 cache Memory Controller Off-chip Interface cycles Off-chip DRAM
4 Long Latency Stalls Warp Execution Instruction stream of a Warp 0x80 0x88 0x90 0x98 0xA0 0x80 0x88 0x90 0x98 0xA0 ld.global $r0, [$r1] ld.global $r0, [$r1] In-order Long-latency memory instructions
5 Long Latency Stalls Warp Execution Instruction stream of a Warp Stall Point 0x80 0x88 0x90 0x98 0xA0 0x80 0x88 0x90 0x98 0xA0 ld.global $r0, [$r1] $r4 brlong-latency 0x80 ld.global stall! $r0, [$r1] In-order
6 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units
7 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units
8 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units
9 Fraction of Total Cycles Need of Improving Latency Tolerance 100% 80% 60% 40% 20% Long-latency RAW Stall Load/Store Unit Stall Other Stalls 0% SG LI MC WP KM BF SV FD MU BT LP NW FW CF LK DW PF DC S1 BP S2 MT LB ST AE SF MQ SD Benchmarks Long-latency RAW stalls Warps waiting for the response from memory system Long memory latency Load/store unit stall No more memory requests can be generated Limited memory throughput
10 Fraction of Total Cycles Need of Improving Latency Tolerance 100% 80% Long-latency RAW Stall Load/Store Unit Stall Other Stalls Latency-sensitive Latency-insensitive 60% 40% 20% 0% SG LI MC WP KM BF SV FD MU BT LP NW FW CF LK DW PF DC S1 BP S2 MT LB ST AE SF MQ SD Long memory Benchmarks latency 23% of cycles stalled Long-latency RAW stalls Warps waiting for the response from memory system Long memory latency Load/store unit stall No more memory requests can be generated Limited memory throughput
11 Relative Performance More TLP Improves Performance Better latency tolerance by interleaving more threads 1.2 Performance TB2 Baseline TB8 TB16 Active Thread Blocks / Core
12 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts 1.2 Performance 350 Registers Scratchpad TB2 Baseline TB8 TB16 0 TB2 Baseline TB8 TB16 Active Thread Blocks / Core Active Thread Blocks / Core
13 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases Performance +6.7% +4.0% Registers Scratchpad % x 2x x TB2 Baseline TB8 TB16 0 TB2 Baseline TB8 TB16 Active Thread Blocks / Core Active Thread Blocks / Core
14 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases % Can Performance we improve latency Registers hiding Scratchpad 350 without +6.7% +4.0% adding more 300 threads? 250 2x TB2 Baseline TB8 TB16 1.5x 2x TB2 Baseline TB8 TB16
15 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases % Can Performance we improve latency Registers hiding Scratchpad 350 without +6.7% +4.0% adding more 300 threads? Exploit instruction-level parallelism! x TB2 Baseline TB8 TB16 2x 2x TB2 Baseline TB8 TB16
16 Where is Instruction-level Parallelism? Stall ld.global $r0, [$r1] ld.global $r0, [$r1]
17 Where is Instruction-level Parallelism? Stall Appears past branch ld.global $r0, [$r1] ld.global $r0, [$r1] Long-latency independent
18 Where is Instruction-level Parallelism? Multiple Warps Stall Appears past branch ld.global $r0, [$r1] ld.global add $r0, $r2, [$r1] $r2, #1 ld.global add mul $r0, $r2, $r6, [$r1] $r2, $r0, #1 $r6 add mul set.lt $r2, $r6, $r4, $r2, $r0, $r2, #1 $r6$r5 mul set.lt $r4 br $r6, $r4, 0x80 $r0, $r2, $r6$r5 set.lt $r4 ld.global br $r4, 0x80 $r0, $r2, [$r1] $r5 $r4 ld.global br add 0x80 $r0, $r2, [$r1] $r2, #1 ld.global add mul $r0, $r2, $r6, [$r1] $r2, $r0, #1 $r6 add mul set.lt $r2, $r6, $r4, $r2, $r0, $r2, #1 $r6$r5 mul set.lt $r4 br $r6, $r4, 0x80 $r0, $r2, $r6$r5 set.lt $r4 br $r4, 0x80 $r2, $r5 Long-latency independent
19 Where is Instruction-level Parallelism? Rescheduling required to exploit instruction-level parallelism Hardware schedulers may be too expensive for multiple threads Execution order Stall Stall ld.global $r0, [$r1] ld.global $r0, [$r1] Long-latency independent Producer-consumer correctness between ld and mul ignored for illustration
20 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics
21 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics P-mode (Pre-execution mode) At long-latency stall point, Switch warp to P-mode Skip instructions in long-latency dependence chain Execute long-latency independent instructions
22 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics P-mode (Pre-execution mode) At long-latency stall point, Switch warp to P-mode Skip instructions in long-latency dependence chain Execute long-latency independent instructions N-mode (Normal execution mode) Resume warp at stall point after completion of long-latency operation Execute long-latency dependent instructions Key: Keep in-order Fetch-decode-issue
23 Warp Pre-execution Warp mode: Normal (N) Key modifications: Each warp has mode (Normal and Pre-ex) Scoreboard/SIMT Stacks for each mode P-queue: Records pre-executed instructions in program order N-Scoreboard P-Scoreboard P-queue Head N-Stack P-Stack
24 Warp Pre-execution Warp mode: Normal (N) Long-latency Inst. N-Scoreboard $r0 (Long) P-Scoreboard N-Stack 1111 P-Stack P-queue Head
25 Warp Pre-execution Warp mode: Pre-execution (N P) Long-latency dependent! N-Scoreboard $r0 (Long) Copy P-Scoreboard $r0 (Long) P-queue N-Stack 1111 P-Stack 1111 Long-latency dependent detected Switch from N-mode to P-mode Head
26 Warp Pre-execution Skip Warp mode: Pre-execution (P) In P-mode, instructions dependent to in-flight long-latency operation are skipped N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue Head N-Stack 1111 P-Stack 1111
27 Warp Pre-execution Warp mode: Pre-execution (P) In P-mode, executed instructions are recorded in P-queue N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack 1111
28 Warp Pre-execution Warp mode: Pre-execution (P) Divergent state handled using P-stack P-mode continues beyond branches No branch speculation Stop at dependent branch N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack
29 Warp Pre-execution Prefetch Warp mode: Pre-execution (P) In P-mode, loads are used as accurate prefetch (treated as skipped) N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack
30 Warp Pre-execution Warp mode: Pre-execution (P) N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue N-Stack 1111 P-Stack nd instance of mul is also skipped add set.lt Head
31 Warp Pre-execution Warp mode: Pre-execution (P) 2 nd instance of set.lt pushed to P-queue P-queue can hold multiple instances of an instruction in the presence of loops N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue set.lt add set.lt Head N-Stack 1111 P-Stack
32 Warp Pre-execution Warp mode: Normal (P N) After 1 st ld.global completion, switch to N-mode and continues from the original next instruction N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt add set.lt Head N-Stack 1111 P-Stack
33 Warp Pre-execution Reuse Warp mode: Normal (N) set.lt was pre-executed Popped from P-queue head No re-execution, result reused immediately N-Scoreboard P-Scoreboard P-queue set.lt add set.lt Head N-Stack 1111 P-Stack
34 Warp Pre-execution Prefetch Hit Warp mode: Normal (N P) Prefetches generated for 2 nd ld.global will make cache hits N-Scoreboard $r0 (Cache hit) P-Scoreboard P-queue set.lt add Head N-Stack P-Stack
35 Warp Pre-execution Reuse Warp mode: Normal (N P) The result of 2 nd add also reused through P-queue N-Scoreboard P-Scoreboard P-queue set.lt add Head N-Stack P-Stack
36 Problem: WAR and WAW Hazards N P add $r1, $r1, #4 mul $r3, $r2, $r0 sub $r2, $r4, $r5 sub.$r2 must be written to register file after mul.$r2 is read sub can be pre-executed, but the result cannot be updated until mul.$r2 is read!
37 Solution: Register Renaming N P ld.global $r0, [$r1] add $r1, $r1, #4 mul $r3, $r2, $r0 sub $p0, $r4, $r5 set.lt $r4, $p0, $r5 $r2 renamed to $p0 $r2 renamed to $p0 Idea: Store pre-executed results to rename registers In P-mode, source and destination registers renamed mul.$r2 not overwritten, sub.$r2 renamed to $p0 Subsequent $r2 renamed to $p0 Values propagated via rename registers in P-mode
38 Solution: Register Renaming N P N ld.global $r0, [$r1] add $r1, $r1, #4 mul $r3, $r2, $r0 sub $p0, $r4, $r5 set.lt $r4, $p0, $r5 mul $r3, $r2, $r0 sub $r2, $r4, $r5 $r2 renamed to $p0 $r2 renamed to $p0 Result read from $p0, updated to $r2 Idea: Store pre-executed results to rename registers In P-mode, source and destination registers renamed mul.$r2 not overwritten, sub.$r2 renamed to $p0 Subsequent $r2 renamed to $p0 Values propagated via rename registers in P-mode In N-mode, commit renamed registers in-order (reuse) Pre-computed sub.$r2 is read from rename register $p0
39 Warp Pre-execution Reuse Inst. ID set.lt Warp mode: Normal (N P) Arch. Dst ID Rename Register ID $r4 $p100 More details in paper N-Scoreboard P-Scoreboard P-queue set.lt add set.lt Head N-Stack 1111 P-Stack
40 Register Utilization (%) Do We Need More Registers? 100% 80% 60% 40% 20% Baseline P-mode 0% MU BT BP LB AE LK SF S2 ST SG KM SV MC PF MQ DC S1 CF FW LP DWWP BF FD SD MT LI NW GPUs underutilize register file Thread block granularity register management Unused registers for P-mode rename registers Increased register utilization (73% 86%)
41 Relative Performance Storage/SM (KB) Performance and Storage Up to +38% for latency-sensitive top 50% applications Combined with OptTB (Optimized thread block count) Only 10% storage overhead Performance On-chip Storage P-mode Overhead % +22% +22% +38% x 1.1x 0.8 Base 2xTLP 2xTLP P-modeP-mode +OptTB +OptTB 0 Base 2xTLP 2xTLP P-modeP-mode +OptTB +OptTB Outperforms 2x more threads Less storage overhead
42 LI KM SV SG MC BT CF LP FD WP LB DW LK FW DC S2 MU BP ST S1 PF BF NW AE SF MQ MT SD P-mode Instruction Coverage 100% 80% 60% 40% 20% 0% Pre-execute/Prefetch Skipped Pre-execution covers 38% of dynamic instructions 16% useful instructions (pre-execution/prefetch) 22% skipped Up to 90% coverage, up to 60% useful pre-execution Implication: GPU apps may have further ILP to exploit
43 Conclusion Scaling TLP for higher latency tolerance demands high storage cost Our solution: Warp pre-execution Exploit instruction-level parallelism Retain in-order fetch-decode-scheduling Challenges WAW and WAR hazards Solution: Register renaming Pre-execution shows promising results 38% performance improvement (combined with TLP throttling)
44 Thank you for listening! Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern California
Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding
Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim * Sangpil Lee * Myung Kuk Yoon * Gunjae Koo Won Woo Ro * Murali Annavaram * School of Electrical and Electronic
More informationLecture 19: Instruction Level Parallelism
Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register
More informationAPRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationCache efficiency based dynamic bypassing technique for improving GPU performance
94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationLecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue
Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationThis course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers
Course Introduction Purpose: This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Objectives: Learn about error detection and address errors
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,
More informationReadings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152
Readings H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152 Recent Research Paper The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays, Hrishikesh et
More informationLecture: Pipeline Wrap-Up and Static ILP
Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationWhat is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?
What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,
More informationTDT 4260 TDT ILP Chap 2, App. C
TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationAdapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]
Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationChapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>
Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCSE 240A Midterm Exam
Student ID Page 1 of 7 2011 Fall Professor Steven Swanson CSE 240A Midterm Exam Please write your name at the top of each page This is a close book, closed notes exam. No outside material may be used.
More informationOut of Order Processing
Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative
More informationInstruction-Level Parallelism (ILP)
Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2018 Static Instruction Scheduling 1 Techniques to reduce stalls CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per
More informationCS 2410 Mid term (fall 2018)
CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More informationAnnouncement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory
ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcement Lab1 is released Start early we only have limited computing
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationInstruction-Level Parallelism. Instruction Level Parallelism (ILP)
Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline
More informationGood luck and have fun!
Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationDYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD
DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationBackground: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation
Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationLecture 12 Branch Prediction and Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationCMSC411 Fall 2013 Midterm 2 Solutions
CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence
More informationOOO Execution and 21264
OOO Execution and 21264 1 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? 2 Parallelism ET = IC * CPI
More informationProcessor Architecture
Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading Content Tomasulo
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More information