Introduction to Control Divergence

Size: px
Start display at page:

Download "Introduction to Control Divergence"

Transcription

1 Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted (1) Objective Understand the occurrence of control divergence and the concept of thread reconvergence v Also described as branch divergence and thread divergence Cover a basic thread reconvergence mechanism PDOM v Set up discussion of further optimizations and advanced techniques xplore one approach for mitigating the performance degradation due to control divergence dynamic warp formation (2) 1

2 Reading W. Fung, I. Sham, G. Yuan, and T. Aamodt, Dynamic Warp Formation: fficient MIMD Control Flow on SIMD Graphics Hardware, ACM TACO, June 2009 W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 M. Rhu and M. rez, CAPRI: Prediction of Compaction- Adequacy for Handling Control-Divergence in GPGPU Architectures, ISCA 2012 Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (3) Handling Branches CUDA Code: if( ) (True for some threads) else (True for others) T T T T taken not taken What if threads takes different branches? Branch Divergence! (4) 2

3 Branch Divergence Occurs within a warp Branches lead serialization branch dependent code v Performance issue: low warp utilization if( ) { } Idle threads else { } Reconvergence! (5) Branch Divergence A C B D F Thread Warp Thread 1 Thread 2 Common PC Thread Thread 3 4 G Different threads follow different control flow paths through the kernel code Thread execution is (partially) serialized v Subset of threads that follow the same path execute in parallel Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt 3

4 Basic Idea Split: partition a warp v Two mutually exclusive thread subsets, each branching to a different target v Identify subsets with two activity masks à effectively two warps Join: merge two subsets of a previously split warp v Reconverge the mutually exclusive sets of threads Orchestrate the correct execution for nested branches Note the long history of technques in SIMD processors (see background in Fung et. al.) Thread Warp T1 T2 T3 T4 Common PC activity mask (7) Thread Reconvergence Fundamental problem: v Merge threads with the same PC v How do we sequence execution of threads? Since this can effect the ability to reconverge Question: When can threads productively reconverge? Question: When is the best time to reconverge? A B C D F G (8) 4

5 Dominator Node d dominates node n if every path from the entry node to n must go through d A B C D F G (9) Immediate Dominator Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n A B C D F G (10) 5

6 Post Dominator Node d post dominates node n if every path from the node n to the exit node must go through d A B C D F G (11) Immediate Post Dominator Node d immediate post dominates node n if every path from node n to the exist node must go through d and no other nodes post dominate n between d and n A B C D F G (12) 6

7 Baseline: PDOM A A/1111 B B/1111 Stack Reconv. PC Next PC Active Mask TOS - G A B 1111 TOS D 0110 TOS C 1001 C C/1001 D D/0110 /1111 G G/1111 F Thread Warp Thread 1 Thread 2 Thread 3 Common PC Thread 4 A B C D G A Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Time Stack ntry A A/1111 B B/1111 Stack Reconv. PC Next PC Active Mask TOS - G A B 1111 TOS D 0110 TOS C 1001 C C/1001 D D/0110 F /1111 G G/1111 A stack entry is a specification of a group of active threads that will execute that basic block The natural nested structure of control exposes the use stack-based serialization (14) 7

8 More Complex xample Stack based implementation for nested control flow v Stack entry RPC set to IPDOM Re-convergence at the immediate post-dominator of the branch From Fung, et. Al., Dynamic Warp Formation: fficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009 (15) Implementation I-Fetch scalar Pipeline Decode I-Buffer Issue PRF RF scalar pipeline D-Cache All Hit? Data Writeback scalar pipeline pending warps Stack Reconv. PC Next PC Active Mask TOS - G A B 1111 TOS D 0110 TOS C 1001 GPGPUSim model: v Implement per warp stack at issue stage v Acquire the active mask and PC from the TOS v Scoreboard check prior to issue v Register writeback updates scoreboard and ready bit in instruction buffer v When RPC = Next PC, pop the stack Implications for instruction fetch? FromGPGPU-Sim Documentation Sim_3.x_Manual#SIMT_Cores (16) 8

9 Implementation (2) warppc (next instruction) compared to reconvergence PC On a branch v Can store the reconvergence PC as part of the branch instruction v Branch unit has NextPC, TargetPC and reconvergence PC to update the stack On reaching a reconvergence point v Pop the stack v Continue fetching from the NextPC of the next entry on the stack (17) Warps are formed statically Can We Do Better? Key idea of dynamic warp formation v Show a pool of warps and how they can be merged At a high level what are the requirements B A C D F G (18) 9

10 Compaction Techniques Can we reform warps so as to increase utilization? Basic idea: Compaction v Reform warps with threads that follow the same control flow path v Increase utilization of warps Two basic types of compaction techniques Inter-warp compaction v Group threads from different warps v Group threads within a warp o Changing the effective warp size (19) Inter-Warp Thread Compaction Techniques Lectures Slides and Figures contributed from sources as noted (20) 10

11 Goal Warp 0 Merge threads? Warp 1 Warp 2 Warp 3 if( ) { } Taken Not Taken Warp 4 Warp 5 else { } (21) Reading W. Fung, I. Sham, G. Yuan, and T. Aamodt, Dynamic Warp Formation: fficient MIMD Control Flow on SIMD Graphics Hardware, ACM TACO, June 2009 W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (22) 11

12 DWF: xample Baseline B x/1110 y/0011 A x/1111 y/1111 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 x/1110 y/0011 G x/1111 y/1111 Legend A A xecution of Warp x at Basic Block A D A new warp created from scalar threads of both Warp x and y executing at Basic Block D A A B B C C D D F F G G A A xecution of Warp y at Basic Block A Dynamic Warp Formation Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt A A B B C D F G G A A Time Time How Does This Work? Criteria for merging v Same PC v Complements of active threads in each warp v Recall: many warps/tb all executing the same code What information do we need to merge two warps v Need thread IDs and PCs Ideally how would you find/merge warps? (24) 12

13 DWF: Microarchitecture Implementation Identify available lanes Assists in aggregating threads Identify occupied lanes Point to warp being formed Thread Scheduler PC-Warp LUT Warp Update Register T TID x N RQ PC A Warp Update Register NT TID x N RQ PC B H H PC OCC PC OCC IDX IDX Warp Pool PC PC PC TID x N Prio TID x N Prio TID x N Prio I s s u e L o g i c Warp Allocator Commit/ Writeback Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt ALU 1 ALU 2 ALU 3 ALU 4 RF 1 RF 2 RF 3 RF 4 (TID, Reg#) (TID, Reg#) (TID, Reg#) (TID, Reg#) Warps formed dynamically in the warp pool After commit check PC-Warp LUT and merge or allocated newly forming warp Decode I-Cache DWF: Microarchitecture Implementation A: BQ R2, B C: Thread Scheduler PC-Warp LUT Warp Pool Warp Update Register T PC B OCC IDX 02 PC A B 15 TID 2 3 x N 48 5 TID 2 x 73 N RQ PC B A H PC C OCC IDX 1 PC A C 51 TID 6 7 x N 84 Warp Update Register NT PC B TID x C TID x N Commit/ Writeback RQ X Y PC B H ALU 1 ALU 2 ALU 3 ALU 4 X Y RF 1 RF 2 RF 3 RF 4 N Warp Allocator X Z Y (TID, Reg#) (TID, Reg#) (TID, Reg#) (TID, Reg#) Prio Prio Prio X Z Y Decode I s s u e L o g i c X Z Y I-Cache No Lane Conflict Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt 13

14 Resource Usage Ideally would like a small number of unique PCs in progress at a time à minimize overhead Warp divergence will increase the number of unique PCs v Mitigate via warp scheduling Scheduling policies v FIFO v Program counter address variation measure of divergence v Majority/Minority- most common vs. helping stragglers v Post dominator (catch up) (27) Hardware Consequences xpose the implications that warps have in the base design v Implications for register file access à lane aware DWF Register bank conflicts From Fung, et. Al., Dynamic Warp Formation: fficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009 (28) 14

15 Relaxing Implications of Warps Thread swizzling v ssentially remap work to threads so as to create more opportunities for DWF à requires deep understanding of algorithm behavior and data sets Lane swizzling in hardware v Provide limited connectivity between register banks and lanes à avoiding full crossbars (29) Summary Control flow divergence is a fundamental performance limiter for SIMT execution Dynamic warp formation is one way to mitigate these effects v We will look at several others Must balance a complex set of effects v Memory behaviors v Synchronization behaviors v Scheduler behaviors (30) 15

16 Thread Block Compaction W. Fung and T. Aamodt HPCA 2011 Overcome some of the disadvantages of dynamic warp formation v Impact of Scheduling v Breaking implicit synchronization v Reduction of memory coalescing opportunities Goal (32) 16

17 DWF Pathologies: Starva:on Majority Scheduling Best Performing Priori:ze largest group of threads with same PC Starva&on LOWR SIMD fficiency! Other Warp Scheduler? Tricky: Variable Memory Latency B: if (K > 10) C: K = 10; else D: K = 0; : B = C[tid.x] + K; C C D D s cycles D D Time Wilson Fung, Tor Aamodt Thread Block Compac:on 33 DWF Pathologies: xtra Uncoalesced Accesses Coalesced Memory Access = Memory SIMD 1 st Order CUDA Programmer Op:miza:on Not preserved by DWF : B = C[tid.x] + K; No DWF With DWF #Acc = 3 #Acc = 9 0x100 0x140 0x180 0x100 0x140 0x180 Memory Memory L1 Cache Absorbs Redundant Memory Traffic L1$ Port Conflict Wilson Fung, Tor Aamodt Thread Block Compac:on 34 17

18 DWF Pathologies: Implicit Warp Sync. Some CUDA applications depend on the lockstep execution of static warps Warp 0 Warp 1 Warp 2 Thread Thread Thread From W. Fung, I. Sham, G. Yuan, and T. Aamodt, Dynamic Warp Formation: fficient MIMD Control Flow on SIMD Graphics Hardware, ACM TACO, June 2009 Wilson Fung, Tor Aamodt (35) Performance Impact From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (36) 18

19 Thread Block Compaction n Block-wide Reconvergence Stack Thread Warp Block 0 0 Warp 1 Warp 2 PC RPC AMask Active PC RPC Mask AMask PC RPC AMask Warp D D D D Warp 1100 U 1 C C C C D Warp 0011 X T 2 C Warp Y q Regroup threads within a block n Better Reconv. Stack: Likely Convergence q Converge before Immediate Post-Dominator n Robust q Avg. 22% speedup on divergent CUDA apps q No penalty on others Wilson Fung, Tor Aamodt Thread Block Compaction 37 GPU Microarchitecture SIMT SIMT Core Core SIMT SIMT Core SIMT Core Done (Warp ID) Interconnection Network Memory Memory Partition Last-Level Cache Cache Bank Bank Cache Bank DRAM DRAM Off-Chip DRAM Channel SIMT Front nd SIMD Datapath Fetch Decode Schedule Branch Memory Subsystem SMem L1 D$ Tex $ Const$ Icnt. Network More Details in Paper Wilson Fung, Tor Aamodt Thread Block Compaction 19

20 Observation n Compute kernels usually contain divergent and non-divergent (coherent) code segments n Coalesced memory access usually in coherent code segments q DWF no benefit there Coherent Divergent Static Warp Divergence Dynamic Warp Reset Warps Coales. LD/ST Coherent Static Warp Recvg Pt. Wilson Fung, Tor Aamodt Thread Block Compaction 39 Thread Block Compaction n Branch/reconverge pt. q All avail. threads arrive at branch q Insensitive to warp scheduling n Run a thread block like a warp q Whole block move between coherent/divergent code q Block-wide stack to track exec. paths reconvg. n Warp compaction q Regrouping with all avail. threads Implicit Warp Sync. Starvation xtra Uncoalesced Memory Access q If no divergence, gives static warp arrangement Wilson Fung, Tor Aamodt Thread Block Compaction 40 20

21 Thread Block Compaction PC RPC Active Threads A D C A A A A A A A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; : B = C[tid.x] + K; C C D D C C C D D D Time Wilson Fung, Tor Aamodt Thread Block Compaction 41 Thread Block Compaction n Barrier every basic block?! (Idle pipeline) n Switch to warps from other thread blocks q Multiple thread blocks run on a core q Already done in most CUDA applications Branch Warp Compaction Block 0 xecution xecution Block 1 Block 2 xecution Time xecution Wilson Fung, Tor Aamodt Thread Block Compaction 42 21

22 High Level View DWF: warp broken down every cycle and threads in a warp shepherded into a new warp (LUT and warp pool) TBC: warps broken down at potentially divergent points and threads compacted across the thread block (43) Microarchitecture Modification n Per-Warp Stack à Block-Wide Stack n I-Buffer + TIDs à Warp Buffer q Store the dynamic warps n New Unit: Thread Compactor q Translate activemask to compact dynamic warps n More Detail in Paper Fetch I-Cache Branch Target PC Valid[1:N] Decode Warp Buffer Score- Board Thread Compactor Issue Done (WID) Block-Wide Stack Pred. Active Mask RegFile ALU ALU MM Wilson Fung, Tor Aamodt Thread Block Compaction 44 22

23 Microarchitecture Modification (2) TIDs of #compacted warps #compacted warps From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (45) Operation All threads mapped to the same lane Warp 2 arrives first creating 2 target entries Next warps update the active mask When this is zero compact (priority encoder) Pick a thread mapped to this lane From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (46) 23

24 Thread Compactor n Convert activemask from block-wide stack to thread IDs in warp buffer n Array of Priority-ncoder C P-nc P-nc P-nc P-nc Warp Buffer C C Wilson Fung, Tor Aamodt Thread Block Compaction 47 Likely-Convergence n Immediate Post-Dominator: Conservative q All paths from divergent branch must merge there n Convergence can happen earlier q When any two of the paths merge while (i < K) { X = data[i]; A: if ( X = 0 ) B: result[i] = Y; B C: else if ( X = 1 ) D: break; : i++; } ipdom of A F: return result[i]; Rarely Taken n xtended Recvg. Stack to exploit this q TBC: 30% speedup for Ray Tracing A F C D Details in Paper Wilson Fung, Tor Aamodt Thread Block Compaction 48 24

25 Likely-Convergence (2) n NVIDIA uses break instruction for loop exits q That handles last example n Our solution: Likely-Convergence Points PC RPC LPC LPos ActiveThds F F B F 1 1 C D F F 1 2 Convergence! n This paper: only used to capture loop-breaks Wilson Fung, Tor Aamodt Thread Block Compaction 49 Likely-Convergence (3) Check! Merge inside the stack From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (50) 25

26 Likely-Convergence (4) n Applies to both per-warp stack (PDOM) and thread block compaction (TBC) q nable more threads grouping for TBC q Side effect: Reduce stack usage in some case Wilson Fung, Tor Aamodt Thread Block Compaction 51 valuation n Simulation: GPGPU-Sim (2.2.1b) q ~Quadro FX L1 & L2 Caches n 21 Benchmarks q All of GPGPU-Sim original benchmarks q Rodinia benchmarks q Other important applications: n n n n Face Detection from Visbench (UIUC) DNA Sequencing (MUMMR-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research Wilson Fung, Tor Aamodt Thread Block Compaction 52 26

27 xperimental Results n 2 Benchmark Groups: q COH = Non-Divergent CUDA applications q DIVG = Divergent CUDA applications DWF TBC COH DIVG Serious Slowdown from pathologies No Penalty for COH 22% Speedup on DIVG IPC Relative to Baseline Per-Warp Stack Wilson Fung, Tor Aamodt Thread Block Compaction 53 Normalized Memory Stalls ffect on Memory Traffic n TBC does still generate some extra uncoalesced memory access 300% 250% 200% 150% 100% 50% 0% C C nd Acc will hit the L1 cache TBC DWF Baseline Memory Traffic Normalized to Baseline #Acc = BFS2 FCDT 0x100 0x140 0x180 HOTSP Memory TBC-AG TBC-RRB 2.67x LPS MUM DIVG MUMpp NAMD NVRT AS BACKP CP DG HRTWL LIB No Change to Overall Memory Traffic In/out of a core LKYT MGST COH NNC RAY STMCL STO WP Wilson Fung, Tor Aamodt Thread Block Compaction 54 27

28 Conclusion Thread Block Compaction v Addressed some key challenges of DWF v One significant step closer to reality Benefit from advancements on reconvergence stack v Likely-Convergence Point v xtensible: Integrate other stack-based proposals Wilson Fung, Tor Aamodt Thread Block Compaction (55) CAPRI: Prediction of Compaction- Adequacy for Handling Control- Divergence in GPGPU Architectures M. Rhu and M. rez ISCA

29 Improve the performance of inter-warp compaction techniques Predict when branches diverge v Borrow philosophy from branch prediction Use prediction to apply compaction only when it is beneficial Goals (57) Issues with Thread Block Compaction C B Implicit barrier across warps to collect compaction candidates A D G F TBC: warps broken down at potentially divergent points and threads compacted across the thread block Barrier synchronization overhead cannot always be hidden When it works, it works well (58) 29

30 Divergence Behavior: A Closer Look Compaction ineffective branch Loop executes a fixed number of times Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (59) Compaction-Resistant Branches Control flow graph Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (60) 30

31 Compaction-Resistant Branches(2) Control flow graph Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (61) Impact of Ineffective Compaction Threads shuffled around with no performance improvement However, can lead to increased memory divergence! Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (62) 31

32 Only stall and compact when there is a high probability of compaction success Basic Idea Otherwise allow warps to bypass the (implicit) barrier Compaction adequacy predictor! v Think branch prediction! (63) xample: TBC vs. CAPRI Bypassing enables increased overlap of memory references Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (64) 32

33 CAPRI: xample No divergence, no stalling Diverge, stall, initialize history, all others will now stall Diverge and history available, predict, update prediction All other warps follow (one prediction/ branch CAPT updated Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (65) Available threads in a lane The Predictor Prediction uses active masks of all warps v Need to understand what could have happened Actual compaction only uses actual stalled warps Minimum provides the maximum compaction ability, i.e., #compacted warps Update history predictor accordingly Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (66) 33

34 Behavior Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (67) Impact of Implicit Barriers Idle cycle count helps us understand the negative effects of implicit barriers in TBC Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (68) 34

35 Summary The synchronization overhead of thread block compaction can introduce performance degradation Some branches more divergence than others Apply TBC judiciously à predict when it is beneficial ffectively predict when the inter-warp compaction is effective. (69) M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA

36 Goals Understand the limitations of compaction techniques and proximity to ideal compaction Provide mechanisms to overcome these limitations and approach ideal compaction rates (71) Limitations of Compaction Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (72) 36

37 Mapping Threads to Lanes: Today Linearization of thread IDs Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Warp 0 Warp 1 Modulo assignment to lanes RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 Thread (0,0,0) Thread (0,1,0) Thread (0,0,1) Thread (0,1,1) Thread (0,0,2) Thread (0,1,2) Thread (0,0,3) Thread Thread (0,0,0) (0,1,3) scalar Pipeline scalar pipeline scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline scalar pipeline (73) Data Dependent Branches Data dependent control flows less likely to produce lane conflicts Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (74) 37

38 Programmatic Branches Programmatic branches can be correlated to lane assignments (with modulo assignment) Program variables that operate like constants across threads can produced correlated branching behaviors Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (75) P-Branches vs. D-Branches P-branches are the problem! Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (76) 38

39 Compaction Opportunities Lane reassignment can improve compaction opportunities Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (77) Aligned Divergence Threads mapped to a lane tend to evaluate (programmatic) predicates the same way v mpirically, rarely exhibited for input, data dependent control flow behavior Compaction cannot help in the presence of lane conflicts à performance of compaction mechanisms depends on both divergence patterns and lane conflicts We need to understand impact of lane assignment (78) 39

40 Impact of Lane Reassignment Goal: Improve compactability Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (79) Random Permutations Does not always work well à works well on average Better understanding of programs can lead to better permutations choices Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (80) 40

41 Mapping Threads to Lanes; New Warp 0 Warp 1 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 Warp N-1 scalar pipeline scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline scalar pipeline scalar Pipeline SIMD Width = 8 What criteria do we use for lane assignment? (81) ven warps: permutation within a half warp Balanced Permutation Odd warps: swap upper and lower XOR evenwid = evenwid 2 ach lane has a single instance of a logical thread from each warp Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (82) 41

42 Balanced Permutation (2) XOR evenwid = evenwid 2 Logical TID of 0 in each warp is now assigned a different lane Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (83) Characteristics Vertical Balance: ach lane only has logical TIDs of distinct threads in a warp Horizontal balance: Logical TID x in all of the warps is bound to different lanes This works when CTA have fewer than SIMD_Width warps: why? Note that random permutations achieve this only on average (84) 42

43 Impact on Memory Coalescing Modern GPUs do not require ordered requests Coalescing can occur across a set of requests à speicific lane assignments do not affect coalescing behavior Increase is L1 miss rate offset by benefits of compaction Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (85) Speedup of Compaction Can improve the compaction rate of divergence dues to the majority of programmatic branches Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (86) 43

44 Compaction Rate vs. Utilization Distinguish between compaction rate and utilization! Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (87) Application of Balanced Permutation scalar Pipeline I-Fetch Decode I-Buffer Issue PRF RF scalar pipeline scalar pipeline D-Cache All Hit? Data pending warps Permutation is applied when the warp is launched Maintained for the life of the warp Does not affect the baseline compaction mechanism nable/disable SLP to preserve target specific, programmer implemented optimizations Writeback (88) 44

45 Structural hazards limit the performance improvements from inter-warp compaction Program behaviors produce correlated lane assignments today Summary Remapping threads to lanes enables extension of compaction opportunities (89) Summary: Inter-Warp Compaction A Thread Block Thread Block B C D F G Co-Design of applications, resource management, software, microarchitecture, Program Properties Scope μarch Resource Management (90) 45

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 40th IEEE/ACM International Symposium on Microarchitecture Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Department of Electrical

More information

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Nicolas Brunie Sylvain Collange Gregory Diamos by Ravi Godavarthi Outline Introduc)on ISCA'39 (Interna'onal Society for Computers

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Thread Block Compaction for Efficient SIMT Control Flow

Thread Block Compaction for Efficient SIMT Control Flow Thread lock Compaction for fficient SIMT Control low Wilson W. L. ung and Tor M. amodt University of ritish Columbia wwlfung@ece.ubc.ca, aamodt@ece.ubc.ca bstract Manycore accelerators such as graphics

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU

Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU Chinese Journal of Electronics Vol.24, No.4, Oct. 2015 Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU LI Bingchao, WEI Jizeng, GUO Wei and SUN Jizhou (School of Computer Science

More information

Dynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware

Dynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware Dynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware by Wilson Wai Lun Fung B.A.Sc., The University of British Columbia, 2006 A THESIS SUBMITTED

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow

A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow Scalable Multi-Path Microarchitecture for Efficient GPU ontrol Flow hmed ElTantawy, Jessica Wenjie Ma, Mike O onnor, and Tor M. amodt University of British olumbia NVIDI Research bstract Graphics processing

More information

Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity

Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Mattan Erez The University of Texas at Austin (C) Mattan Erez 2 Lots of interesting multi-level projects Resilience/

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

Occupancy-based compilation

Occupancy-based compilation Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)

More information

Performance in GPU Architectures: Potentials and Distances

Performance in GPU Architectures: Potentials and Distances Performance in GPU Architectures: s and Distances Ahmad Lashgar School of Electrical and Computer Engineering College of Engineering University of Tehran alashgar@eceutacir Amirali Baniasadi Electrical

More information

IMPROVING GPU SIMD CONTROL FLOW EFFICIENCY VIA HYBRID WARP SIZE MECHANISM

IMPROVING GPU SIMD CONTROL FLOW EFFICIENCY VIA HYBRID WARP SIZE MECHANISM IMPROVING GPU SIMD CONTROL FLOW EFFICIENCY VIA HYBRID WARP SIZE MECHANISM A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the Degree of

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Towards Scalar Synchronization in SIMT Architectures

Towards Scalar Synchronization in SIMT Architectures Towards Scalar Synchronization in SIMT Architectures by Arun Ramamurthy B.Eng, McMaster University, 2008 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Masters of Applied

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance

Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance Jiayuan Meng Department of Computer Science University of Virginia jm6dg@virginia.edu David Tarjan Department of Computer

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision

Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision Jiayuan Meng, David Tarjan, Kevin Skadron Univ. of Virginia Dept. of Comp. Sci. Tech Report CS-2009-02 Abstract SIMD organizations have

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 18-447 Computer Architecture Lecture 15: GPUs, VLIW, DAE Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle

More information

A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency

A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2012 A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

PATS: Pattern Aware Scheduling and Power Gating for GPGPUs

PATS: Pattern Aware Scheduling and Power Gating for GPGPUs PATS: Pattern Aware Scheduling and Power Gating for GPGPUs Qiumin Xu and Murali Annavaram Ming Hsieh Department of Electrical Engineering, University of Southern California Los Angeles, CA {qiumin, annavara}

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

MIMD Synchronization on SIMT Architectures

MIMD Synchronization on SIMT Architectures MIMD Synchronization on SIMT Architectures Ahmed ElTantawy and Tor M. Aamodt University of British Columbia {ahmede,aamodt}@ece.ubc.ca Abstract In the single-instruction multiple-threads (SIMT) execution

More information

Spatio-Temporal SIMT and Scalarization for Improving GPU Efficiency

Spatio-Temporal SIMT and Scalarization for Improving GPU Efficiency http://dx.doi.org/10.14279/depositonce-6262.) Spatio-Temporal SIMT and Scalarization for Improving GPU Efficiency Jan Lucas, Technische Universität Berlin Michael Andersch, Technische Universität Berlin

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism Lecture 9 Thread Divergence Scheduling Instruction Level Parallelism Announcements Tuesday s lecture on 2/11 will be moved to room 4140 from 6.30 PM to 7.50 PM Office hours cancelled on Thursday; make

More information

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline DOI 10.1007/s10766-012-0201-1 Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline Chunyang Gou Georgi N. Gaydadjiev Received: 17 August 2011 / Accepted: 14 June 2012 The Author(s)

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 19: SIMD

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

RegMutex: Inter-Warp GPU Register Time-Sharing

RegMutex: Inter-Warp GPU Register Time-Sharing RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

CS 1013 Advance Computer Architecture UNIT I

CS 1013 Advance Computer Architecture UNIT I CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer

More information

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi

More information

CPU Architecture. HPCE / dt10 / 2013 / 10.1

CPU Architecture. HPCE / dt10 / 2013 / 10.1 Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Analyzing CUDA Workloads UsingaDetailedGPUSimulator

Analyzing CUDA Workloads UsingaDetailedGPUSimulator Analyzing CUDA Workloads UsingaDetailedGPUSimulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia, Vancouver, BC, Canada {bakhoda,gyuan,wwlfung,henryw,aamodt}@ece.ubc.ca

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Divergence-Aware Warp Scheduling

Divergence-Aware Warp Scheduling Divergence-Aware Warp Scheduling Timothy G. Rogers Department of Computer and Electrical Engineering University of British Columbia tgrogers@ece.ubc.ca Mike O Connor NVIDIA Research moconnor@nvidia.com

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

ece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================

More information

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Achieving Out-of-Order Performance with Almost In-Order Complexity

Achieving Out-of-Order Performance with Almost In-Order Complexity Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures

More information

Unit 8: Superscalar Pipelines

Unit 8: Superscalar Pipelines A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Impact of Warp Formation on GPU Performance

Impact of Warp Formation on GPU Performance Impact of Warp Formation on GPU Performance Hong Jun Choi, Dong Oh Son, and Cheol Hong Kim Abstract As computing power of GPU increases dramatically, the GPU is widely used for general-purpose parallel

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES

Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency Jan Lucas TU Berlin - AES Overview What is a Scalarization? Why are Scalar Operations

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * modern GPU architectures deeply pipelined for efficient resource sharing several buffering

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information