Introduction to Control Divergence
|
|
- Mercy Page
- 6 years ago
- Views:
Transcription
1 Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted (1) Objective Understand the occurrence of control divergence and the concept of thread reconvergence v Also described as branch divergence and thread divergence Cover a basic thread reconvergence mechanism PDOM v Set up discussion of further optimizations and advanced techniques xplore one approach for mitigating the performance degradation due to control divergence dynamic warp formation (2) 1
2 Reading W. Fung, I. Sham, G. Yuan, and T. Aamodt, Dynamic Warp Formation: fficient MIMD Control Flow on SIMD Graphics Hardware, ACM TACO, June 2009 W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 M. Rhu and M. rez, CAPRI: Prediction of Compaction- Adequacy for Handling Control-Divergence in GPGPU Architectures, ISCA 2012 Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (3) Handling Branches CUDA Code: if( ) (True for some threads) else (True for others) T T T T taken not taken What if threads takes different branches? Branch Divergence! (4) 2
3 Branch Divergence Occurs within a warp Branches lead serialization branch dependent code v Performance issue: low warp utilization if( ) { } Idle threads else { } Reconvergence! (5) Branch Divergence A C B D F Thread Warp Thread 1 Thread 2 Common PC Thread Thread 3 4 G Different threads follow different control flow paths through the kernel code Thread execution is (partially) serialized v Subset of threads that follow the same path execute in parallel Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt 3
4 Basic Idea Split: partition a warp v Two mutually exclusive thread subsets, each branching to a different target v Identify subsets with two activity masks à effectively two warps Join: merge two subsets of a previously split warp v Reconverge the mutually exclusive sets of threads Orchestrate the correct execution for nested branches Note the long history of technques in SIMD processors (see background in Fung et. al.) Thread Warp T1 T2 T3 T4 Common PC activity mask (7) Thread Reconvergence Fundamental problem: v Merge threads with the same PC v How do we sequence execution of threads? Since this can effect the ability to reconverge Question: When can threads productively reconverge? Question: When is the best time to reconverge? A B C D F G (8) 4
5 Dominator Node d dominates node n if every path from the entry node to n must go through d A B C D F G (9) Immediate Dominator Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n A B C D F G (10) 5
6 Post Dominator Node d post dominates node n if every path from the node n to the exit node must go through d A B C D F G (11) Immediate Post Dominator Node d immediate post dominates node n if every path from node n to the exist node must go through d and no other nodes post dominate n between d and n A B C D F G (12) 6
7 Baseline: PDOM A A/1111 B B/1111 Stack Reconv. PC Next PC Active Mask TOS - G A B 1111 TOS D 0110 TOS C 1001 C C/1001 D D/0110 /1111 G G/1111 F Thread Warp Thread 1 Thread 2 Thread 3 Common PC Thread 4 A B C D G A Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Time Stack ntry A A/1111 B B/1111 Stack Reconv. PC Next PC Active Mask TOS - G A B 1111 TOS D 0110 TOS C 1001 C C/1001 D D/0110 F /1111 G G/1111 A stack entry is a specification of a group of active threads that will execute that basic block The natural nested structure of control exposes the use stack-based serialization (14) 7
8 More Complex xample Stack based implementation for nested control flow v Stack entry RPC set to IPDOM Re-convergence at the immediate post-dominator of the branch From Fung, et. Al., Dynamic Warp Formation: fficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009 (15) Implementation I-Fetch scalar Pipeline Decode I-Buffer Issue PRF RF scalar pipeline D-Cache All Hit? Data Writeback scalar pipeline pending warps Stack Reconv. PC Next PC Active Mask TOS - G A B 1111 TOS D 0110 TOS C 1001 GPGPUSim model: v Implement per warp stack at issue stage v Acquire the active mask and PC from the TOS v Scoreboard check prior to issue v Register writeback updates scoreboard and ready bit in instruction buffer v When RPC = Next PC, pop the stack Implications for instruction fetch? FromGPGPU-Sim Documentation Sim_3.x_Manual#SIMT_Cores (16) 8
9 Implementation (2) warppc (next instruction) compared to reconvergence PC On a branch v Can store the reconvergence PC as part of the branch instruction v Branch unit has NextPC, TargetPC and reconvergence PC to update the stack On reaching a reconvergence point v Pop the stack v Continue fetching from the NextPC of the next entry on the stack (17) Warps are formed statically Can We Do Better? Key idea of dynamic warp formation v Show a pool of warps and how they can be merged At a high level what are the requirements B A C D F G (18) 9
10 Compaction Techniques Can we reform warps so as to increase utilization? Basic idea: Compaction v Reform warps with threads that follow the same control flow path v Increase utilization of warps Two basic types of compaction techniques Inter-warp compaction v Group threads from different warps v Group threads within a warp o Changing the effective warp size (19) Inter-Warp Thread Compaction Techniques Lectures Slides and Figures contributed from sources as noted (20) 10
11 Goal Warp 0 Merge threads? Warp 1 Warp 2 Warp 3 if( ) { } Taken Not Taken Warp 4 Warp 5 else { } (21) Reading W. Fung, I. Sham, G. Yuan, and T. Aamodt, Dynamic Warp Formation: fficient MIMD Control Flow on SIMD Graphics Hardware, ACM TACO, June 2009 W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (22) 11
12 DWF: xample Baseline B x/1110 y/0011 A x/1111 y/1111 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 x/1110 y/0011 G x/1111 y/1111 Legend A A xecution of Warp x at Basic Block A D A new warp created from scalar threads of both Warp x and y executing at Basic Block D A A B B C C D D F F G G A A xecution of Warp y at Basic Block A Dynamic Warp Formation Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt A A B B C D F G G A A Time Time How Does This Work? Criteria for merging v Same PC v Complements of active threads in each warp v Recall: many warps/tb all executing the same code What information do we need to merge two warps v Need thread IDs and PCs Ideally how would you find/merge warps? (24) 12
13 DWF: Microarchitecture Implementation Identify available lanes Assists in aggregating threads Identify occupied lanes Point to warp being formed Thread Scheduler PC-Warp LUT Warp Update Register T TID x N RQ PC A Warp Update Register NT TID x N RQ PC B H H PC OCC PC OCC IDX IDX Warp Pool PC PC PC TID x N Prio TID x N Prio TID x N Prio I s s u e L o g i c Warp Allocator Commit/ Writeback Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt ALU 1 ALU 2 ALU 3 ALU 4 RF 1 RF 2 RF 3 RF 4 (TID, Reg#) (TID, Reg#) (TID, Reg#) (TID, Reg#) Warps formed dynamically in the warp pool After commit check PC-Warp LUT and merge or allocated newly forming warp Decode I-Cache DWF: Microarchitecture Implementation A: BQ R2, B C: Thread Scheduler PC-Warp LUT Warp Pool Warp Update Register T PC B OCC IDX 02 PC A B 15 TID 2 3 x N 48 5 TID 2 x 73 N RQ PC B A H PC C OCC IDX 1 PC A C 51 TID 6 7 x N 84 Warp Update Register NT PC B TID x C TID x N Commit/ Writeback RQ X Y PC B H ALU 1 ALU 2 ALU 3 ALU 4 X Y RF 1 RF 2 RF 3 RF 4 N Warp Allocator X Z Y (TID, Reg#) (TID, Reg#) (TID, Reg#) (TID, Reg#) Prio Prio Prio X Z Y Decode I s s u e L o g i c X Z Y I-Cache No Lane Conflict Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt 13
14 Resource Usage Ideally would like a small number of unique PCs in progress at a time à minimize overhead Warp divergence will increase the number of unique PCs v Mitigate via warp scheduling Scheduling policies v FIFO v Program counter address variation measure of divergence v Majority/Minority- most common vs. helping stragglers v Post dominator (catch up) (27) Hardware Consequences xpose the implications that warps have in the base design v Implications for register file access à lane aware DWF Register bank conflicts From Fung, et. Al., Dynamic Warp Formation: fficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009 (28) 14
15 Relaxing Implications of Warps Thread swizzling v ssentially remap work to threads so as to create more opportunities for DWF à requires deep understanding of algorithm behavior and data sets Lane swizzling in hardware v Provide limited connectivity between register banks and lanes à avoiding full crossbars (29) Summary Control flow divergence is a fundamental performance limiter for SIMT execution Dynamic warp formation is one way to mitigate these effects v We will look at several others Must balance a complex set of effects v Memory behaviors v Synchronization behaviors v Scheduler behaviors (30) 15
16 Thread Block Compaction W. Fung and T. Aamodt HPCA 2011 Overcome some of the disadvantages of dynamic warp formation v Impact of Scheduling v Breaking implicit synchronization v Reduction of memory coalescing opportunities Goal (32) 16
17 DWF Pathologies: Starva:on Majority Scheduling Best Performing Priori:ze largest group of threads with same PC Starva&on LOWR SIMD fficiency! Other Warp Scheduler? Tricky: Variable Memory Latency B: if (K > 10) C: K = 10; else D: K = 0; : B = C[tid.x] + K; C C D D s cycles D D Time Wilson Fung, Tor Aamodt Thread Block Compac:on 33 DWF Pathologies: xtra Uncoalesced Accesses Coalesced Memory Access = Memory SIMD 1 st Order CUDA Programmer Op:miza:on Not preserved by DWF : B = C[tid.x] + K; No DWF With DWF #Acc = 3 #Acc = 9 0x100 0x140 0x180 0x100 0x140 0x180 Memory Memory L1 Cache Absorbs Redundant Memory Traffic L1$ Port Conflict Wilson Fung, Tor Aamodt Thread Block Compac:on 34 17
18 DWF Pathologies: Implicit Warp Sync. Some CUDA applications depend on the lockstep execution of static warps Warp 0 Warp 1 Warp 2 Thread Thread Thread From W. Fung, I. Sham, G. Yuan, and T. Aamodt, Dynamic Warp Formation: fficient MIMD Control Flow on SIMD Graphics Hardware, ACM TACO, June 2009 Wilson Fung, Tor Aamodt (35) Performance Impact From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (36) 18
19 Thread Block Compaction n Block-wide Reconvergence Stack Thread Warp Block 0 0 Warp 1 Warp 2 PC RPC AMask Active PC RPC Mask AMask PC RPC AMask Warp D D D D Warp 1100 U 1 C C C C D Warp 0011 X T 2 C Warp Y q Regroup threads within a block n Better Reconv. Stack: Likely Convergence q Converge before Immediate Post-Dominator n Robust q Avg. 22% speedup on divergent CUDA apps q No penalty on others Wilson Fung, Tor Aamodt Thread Block Compaction 37 GPU Microarchitecture SIMT SIMT Core Core SIMT SIMT Core SIMT Core Done (Warp ID) Interconnection Network Memory Memory Partition Last-Level Cache Cache Bank Bank Cache Bank DRAM DRAM Off-Chip DRAM Channel SIMT Front nd SIMD Datapath Fetch Decode Schedule Branch Memory Subsystem SMem L1 D$ Tex $ Const$ Icnt. Network More Details in Paper Wilson Fung, Tor Aamodt Thread Block Compaction 19
20 Observation n Compute kernels usually contain divergent and non-divergent (coherent) code segments n Coalesced memory access usually in coherent code segments q DWF no benefit there Coherent Divergent Static Warp Divergence Dynamic Warp Reset Warps Coales. LD/ST Coherent Static Warp Recvg Pt. Wilson Fung, Tor Aamodt Thread Block Compaction 39 Thread Block Compaction n Branch/reconverge pt. q All avail. threads arrive at branch q Insensitive to warp scheduling n Run a thread block like a warp q Whole block move between coherent/divergent code q Block-wide stack to track exec. paths reconvg. n Warp compaction q Regrouping with all avail. threads Implicit Warp Sync. Starvation xtra Uncoalesced Memory Access q If no divergence, gives static warp arrangement Wilson Fung, Tor Aamodt Thread Block Compaction 40 20
21 Thread Block Compaction PC RPC Active Threads A D C A A A A A A A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; : B = C[tid.x] + K; C C D D C C C D D D Time Wilson Fung, Tor Aamodt Thread Block Compaction 41 Thread Block Compaction n Barrier every basic block?! (Idle pipeline) n Switch to warps from other thread blocks q Multiple thread blocks run on a core q Already done in most CUDA applications Branch Warp Compaction Block 0 xecution xecution Block 1 Block 2 xecution Time xecution Wilson Fung, Tor Aamodt Thread Block Compaction 42 21
22 High Level View DWF: warp broken down every cycle and threads in a warp shepherded into a new warp (LUT and warp pool) TBC: warps broken down at potentially divergent points and threads compacted across the thread block (43) Microarchitecture Modification n Per-Warp Stack à Block-Wide Stack n I-Buffer + TIDs à Warp Buffer q Store the dynamic warps n New Unit: Thread Compactor q Translate activemask to compact dynamic warps n More Detail in Paper Fetch I-Cache Branch Target PC Valid[1:N] Decode Warp Buffer Score- Board Thread Compactor Issue Done (WID) Block-Wide Stack Pred. Active Mask RegFile ALU ALU MM Wilson Fung, Tor Aamodt Thread Block Compaction 44 22
23 Microarchitecture Modification (2) TIDs of #compacted warps #compacted warps From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (45) Operation All threads mapped to the same lane Warp 2 arrives first creating 2 target entries Next warps update the active mask When this is zero compact (priority encoder) Pick a thread mapped to this lane From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (46) 23
24 Thread Compactor n Convert activemask from block-wide stack to thread IDs in warp buffer n Array of Priority-ncoder C P-nc P-nc P-nc P-nc Warp Buffer C C Wilson Fung, Tor Aamodt Thread Block Compaction 47 Likely-Convergence n Immediate Post-Dominator: Conservative q All paths from divergent branch must merge there n Convergence can happen earlier q When any two of the paths merge while (i < K) { X = data[i]; A: if ( X = 0 ) B: result[i] = Y; B C: else if ( X = 1 ) D: break; : i++; } ipdom of A F: return result[i]; Rarely Taken n xtended Recvg. Stack to exploit this q TBC: 30% speedup for Ray Tracing A F C D Details in Paper Wilson Fung, Tor Aamodt Thread Block Compaction 48 24
25 Likely-Convergence (2) n NVIDIA uses break instruction for loop exits q That handles last example n Our solution: Likely-Convergence Points PC RPC LPC LPos ActiveThds F F B F 1 1 C D F F 1 2 Convergence! n This paper: only used to capture loop-breaks Wilson Fung, Tor Aamodt Thread Block Compaction 49 Likely-Convergence (3) Check! Merge inside the stack From W. Fung and T. Aamodt, Thread Block Compaction for fficient SIMT Control Flow, International Symposiumomn High Performance Computer Architecture, 2011 (50) 25
26 Likely-Convergence (4) n Applies to both per-warp stack (PDOM) and thread block compaction (TBC) q nable more threads grouping for TBC q Side effect: Reduce stack usage in some case Wilson Fung, Tor Aamodt Thread Block Compaction 51 valuation n Simulation: GPGPU-Sim (2.2.1b) q ~Quadro FX L1 & L2 Caches n 21 Benchmarks q All of GPGPU-Sim original benchmarks q Rodinia benchmarks q Other important applications: n n n n Face Detection from Visbench (UIUC) DNA Sequencing (MUMMR-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research Wilson Fung, Tor Aamodt Thread Block Compaction 52 26
27 xperimental Results n 2 Benchmark Groups: q COH = Non-Divergent CUDA applications q DIVG = Divergent CUDA applications DWF TBC COH DIVG Serious Slowdown from pathologies No Penalty for COH 22% Speedup on DIVG IPC Relative to Baseline Per-Warp Stack Wilson Fung, Tor Aamodt Thread Block Compaction 53 Normalized Memory Stalls ffect on Memory Traffic n TBC does still generate some extra uncoalesced memory access 300% 250% 200% 150% 100% 50% 0% C C nd Acc will hit the L1 cache TBC DWF Baseline Memory Traffic Normalized to Baseline #Acc = BFS2 FCDT 0x100 0x140 0x180 HOTSP Memory TBC-AG TBC-RRB 2.67x LPS MUM DIVG MUMpp NAMD NVRT AS BACKP CP DG HRTWL LIB No Change to Overall Memory Traffic In/out of a core LKYT MGST COH NNC RAY STMCL STO WP Wilson Fung, Tor Aamodt Thread Block Compaction 54 27
28 Conclusion Thread Block Compaction v Addressed some key challenges of DWF v One significant step closer to reality Benefit from advancements on reconvergence stack v Likely-Convergence Point v xtensible: Integrate other stack-based proposals Wilson Fung, Tor Aamodt Thread Block Compaction (55) CAPRI: Prediction of Compaction- Adequacy for Handling Control- Divergence in GPGPU Architectures M. Rhu and M. rez ISCA
29 Improve the performance of inter-warp compaction techniques Predict when branches diverge v Borrow philosophy from branch prediction Use prediction to apply compaction only when it is beneficial Goals (57) Issues with Thread Block Compaction C B Implicit barrier across warps to collect compaction candidates A D G F TBC: warps broken down at potentially divergent points and threads compacted across the thread block Barrier synchronization overhead cannot always be hidden When it works, it works well (58) 29
30 Divergence Behavior: A Closer Look Compaction ineffective branch Loop executes a fixed number of times Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (59) Compaction-Resistant Branches Control flow graph Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (60) 30
31 Compaction-Resistant Branches(2) Control flow graph Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (61) Impact of Ineffective Compaction Threads shuffled around with no performance improvement However, can lead to increased memory divergence! Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (62) 31
32 Only stall and compact when there is a high probability of compaction success Basic Idea Otherwise allow warps to bypass the (implicit) barrier Compaction adequacy predictor! v Think branch prediction! (63) xample: TBC vs. CAPRI Bypassing enables increased overlap of memory references Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (64) 32
33 CAPRI: xample No divergence, no stalling Diverge, stall, initialize history, all others will now stall Diverge and history available, predict, update prediction All other warps follow (one prediction/ branch CAPT updated Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (65) Available threads in a lane The Predictor Prediction uses active masks of all warps v Need to understand what could have happened Actual compaction only uses actual stalled warps Minimum provides the maximum compaction ability, i.e., #compacted warps Update history predictor accordingly Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (66) 33
34 Behavior Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (67) Impact of Implicit Barriers Idle cycle count helps us understand the negative effects of implicit barriers in TBC Figure from M. Rhu and M. rez, CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures, ISCA 2012 (68) 34
35 Summary The synchronization overhead of thread block compaction can introduce performance degradation Some branches more divergence than others Apply TBC judiciously à predict when it is beneficial ffectively predict when the inter-warp compaction is effective. (69) M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA
36 Goals Understand the limitations of compaction techniques and proximity to ideal compaction Provide mechanisms to overcome these limitations and approach ideal compaction rates (71) Limitations of Compaction Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (72) 36
37 Mapping Threads to Lanes: Today Linearization of thread IDs Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Warp 0 Warp 1 Modulo assignment to lanes RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 Thread (0,0,0) Thread (0,1,0) Thread (0,0,1) Thread (0,1,1) Thread (0,0,2) Thread (0,1,2) Thread (0,0,3) Thread Thread (0,0,0) (0,1,3) scalar Pipeline scalar pipeline scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline scalar pipeline (73) Data Dependent Branches Data dependent control flows less likely to produce lane conflicts Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (74) 37
38 Programmatic Branches Programmatic branches can be correlated to lane assignments (with modulo assignment) Program variables that operate like constants across threads can produced correlated branching behaviors Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (75) P-Branches vs. D-Branches P-branches are the problem! Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (76) 38
39 Compaction Opportunities Lane reassignment can improve compaction opportunities Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (77) Aligned Divergence Threads mapped to a lane tend to evaluate (programmatic) predicates the same way v mpirically, rarely exhibited for input, data dependent control flow behavior Compaction cannot help in the presence of lane conflicts à performance of compaction mechanisms depends on both divergence patterns and lane conflicts We need to understand impact of lane assignment (78) 39
40 Impact of Lane Reassignment Goal: Improve compactability Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (79) Random Permutations Does not always work well à works well on average Better understanding of programs can lead to better permutations choices Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (80) 40
41 Mapping Threads to Lanes; New Warp 0 Warp 1 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 RF1 RF0 Warp N-1 scalar pipeline scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline scalar pipeline scalar Pipeline SIMD Width = 8 What criteria do we use for lane assignment? (81) ven warps: permutation within a half warp Balanced Permutation Odd warps: swap upper and lower XOR evenwid = evenwid 2 ach lane has a single instance of a logical thread from each warp Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (82) 41
42 Balanced Permutation (2) XOR evenwid = evenwid 2 Logical TID of 0 in each warp is now assigned a different lane Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (83) Characteristics Vertical Balance: ach lane only has logical TIDs of distinct threads in a warp Horizontal balance: Logical TID x in all of the warps is bound to different lanes This works when CTA have fewer than SIMD_Width warps: why? Note that random permutations achieve this only on average (84) 42
43 Impact on Memory Coalescing Modern GPUs do not require ordered requests Coalescing can occur across a set of requests à speicific lane assignments do not affect coalescing behavior Increase is L1 miss rate offset by benefits of compaction Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (85) Speedup of Compaction Can improve the compaction rate of divergence dues to the majority of programmatic branches Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (86) 43
44 Compaction Rate vs. Utilization Distinguish between compaction rate and utilization! Figure from M. Rhu and M. rez, Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation, ISCA 2013 (87) Application of Balanced Permutation scalar Pipeline I-Fetch Decode I-Buffer Issue PRF RF scalar pipeline scalar pipeline D-Cache All Hit? Data pending warps Permutation is applied when the warp is launched Maintained for the life of the warp Does not affect the baseline compaction mechanism nable/disable SLP to preserve target specific, programmer implemented optimizations Writeback (88) 44
45 Structural hazards limit the performance improvements from inter-warp compaction Program behaviors produce correlated lane assignments today Summary Remapping threads to lanes enables extension of compaction opportunities (89) Summary: Inter-Warp Compaction A Thread Block Thread Block B C D F G Co-Design of applications, resource management, software, microarchitecture, Program Properties Scope μarch Resource Management (90) 45
Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationDynamic Warp Formation and Scheduling for Efficient GPU Control Flow
40th IEEE/ACM International Symposium on Microarchitecture Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Department of Electrical
More informationSimultaneous Branch and Warp Interweaving for Sustained GPU Performance
Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Nicolas Brunie Sylvain Collange Gregory Diamos by Ravi Godavarthi Outline Introduc)on ISCA'39 (Interna'onal Society for Computers
More informationAnalyzing CUDA Workloads Using a Detailed GPU Simulator
CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationThread Block Compaction for Efficient SIMT Control Flow
Thread lock Compaction for fficient SIMT Control low Wilson W. L. ung and Tor M. amodt University of ritish Columbia wwlfung@ece.ubc.ca, aamodt@ece.ubc.ca bstract Manycore accelerators such as graphics
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationImproving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU
Chinese Journal of Electronics Vol.24, No.4, Oct. 2015 Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU LI Bingchao, WEI Jizeng, GUO Wei and SUN Jizhou (School of Computer Science
More informationDynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware
Dynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware by Wilson Wai Lun Fung B.A.Sc., The University of British Columbia, 2006 A THESIS SUBMITTED
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationA Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow
Scalable Multi-Path Microarchitecture for Efficient GPU ontrol Flow hmed ElTantawy, Jessica Wenjie Ma, Mike O onnor, and Tor M. amodt University of British olumbia NVIDI Research bstract Graphics processing
More informationRobust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity
Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Mattan Erez The University of Texas at Austin (C) Mattan Erez 2 Lots of interesting multi-level projects Resilience/
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More informationOccupancy-based compilation
Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)
More informationPerformance in GPU Architectures: Potentials and Distances
Performance in GPU Architectures: s and Distances Ahmad Lashgar School of Electrical and Computer Engineering College of Engineering University of Tehran alashgar@eceutacir Amirali Baniasadi Electrical
More informationIMPROVING GPU SIMD CONTROL FLOW EFFICIENCY VIA HYBRID WARP SIZE MECHANISM
IMPROVING GPU SIMD CONTROL FLOW EFFICIENCY VIA HYBRID WARP SIZE MECHANISM A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the Degree of
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationTowards Scalar Synchronization in SIMT Architectures
Towards Scalar Synchronization in SIMT Architectures by Arun Ramamurthy B.Eng, McMaster University, 2008 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Masters of Applied
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationDynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance
Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance Jiayuan Meng Department of Computer Science University of Virginia jm6dg@virginia.edu David Tarjan Department of Computer
More informationLecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs
Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationOrchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationLeveraging Memory Level Parallelism Using Dynamic Warp Subdivision
Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision Jiayuan Meng, David Tarjan, Kevin Skadron Univ. of Virginia Dept. of Comp. Sci. Tech Report CS-2009-02 Abstract SIMD organizations have
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationSIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13
SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationUnderstanding Outstanding Memory Request Handling Resources in GPGPUs
Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca
More informationComputer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015
18-447 Computer Architecture Lecture 15: GPUs, VLIW, DAE Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle
More informationA Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2012 A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationPATS: Pattern Aware Scheduling and Power Gating for GPGPUs
PATS: Pattern Aware Scheduling and Power Gating for GPGPUs Qiumin Xu and Murali Annavaram Ming Hsieh Department of Electrical Engineering, University of Southern California Los Angeles, CA {qiumin, annavara}
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationMIMD Synchronization on SIMT Architectures
MIMD Synchronization on SIMT Architectures Ahmed ElTantawy and Tor M. Aamodt University of British Columbia {ahmede,aamodt}@ece.ubc.ca Abstract In the single-instruction multiple-threads (SIMT) execution
More informationSpatio-Temporal SIMT and Scalarization for Improving GPU Efficiency
http://dx.doi.org/10.14279/depositonce-6262.) Spatio-Temporal SIMT and Scalarization for Improving GPU Efficiency Jan Lucas, Technische Universität Berlin Michael Andersch, Technische Universität Berlin
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationLecture 9. Thread Divergence Scheduling Instruction Level Parallelism
Lecture 9 Thread Divergence Scheduling Instruction Level Parallelism Announcements Tuesday s lecture on 2/11 will be moved to room 4140 from 6.30 PM to 7.50 PM Office hours cancelled on Thursday; make
More informationAddressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline
DOI 10.1007/s10766-012-0201-1 Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline Chunyang Gou Georgi N. Gaydadjiev Received: 17 August 2011 / Accepted: 14 June 2012 The Author(s)
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationComputer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 19: SIMD
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationLecture 11 Cache. Peng Liu.
Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationLecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )
Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationRegMutex: Inter-Warp GPU Register Time-Sharing
RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer
More informationInstruction Level Parallelism (Branch Prediction)
Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution
More informationAdvanced Caching Techniques (2) Department of Electrical Engineering Stanford University
Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15
More informationLecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability
Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood
More informationBanked Multiported Register Files for High-Frequency Superscalar Microprocessors
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation
More informationCS 1013 Advance Computer Architecture UNIT I
CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer
More informationRethinking Prefetching in GPGPUs: Exploiting Unique Opportunities
Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi
More informationCPU Architecture. HPCE / dt10 / 2013 / 10.1
Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationAnalyzing CUDA Workloads UsingaDetailedGPUSimulator
Analyzing CUDA Workloads UsingaDetailedGPUSimulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia, Vancouver, BC, Canada {bakhoda,gyuan,wwlfung,henryw,aamodt}@ece.ubc.ca
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationDivergence-Aware Warp Scheduling
Divergence-Aware Warp Scheduling Timothy G. Rogers Department of Computer and Electrical Engineering University of British Columbia tgrogers@ece.ubc.ca Mike O Connor NVIDIA Research moconnor@nvidia.com
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================
More informationInput and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state
What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationComputer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction
More informationAchieving Out-of-Order Performance with Almost In-Order Complexity
Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationExploiting Core Criticality for Enhanced GPU Performance
Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures
More informationUnit 8: Superscalar Pipelines
A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationImpact of Warp Formation on GPU Performance
Impact of Warp Formation on GPU Performance Hong Jun Choi, Dong Oh Son, and Cheol Hong Kim Abstract As computing power of GPU increases dramatically, the GPU is widely used for general-purpose parallel
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationMotivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture
Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More information18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013
18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,
More informationScalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES
Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency Jan Lucas TU Berlin - AES Overview What is a Scalarization? Why are Scalar Operations
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationNam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *
Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * modern GPU architectures deeply pipelined for efficient resource sharing several buffering
More informationWrite only as much as necessary. Be brief!
1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More information