Occupancy-based compilation

Size: px

Start display at page:

Download "Occupancy-based compilation"

Dominick Blair
5 years ago
Views:

1 Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto

2 Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group) scheduling divergent vs uniform thread execution stack-based and PC-based reconvergence Divergence analysis different forms of divergence micro-simd language and machine (an toy example) Gated Single Assignment form (GSA) Compiling into Cooperative Thread Arrays (CTA) having massive number of threads GPU memory issues CTA and warp formation 2/40

3 Threads and occupancy 3/40

4 Having massive number of threads Repeated executions of code programs spend their time in repeatedly executed parts Dependence-based compilation we consider loops as the primary form of repetition covers also many other control structures (e.g., recursion) we form iteration space and analyze dependencies Occupancy-based compilation we consider threads as the primary form of repetition we have a number of threads that execute the same code we try to maximize number of parallel threads light threads help in maximing HW occupancy 4/40

5 GPUs as an example (1/2) There are several vendors offering different concepts. The two main ones NVIDIA SM (Streaming Multiprocessor) warps and warp scheduling AMD VLIW4 (or VLIW5), older basically dependency limited Graphics Core Next (GCN), newer more similar to NVIDIA SM Both toward scalar cores Occupancy limited computing We will (mostly) use NVIDIA terminology in the following 5/40

6 GPUs as an example (2/2) Vector or VLIW cores (in parallel, SIMT style) Especially VLIW is suitable for graphics also NVIDIA used it in the past Both need complex compilers dependency limited loop structures, static scheduling static register conflict resolution Scalar cores (in parallel, pure SIMT) More toward HW scheduling hardware tries to get utilization (occupancy) up software must offer threads for occupancy Simpler compiler More predictable performance over wide range of applications 6/40

7 Utilization and occupancy Occupancy Warp hardware utilization occupancy = Actual active warps / Maximum active warps Why HW sits with nothing to do? Interconnects saturate (memory, registers,...) If per-cta requirements are high, effective number of simultaneous threads is limited CTAs is limited by these High control flow divergence threads in a warp have different control flow cannot be run in parallel i.e., the warping is "sparse" instead of being "dense" Inefficient scheduling mechanisms after a long memory wait, many of the warps tend to get ready roughly at the same time 7/40

8 SIMT execution 8/40

9 SIMT microarchitecture MIMD (Multiple Instruction Multiple Data) Each core has its own instruction stream Co-operation and synchronization through memory Vector-SIMD There is only a single instruction stream Multiple separate vector lanes like cores without the control (etc.) each operates on a word there can be predication SIMT All cores have their own instruction stream However, there is (usually) only one issue unit The cores run the same instruction use predication ignore irrelevant instructions There are several flavors of each! be careful about the terminology (processor, core, lane,...) 9/40

10 Warp scheduling Microthread issue is done by a warp scheduler 1. Warp scheduler chooses a warp Warp must be in an active block Warp must be ready not be waiting for memory or register operands The warp has a PC, which applies to all its unmasked threads 2. Instruction for warp is fetched and decoded Let x denote the number of functional units for this instruction 3. Warp is executed, can take several cycles At each cycle, x threads are issued to functional units Repeat until all threads in the warp are issued Execution is usually pipelined Can take several (e.g., 20) cycles until result is available 10/40

11 Warp scheduling example (1/2) Assume Warp width 32 8 cores, 2 SFUs Code I1: FMUL R17, R19, R29; // Multiplication, uses a core I2: RSQ R7, R7; // Square root, uses a SFU Execution Cyc: W00: [xx I1 xx] [xx I2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx] 11/40

12 Warp scheduling example (2/2) Assume Two warp schedulers, 8 cores per scheduler, 12-stage pipeline Code I1: IADD R1, R0, R5; I2: IMAD.U16 R3, g [0x6].U16, R5L, R2; I3: IADD R2, R1, R5; // Depends on I1 (via R1) Execution (note how low is the utilization) there is plenty of room to run other threads! remember memory latencies Cyc: W00: [I1 ] [I2 ] [I3 ] W01: [I1 ] [I2 ] [I3 ] W02: [I1 ] [I2 ] [I3 ] W03: [I1 ] [I2 ] [I3 ] 12/40

13 Stack-based branch divergence Program flow Active threads A B C D A = B = C = D = Initial stack contents R pc Next pc Mask A stack top Stack after divergence R pc Next pc Mask D D C D B Time After branch completion R pc Next pc Mask D D C After reconvergence R pc Next pc Mask D /40

14 Memory divergence Memory delays can cause warp splitting Typical current HW does not allow such splitting i.e., execution is kept uniform evenif latencies are divergent Below we assume warp splitting support Program flow Active threads A = A B B C B = C = Active mask Data ready Time / cycles Reconvergence when memory gets ready Memory status at initial time 14/40

15 Stack-based vs PC-based reconvergence Warp slitting together with PC-based reconvergence more complex HW needed Program flow A = B = C = D = E = A E B D F G C Conventional sequence Time F = E G = A B D F G C Execution sequence using dynamic warp subdivision 15/40

16 Divergence analysis 16/40

17 Microthread divergence Data divergence occurs If the same variable name is mapped to different values in the environments of distinct processing elements Data divergence produces Control divergence occurs when threads in a warp follow different paths after processing the same branch Memory divergence occurs when a load or store instruction targeting data divergent addresses causes different access delays for different threads affine address: consecutive threads access adjacent or regularly-spaced memory locations uniform address: threads access the same memory location Note: Support for divergence is essential for manycores to target wide set of applications. Basic tile processors (e.g., Epiphany) or multicore CPUs do not have such support. 17/40

18 The micro-simd language We will use a simple (machine) language to demonstrate divergence. Labels (L) ::= l N Variables (V) ::= Tid {v1,v2,...} Instructions ::= - (jump if zero/not zero) bz/bnz v, l - (unconditional jump) jump l - (store into shared memory) store v x = v - (load from shared memory) load v = v x - (atomic increment) atominc v = v x (other binary operations) binop v 1 = v 2 op v 3 - (immediate copy) const v = n - (synchronization barrier) sync - (halt execution) stop Tid is the thread identifier (corresponding to the get_x_id() functions of OpenCL). 18/40

19 The micro-simd machine The state the machine is determined The program that the machine executes The program counter (pc) all Processing Elements, PEs, use the same pc The contents of the shared memory shared by all Processing Elements, PEs The contents of the local memories every PE has a local memory the load and store instructions transfer data between the local memory of a PE and the share memory The set of active threads Θ The contents of the synchronization stack 19/40

20 The synchronization stack The synchronization stack contains frames holding quadruples (l id, Θ done, l next, Θ todo ) l id = the conditional branch that caused the divergence Θdone = the PEs that have reached the synchronization point lnext = indicates the instruction where Θ todo will resume execution Θ todo = the PEs waiting to execute Basically, the machine pushes frames at branches and pops them at synchronization barriers. note that this is different than the R-pc -based stack resembles predicate mask of vector machines, but allows for nested "if" structures and loops that may include synchronization primitives (in micro-simd: atominc, sync) 20/40

21 Micro-SIMD machine operation (1/2) Upon reaching bz v, l we evaluate v in the local memory of each active PE. If the condition v is non-zero for every PE, we move to next instruction, i.e., pc + 1 Similarly, if condition is zero for every PE, we jump to the instruction at l However, if we get distinct values for different PEs, then the branch is divergent we split Θ into Θ0 (condition not true) Θ n (condition true) we execute the "else" branch (with Θ0 ) we push the other branch onto the stack (i.e., Θ n ) basically, we have an execution pending in the stack 21/40

22 Micro-SIMD machine operation (2/2) The handling for branch non-zero (bnz) is similar. Also the non-divergent branch instructions (Θ 0 or Θ n is empty) update the synchronization stack (in order not to get stuck when trying to pop a node because of a sync instruction) If we arrive at a barrier (sync) with a group Θ n of PEs waiting to execute, then we resume their execution at the "then" branch, keeping the previously active PEs into hold If we reach a barrier without any PE waiting to execute, we synchronize the PEs with the current set of active PEs, and resume execution at the next instruction after the barrier. 22/40

23 Example (1/5) Consider the following code in a language resembling OpenCL (note the use of the thread identifier Tid): kernel TriangleSum(float* m, float* v, int c) { if (Tid < c) { int d = 0; float sum = 0.0F; int L = (Tid + 1) * c; for (int i = Tid; i < L; i += c) { if (d % 2) sum += m[i]; d += 1; } v[d-1] = sum; } } 23/40

24 Example (2/5) An simple translation yields to following code in micro-simd: l00: const d = 0 l01: const s = 0 l02: const i = Tid l03: binop x = Tid + 1 l04: binop L = c * x l09: load x = i l10: binop s = s + x l05: binop p = i L l06: bz p, l15 l07: binop p = d % 2 l08: bnz p, l11 l15: sync l16: binop x = d 1 l17: store x = s l18: stop l11: sync l12: binop d = d + 1 l13: binop i = i + c l14: jmp l05 24/40

25 Example (3/5) If we execute the kernel for a 4x4 matrix (i.e., c=4) with 4 threads (t0,t1,t2,t3), we can observe divergence. E.g., when executing l6 for the second time (cycle 17) the predicate p is 0 for thread t0, and 1 for the other PEs because of this, thread t1 is pushed onto the stack Once t3 leaves the loop threads synchronize via the sync instruction at label l15, and resume lock-step execution. In the following (partial!) listing of the execution we mark an active thread with + we mark a passive thread with - 25/40

26 Example (4/5) Cycle Instruction t0 t1 t2 t3 16 l05: binop p = i - L l06: bz p, l l07: binop p = d % l08: bnz p, l l09: load x = i l10: binop s = s + x l11: sync l12: d = d l13: binop i = i + c l14: jmp l l05: binop p = i - L l06: bz p, l l07: p = d % l08: bnz p, l l05: bz p, l l15: sync /40

27 Example (5/5) The instruction l08: bnz p, l11 cannot cause a divergence, even though the predicate p is data dependent on variable d, which is created inside a divergent loop. Variable d is not divergent, although the variable p that controls the loop is Note that a variable can be divergent inside a loop, but uniform outside Precise divergence analysis is needed to observe such behavior Note that on a real GPU we could use simple predication without any branching for the "if" in a PTX compilation bra.uni instead of bra 27/40

28 Gated Single Assignment form (GSA) GSA uses three special functions as instructions: µ, γ, and η instead of the SSA phi-function (or φ). γ function is a join (for branches). γ(p, v 1, v 2 ) is v 1 if the predicate (the gate) is true or else v 2. µ function is a loop join (for loop headers). µ(v 1, v 2 ) is v 1 for the first iteration and v 2 for the rest. η is the loop exit function η(p, v). It binds a loop dependent value v to loop predicate p. GSA allows control dependencies to be transformed into data dependencies. Placement of GSA functions into blocks is similar to SSA However, to save space, we omit details of GSA 28/40

29 Example reconsidered with GSA (1/2) Below sync instructtions are implicit in the η and γ functions Note that the labeling has changed (because of the assigment for µ function) l00: const d 0 = 0 l01: const s 0 = 0 l02: const i 0 = Tid l03: binop x 0 = Tid + 1 l04: binop L 0 = c * x 0 l10: load x 2 = i 1 l11: binop s 2= s 1+ x2 l05: [i 1,s 1,d 1]= µ (i 0,s 0,d 0),(i 2,s 3,d 2) l06: binop p 0 = i 1 L 0 l07: bz p 0, l16 l08: binop p 1= d 1% 2 l09: bnz p 1, l12 l16: [s 4,d 3]= η[(p 0,(s,d 1)] l17: binop x 3 = d 3 1 l18: store x 3 = s 4 l19: stop l12: [s 3]= γ(p 1,s 2, s 1) l13: binop d 2 = d l14: binop i 2 = i 1 + c l15: jmp l05 29/40

30 Example reconsidered with GSA (2/2) Using GSA we can analyze the divergence behavior, e.g. The "if" branch instruction (bnz p1, l12) depends on d 1, for which we have no divergent definition thus, the "if" is not divergent The "for" loop branch instruction(bnz p0, l16) depends on p 0, for which have a divergent definition because Tid is divergent thus, the "for" loop is divergent The threads go "hand-in-hand" inside the loop, but their exit from to loop is divergent note however, that 4 th value of d (i.e., d 3 ) is divergent In general, we get more precision by considering values instead of variables 30/40

31 Compiling CTAs 31/40

32 Cooperative Thread Arrays Applications are typically divided into kernels each kernel is capable of spawning many threads The threads are usually grouped together into thread blocks to a compiler, these are known as cooperative thread arrays (CTAs) When an application starts its execution The CTA scheduler initiates scheduling of CTAs onto the available processors All the threads within a CTA are executed on the same processor in groups of, e.g., of 32 threads This collection of threads is referred to as a warp the threads within a warp share the same instruction stream i.e., single instruction multiple threads, SIMT 32/40

33 CTA formation CTAs correspond to OpenCL workgroups (CUDA blocks) more fundamentally, A CTA is an abstraction which encapsulates all synchronization in a thread group CTAs are partitioned into warps this sub-division is transparent to the application programmer and is an architectural abstraction based on a cost model (memory, regs,...) implied by the kernel code (note the hierarchy) Application Kernel CTA Warp Kernel 1 CTA 1 Warp 1 Thread 1 Kernel 2 CTA 2 Warp 2 Thread 2 Kernel 3 CTA 3 Warp 3 Thread 3 33/40

34 CTA issues The basic idea of having many threads We can hide memory latency A thread waits for memory, we execute other threads Two level scheduling CTA scheduler the compiler produces the CTAs Warp scheduler the compiler must co-operate with the scheduler divergence and synchronization are not trivial Performance bottlenecks 1. On-chip memory and register files are limiting factors on parallelism 2. High control flow divergence 3. Inefficient scheduling mechanisms Round-Robin is typical, memory use easily congested 34/40

35 Memory addressing Specific memory addressing We know statically the memory space that we are using LLD.U32 R5, local [0x10]; GLD.U32 R5, global14 [R3]; G2R.U32 R0, g [A3+0x4].U32; // Load from local space // Load from global space // Load from shared space Generic memory addressing Memory space resolution is dynamic CUDA supports this (OpenCL calls for explicit usage) LD.E!R5, generic [ R3] 35/40

36 Memory space analysis On some architectures, accesses can be generic We know only the space of the pointer Not the points-to space HW resolves dynamically the access type, typically local (OpenCL: private), shared (OpenCL: local), global such HW support makes programming easier Not only about address conversions Alias analysis Memory disambiguation Is basically a points-to-analysis Forward data-flow analysis Using a lattice (as typical for data-flow) Meet is toward "unknown" 36/40

37 Memory organization GPUs usually have memory restrictions Banked memory is the most typical Each bank can serve only one thread (core) at a time Full crossbars would be very expensive Therefore, memory accesses must be correctly structured A 2-way conflict causes serialization and doubles access times Thread0 Thread1 Thread2 Thread3 Thread4 Thread5 Thread6 Thread7 Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 Thread0 Thread1 Thread2 Thread3 Thread4 Thread5 Thread6 Thread7 Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 Thread0 Thread1 Thread2 Thread3 Thread4 Thread5 Thread6 Thread7 Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 Linear addressing (stride=1) No bank conflict Nonlinear addressing No bank conflict Linear addressing (stride=2) 2 way bank conflict 37/40

38 Assigning microthreads Cooperative thread array (CTA) Structure to hold the generated threads Registers are partitioned among threads The number of threads depends on that Smaller register usage = getting more threads not like typical RISC compiling We need huge number of threads, consider, e.g.: Memory latencies of x times instruction cycle time Assume every y th instruction to access memory we need x/y ready warps Typical GPU HW has, e.g., 48 warps scheduled per scheduler (if some instruction stalls, have others to choose) Note: having 32 threads in a warp, 48 warps per scheduler, 4 warp schedulers per processor, and 32 processors means simultaneously scheduled threads. As threads are very short, the software should be able to create millions of threads to keep the HW busy! 38/40

39 Code partitioning to remove synchronization Input: List of Statements F in AST representation Output: List X of code partitions free of barriers Begin P = new partition; while F has next statement S do switch type of statement S case barrier Add P to X; P = new partition; case simple statement Add S to P; case statement sequence Prepend statements comprising S to F; otherwise if S contains a barrier statement then Add P to X and invoke recursively for the body of S Append L to X; P = new partition; else Add S to P; 39/40

40 Optimizations Optimizing applications can be hard Some parallel algorithms are intrinsically divergent programmers need know the overall platform properties Precise identification divergences is hard compiler support is a big help Divergence-aware optimizations branch distribution merges code inside potentially divergent program paths branch fusion a generalization of branch distribution, joins chains of common instructions present in two divergent paths branch splitting divides a parallelizable loop enclosing a multi-path branch into multiple loops, each containing only one branch thread reallocation regrouping divergent threads among warps, so that only one or just a few warps will contain divergent threads 40/40

On Static Timing Analysis of GPU Kernels

On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Aalto University Espoo, Finland vesa.hirvisalo@aalto.fi Abstract We study static timing analysis of programs running on GPU accelerators. Such programs