Inthreads: Code Generation and Microarchitecture

Inthreads: Code Generation and Microarchitecture Alex Gontmakher Gregory Shklover Assaf Schuster Avi Mendelson 1

Outline Inthreads Introduction Inthreads code generation Microarchitecture 2

Motivation Programs contain low-level parallelism Would be nice to be able to express it But: too complex for superscalar execution But: too low-level to be done by threads Solution: an extremely lightweight threading mechanism 3

Inthreads architecture Processor resources are shared between threads just like SMT Architectural state, including registers, is shared between threads unlike SMT!!! Lightweight thread context OS transparent Built-in synchronization Integration with processor hardware more efficient communication & synchronization 4

Inthreads architecture: Thread manipulation inth.start tid, addr - create a new thread Only a fixed number of threads supported No scheduling necessary for threads inth.halt - terminate the current thread inth.kill tid - terminate the given thread 5

Inthreads architecture: Synchronization Condition registers C 1, C 2, C n : binary semaphores inth.set C i sets condition C i. Releases one waiting thread. inth.clr C i clears C i. inth.wait C i waits for C i to be set. 6

Inthreads programming model Threads share the registers cooperatively Thread-private variables use different registers in each thread Shared variables must be allocated to same register in all threads No race conditions allowed Accesses must be protected by synchronization Speculative races still possible compiler analysis? No function calls when threads are active 7

Inthreads Code Generation: Compilation flow Inthreads-C: C with explicit parallelization.c.c Programmer Parallelizer Inthreads-C.c Inthreads-C compiler.o 8

Inthreads-C Extension INTH_CLEAR(1); /* initialize conditional variable #1 */ INTH_START(start2); /* start the thread at label start2 */ #pragma inthread (1 of 2) { b1 = ; c1 = ; a1 = min(b1,c1); /* do half of the job */ INTH_WAIT(1); /* wait till the second thread terminates */ ret = min(a1,a2); /* calculate the final value */ } return ret; start2: Must be a macro #pragma inthread (2 of 2) { b2 = ; b3 = ; a2 = min(b2,c2); /* do half of the job */ INTH_SET(1); /* notify the first thread */ INTH_HALT(); /* terminate this thread */ } 9

Inthreads-C Extension: Correctness Requirements INTH_CLEAR(1); INTH_START(start2); #pragma reg_range (1 of 2) { b1 = ; c1 = ; a1 = min(b1,c1); INTH_WAIT(1); ret = min(a1,a2); } return ret; start2: #pragma reg_range (2 of 2) { b2 = ; c2 = ; a2 = min(b2,c2); INTH_SET(1); INTH_HALT(); } b1, c1, b2, c2: Private values Must be allocated to different registers a2: Shared value Must be allocated to the same register in both threads Special Semantics: The value of a2 must be transferred correctly. INTH_WAIT is an Acquire for a2 INTH_SET is a Release for a2 10

Optimization Correctness Problem: Basic compiler assumptions are broken Execution doesn t follow a single path at any given moment i1: A = 5 j1: A = 10 i2: INTH_WAIT(1) j2: INTH_SET(1) i3: A ==? Classic optimizations may produce incorrect code Common Subexpression Elimination Loop Invariant Motion Dead Code Elimination 11

Internal Representation (Concurrent) Control Flow Graph Do job INTH_WAIT INTH_START continue... Do job INTH_SET INTH_HALT INTH_START is represented by a branch instruction with 100% probability for each edge INTH_SET are connected to corresponding INTH_WAIT instructions by synchronization edges INTH_HALT is represented by a dead end block 12

Concurrent Data Flow Concurrent data flow is propagated through synchronization edges (that connect inth_set to corresponding inth_wait). A is marked as shared through this INTH_SET(). Meaning: live unknown value A = 5 INTH_SET(2) INTH_WAIT(2) A = A + 1 INTH_SET(3) A == 5 INTH_WAIT(3) B = A + 2 13

Register Allocation The problem: make sure concurrently-live values are allocated to different registers at each execution point. Method 1: Manual register file partitioning Allocate private subset of registers to each thread Allocate a subset of registers for shared variables Use the allocation during register assignment Not user-friendly, not portable, sub-optimal Method 2: Automatic register allocation Using concurrent interference graph Uniform approach guided by spill costs heuristics 14

Register Allocation Graph Coloring Algorithm Additional conflict edges are added to the interference graph to represent concurrently-live values. Each node now has parallel and sequential conflicts. Conservative exact execution ordering is not calculated Interference graph: A D B E C F 15

A Running Example (from SPEC2K MCF) for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( bea_is_dual_infeasible( arc, red_cost ) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } } 16

Example continued: parallelization for( ; arc1 < stop_arcs; arc1 += nr_group*3 ) { if( arc1->ident > BASIC ) { red_cost1 = bea_compute_red_cost( arc1 ); res1 = bea_is_dual_infeasible( arc1, red_cost1 ); INTH_SET(1); /* receive destination from the main thread */ INTH_WAIT(2); if( res1 ) { perm[dest1]->a = arc1; perm[dest1]->cost = red_cost1; perm[dest1]->abs_cost = ABS(red_cost1); } } else { res1 = 0; INTH_SET(1); INTH_WAIT(2); } } for (; ; ) { INTH_WAIT(1); basket_size += res1; dest1 = basket_size; INTH_SET(2); INTH_WAIT(3); basket_size += res2; dest2 = basket_size; INTH_SET(4); INTH_WAIT(5); basket_size += res3; dest3 = basket_size; INTH_SET(6); } 17

Inthreads Implementation Basic processor structure Instruction execution ordering Handling speculation 18

Basic Processor Structure SMT + extensions SAME: Concurrent fetch SAME: Multiple ROBs DIFF: Shared architectural regs DIFF: Shared renaming hardware Fetch Fetch Fetch Decode OOO Execution Core Commit Waiting Instructions Condition Info Thread Start/Kill Signals Speculation Info (branches) Thread Control Unit Speculation Info (waits) 19

Instruction Execution Order Transfer of values between registers MOVI R1, 10? MOVI R2, R1 Race Condition! (forbidden) MOVI R1, 10 COND.SET 5 WAIT 5 MOVI R2, R1 In-Order renaming ensures correct execution order 20

Instruction Execution Order Transfer of Values between memory insns STORE [100] COND.SET 5 WAIT 5 LOAD [100] Instructions inserted into MOB in-order MOB will not reorder load and store if their addresses conflict In absence of speculation, OOO executes register-shared threads correctly! 21

Speculation the Problem No Inthreads Synchronized With Inthreads Unsynchronized BEQ R1,R0, MOVE R3,R4 Move is placed after BEQ in the ROB will not be committed before BEQ, will be squashed by BEQ. BEQ R1,R0, SET 5 WAIT 5 MOVE R3,R4 Move is in a different ROB than BEQ can potentially be committed before BEQ. Needs special mechanism to know when BEQ is speculative and that BEQ is squashed. MOVI R4,10 MOVE R3,R4 This is a race for access to R4. Forbidden in correct programs, but can still happen due to speculation. 22

Handling Speculation: Synch Execute condition sets only when non-speculative Executed branches Mispredicted branches T 0 [100] beq R1,R2 [117] bne R5,R0 Unresolved Branches T N [105] j 405888 First Unresolved Branch[T] Thread Control Instructions T 0 : [095] SET 4 T 0 : [108] SET 5 Timestamps: Assigned during Decode 23

Handling Speculation: Synch cont d Synchronization is a barrier for speculation Performance Hit: 0% speedup in the running example Can be worse than non-threaded execution Need to issue cond.sets speculatively Extended Speculation Definition: WAIT is speculative if the SET which provided its condition is speculative Any instruction is speculative if a preceding WAIT in its ROB is speculative or a preceding branch is yet unresolved 24

Speculative Synchronization BEQ SET 5 BNE BEQ SET 6 WAIT 5 WAIT 6 25

Handling Speculation: Synch cont d For each SET/WAIT instruction For each inthread Keep the timestamp of the latest branch that affects this instruction Computing timestamps affecting a SET: MAX of: the latest branch timestamps of all the active WAITs of the same inthread, the latest timestamp of all the unresolved branches in the same inthread Computing timestamps affecting a WAIT: MAX of: same as SET, the timestamps of SET providing the condition 26

Handling Speculation: Synch cont d Mispredicted branches Unresolved Branches First Unresolved Branch[T] Last Unresolved Branch[T] Committed conditions 1 1 0 0 T 0 : SET 2 103 --- --- --- T 3 : WAIT 2 103 --- --- 098 T 3 : SET 3 103 --- --- 098 T 1 : WAIT 0 --- --- --- --- T 2 : WAIT 3 103 --- 114 098 0 1 0 1 0 1 0 0 Squashed WAITs Computed conditions 27

Handling Speculation: cont d Assuming no unsynchronized races Register values are transferred speculatively between inthreads 63% speedup on the running example Assuming unsynchronized races Transferring registers non-speculatively Performance hit: only 10% speedup Recovery mechanism as in Thread Control Unit Expensive Ensuring no unsynchronized races Need compiler support 28

Conclusion Basic Inthreads microarchitecture is similar to SMT Speculative execution needs special recovery mechanisms for inter-threads dependencies Inthreads Compilation Register allocation problems unique to Inthreads Accurate code analysis could allow for simpler implementation 29

Questions? 30

BACKUP foils 31

Optimizations Correctness Some optimizations are not affected Cross jumping, Block reordering, Some optimizations just need the corrected (concurrent) data flow Dead code elimination, Constant propagation, Loop invariant motion, Some optimizations require algorithmic fixes Global common sub-expression elimination, Register Allocation, A formal approach Delay Set Analysis. Still under investigation 32

Conservative Concurrent Conflict Analysis Conservative: INTH_START A B A = INTH_SET(2) B = INTH_SET(3) C D INTH_WAIT(3) C = INTH_WAIT(3) D = Precise analysis: A B C D 33

Register Allocation Possible register allocation solutions: GCC s old separate global + local allocation Static register file partitioning GCC s new (Chaitin-style) algorithm Automatic partitioning (global, uniform approach) Optimal allocator (based on integer linear solver) Still under investigation 34

Future work Our compiler sees code of all the threads. Can we do better? Automatic parallelization. 35