Inthreads: Code Generation and Microarchitecture

Similar documents
HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding


Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Software-Controlled Multithreading Using Informing Memory Operations

Getting CPI under 1: Outline

November 7, 2014 Prediction

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

5008: Computer Architecture

Simultaneous Multithreading Processor

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Hardware-Based Speculation

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Multiple Instruction Issue and Hardware Based Speculation

Static vs. Dynamic Scheduling

EECS 570 Final Exam - SOLUTIONS Winter 2015

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Wide Instruction Fetch

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Handout 2 ILP: Part B

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Handout 3 Multiprocessor and thread level parallelism

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

SUPERSCALAR AND VLIW PROCESSORS

HY425 Lecture 09: Software to exploit ILP

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

HY425 Lecture 09: Software to exploit ILP

The University of Texas at Austin

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

The Pentium II/III Processor Compiler on a Chip

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Achieving Out-of-Order Performance with Almost In-Order Complexity

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

The IA-64 Architecture. Salient Points

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

CS 152, Spring 2011 Section 8

CS5460: Operating Systems

Hardware-Based Speculation

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Hardware-based Speculation

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Martin Kruliš, v

CSC D70: Compiler Optimization Register Allocation

Advanced issues in pipelining

SPECULATIVE MULTITHREADED ARCHITECTURES

RELAXED CONSISTENCY 1

UNIT:2. Process Management

The Tomasulo Algorithm Implementation

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

A Modern Parallel Register Sharing Architecture for Code Compilation

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Exploitation of instruction level parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

The Java Memory Model

Lecture 4: MIPS Instruction Set

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Main Points of the Computer Organization and System Software Module

Super Scalar. Kalyan Basu March 21,

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Short Answer: [3] What is the primary difference between Tomasulo s algorithm and Scoreboarding?

Lecture 19: Instruction Level Parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

INSTRUCTION LEVEL PARALLELISM

Lecture-13 (ROB and Multi-threading) CS422-Spring

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Alexandria University

Multiprocessor Synchronization

Portland State University ECE 587/687. Memory Ordering

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Processor Architecture

Dynamic Scheduling. CSE471 Susan Eggers 1

CS553 Lecture Profile-Guided Optimizations 3

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

ILP: Instruction Level Parallelism

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Transcription:

Inthreads: Code Generation and Microarchitecture Alex Gontmakher Gregory Shklover Assaf Schuster Avi Mendelson 1

Outline Inthreads Introduction Inthreads code generation Microarchitecture 2

Motivation Programs contain low-level parallelism Would be nice to be able to express it But: too complex for superscalar execution But: too low-level to be done by threads Solution: an extremely lightweight threading mechanism 3

Inthreads architecture Processor resources are shared between threads just like SMT Architectural state, including registers, is shared between threads unlike SMT!!! Lightweight thread context OS transparent Built-in synchronization Integration with processor hardware more efficient communication & synchronization 4

Inthreads architecture: Thread manipulation inth.start tid, addr - create a new thread Only a fixed number of threads supported No scheduling necessary for threads inth.halt - terminate the current thread inth.kill tid - terminate the given thread 5

Inthreads architecture: Synchronization Condition registers C 1, C 2, C n : binary semaphores inth.set C i sets condition C i. Releases one waiting thread. inth.clr C i clears C i. inth.wait C i waits for C i to be set. 6

Inthreads programming model Threads share the registers cooperatively Thread-private variables use different registers in each thread Shared variables must be allocated to same register in all threads No race conditions allowed Accesses must be protected by synchronization Speculative races still possible compiler analysis? No function calls when threads are active 7

Inthreads Code Generation: Compilation flow Inthreads-C: C with explicit parallelization.c.c Programmer Parallelizer Inthreads-C.c Inthreads-C compiler.o 8

Inthreads-C Extension INTH_CLEAR(1); /* initialize conditional variable #1 */ INTH_START(start2); /* start the thread at label start2 */ #pragma inthread (1 of 2) { b1 = ; c1 = ; a1 = min(b1,c1); /* do half of the job */ INTH_WAIT(1); /* wait till the second thread terminates */ ret = min(a1,a2); /* calculate the final value */ } return ret; start2: Must be a macro #pragma inthread (2 of 2) { b2 = ; b3 = ; a2 = min(b2,c2); /* do half of the job */ INTH_SET(1); /* notify the first thread */ INTH_HALT(); /* terminate this thread */ } 9

Inthreads-C Extension: Correctness Requirements INTH_CLEAR(1); INTH_START(start2); #pragma reg_range (1 of 2) { b1 = ; c1 = ; a1 = min(b1,c1); INTH_WAIT(1); ret = min(a1,a2); } return ret; start2: #pragma reg_range (2 of 2) { b2 = ; c2 = ; a2 = min(b2,c2); INTH_SET(1); INTH_HALT(); } b1, c1, b2, c2: Private values Must be allocated to different registers a2: Shared value Must be allocated to the same register in both threads Special Semantics: The value of a2 must be transferred correctly. INTH_WAIT is an Acquire for a2 INTH_SET is a Release for a2 10

Optimization Correctness Problem: Basic compiler assumptions are broken Execution doesn t follow a single path at any given moment i1: A = 5 j1: A = 10 i2: INTH_WAIT(1) j2: INTH_SET(1) i3: A ==? Classic optimizations may produce incorrect code Common Subexpression Elimination Loop Invariant Motion Dead Code Elimination 11

Internal Representation (Concurrent) Control Flow Graph Do job INTH_WAIT INTH_START continue... Do job INTH_SET INTH_HALT INTH_START is represented by a branch instruction with 100% probability for each edge INTH_SET are connected to corresponding INTH_WAIT instructions by synchronization edges INTH_HALT is represented by a dead end block 12

Concurrent Data Flow Concurrent data flow is propagated through synchronization edges (that connect inth_set to corresponding inth_wait). A is marked as shared through this INTH_SET(). Meaning: live unknown value A = 5 INTH_SET(2) INTH_WAIT(2) A = A + 1 INTH_SET(3) A == 5 INTH_WAIT(3) B = A + 2 13

Register Allocation The problem: make sure concurrently-live values are allocated to different registers at each execution point. Method 1: Manual register file partitioning Allocate private subset of registers to each thread Allocate a subset of registers for shared variables Use the allocation during register assignment Not user-friendly, not portable, sub-optimal Method 2: Automatic register allocation Using concurrent interference graph Uniform approach guided by spill costs heuristics 14

Register Allocation Graph Coloring Algorithm Additional conflict edges are added to the interference graph to represent concurrently-live values. Each node now has parallel and sequential conflicts. Conservative exact execution ordering is not calculated Interference graph: A D B E C F 15

A Running Example (from SPEC2K MCF) for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( bea_is_dual_infeasible( arc, red_cost ) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } } 16

Example continued: parallelization for( ; arc1 < stop_arcs; arc1 += nr_group*3 ) { if( arc1->ident > BASIC ) { red_cost1 = bea_compute_red_cost( arc1 ); res1 = bea_is_dual_infeasible( arc1, red_cost1 ); INTH_SET(1); /* receive destination from the main thread */ INTH_WAIT(2); if( res1 ) { perm[dest1]->a = arc1; perm[dest1]->cost = red_cost1; perm[dest1]->abs_cost = ABS(red_cost1); } } else { res1 = 0; INTH_SET(1); INTH_WAIT(2); } } for (; ; ) { INTH_WAIT(1); basket_size += res1; dest1 = basket_size; INTH_SET(2); INTH_WAIT(3); basket_size += res2; dest2 = basket_size; INTH_SET(4); INTH_WAIT(5); basket_size += res3; dest3 = basket_size; INTH_SET(6); } 17

Inthreads Implementation Basic processor structure Instruction execution ordering Handling speculation 18

Basic Processor Structure SMT + extensions SAME: Concurrent fetch SAME: Multiple ROBs DIFF: Shared architectural regs DIFF: Shared renaming hardware Fetch Fetch Fetch Decode OOO Execution Core Commit Waiting Instructions Condition Info Thread Start/Kill Signals Speculation Info (branches) Thread Control Unit Speculation Info (waits) 19

Instruction Execution Order Transfer of values between registers MOVI R1, 10? MOVI R2, R1 Race Condition! (forbidden) MOVI R1, 10 COND.SET 5 WAIT 5 MOVI R2, R1 In-Order renaming ensures correct execution order 20

Instruction Execution Order Transfer of Values between memory insns STORE [100] COND.SET 5 WAIT 5 LOAD [100] Instructions inserted into MOB in-order MOB will not reorder load and store if their addresses conflict In absence of speculation, OOO executes register-shared threads correctly! 21

Speculation the Problem No Inthreads Synchronized With Inthreads Unsynchronized BEQ R1,R0, MOVE R3,R4 Move is placed after BEQ in the ROB will not be committed before BEQ, will be squashed by BEQ. BEQ R1,R0, SET 5 WAIT 5 MOVE R3,R4 Move is in a different ROB than BEQ can potentially be committed before BEQ. Needs special mechanism to know when BEQ is speculative and that BEQ is squashed. MOVI R4,10 MOVE R3,R4 This is a race for access to R4. Forbidden in correct programs, but can still happen due to speculation. 22

Handling Speculation: Synch Execute condition sets only when non-speculative Executed branches Mispredicted branches T 0 [100] beq R1,R2 [117] bne R5,R0 Unresolved Branches T N [105] j 405888 First Unresolved Branch[T] Thread Control Instructions T 0 : [095] SET 4 T 0 : [108] SET 5 Timestamps: Assigned during Decode 23

Handling Speculation: Synch cont d Synchronization is a barrier for speculation Performance Hit: 0% speedup in the running example Can be worse than non-threaded execution Need to issue cond.sets speculatively Extended Speculation Definition: WAIT is speculative if the SET which provided its condition is speculative Any instruction is speculative if a preceding WAIT in its ROB is speculative or a preceding branch is yet unresolved 24

Speculative Synchronization BEQ SET 5 BNE BEQ SET 6 WAIT 5 WAIT 6 25

Handling Speculation: Synch cont d For each SET/WAIT instruction For each inthread Keep the timestamp of the latest branch that affects this instruction Computing timestamps affecting a SET: MAX of: the latest branch timestamps of all the active WAITs of the same inthread, the latest timestamp of all the unresolved branches in the same inthread Computing timestamps affecting a WAIT: MAX of: same as SET, the timestamps of SET providing the condition 26

Handling Speculation: Synch cont d Mispredicted branches Unresolved Branches First Unresolved Branch[T] Last Unresolved Branch[T] Committed conditions 1 1 0 0 T 0 : SET 2 103 --- --- --- T 3 : WAIT 2 103 --- --- 098 T 3 : SET 3 103 --- --- 098 T 1 : WAIT 0 --- --- --- --- T 2 : WAIT 3 103 --- 114 098 0 1 0 1 0 1 0 0 Squashed WAITs Computed conditions 27

Handling Speculation: cont d Assuming no unsynchronized races Register values are transferred speculatively between inthreads 63% speedup on the running example Assuming unsynchronized races Transferring registers non-speculatively Performance hit: only 10% speedup Recovery mechanism as in Thread Control Unit Expensive Ensuring no unsynchronized races Need compiler support 28

Conclusion Basic Inthreads microarchitecture is similar to SMT Speculative execution needs special recovery mechanisms for inter-threads dependencies Inthreads Compilation Register allocation problems unique to Inthreads Accurate code analysis could allow for simpler implementation 29

Questions? 30

BACKUP foils 31

Optimizations Correctness Some optimizations are not affected Cross jumping, Block reordering, Some optimizations just need the corrected (concurrent) data flow Dead code elimination, Constant propagation, Loop invariant motion, Some optimizations require algorithmic fixes Global common sub-expression elimination, Register Allocation, A formal approach Delay Set Analysis. Still under investigation 32

Conservative Concurrent Conflict Analysis Conservative: INTH_START A B A = INTH_SET(2) B = INTH_SET(3) C D INTH_WAIT(3) C = INTH_WAIT(3) D = Precise analysis: A B C D 33

Register Allocation Possible register allocation solutions: GCC s old separate global + local allocation Static register file partitioning GCC s new (Chaitin-style) algorithm Automatic partitioning (global, uniform approach) Optimal allocator (based on integer linear solver) Still under investigation 34

Future work Our compiler sees code of all the threads. Can we do better? Automatic parallelization. 35