Parallelism and runtimes

Size: px

Start display at page:

Download "Parallelism and runtimes"

Georgiana Walters
5 years ago
Views:

1 Parallelism and runtimes Advanced Course on Compilers Spring 2015 (III-V): Lecture 7 Vesa Hirvisalo ESG/CSE/Aalto

2 Today Parallel platforms Concurrency Consistency Examples of parallelism Regularity of data accesses Regularity of control flow Approaches to parallelism How we map control, data, and HW Runtimes and compiling Coordination and scalability Inter-procedural aspects 2/32

3 Parallel platforms 3/32

4 Some basics Basics (as for any concurrent execution) memory synchronization scheduling What is so hard here? software threads, communication, loops, etc. hardware threads, etc. Not just parallel execution, but statically several core-memory, core-core dependences races, concurrency control communication (typically buses + memory) These things do not scale well no silver bullet (all approaches have limitations) 4/32

5 Coordinated processing Coherence and store atomicity Strict coherence Store atomicity Plain coherence Consistency Sequential consistency Violations (e.g., speculative execution) Relaxations (with and without forwarding) Processor consistency, weak consistency Synchronization Hardware implementation Software implementation 5/32

6 Performance The motivation is performance that is often linked to scalability strong vs. weak scalability horizontal vs. vertical scalability Parallelism costs a lot of overhead work may be needed to make the parallelism happen The overhead typically includes managing the multiple threads communication among threads/processes synchronization Amdahl s law remember the serial part 6/32

7 Speculation Often, speculation is needed. execution of something before knowledge of need with threads, SpMT Granularity affects a lot coarse-grained is with less overhead flexibility needed Memory issues coherence consistency Data and control exchange of data and control dependences 7/32

8 Consistency and coherence We must understand how memory works. the real memory is complex instead, we use an abstraction Consistency models are for the programmer. The basic problem is understanding memory access semantics. write (several times) read What value do we get (which write we do read)? Strict consistency any read to a memory location x returns the value stored by the most recent write operation to x This is often very impractical. real races? optimizations! 8/32

9 Consistency models Sequential consistency the result of any execution is the same as if the reads and writes occurred in some order the operations of each individual processor appear in this sequence in the order specified by its program Cache coherence distinction between local and global locally (sequentially) consistent view Processor consistency preserve order per processor Pipelined Random Access Memory Weak consistency division between synchronizing and non-sync. accesses synchronizing accesses sequentially consistent 9/32

10 Examples of parallelism 10/32

11 Regular code Consider for(i = 0; i < n; i++) C[i] = x * A[i] + B[2*i]; The code has regular data access the strides are linear having different step size does not matter regular control flow no conditionals 11/32

12 Irregular data accesses Consider for (i = 0; i < n; i++) E[C[i]] = D[A[i]] + B[i]; The code has irregular data access dependent on other data basically indirect accesses the data is not known regular control flow no conditionals 12/32

13 Irregular control flow Consider for (i = 0; i < n; i++) { x = (A[i] > 0)? y : z; C[i] = x * A[i] + B[i]; } The code has regular data access irregular control flow there is a conditional 13/32

14 Simple irregular code Consider for (i = 0; i < n; i++) if (A[i] > 0) C[i] = x * A[i] + B[i]; The code has irregular control flow there is a conditional irregular data access the conditional affects the striding the strides are punctuated 14/32

15 Code with complex irregularities Consider for (i = 0; i < n; i++) { C[i] = false; j = 0; while (!C[i] & (j < m)) if (A[i] == B[j++]) C[i] = true; } The code has irregular control flow irregular data access these two are interdependent 15/32

16 MIMD code For the simple irregular code 1 div m, n, nthr 11 loop: 2 mul t, m, tidx 12 load a, a_ptr 3 add a_ptr, t 13 br.eq a, 0, done 4 add b_ptr, t 14 load b, b_ptr 5 add c_ptr, t 15 mul t, x, a 6 sub t, nthr, 1 16 add c, t, b 7 br.neq t, tidx, ex 17 store c, c_ptr 8 rem m, n, nthr 18 done: 9 ex: 19 add a_ptr, 1 10 load x, x_ptr 20 add b_ptr, 1 21 add c_ptr, 1 22 sub m, 1 23 br.neq m, 0, loop 16/32

17 Vector-SIMD code For the simple irregular code note the vector instructions 1 load x, x_ptr 2 loop: 3 setvl vlen, n 4 load.v VA, a_ptr 5 load.v VB, b_ptr 6 cmp.gt.v VF, VA, 0 7 mul.sv VT, x, VA, VF 8 add.vv VC, VT, VB, VF 9 store.v VC, c_ptr, VF 10 add a_ptr, vlen 11 add b_ptr, vlen 12 add c_ptr, vlen 13 sub n, vlen 14 br.neq n, 0, loop 17/32

18 SIMT code For the simple irregular code note that there is no loop 1 br.gte tidx, n, done 2 add a_ptr, tidx 3 load a, a_ptr 4 br.eq a, 0, done 5 add b_ptr, tidx 6 add c_ptr, tidx 7 load x, x_ptr 8 load b, b_ptr 9 mul t, x, a 10 add c, t, b 11 store c, c_ptr 12 done: 18/32

19 Approaches to compilation 19/32

20 Level and forms of parallelism Several levels for parallelism instruction-level parallelism often the same as pipelining thread-level parallelism Threads can differ traditional threads microthreads close to vectorization Synchronization between threads between tasks 20/32

21 Multiple threads Multithreading multiple threads shared memory Multiprocessing multiple processes distributed memory In practice, sadly, these terms are often used interchangeably! compiler technology: mostly the former note that other models exist we will review the latter shortly In any case we must synchronize! 21/32

22 Structuring the threads Often, the Fork/Join model is used. each program begins with a single thread new threads are forked, when a parallel region is reached threads are joined, when the parallel region ends Note in principle, the child threads are terminated in practice, the child threads often continue Why overhead costs of thread creation and termination Note the scheduling issues gang scheduling 22/32

23 Synchronization There are several synchronization mechanisms memory sharing channel sharing Essential who waits and how different semantics and implementation hardware involvement scheduling and memory Connection to dependences they are the cause static vs. dynamic 23/32

24 Compilation and mappings In parallel processing, we typically have control (i.e., the code running) data (i.e., accesses made by the code) multiple processing units memory Mapping our problem is often finding mappings between these usually there are restrictions Approaches Occupancy-based compilation not allowing the control to diverge Dependency-based compilation not allowing the data to diverge 24/32

25 Runtimes and compiling 25/32

26 Traditional coordination and scalability Coordination Communication Synchronization Traditionally, coherency and consistency are present E.g., micro-architecture-level speculation We have the semantics at the ISA level We can (locally!) check against that E.g., SMP systems We can interrupt a thread Threads are "co-operating", not "co-performing" The memory is coherent and consistent (wrt the task) MOESI (etc.) in multicores Partial coherency and consistency => a lo tof trouble Manycores 26/32

27 Performance-oriented coordination Latency hiding is the key here We do coordination in a way that it hides the latency Effective computation goes on despite waits Examples Speculation In pipelining we start an operation before its data is there In transactional memory (TM) we roll back if we fail Cache Is a parallel structure (we check the tags in parallel) Prefetching and speculating (keeping data close to core) I/O wait We execute some other thread while waiting a device Memory wait The core is fed some other thread Note that coherency and consistency are the key here Also the reason we TM is very hard for manycores 27/32

28 Platform support The classical example POSIX and compiler runtime on top of that (threads) Coordination Synchronization Communication Resource management Memory management Resource registration and partitioning Task management Task life cycle (there can be a lot!) Task placement, priorities, and dependecies Scheduling (toward HW assisted scheduling) 28/32

29 Compiler support Programs spend their time in repeatedly executed parts Only some loops structures are suitable for modern architectures Loops need to be re-structured Loop restructuring hard A lot of complex tools (and theory) is needed Many transformations are based on separating Iterations Schedules Syntactic representations Statements can represented represented independently of their location and control Typically supported by an optimization framework Dependences are essential 29/32

30 Static analysis of calls Modern software consists of multiple compilation units a lot of small fragments (procedures, subroutines, functions, methods, operators,...) many of them are dynamic frequent call/return to a dynamic target The above makes control-flow analysis hard. We should statically understand the calling structure. As our analysis is static and the calls are dynamic, there are limits to this. However, if the code is static (e.g., no new subroutines loaded and linked), the problem is related to the code structure. 30/32

31 Caller-callee interaction Programs handle dynamic objects. We have analyses like pointer analysis shape analysis etc. Without such, efficient code is lost: what if on every line there is a call to something embedding a pointer to anything? if we have no understanding of the semantics, we cannot do any optimization In addition to understanding the call-return flow, it is important to understand what happens during the callee activations and use this information in the analysis of the caller. 31/32

32 Approaching interprocedural analysis We have two basic issue understanding the flow adding call-return to our understanding of control flow understanding the effects how the caller affects the callee how the callee affects the caller flow and context sensitivity Control-flow super graph (CFSG) is one way to solve, but region-based analysis yields faster solutions with context information included. call graphs and call strings lattices of transfer functions instead of values closure of meet repretition is needed summary analysis (forget the flow) 32/32

39 Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

39 Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators YUNSUP LEE, University of California at Berkeley RIMAS AVIZIENIS, University of California at Berkeley ALEX