UPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2

Size: px

Start display at page:

Download "UPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2"

Silas Jack Stevens
5 years ago
Views:

2 Illiac UPCRC Petascale computing Gigascale System Research Center Cloud Computing Testbed (CCT) 2

3 Mul2 Core: All Computers Are Now Parallel We con'nue to have more transistors per chip Moore s Law Cannot con'nue increasing clock cycle Power Cannot con'nue increasing single thread performance Diminishing returns on added circuitry Transistors are used to populate chips with an increasing numbers of cores 3

4 The Computer Economy Old game: Each new h/w genera'on provides beder user experience (performance) with lidle/no change in s/w People buy new laptop every 3 years (or less) good for Intel & MicrosoL New game: Each new chip genera'on provides beder experience only for applica,ons that run in parallel & scale Goal: BeDer user experience with increased number of cores with lidle/no code rewrite 4

5 The Consequences of Failure A very different business model for the IT industry A slowdown 5

6 Parallel SoJware is Hard Prone to subtle, hard to reproduce bugs Much more complex to test More complex applica'on mapping to hardware Immature development environments Lack of trained manpower 6

7 Two Hypotheses A. The development of parallel solware is inherently hard Hard to think parallel B. The development of parallel solware need not be (much) harder than the development of sequen'al solware Currently hampered by lack of good programming models, tools, educa'on, etc. Can be cured by suitable investments 7

8 Arguments for B Some forms of parallel programming are easy Most current parallel programming environments were developed to support hard forms of parallel programming (e.g., system code or performance programming) Complex interac'ons Detailed, low level resource management; machinedependent code Technologies (and $$) exist to do beder 8

9 (Some) Parallel Programming is a Etoys Child s Play Shared nothing programming style: Set of independent objects, each with own program and local state; no shared state Object updates its own state and can read the state of other objects Global clock Simple interac'on model Determinis2c Execu2on 9

10 What is Determinism? Given a sequence of inputs, all execu'ons of the program will have the same perceived behavior It depends what same is It depends what percep,on is 10

11 What Is Same? Same performance? Same ResourceExcep2on? Can assume addi'on is associa've and commuta've? Same : Equivalence rela'on on execu'ons 11

12 What Can We Observe? Outputs Non recovered excep'ons When debugging, program execu'on state Assumes opera'onal seman'c model 12

program order orders all conflicts Determinis'c

13 Formalism write b read b read a write a Opera'ons Program order Conflicts Program determinis,c if program order orders all conflicts Determinis'c programs have sequen'al opera'onal seman'cs 13

14 Race Free is not Enough lock(l); x++; unlock(l); lock(l); y=x; unlock(l); Determinis'c = race free (all conflic'ng opera'ons are synchronized) + Ordering synchroniza'ons 14

15 Why is Determinism Good? Easy to test: only one execu'on path Easy to understand: execu'on equivalent to sequen'al execu'on Easy to debug Easy to incrementally parallelize code Can use current tools and methodologies for program development 15

16 Do We Need Nondeterminism? Reac've code: reacts to external events OLTP, OS, GUI Nondeterminism is inherent; inputs are not sequen'al How about transforma'onal code? Machine dependent code Randomized algorithms 16

17 Reduc2on a b c d e f abcdef a b c abc d e f def abcdef Same set of issues as for op'mizing compilers and run 'me compila'on 17

18 Linked List Reduc2on a b c d abcd Easy if nodes are stored in con'guous loca'on a b c d ab cb abcd Hard if nodes are not sorted 18

19 Randomized Linked List Reduc2on a b c d abcd Pick randomly half of the nodes a b c d Break 'es (adjacent nodes) by coin tossing Each phase reduces # nodes by ~1/4 op'mal within (small) constant factor 19

20 Determinis2c Linked List Reduc2on Sequence of STOC/FOCS papers by Cole & Vishkin derive a determinis'c logarithmic, work op'mal algorithm Complex algorithm asympto'cally op'mal but not prac'cal No proof that nondeterminism (or randomiza'on) is necessary in parallel (transforma'onal) computa'ons Seems to make life easier in some cases (parallel graph algorithms, parallel op'miza'on) 20

21 Goal Shared memory language that is determinis'c by design and by default unordered conflicts are detected at compile 'me, if possible run 'me, otherwise Nondeterminis'c behavior has to be introduced explicitly using nondeterminis'c control constructs and is disciplined 21

22 Hidden Parallelism (1) Use conven'onal sequen'al language; let compiler + run 'me introduce parallelism in a safe manner Parallelizing compilers have had limited success; in par'cular they are bridle User has no parallel performance model 22

23 Hidden Parallelism (2) Use func'onal programming language Copying & inefficient use of memory bandwidth Far from established prac'ce Use data parallelism (e.g., vector opera'ons) Good, but not enough: need control parallelism 23

24 Hidden Parallelism (3) Use annota'ons or seman'cally neutral syntax to declare intended programming model (1) for i = [lb..ub] loop_body (2) forall i = [lb..ub] loop_body same seman,cs loop carried dependencies are allowed in first case and disallowed in second case excep,on generated if dependency exists 24

25 Run Time Detec2on (1) Using Thread Level Specula'on Iterates execute in parallel specula,vely. Variables wriden are kept in cache; variables read are marked Commit protocol: checks that no variable wriden by one thread was accessed by another thread during specula've execu'on Can be implemented efficiently in h/w for short threads running concurrently on dis'nct cores [,Torrellas 2006] 25

26 Run Time Detec2on (2) Good for ensuring that a specific parallel execu'on does not violate sequen'al seman'cs Not good enough to ensure that no parallel execu'on will ever violate sequen'al seman'cs (i.e., that iterates are independent) 26

27 Compile Time Detec2on Hard for irregular data structures, dynamic par''ons, etc. Possible approach: allow user to annotate program with type & effect annota'ons to restrict what can accessed or updated by a task Facilitates compiler analysis (restricts/eliminates run 'me checks) User can express implicit knowledge Determinis'c Parallel Java (DPJ) [Bocchino &Adve] 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 Disciplined Nondeterminism Linked List Reduc'on repeat { pick p in List where p.next!= null; p.val += p.next.val; p.next = p.next.next; } until pick fails; 36

37 Nondeterminis2c Iterator Equivalent to sequen'al code Construct precisely defines possible serializa'ons Analysis indicates that can proceed concurrently with nonadjacent nodes (or proceed specula'vely with any number of nodes [Galois, Pingali]) 37

38 38

Tools zur Op+mierung eingebe2eter Mul+core- Systeme. Bernhard Bauer

Tools zur Op+mierung eingebe2eter Mul+core- Systeme Bernhard Bauer Agenda Mo+va+on So.ware Engineering & Mul5core Think Parallel Models Added Value Tooling Quo Vadis? The Mul5core Era Moore s Law: The