The Concurrent Collections (CnC) Parallel Programming Model. Kathleen Knobe Intel.

Size: px

Start display at page:

Download "The Concurrent Collections (CnC) Parallel Programming Model. Kathleen Knobe Intel."

Spencer Austin
6 years ago
Views:

1 The Concurrent Collections (CnC) Parallel Programming Model Kathleen Knobe Intel *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners

2 Cholesky Performance Aparna Chandramowlishwaran Rich Vuduc (Georgia Tech) Intel 2-socket x 4-core 2.8 GHz + Intel MKL

3 Eigensolver Performance ~2x speedup over MKL Intel 2-socket x 4-core 2.8 GHz + Intel MKL

4 Why is Concurrent Collections interesting for Autotuners? Goal of autotuning: find best options within the constraints of the application Goal of Concurrent Collections: Separation of concerns domain-expert: semantic requirements only tuning-expert: maximal flexibility (minimizes constraints) for tuning What is a tuning-expert? The same person as the domain-expert at a different time A different person A naïve runtime A sophisticated static analyzer An autotuner Concurrent Collections minimizes the constraints for the autotuner 4

5 Intel Concurrent Collections The application problem The work of the domain expert Semantic correctness Constraints required by the application Goal: serious separation of concerns: The domain expert does not need to know about parallelism Concurrent Collections Spec The work of the tuning expert Architecture Actual parallelism Locality Overhead Load balancing Distribution among processors Scheduling within a processor The tuning expert does not need to know about the domain. Mapping to target platform 5

6 The problem Serial languages over-constrain orderings Require arbitrary serialization Allow for overwriting of data The decision of if and when to execute are bound together Parallel programming languages embedded within serial languages Inherit problems of serial languages Too specific wrt type of parallelism in the application type target architecture 6

7 Raise the level of the programming model just enough to avoid over-constraints Concurrent Collections (only semantically required constraints) explicitly serial languages (over-constrained) explicitly parallel languages (over-constrained) 7

8 Outline Language concepts Runtime Abstract execution model Actual systems 2 examples of flexibility Checkpointing Interesting target architectures 8

9 Outline Language concepts Runtime Examples of flexibility 9

10 The Big Idea Don t specify what operations run in parallel difficult and depends on target Specify only the semantic ordering requirements easier and depends only on application 10

11 Exactly two sources of ordering requirements Producer / Consumer (Data Dependence) Producer must execute before consumer Controller / Controllee (Control Dependence) Controller must execute before controllee 11

12 Notation White Board Textual Slideware Computation Step foo (foo) (foo) Data Item x [x] [x] Control Tag T <T> <T> 12

13 Exactly two sources of ordering requirements Producer / Consumer (Data Dependence) Producer must execute before consumer Controller / Controllee (Control Dependence) Controller must execute before controllee Producer - consumer Controller - controllee <t2> (step1) [item] (step2) (step1) (step2) 13

14 Tag collections: new concept Tags are: Generalization of iteration space Typically related to the semantics of application. (step1) <t2> (step2) Loop nests components of tag <t2> are k and j (Step2) loop k = loop j = loop i = Z(i, j, k) = = A(k, j) end end end Loop iteration space is a sequence <1,1,1>, <1,1,2>, The body of the loop has access to its indices As each tuple in the sequence is created, the associated body is executed. The related tag collection is a set {<1,1>, <1, 2>, } The step code has access to its tags The time the associated body is executed is yet to be determined. 14

15 Tag collections: new concept Tag collections are like iteration spaces but more general Not just loop indices Trees (e.g., recursion) Graphs (e.g., irregular mesh) Sets (employees, molecules, frames of images) Not sequential But rather unordered Named To facilitate reuse 15

16 Collections of dynamic instances Static <t> (s) (q) Dynamic <t: 1> <t: 2> <t: 3> <t: 4> (s: 1) (s: 2) (s: 3) (s: 4) (q: 1) (q: 2) (q: 3) (q: 4) 16

17 Collections of dynamic instances Static (s1) [i] (s2) Dynamic [ i: 6 ] (s1: 5 ) [ i: 7] (s2: 7) (s2: 4) [i: 13] ( s1: 12 ) [i: 4] ( s1: 3 ) [ i:5] (s2: 13) 17

18 An Application (simple in the extreme) Break up an input string - sequences of repeated single characters Filter allowing only - sequences of odd length Input string Sequences of repeated characters Filtered sequences aaaffqqqmmmmmmm aaa ff qqq mmmmmmm aaa qqq mmmmmmm 18

19 How people think about their application: The white board drawing What are the high level operations? What are the chunks of data? What are the producer/consumer relationships? What are the inputs and outputs? [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [Input] = aaaffqqqmmmmmmm [span: 1] = aaa [span: 2] = ff [span: 3] = qqq [span: 4] = mmmmmmm [results: 1] = aaa [results: 3] = qqq [results: 4] = mmmmmmm 19

20 Make it precise enough to execute How do we distinguish among the instances? Here the tag values are arbitrary. Often the tags have semantic meaning in the application. <spantag: 1> <spantag: 2> <spantag: 3> <spantag: 4> <spantag: j, s> [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [Input] = aaaffqqqmmmmmmm [span: 1] = aaa [span: 2] = ff [span: 3] = qqq [span: 4] = mmmmmmm [results: 1] = aaa [results: 3] = qqq [results: 4] = mmmmmmm 20

21 <spantag> CnC specification [span] (processspan) [results] <spantag: j,s> :: (processspan: j,s); [span:j,s] -> (processspan: j,s); (processspan: j,s) -> [result: j,s]; Original scalar routine string processspanorig(string instr) { //unchanged} Glue code Step processspan( Graph_t &g, Tag_t t) { string outstr = processspanorig(g.span.get(t)); if (strlen(outstr) > 0) g.result.put(t); } 21

22 Make it precise enough to execute How do we distinguish among the instances? <stringtag: 1> <stringtag: 2> <spantag: 1, 1> <spantag: 1, 2> <spantag: 1, 3> <spantag: 1, 4> <stringtag: j> <spantag: j, s> [input: j] (createspan: j) [span: j, s] (processspan: j, s) [results: j, s] [input: 1] = aaaffqqqmmmmmmm [input: 2] = rrhhhhxxx [span: 1, 1] = aaa [span: 1, 2] = ff [span: 1, 3] = qqq [span: 1, 4] = mmmmmmm [results: 1, 1] = aaa [results: 1, 3] = qqq [results: 1, 4] = mmmmmmm 22

23 No thinking about parallelism Only domain/application knowledge Result is: Parallel Deterministic (wrt results) Race-free 23

24 <K> Experience with graph design questions (Cell detector: K) [Histograms: K] [Cell candidates: K ] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] <Cell tags Initial: K, cell> (Correction filter: K, cell) [Motion corrections: K, cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] <Cell tags Final: K, cell> (Prediction filter: K, cell) 24 [Predicted states: K, cell]

25 Outline Language concepts Runtime Examples of flexibility 25

26 Semantic model: instances acquire attributes When this tag is available When these items are available This step will execute (sometime) <T: 3, 28> <T: 3, 29> and this tag is available This tag becomes available [X: 2, 28] (FOO: 3, 28) [X: 3, 28] [X: 4, 28] [Y: 3 ] When this step executes this step becomes enabled 26

27 Execution model Red: prescribed Blue: inputs available Red&blue: enabled Yellow: executing Gray: done&dead 27

28 CnC supports not only different runtimes but a wide range of runtime styles HP TStreams memory grain distribution schedule distributed static static static HP TStreams distributed static static dynamic Intel CnC shared static dynamic dynamic * Rice CnC shared static dynamic dynamic GaTech CnC shared dynamic dynamic dynamic * soon: the ability to plug in an application-specific or a domain-specific scheduler 28

29 Outline Language concepts Runtime Examples of flexibility 29

30 Checkpoint: save execution frontier As objects become part of the execution frontier, we can save them on disk. We maintain an independent execution frontier on disk. On disk 30

31 Restart: from disk Assume the executing program dies. We can restart from the execution frontier on disk. On disk 31

32 Checkpoint/continue Checkpoint/restart implies full stop on failure Checkpoint/continue would continue around the failed component 32

33 Dynamic GPU/vector operations Red step collection Blue step collection All enabled steps Enabled Blue steps Enabled Red steps Single bucket Bucket per collection 33

34 ILP Red step collection Blue step collection Step fusion yields additional parallelism Red-blue step collection <b> <i, j> <b, i, j > Compiled independently for ILP For ILP, compiled together assuming independence between red and blue 34

35 ILP Pick an enabled blue step an enabled red step or one of each Enabled Blue steps Enabled Red steps 35

36 The Intel community The HP community DPD Shin Lee Steve Rose Leo Treggiari Ilya Cherny Frank Schlimbach Ganesh Rao Nikolay Kurtov SPI Kath Knobe Geoff Lowney Mark Hampton Ryan Newton Others Steve Lang John Pieper Carl Offner Alex Nelson The academic community Rice University Vivek Sarkar Zoran Budimlic Sagnak Tasirlar David Peixotto Georgia Tech Rich Vuduc Aparna Chandramowlishwaran Colorado State Michelle Strout Novosibirsk Nikolay Kurtov 36

37 whatif.intel.com 37

The Concurrent Collections (CnC) Parallel Programming Model Tutorial. Kathleen Knobe Intel Vivek Sarkar Rice University

The Concurrent Collections (CnC) Parallel Programming Model Tutorial Kathleen Knobe Intel Vivek Sarkar Rice University kath.knobe@intel.com, vsarkar@rice.edu Tutorial CnC 09 July 23, 2009 *Intel and the