Architecture and Programming Model for High Performance Interactive Computation

Size: px

Start display at page:

Download "Architecture and Programming Model for High Performance Interactive Computation"

Virgil Mills
5 years ago
Views:

1 Architecture and Programming Model for High Performance Interactive Computation Based on Title of TM127, CAPSL TM-127 By Jack B. Dennis, Arvind, Guang R. Gao, Xiaoming Li and Lian-Ping Wang April 3rd, 2014 Haitao Wei CAPSL at UDEL CPEG-852:Advanced Topics in Computing Systems 4/22/14

2 Outline Introduction to DDDAS/Interaction Computation An Example and Problems Fresh Breeze Execution Model and Architecture Execution Model Memory Model Task Model Streaming and Transactions Stream Type and Operations Concurrency Operations of Transaction Style Conclusion 1

3 Outline Introduction to DDDAS/Interaction Computation An Example and Problems Fresh Breeze Execution Model and Architecture Execution Model Memory Model Task Model Streaming and Transactions Stream Type and Operations Concurrency Operations of Transaction Style Conclusion 2

4 An Example of DDDAS/Interaction Computation Radio Astronomy Receipt signals Antenna Array Antenna Array Filter signals and control Local Processor Local Processor Analyze signals Data Analysis Observer Make decision and change parameters 3

5 Dynamic Data Driven Application System (DDDAS) Challenges real time interaction with parts of the physical environment. management of processing and memory resources according to dynamic needs generated by local events input and output devices process streams of data items make decisions about the work using transaction processing 4

6 Our Solutions: Programming Model and Architecture Support Fresh Breeze Execution Model and Architecture based on codelet execution model support fine-grained execution and memory management Streaming support streaming data expression and operations Transaction support concurrency operations of transaction style 5

7 Outline Introduction to DDDAS/Interaction Computation An Example and Problems Fresh Breeze Execution Model and Architecture Execution Model Memory Model Task Model Streaming and Transactions Stream Type and Operations Concurrency Operations of Transaction Style Conclusion 6

8 CPU CPU Memory Memory Thread Unit Executor Locus A Single Thread Thread Unit Executor Locus A Thread Pool Coarse-Grain thread- The family home model Fine-Grain non-preemptive thread- The hotel model Coarse-Grain vs. Fine-Grain Multithreading 7

9 Case Studies of Fine-Gran Execution Models Dataflow Model (1970s - ) EARTH Model ( ) HTVM Model ( ) Fresh Breeze Model (2000 -) Runnemede Model (2010- ) 8

Model Dictate the ordering of memory operations Synchronization

10 Fresh Breeze Execution Model Task Model A set of rules for creating, destroying and managing threads Execution Model Memory Model Dictate the ordering of memory operations Synchronization Model Provide a set of mechanisms to protect from data races The Abstract Machine 9

11 Fresh Breeze Memory Model -- Main Features and Vision Global shared name space with one-level store A single-update storage model to eliminate the cache-coherence problem Concept of sealed memory chunks/sections with single assigned property Trees of fixed-sized chuncks Fine-Grain memory management support memory allocation and data transfer is performed entirely by architecture/hardware mechanisms[nd 10 security

12 Fresh Breeze Memory Model Master Chunk Data Chunk data item DAG heap chunk handle Arrays as Trees of Chunks Write Once then Read only Fix chunk size: 128 Bytes: 16 doubles, 32 integers, Chunk handle: 64 bits unique identifier Arrays: Three levels yields 4096 elements(longs) 11

13 Task/Concurrency Model Master Task Spawned Tasks spawn(n) Worker 0 Worker n-1 join_update join_update join_fetch Join ContinuationTask - Asynchronous tasking - Continuation Task receives children s results - Non-blocking continuation - Light-Weight Tasks 12

A0~A15 Data Chunk 16 elements Build Build

14 Example Dot Product sum=0; for(i=0;i<16*16*16;i++) sum+=a[i]*b[i]; Sync Chunk Build Vector Build Vector Build Vector Step1: Build Vector Join-update Build Chunk A0~A15 Data Chunk 16 elements Build Build Chunk Chunk A240~A255 Build Chunk Build Done Build Done Continue Codelet 13

15 Example Dot Product sum=0; for(i=0;i<16*16*16;i++) sum+=a[i]*b[i]; partial sum sum Traverse Vector partial sum Traverse Vector Traverse Vector Step2: Compute Comput e Comput Comput e e Comput e A0~A15 sum B0~B15 Reduce Reduce Reduce 14

16 Outline Introduction to DDDAS/Interaction Computation An Example and Problems Fresh Breeze Execution Model and Architecture Execution Model Memory Model Task Model Streaming and Transactions Stream Type and Operations Concurrency Operations of Transaction Style Conclusion 15

17 What We Will Learn Streaming stream data type and operations Future structure to implement streaming streaming on dataflow model: static dataflow and synchronous dataflow Transactions Determinate and Repeatable Using Future and Guard to implement concurrency operations of transaction style 16

18 Stream Type and Operations Stream: A sequence of values of type, maybe infinite Define a stream Stream <DataItem> instream = new Stream <DataItem>(); DateItem can be any data type Concatenate two streams Stream <DataItem> strm1 = strm0 + new Stream <DataItem>{i0, i1,... } Get first element in stream strm.first(); 17

19 Stream Type and Operations (cont d) Remove the first element in stream Stream <DataItem> strm1 = strm0.rest () Stream <DataItem> strm = strm.first () + strm.rest () Append an data item to stream strm.append(item) ; It is the end of data stream if ( strm.moredata ()) { statement } 18

An Stream Example Average Pairs Source Module Average Pairs Sink

true ) { int itm = genedataitem (); sourcestream.

moredata()) { int itm0 = instream.first (); instream = instream.

20 An Stream Example Average Pairs Source Module Average Pairs Sink Module Stream <int> sourcestream = new Stream <int> ( ) ; while ( true ) { int itm = genedataitem (); sourcestream.append (itm); } result sourcestream ; while(instream.moredata()) { int itm0 = instream.first (); instream = instream.rest (); int itm1 = instream.first (); instream = instream.rest (); outstream.append (( itm0 + itm1 ) / 2); } result outstream ; while (instream.moredata ()) { achive.put (instream.first()); instream = instream.rest (); } result achive.getachive(); 19

21 Stream Implementation in FreshBreeze Stream representation a linear chain of chunks, each chunk holds data items and a reference to the next chunk Stream operations FIFO queue operations on chain of chunks read from the head of the chain of chunks, write to the tail of the chain of chunks Synchronization between Producer and Consumer Special Object: Future 20

22 Future A future is a memory cell with a state waiting to receive a data value:status: undefined, defined, waiting Future Read and Future Write are Atomic 1. create future 2. T1 write future 3. T2 read future undef Data defined Data defined Read After Write T2 gets Data 21

Future (Cont d) A future is a memory cell with a state waiting to receive a data value:status: undefined, defined, waiting Future Read and Future Write are Atomic 1.

23 Future (Cont d) A future is a memory cell with a state waiting to receive a data value:status: undefined, defined, waiting Future Read and Future Write are Atomic 1. create future 2. T1 read future 3. T2 read future 4. T3 write future undef waiting waiting Data defined T1 Read T1 Read T1 Data T2 Read T2 Data Write After Read 22

24 Stream Operation Based on Future Fresh Breeze Instruction Set Support 4 stream operations New, Append, First and Rest 1. new stream undef 23

25 Stream Operation Based on Future Fresh Breeze Instruction Set Support 4 stream operations New, Append, First and Rest 1. new stream 2. append Data1 defined undef 24

26 Stream Operation Based on Future Fresh Breeze Instruction Set Support 4 stream operations New, Append, First and Rest 1. new stream 2. append 3. first Data1 defined undef Data1 25

27 Stream Operation Based on Future Fresh Breeze Instruction Set Support 4 stream operations New, Append, First and Rest 1. new stream 2. append 3. first 4. rest Data1 undef 26

28 Related Works on Streaming Based on Dataflow Model Streaming based on Static Dataflow Model Streaming based Synchronized Dataflow Model 27

29 Translate Average Pairs to Dataflow Graph Tail Recursive AveragePairs D reset AveragePairs Extract Compute [1] [2] + /2 concat R 28

30 Translate Average Pairs to Dataflow Graph General AveragePairs Compute D *TF T I + /2 R 29

31 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 4321 TTF T I + /2 R 30

32 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 432 TT F 1 T 1 I + /2 R 31

33 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 432 TT T I 1 + /2 R 32

34 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 43 T T T 2 2 I 1 + /2 R 33

35 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 43 T T 2 I /2 R 34

36 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 43 T T T 2 I 3 + /2 R 35

37 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 4 T T T 3 3 I 2 + /2 1 R 36

38 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 4 T T 3 2 I 3 + /2 R 1 37

39 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 4 T T 3 I + /2 R

40 Translate Average Pairs to Dataflow Graph General AveragePairs Compute 5 T T T 4 4 I 3 + /2 R

41 Can Any Streaming Program Work? Translate any streaming program into dataflow graph---recursive and General Answer is YES, see Jack 94 s Paper [2] 40

42 Related Works on Streaming Based on Dataflow Model Streaming based on Static Dataflow Model Streaming based Synchronized Dataflow Model 41

43 Synchronous Dataflow Model Synchronous Dataflow Graph (SDF) Node (actor) represents computation Edge is FIFO queue representing data communication Weights on Edge presents producing rate and consuming rate A p FIFO q B Each time firing actor A produces p data items and actor B consumes q data items. 42

44 Average Pairs to Synchronous Dataflow Graph Source push=1 peek=2 pop=1 Average Pairs push=1 pop=1 Sink Extend the SDF to support peek sematic 43

45 Average Pairs to Synchronous Dataflow Graph Source Average Pairs Sink 44

46 Average Pairs to Synchronous Dataflow Graph Source Average Pairs Sink 45

47 Average Pairs to Synchronous Dataflow Graph Source Average Pairs Sink 46

48 Average Pairs to Synchronous Dataflow Graph Source Average Pairs Sink 47

49 Average Pairs to Synchronous Dataflow Graph Source Average Pairs Sink Details about SDF model, see Lee 87 s Paper [8] 48

50 Some Projects Based on SDF model Early Ptolemy Project at UC Berkeley Software Synthesis for Embedded system StreamIt at MIT streaming program language and compiler InforStream and SPL IBM streaming computing product Our work COStream hierarchical date flow programming language and compiler OpenStream language and compiler support for streaming in OpenMP 49

51 Resources Early Ptolemy Project at UC Berkeley StreamIt at MIT InforStream and SPL COStream OpenStream 50

52 Outline Introduction to DDDAS/Interaction Computation An Example and Problems Fresh Breeze Execution Model and Architecture Execution Model Memory Model Task Model Streaming and Transactions Stream Type and Operations Transactions Conclusion 51

53 Concurrent Transactions Scenario: A Simple Shared Hash Table Shared by two concurrent users. Either user may search the value corresponding to a key, and either user may add or delete entries Using concurrent shared queue User A User B Enter Requests Request Queue Process Queue 52

54 Determinate and Repeatable A system is determinate if the same ultimate output is produced for every run of the system for a presented input, and this is true for all choices of input and for all interpretations of operators. A parallel program schema is said to be repeatable if for a specific interpretation of the operators of the schema, the output for all runs will be strictly a function of the input. 53

55 Determinate f() g() Input: x for all interpretation Merge Output: f(x), g(x) g(x), f(x) Non-Determinate Observe the input and output Black Box Principle Determinant requires that a program schema produce the same results for given input regardless of the specific functions assigned to its operators 54

56 Repeatable f() Merge g() Input: x for specific interpretation f()=g() Output: f(x), f(x) Repeatable White Box Principle A program, with specified operations, is repeatable if any run of the program for a given input produces the same result 55

57 Determinate and Repeatable Determinate Repeatable 56

58 Revisit the Concurrent Request Example Request A Request B Enter Requests Request Queue Process Queue Non-determinate The result order of request A and request B can not be fixed Non-Repeatable Enter requests is interpreted as a non-determinate merge 57

59 Support Transaction Using Guard In FreshBreeze Guard object special data object which can only be accessed by GuardSwap instruction GuardSwap atomic instruction put the new data object into guard, and return the old data object in guard For the Concurrent Request Example using a guard to lock the tail of the queue each request needs to get the guard before be added to the tail of the queue 58

60 Concurrent Writes to Queue Two requests arrive Request A head RA defined undef RA defined undef guard RB defined undef Request B 59

61 Concurrent Writes to Queue Contend the guard guardswap (atomic) Request A head RA defined undef RA defined undef guard RB defined undef guardswap (atomic) Request B 60

62 Concurrent Writes to Queue Request A gets the guard and old tail guardswap (atomic) Request A head RA defined undef RA defined undef guard RB defined undef Request B 61

63 Concurrent Writes to Queue Request A substitute the old tail with the new request WriteFuture (atomic) Request A head RA defined undef RA defined guard RB defined undef Request B 62

64 Concurrent Writes to Queue head Request B gets guard and add to the tail RA Request A defined RA defined RB defined undef Homework: Support concurrent reads? guard Request B 63

Request B Enter Requests Request Queue Request A // Init queuehead, queuetail; TransactionManager () {

queuehead ) ; } enterrequest ( RQ request ) { QueueEntry newentry = new QueueEntry ( request ); Future

next); FutureWrite ( oldtail, newentry ); } Process Queue processqueue( Future <QueueEntry> reqqueue ) {

65 Request B Enter Requests Request Queue Request A // Init queuehead, queuetail; TransactionManager () { queuehead =queuetail new Future <QueueEntry> (); Guard managerguard = new Guard (queuetail); processqueue ( queuehead ) ; } enterrequest ( RQ request ) { QueueEntry newentry = new QueueEntry ( request ); Future <QueueEntry> oldtail = managerguard.guardswap (newentry.next); FutureWrite ( oldtail, newentry ); } Process Queue processqueue( Future <QueueEntry> reqqueue ) { QueueEntry entry = reqqueue.futureread (); <RQ> request = entry. request ; request. processrequest (); Future <QueueEntry> newhead = entry.next ; processqueue ( newhead ) ;} 64

66 Conclusion Codelet based fine-grain execution model is promising for DDDAS Streaming and Transaction are two important features that support Interactive computing Future and Guard are two special mechanisms to support synchronization and concurrency. 65

67 Reference 1. Dennis, J., Arvind, Gao, G., Li, X., Wang, L.P.: CH1: Architecture and Programming Model for High Performance Interactive Computation (2014) 2. Dennis, J.: Stream data types for signal processing. In: Advanced Topics in Dataflow Computing and Multithreading. pp IEEE Computer Society Press (1995) 3. Dennis, J.: The Fresh Breeze Model of Thread Execution. In Workshop on Programming Models for Ubiquitous Parallelism (2006) 4. Dennis, J.: A Fresh Breeze Processor Supporting Instruction Level Parallelism. Proceedings of Data-Flow Execution Models for Extreme Scale Computing (DFM) (2013) 5. Dennis, J., Gao, G.: Multiprocessor implementation of nondeterminate computations in a functional programming framework, Technical Report TM-375. Computation Structures Group, MIT, Cambridge, MA (1995), pubs/memos/memo-375/memo-375.pdf 6. Dennis, J., Gao, G., Sarkar, V.: Determinacy and repeatability of parallel program schemata. Proceedings of Data- Flow Execution Models for Extreme Scale Computing (DFM) pp (September 2012) 7. Dennis, J., Gao, G., Meng, X.: Experiments with the Fresh Breeze tree-based memory model. International Symposium on Supercomputing, Hamburg (June 2011) 8. Edward A. Lee, David G. Messerschmitt: Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Trans. Computers 36(1): (1987)

68 Concurrent Writes to Queue guardswap (atomic) Request A head RA defined undef RA defined undef guard guard RB defined undef guardswap (atomic) Request B 67

Toward A Codelet Based Execution Model and Its Memory Semantics

Toward A Codelet Based Execution Model and Its Memory Semantics -- For Future Extreme-Scale Computing Systems HPC Worshop Centraro, Italy June 20, 2012 Guang R. Gao ACM Fellow and IEEE Fellow Distinguished