Exploiting Locality of Array Data with Parallel Object-Oriented Model for Multithreaded Computation

Size: px

Start display at page:

Download "Exploiting Locality of Array Data with Parallel Object-Oriented Model for Multithreaded Computation"

Solomon Stevenson
6 years ago
Views:

1 Exploiting Locality of Array Data with Parallel Object-Oriented Model for Multithreaded Computation Jonghoon Song, Juno Chang, Sangyong Han Dept. of Computer Science, Seoul National Univ. Seoul, , Korea {song, chang, syhan}@pandora.snu.ac.kr Heunghwan Kim Dept. of Computer Science, Seowon Univ., Cheongju, Chungcheng-Do, , Korea khh@dragon.seowon.ac.kr Abstract I-structure was designed to achieve efficiency and parallelism in functional programs that manipulate large data structures. Most multithreading models based on dataflow use it and it is put in a global heap memory that is shared by all code blocks. In this case, we can not effectively exploit the locality of data structure in most scientific application programs in which the production and consumption patterns are highly regular. Although many research projects have been done to address the problem, there is yet no good solution for that problem. In this paper, we exploit the locality of array data by using a parallel object-oriented model. In this model, the locality of array data is implicit as a membership of an object and it is also easy to write parallel programs by supporting good features of object-oriented programming paradigm. 1. Introduction Multithreading has been proposed as an execution model for massively parallel processors. It tries to hide latency by switching among a set of ready threads, thus improving the processor utilization. Both of inter-processor communication and remote data access latencies can be masked. Another view of multithreading based on dataflow is that it attempts to exploit instruction level locality from von Neumann model and natural synchronization from dataflow model. Many multithreading models lie on the spectrum from pure von Neumann model to pure dataflow model. As the base model moves closer to the von Neumann world, the locality of data structure can be better exploited. As the base model moves closer to dataflow, latencies are better tolerated and parallelism is more easily exploited. Most multithreading models based on dataflow use functional programming languages and there is much parallelism inherent in functional programs. However, this abundant parallelism could overwhelm the machine resources and make it difficult to exploit the locality of 1

2 computations[1]. Most multithreading model based on dataflow use I-structure[2] and it is put in a global heap memory that is shared by all code blocks. So, it is necessary that the request and response for I-structure should be handled in split phase to tolerate the latency, because a request for I-structure incurs long unpredictable latency. This property is suitable for the application programs in which the patterns of production and consumption are irregular and nondeterministic. If the patterns are highly regular, this model suffers from inordinate synchronization overhead. This makes it impossible to exploit data locality efficiently. Many research projects[3,4,5,6] have been done to reduce the overhead of remote access. These projects focused on modification of mechanism for handling array data to improve the performance. Sometimes, these mechanisms need the transfer of the overhead to programmer, compiler, or hardware architecture. The aim of this paper is to exploit the locality of array data by using the parallel objectoriented programming model[7]. In this model, it is easy to write parallel programs and exploit the locality of array data naturally. In parallel object-oriented model, the locality of data is implicit as a membership of an object. We use a multithread model based on dataflow that uses a frame-based synchronization. We have designed a parallel object-oriented programming model and implemented its run-time system on the multithreaded architecture DAVRID[8]. We analyze the locality of a parallel object-oriented model by simulating benchmarks on DAVRID. 2. Related Work There is much work to exploit the locality of data, computation, and communication in multithreaded execution. In [9], they studied the nature of inter-processor communications between multithreaded processors, which is based upon a non-blocking thread model that uses a frame-based storage model. It can be used to develop techniques that reduce the amount of inter-processor communications and the associated overhead. They proposed the technique of grouping the threads within a loop or a function body, which could be localized to a processor, to allocate them to the same processor. It means that the communications between these threads remain local to that processor. This allocation strategy showed a higher locality and exploits the memory hierarchy. In grouping of threads, however, there is potential for reducing the program parallelism, so much care must be taken to avoid such a situation. The functional programming model for multithreading has several problems in handling data structures and expressing nondeterministic computation[10]. Such problems can be solved by extending functional languages to incorporate non-functional data structures. Id[11] is a typical example of such a case. Non-strict semantics is worth of notice, not only for its expressive power, but also for its potential for exploiting more parallelism than strict semantics. However, non-strictness makes it difficult to analyze and generate code, because the order of 2

3 sub-expression evaluation in non-strict programs can not be statically determined[12]. There are also several researches to exploit locality while preserving non-strict semantics in multithreaded models based on dataflow. In [3], a hybrid data structure, V-cell, consists of a number of fixed size chunks of data elements. Each chunk is tagged by a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. In V-cell, however, it is assumed that the production and consumption patterns of data structure are highly regular, which allows a chunk to be easily determined before program execution. The data bundling scheme[4], was proposed to incorporate strictness into non-strict data structures while preserving the benefits of non-strict data structures. In this model, each bundle may not be fixed at compile-time but can be determined dynamically, so that the sizes and patterns of bundles for an I-structure can be variable within a program. It also supports element by element access to I- structure by tagging each element separately. This scheme enhances locality of reference and reduces synchronization overhead for non-strict data structure. We have a plan to introduce these schemes to our parallel object-oriented model. As other approaches, which have improved the performance by exploiting locality, there are an I-structure caching[6] or modified I-structure access scheme with varying block size[5]. From the results of the above researches, we concluded that exploiting the locality within single thread, or among multiple threads, or data structures shared by threads, is worthwhile, but these tend to move the overhead to programmer, compiler, or hardware architecture. 3. A Parallel Object-Oriented Model for Multithreaded Computation We developed a parallel programming model[7], which integrates functional programming model with object-oriented model, and implemented it on a multithreaded architecture, DAVRID[8]. In this section, we briefly describe the DAVRID execution model and its architecture organization. We also describe the object-oriented functional programming model. 3.1 DAVRID DAVRID model is based on dynamic dataflow scheduling where a node in dataflow graph represents a thread. A thread is a statically determined sequence of RISC-style instructions operating on registers. A thread is enabled to execute only when all the inputs to the thread are available. Multiple instances of a thread can be enabled at the same time and are distinguished by the continuation, <pointer to frame memory, pointer to thread code>. The thread enabling condition is detected by the synchronization mechanism. This mechanism matches inputs to a particular instance of a thread and decreases synchronization counter(sc). When SC becomes 0, that is, when all the inputs to the thread are available, the corresponding thread can be enabled. Data structures, such as arrays and records, are stored in a global memory. 3

Figure 1. The Organization of DAVRID Figure 2. Node Architecture DAVRID consists of one or more clusters connected by an interconnection network as shown in Figure 1.

4 Figure 1. The Organization of DAVRID Figure 2. Node Architecture DAVRID consists of one or more clusters connected by an interconnection network as shown in Figure 1. In a cluster, there are four nodes connected by NIMU(Node Interface and Management Unit). Each node consists of SU(Synchronization Unit), TPU(Thread Processing Unit), FM(Frame Memory), and several FIFO queues, as shown in Figure 2. NIMU manages the transmission of messages among four nodes in a cluster, and plays the role of an interface to the interconnection network. In addition, it has a global structured data memory, SM(Structured Memory), and manages work loads among nodes and clusters. TPU executes instructions of each thread sequentially. Outgoing messages are sent to other units through NIMU. SU carries out thread synchronization and allocates/deallocates frames in FM. 3.2 A Parallel Object-Oriented Model We developed an parallel object-oriented language, OOId[7]. It is based on Id, which is an implicit parallel programming language, and extended with object-oriented features such as data abstraction and inheritance. The program written in OOId looks like other object-oriented programs rather than functional programs, because OOId has two kind of views; the structural and the functional views. The former has an object-oriented structure, the latter has the same view as a function in Id program. In multithreaded execution using the frame-based synchronization, the runtime data structure, called frame, is dynamically allocated to execute functions or the loop iterations, which are formed as the independently executable unit at compile-time. Each function is executed in parallel, exchanging the messages for the synchronization each other. There are many kinds of messages in multithreaded system, and their patterns depend on the computational model which the system is based on. When the system is based on a functional model, many messages are for sending the argument value. The reason is that nonlocal variables of function are transformed into arguments by lambda lifting[13] to provide the environment where functions can be executed independently. On the other hand, when the system is based on an object-oriented model, many messages are for requesting the allocation of frames. The reason is that the function frame is separated into two kinds of frames in an object-oriented model; an object frame and a member function 4

5 frame. The object frame keeps the information about members of an object, and the member function frame keeps the information about thread codes. By using the object frame, messages for sending argument values can be reduced, because the object frame maintains its state in form of member variables and it can be used by the member function whenever it is necessary. We have constructed a parallel object-oriented model based on OOId, which reduces the total amount of messages and the extra synchronization overhead, by utilizing the state information of objects effectively. We also devised an object frame for realizing an object in frame-based synchronization. It is possible to distribute the data array over the nodes, by partitioning the data array into objects and distributing them. The object frame is alive until the object gets out of the scope. The object frame also plays a role in the frame of constructor, in addition to store the members of an object. Because the constructor is called just once for one object while other member functions may be called randomly, its data can be kept statically in the object frame. The object frame stores the values of member variables, the interface data for the member functions, and synchronization data of constructor threads. 4. Array Handling Mechanism 4.1 array representation In DAVRID, the declarative programming language Id - [15] has been used. Id - has a functional kernel, so arrays are created using monolithic array constructors called array comprehensions. In [2], Arvind and others insisted that array comprehensions lack expressiveness, and, for some problems, lower level constructors for manipulating the elements of I-structures, are necessary. An I-structure is a single assignment array with element-level synchronization for reads and writes. An I-structure array, called Iarray, is a special kind of array, in which each element may be written no more than once. I-structure can achieve efficiency and parallelism in handling large data structures, while programmer is free from details of scheduling and synchronization of parallel activities unlike data structure in imperative languages although there are some restrictions. These advantages motivated the most fine-grain dataflow or multithreaded execution model to support I-structure. However, it can not effectively exploit the locality of data structure in most scientific application programs in which the patterns of production and consumption of data structures are highly regular. So, putting I-structure in global area brings about the ineffective accesses to global I-structure. In our parallel object-oriented model, we can distribute the array data as the member data of each object. In addition, the programmer can make 5 Matrix A Matrix B X

6 decision for the distribution of data or computation in object-oriented programming paradigm. The programmer understands the application domain and can make better data and computation partitioning decisions than the compiler[14]. For example, in case of matrix multiplication, we treat each matrix as the array of vector objects and distribute them over the nodes, while all the matrix data are allocated in I-structure before execution in the functional model. Figure 3 shows the data structure for the multiplication of two matrix. When we make an object as the row vector or column vector, we compute this multiplication effectively. The bold rectangle shows the object for each matrix. In matrix A, each row are declared as an object while each column is declared as an object in matrix B. The following is the skeleton of the codes in OOId. The class vector has 3 member functions. The constructor function is implicitly defined, which includes the initialization of member variables. Each object multiplies the data by the function multiply and stores the result to the new matrix C by the function store. vector = { class size = { n = size; data = make_array (1,size) { }; multiply = {fun X = {...}}; store = {fun index val = {...}}}}; A = make_array (1,n) (vector n); B = make_array (1,n) (vector n); C = make_array (1,n) (vector n); _ = {for i <-1 to n step 1 do _ = {for j <- 1 to n step 1 do _ = C[i].store j (A[i].multiply B[j]) }}; When we distribute the matrix data as the member data of an object, we can exploit the locality and reduce the inter-node communication. If we distribute the matrix without considering a vector-type object, we suffer from the concentration of messages to the main frame as we have seen in irregular computation. In section 5, we will explain it with experimental results. Figure 3. Matrix multiplication 6

7 Figure 4. Array Handling in Functional Model Figure 5. Array Handling in Parallel OO Model 4.2 Array Handling In functional model, Iarray is allocated in global area and can be shared, while array data are distributed to corresponding object frames, which own their member array data in a parallel object-oriented model. Figure 4 and Figure 5 show the allocation and the retrieval of array data in each model. Figure 4 shows the mechanism for array handling in functional model. There are two kinds of arrows; the bold one(lettered by A ) represents the message for allocating array data and the dotted one(lettered by R ) does for requesting array data. Each arrow means the followings: A-1: TPU(Thread Processing Unit) sends a message to the unit managing I-structure memory for allocating Iarray A-2: allocates Iarray A-3: sends a message with the address of Iarray to the synchronization unit. A-4: writes the address in the frame memory and handles the synchronization information A-5: SU(Synchronization Unit) notifies TPU of completion of Iarray allocation When the message is local, we express it as notification because the units in same node communicate through local queues. In case of accessing to Iarray data, the mechanism for synchronization is similar to the allocation. R-1: TPU(Thread Processing Unit) sends a message for requesting Iarray data. R-2: fetch the data and send the message with the data and synchronization information to the SU(Synchronization Unit) R-3: write the data in the frame memory and handle the synchronization information R-4: notify TPU of completion of fetching Iarray data. Figure 5 shows the mechanism for array handling in parallel object-oriented model. The arrows mean the same as Figure 4, but there are two kinds of request; the one is for local request, the other is for remote request. If TPU(Thread Processing Unit) needs member data, request is 7

8 resolved in local, otherwise sends a message to the NIMU(Network Interface & Management Unit) for remote data request. Each arrow means the followings: A-1: TPU(Thread Processing Unit) notifies SU(Synchronization Unit) of allocating array data. A-2: allocate array data as the member data A-3: SU notifies TPU of completion of allocation As you shown in Figure 5, there is no message traffic incurs long latency. The followings are for requesting array. LR-1: TPU(Thread Processing Unit) notifies SU(Synchronization Unit) of requesting array data. LR-2: fetch the data LR-3: SU notifies TPU of completion of fetching array data. We need remote data request even in parallel object-oriented model, when TPU needs member data of an object allocated in other nodes as arguments. RR-1: TPU sends a message for request array data located in other node. RR-2: NIMU(Network Interface & Management Unit) routes the request to corresponding node. RR-3: The node that receives the remote request message sends the message with the data and synchronization information to SU The rest of procedure is the same as in the remote request in functional model. Once we retrieve the data from a remote object frame, we don t need to retrieve it again, by preserving it in local area. This is the difference between two models. So, we reduce the message traffic incurring long latency and exploit the locality in a node by keeping an object frame in parallel object-oriented model. 5. Analysis In this section, we compare the performance of the parallel object-oriented model with that of the functional model. In functional model, I-structure is located in global area and shared by all the loop functions, which are being executed over many nodes. In parallel object-oriented model, the object frame distributed in each node preserves its member data in its local memory. We simulated the models on DAVRID simulator, which execute benchmarks in event list scheduling mechanism[17]. That is, there exist data structures that represent event list and variables that represent the current time of simulator. And the simulator puts the generated events to the event list, and then gets and processes them on time that they should be handled. The benchmark programs are matrix multiplication and N-body problem written in OOId programming language. These programs are compiled to DAVRID PML(Parallel Machine 8

9 Language)[15]. We compile each benchmark into two versions: functional and parallel objectoriented model. Matrix multiplication is the typical benchmark to evaluate the effects of parallel execution. In functional model, all the matrix data are allocated in I-structure before multiplication, and the unfolding degree of parallel loop frames is set by 3, 5, 0 for each nested loop level. In parallel object-oriented model, we treat each matrix as the array of vector objects and distribute them over the nodes. N-body problem is to find the behavior of N-particles interacting through gravity or coulombic force and demonstrates a programming style based on the concept of information hiding[16]. In this approach data structures and operations are associated and encapsulated within programs. In functional model, all the structured data are allocated in I-structure and the unfolding degree of parallel loop frames is set by 15, 0 for each nested loop level, whereas we treat each particle as an object in parallel object-oriented model. N Functional Model Parallel OO Model Speedup (a) Total execution cycles and speedup (b) Execution cycles of each unit (N=60) Figure 6. Simulation results of matrix multiplication benchmark 9

10 Figure 6(a) shows that the parallel object-oriented model improves performance by exploiting locality when matrix multiplication benchmarks are executed with various matrix sizes. The number of total execution cycles, used as the measurement for comparison, is the number of total clock cycles that elapse in DAVRID system from the start of program execution to the end of that, and it is the same meaning as the time of program execution. The speedup is over 8, and it increases as the problem size increases. This means that treating each vector as an object contributes to enhancement of the locality of data. The major cause of this speedup is the avoidance of the concentrated accesses to the global data. Figure 6(b) compares the average execution cycles of processor units in two models. We can see the reduction of execution cycles in all units. Avoiding the concentration of data access reduces the execution cycles of NIMU remarkably. The improvement of TPUs and SUs is resulted from reducing the synchronization with other nodes by using spatial locality of member data. In case of N-body problem, there are two parameter variables to have an effect on performance. The one is the number of bodies N and the other is the time parameter T. Figure 7(a) shows that the average speedup is about 7 when the time parameter T is fixed as 20. Figure 7(b) compares that the average execution cycles of units in two models. The parallel objectoriented model reduces the execution cycles of most units, especially for NIMU. The improvement of NIMU is resulted from the spatial locality of member data in the same manner of matrix multiplication. As you can see in Figure 6(b) and 7(b), the rate of improvement of SU is relatively low, while the execution cycles of NIMUs and TPUs remarkably decrease. Especially, in case of N- body problem, execution cycles of SU are even increased. It results from keeping array data in frame memory instead of I-structure and then distributing the computation load of NIMU among SUs. But, it can lead to the improvement of total performance, because NIMU usually has the heavier loads than other units Figure 7(c) is the change of speedup for time parameter T. The speedup increases in proportion to the increase of time parameter, and it shows the exploitation of another locality in N-body problem. In parallel object-oriented model, it may happen that the messages concentrate at the node containing the main function frame, because it preserves the object frame pointers. This causes the overhead of SU at the node containing the main function frame to increase, because it must handle the concentrated messages. In case of N-body problem, however, we can improve performance by increasing the time parameter T. This improvement comes from temporal locality by keeping the data from previous time unit in the object frame. In functional model, all the data should be calculated again at each time unit. 10

11 N Functional Model Parallel OO Model Speedup (a) Total execution cycles and speedup (T=20) (b) Execution cycles of each unit (N=60, T=20) (c) Change of speedup for time parameter T (N=60) Figure 7. Simulation results of N-body problem benchmark 11

12 6. Conclusion and Future Works Most multithreading models based on dataflow use functional programming languages and there is much parallelism inherent in functional programs. However, this abundant parallelism could overwhelm the machine resources and make it difficult to exploit the locality of computations. Most multithreading model based on dataflow use I-structure and it is put in a global heap memory that is shared by all code blocks. So, it is necessary that the request and response for I-structure should be handled in split-phase to tolerate the latency, because a request for I-structure incurs long unpredictable latency. In this paper, we introduced a parallel object-oriented model to exploit the locality of array data by distributing them as the member data of an object. In this model, it is easy to write parallel programs and exploit the locality of array data naturally. Experiment results show that we can exploit the locality when the benchmark has the regular access pattern such as matrix multiplication. And, when the benchmark needs the production of many objects and complex communication among them, we can get a good performance improvement by using the state information in object frames. One of the important future works is to improve the load balancing. Generally, all messages that are related in requesting the frame allocation include the size of frame in its operand, and it helps the switching unit to balance the computing loads among nodes. This simple method of load balancing may lead to message concentration for the specific frame, because the node that requests allocation of object array must keep their frame pointers and process the messages that try to read them. We have studied to decrease this concentration using the new load balancing methods. In new methods, the concept of frame-token is used and NIMU is in charge of managing the allocated frame pointers. It will be able to decrease the synchronization overhead caused by message traffic for frame pointers and provide the general strategies of data distribution. References [1] Bhanu Shankar, Lucas Roh, Wim Bohm, and Walid Najjar, Control of Loop Parallelism in Multithreaded Code, Proc. of the Int l Conference on Parallel Architectures and Compilation Techniques, pp , June [2] Arvind, R. S. Nikhil, and K. K. Pingali,. I-Structures: Data Structures for Parallel Computing," ACM Transactions on Programming Languages and Systems, Vol. 11, No. 4, pp , Oct [3] Walid Najjar, W. M. Miller, and A. P. W. Bohm, Data Driven Vector Execution, Technical Paper, Dept. of Computer Science, Colorado State University. 12

13 [4] E. H. Rho, S. Y. Han, H. H. Kim, and D. J. Hwang, Effects of Data Bundling in Non-strict Data Structure, Proc. of the Int l Conference on Parallel Architectures and Compilation Techniques, pp , June [5] Y. H. Kim, S. H. Kim, D. W. Rhee, H. H. Kim, Juno Chang, and S. Y. Han, Exploiting the Locality of Data Structures in Multithreaded Architecture, Proc. of the Int l Conference on Parallel and Distributed Systems, pp , June [6] Hyong-Shik Kim, Soonhoi Ha, and Chu Shik Jhon, Quantitative Analysis of Caching Effect of I-structure Data in Frame-based Multithreaded Processing, Proc. of the Int l Conference on Parallel Processing, pp , August [7] Juno Chang, J. H. Song, J. H. Kim, H. H. Kim, and S. Y. Han, Implementation of an Object-Oriented Functional Language on the Multithreaded Architecture, Proc. of the Int l Conference on Parallel and Distributed Systems, pp , Dec [8] S.H.Ha, J.H.Kim, E.H.Rho, H.H.Kim, D.J.Hwang, and S.Y. Han, A Massively Parallel Multithreaded Architecture:DAVRID, Proc. of Int l Conference on Computer Design, pp70-74, Oct [9] Lucas Roh, Walid Najjar, Analysis of Communication and Overhead Reduction in Multithreaded Execution, PACT 95, pp , Cyprus, [10] K.E.Schauser, Compiling Lenient Languages for Parallel Asynchronous Execution, Ph.D. thesis, Computer Science Division, UC Berkeley, [11] R. S. Nikhil, Id-Language Reference Manual(Version 90.1). MIT CSG Memo 284-2, July [12] K. R. Traub, D. E. Culler, and K. E. Shauser, Global Analysis for Partitioning Non-strict Programs into Sequential Threads, Motolora Technical Report MCRC-TR-26, April [13] T. Johnsson, Lambda Lifting:Transforming Program To Recursive Equations, In Springer-Verlag LNCS 201, Sep [14] Andrew S. Grimshaw, Easy-to-Use Object-Oriented Parallel Processing with Mentat, computer, Vol. 26, No. 5, pp 39-51, May [15] E.H.Rho, S.H.Ha, H.H.Kim, D.J.Hwang, and S.Y.Han, Compilation of a Functional Language for the Multithreaded Architecture:DAVRID, Proc. of the Int l Conference on Parallel Processing, Vol. 2, pp , Aug [16] K. Mani Chandy and Stephen Taylor, An Introduction to Parallel Programming, Jones and Bartlett Publishers, [17] M. H. MacDougall, "Simulating Computer Systems Techniques and Tools", The MIT Press,

Dennis [1] developed the model of dataflowschemas, building on work by Karp and Miller [2]. These dataflow graphs, as they were later called, evolved

Dennis [1] developed the model of dataflowschemas, building on work by Karp and Miller [2]. These dataflow graphs, as they were later called, evolved Predictive Block Dataflow Model for Parallel Computation EE382C: Embedded Software Systems Literature Survey Vishal Mishra and Kursad Oney Dept. of Electrical and Computer Engineering The University of