Exploiting Locality of Array Data with Parallel Object-Oriented Model for Multithreaded Computation

Size: px
Start display at page:

Download "Exploiting Locality of Array Data with Parallel Object-Oriented Model for Multithreaded Computation"

Transcription

1 Exploiting Locality of Array Data with Parallel Object-Oriented Model for Multithreaded Computation Jonghoon Song, Juno Chang, Sangyong Han Dept. of Computer Science, Seoul National Univ. Seoul, , Korea {song, chang, syhan}@pandora.snu.ac.kr Heunghwan Kim Dept. of Computer Science, Seowon Univ., Cheongju, Chungcheng-Do, , Korea khh@dragon.seowon.ac.kr Abstract I-structure was designed to achieve efficiency and parallelism in functional programs that manipulate large data structures. Most multithreading models based on dataflow use it and it is put in a global heap memory that is shared by all code blocks. In this case, we can not effectively exploit the locality of data structure in most scientific application programs in which the production and consumption patterns are highly regular. Although many research projects have been done to address the problem, there is yet no good solution for that problem. In this paper, we exploit the locality of array data by using a parallel object-oriented model. In this model, the locality of array data is implicit as a membership of an object and it is also easy to write parallel programs by supporting good features of object-oriented programming paradigm. 1. Introduction Multithreading has been proposed as an execution model for massively parallel processors. It tries to hide latency by switching among a set of ready threads, thus improving the processor utilization. Both of inter-processor communication and remote data access latencies can be masked. Another view of multithreading based on dataflow is that it attempts to exploit instruction level locality from von Neumann model and natural synchronization from dataflow model. Many multithreading models lie on the spectrum from pure von Neumann model to pure dataflow model. As the base model moves closer to the von Neumann world, the locality of data structure can be better exploited. As the base model moves closer to dataflow, latencies are better tolerated and parallelism is more easily exploited. Most multithreading models based on dataflow use functional programming languages and there is much parallelism inherent in functional programs. However, this abundant parallelism could overwhelm the machine resources and make it difficult to exploit the locality of 1

2 computations[1]. Most multithreading model based on dataflow use I-structure[2] and it is put in a global heap memory that is shared by all code blocks. So, it is necessary that the request and response for I-structure should be handled in split phase to tolerate the latency, because a request for I-structure incurs long unpredictable latency. This property is suitable for the application programs in which the patterns of production and consumption are irregular and nondeterministic. If the patterns are highly regular, this model suffers from inordinate synchronization overhead. This makes it impossible to exploit data locality efficiently. Many research projects[3,4,5,6] have been done to reduce the overhead of remote access. These projects focused on modification of mechanism for handling array data to improve the performance. Sometimes, these mechanisms need the transfer of the overhead to programmer, compiler, or hardware architecture. The aim of this paper is to exploit the locality of array data by using the parallel objectoriented programming model[7]. In this model, it is easy to write parallel programs and exploit the locality of array data naturally. In parallel object-oriented model, the locality of data is implicit as a membership of an object. We use a multithread model based on dataflow that uses a frame-based synchronization. We have designed a parallel object-oriented programming model and implemented its run-time system on the multithreaded architecture DAVRID[8]. We analyze the locality of a parallel object-oriented model by simulating benchmarks on DAVRID. 2. Related Work There is much work to exploit the locality of data, computation, and communication in multithreaded execution. In [9], they studied the nature of inter-processor communications between multithreaded processors, which is based upon a non-blocking thread model that uses a frame-based storage model. It can be used to develop techniques that reduce the amount of inter-processor communications and the associated overhead. They proposed the technique of grouping the threads within a loop or a function body, which could be localized to a processor, to allocate them to the same processor. It means that the communications between these threads remain local to that processor. This allocation strategy showed a higher locality and exploits the memory hierarchy. In grouping of threads, however, there is potential for reducing the program parallelism, so much care must be taken to avoid such a situation. The functional programming model for multithreading has several problems in handling data structures and expressing nondeterministic computation[10]. Such problems can be solved by extending functional languages to incorporate non-functional data structures. Id[11] is a typical example of such a case. Non-strict semantics is worth of notice, not only for its expressive power, but also for its potential for exploiting more parallelism than strict semantics. However, non-strictness makes it difficult to analyze and generate code, because the order of 2

3 sub-expression evaluation in non-strict programs can not be statically determined[12]. There are also several researches to exploit locality while preserving non-strict semantics in multithreaded models based on dataflow. In [3], a hybrid data structure, V-cell, consists of a number of fixed size chunks of data elements. Each chunk is tagged by a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. In V-cell, however, it is assumed that the production and consumption patterns of data structure are highly regular, which allows a chunk to be easily determined before program execution. The data bundling scheme[4], was proposed to incorporate strictness into non-strict data structures while preserving the benefits of non-strict data structures. In this model, each bundle may not be fixed at compile-time but can be determined dynamically, so that the sizes and patterns of bundles for an I-structure can be variable within a program. It also supports element by element access to I- structure by tagging each element separately. This scheme enhances locality of reference and reduces synchronization overhead for non-strict data structure. We have a plan to introduce these schemes to our parallel object-oriented model. As other approaches, which have improved the performance by exploiting locality, there are an I-structure caching[6] or modified I-structure access scheme with varying block size[5]. From the results of the above researches, we concluded that exploiting the locality within single thread, or among multiple threads, or data structures shared by threads, is worthwhile, but these tend to move the overhead to programmer, compiler, or hardware architecture. 3. A Parallel Object-Oriented Model for Multithreaded Computation We developed a parallel programming model[7], which integrates functional programming model with object-oriented model, and implemented it on a multithreaded architecture, DAVRID[8]. In this section, we briefly describe the DAVRID execution model and its architecture organization. We also describe the object-oriented functional programming model. 3.1 DAVRID DAVRID model is based on dynamic dataflow scheduling where a node in dataflow graph represents a thread. A thread is a statically determined sequence of RISC-style instructions operating on registers. A thread is enabled to execute only when all the inputs to the thread are available. Multiple instances of a thread can be enabled at the same time and are distinguished by the continuation, <pointer to frame memory, pointer to thread code>. The thread enabling condition is detected by the synchronization mechanism. This mechanism matches inputs to a particular instance of a thread and decreases synchronization counter(sc). When SC becomes 0, that is, when all the inputs to the thread are available, the corresponding thread can be enabled. Data structures, such as arrays and records, are stored in a global memory. 3

4 Figure 1. The Organization of DAVRID Figure 2. Node Architecture DAVRID consists of one or more clusters connected by an interconnection network as shown in Figure 1. In a cluster, there are four nodes connected by NIMU(Node Interface and Management Unit). Each node consists of SU(Synchronization Unit), TPU(Thread Processing Unit), FM(Frame Memory), and several FIFO queues, as shown in Figure 2. NIMU manages the transmission of messages among four nodes in a cluster, and plays the role of an interface to the interconnection network. In addition, it has a global structured data memory, SM(Structured Memory), and manages work loads among nodes and clusters. TPU executes instructions of each thread sequentially. Outgoing messages are sent to other units through NIMU. SU carries out thread synchronization and allocates/deallocates frames in FM. 3.2 A Parallel Object-Oriented Model We developed an parallel object-oriented language, OOId[7]. It is based on Id, which is an implicit parallel programming language, and extended with object-oriented features such as data abstraction and inheritance. The program written in OOId looks like other object-oriented programs rather than functional programs, because OOId has two kind of views; the structural and the functional views. The former has an object-oriented structure, the latter has the same view as a function in Id program. In multithreaded execution using the frame-based synchronization, the runtime data structure, called frame, is dynamically allocated to execute functions or the loop iterations, which are formed as the independently executable unit at compile-time. Each function is executed in parallel, exchanging the messages for the synchronization each other. There are many kinds of messages in multithreaded system, and their patterns depend on the computational model which the system is based on. When the system is based on a functional model, many messages are for sending the argument value. The reason is that nonlocal variables of function are transformed into arguments by lambda lifting[13] to provide the environment where functions can be executed independently. On the other hand, when the system is based on an object-oriented model, many messages are for requesting the allocation of frames. The reason is that the function frame is separated into two kinds of frames in an object-oriented model; an object frame and a member function 4

5 frame. The object frame keeps the information about members of an object, and the member function frame keeps the information about thread codes. By using the object frame, messages for sending argument values can be reduced, because the object frame maintains its state in form of member variables and it can be used by the member function whenever it is necessary. We have constructed a parallel object-oriented model based on OOId, which reduces the total amount of messages and the extra synchronization overhead, by utilizing the state information of objects effectively. We also devised an object frame for realizing an object in frame-based synchronization. It is possible to distribute the data array over the nodes, by partitioning the data array into objects and distributing them. The object frame is alive until the object gets out of the scope. The object frame also plays a role in the frame of constructor, in addition to store the members of an object. Because the constructor is called just once for one object while other member functions may be called randomly, its data can be kept statically in the object frame. The object frame stores the values of member variables, the interface data for the member functions, and synchronization data of constructor threads. 4. Array Handling Mechanism 4.1 array representation In DAVRID, the declarative programming language Id - [15] has been used. Id - has a functional kernel, so arrays are created using monolithic array constructors called array comprehensions. In [2], Arvind and others insisted that array comprehensions lack expressiveness, and, for some problems, lower level constructors for manipulating the elements of I-structures, are necessary. An I-structure is a single assignment array with element-level synchronization for reads and writes. An I-structure array, called Iarray, is a special kind of array, in which each element may be written no more than once. I-structure can achieve efficiency and parallelism in handling large data structures, while programmer is free from details of scheduling and synchronization of parallel activities unlike data structure in imperative languages although there are some restrictions. These advantages motivated the most fine-grain dataflow or multithreaded execution model to support I-structure. However, it can not effectively exploit the locality of data structure in most scientific application programs in which the patterns of production and consumption of data structures are highly regular. So, putting I-structure in global area brings about the ineffective accesses to global I-structure. In our parallel object-oriented model, we can distribute the array data as the member data of each object. In addition, the programmer can make 5 Matrix A Matrix B X

6 decision for the distribution of data or computation in object-oriented programming paradigm. The programmer understands the application domain and can make better data and computation partitioning decisions than the compiler[14]. For example, in case of matrix multiplication, we treat each matrix as the array of vector objects and distribute them over the nodes, while all the matrix data are allocated in I-structure before execution in the functional model. Figure 3 shows the data structure for the multiplication of two matrix. When we make an object as the row vector or column vector, we compute this multiplication effectively. The bold rectangle shows the object for each matrix. In matrix A, each row are declared as an object while each column is declared as an object in matrix B. The following is the skeleton of the codes in OOId. The class vector has 3 member functions. The constructor function is implicitly defined, which includes the initialization of member variables. Each object multiplies the data by the function multiply and stores the result to the new matrix C by the function store. vector = { class size = { n = size; data = make_array (1,size) { }; multiply = {fun X = {...}}; store = {fun index val = {...}}}}; A = make_array (1,n) (vector n); B = make_array (1,n) (vector n); C = make_array (1,n) (vector n); _ = {for i <-1 to n step 1 do _ = {for j <- 1 to n step 1 do _ = C[i].store j (A[i].multiply B[j]) }}; When we distribute the matrix data as the member data of an object, we can exploit the locality and reduce the inter-node communication. If we distribute the matrix without considering a vector-type object, we suffer from the concentration of messages to the main frame as we have seen in irregular computation. In section 5, we will explain it with experimental results. Figure 3. Matrix multiplication 6

7 Figure 4. Array Handling in Functional Model Figure 5. Array Handling in Parallel OO Model 4.2 Array Handling In functional model, Iarray is allocated in global area and can be shared, while array data are distributed to corresponding object frames, which own their member array data in a parallel object-oriented model. Figure 4 and Figure 5 show the allocation and the retrieval of array data in each model. Figure 4 shows the mechanism for array handling in functional model. There are two kinds of arrows; the bold one(lettered by A ) represents the message for allocating array data and the dotted one(lettered by R ) does for requesting array data. Each arrow means the followings: A-1: TPU(Thread Processing Unit) sends a message to the unit managing I-structure memory for allocating Iarray A-2: allocates Iarray A-3: sends a message with the address of Iarray to the synchronization unit. A-4: writes the address in the frame memory and handles the synchronization information A-5: SU(Synchronization Unit) notifies TPU of completion of Iarray allocation When the message is local, we express it as notification because the units in same node communicate through local queues. In case of accessing to Iarray data, the mechanism for synchronization is similar to the allocation. R-1: TPU(Thread Processing Unit) sends a message for requesting Iarray data. R-2: fetch the data and send the message with the data and synchronization information to the SU(Synchronization Unit) R-3: write the data in the frame memory and handle the synchronization information R-4: notify TPU of completion of fetching Iarray data. Figure 5 shows the mechanism for array handling in parallel object-oriented model. The arrows mean the same as Figure 4, but there are two kinds of request; the one is for local request, the other is for remote request. If TPU(Thread Processing Unit) needs member data, request is 7

8 resolved in local, otherwise sends a message to the NIMU(Network Interface & Management Unit) for remote data request. Each arrow means the followings: A-1: TPU(Thread Processing Unit) notifies SU(Synchronization Unit) of allocating array data. A-2: allocate array data as the member data A-3: SU notifies TPU of completion of allocation As you shown in Figure 5, there is no message traffic incurs long latency. The followings are for requesting array. LR-1: TPU(Thread Processing Unit) notifies SU(Synchronization Unit) of requesting array data. LR-2: fetch the data LR-3: SU notifies TPU of completion of fetching array data. We need remote data request even in parallel object-oriented model, when TPU needs member data of an object allocated in other nodes as arguments. RR-1: TPU sends a message for request array data located in other node. RR-2: NIMU(Network Interface & Management Unit) routes the request to corresponding node. RR-3: The node that receives the remote request message sends the message with the data and synchronization information to SU The rest of procedure is the same as in the remote request in functional model. Once we retrieve the data from a remote object frame, we don t need to retrieve it again, by preserving it in local area. This is the difference between two models. So, we reduce the message traffic incurring long latency and exploit the locality in a node by keeping an object frame in parallel object-oriented model. 5. Analysis In this section, we compare the performance of the parallel object-oriented model with that of the functional model. In functional model, I-structure is located in global area and shared by all the loop functions, which are being executed over many nodes. In parallel object-oriented model, the object frame distributed in each node preserves its member data in its local memory. We simulated the models on DAVRID simulator, which execute benchmarks in event list scheduling mechanism[17]. That is, there exist data structures that represent event list and variables that represent the current time of simulator. And the simulator puts the generated events to the event list, and then gets and processes them on time that they should be handled. The benchmark programs are matrix multiplication and N-body problem written in OOId programming language. These programs are compiled to DAVRID PML(Parallel Machine 8

9 Language)[15]. We compile each benchmark into two versions: functional and parallel objectoriented model. Matrix multiplication is the typical benchmark to evaluate the effects of parallel execution. In functional model, all the matrix data are allocated in I-structure before multiplication, and the unfolding degree of parallel loop frames is set by 3, 5, 0 for each nested loop level. In parallel object-oriented model, we treat each matrix as the array of vector objects and distribute them over the nodes. N-body problem is to find the behavior of N-particles interacting through gravity or coulombic force and demonstrates a programming style based on the concept of information hiding[16]. In this approach data structures and operations are associated and encapsulated within programs. In functional model, all the structured data are allocated in I-structure and the unfolding degree of parallel loop frames is set by 15, 0 for each nested loop level, whereas we treat each particle as an object in parallel object-oriented model. N Functional Model Parallel OO Model Speedup (a) Total execution cycles and speedup (b) Execution cycles of each unit (N=60) Figure 6. Simulation results of matrix multiplication benchmark 9

10 Figure 6(a) shows that the parallel object-oriented model improves performance by exploiting locality when matrix multiplication benchmarks are executed with various matrix sizes. The number of total execution cycles, used as the measurement for comparison, is the number of total clock cycles that elapse in DAVRID system from the start of program execution to the end of that, and it is the same meaning as the time of program execution. The speedup is over 8, and it increases as the problem size increases. This means that treating each vector as an object contributes to enhancement of the locality of data. The major cause of this speedup is the avoidance of the concentrated accesses to the global data. Figure 6(b) compares the average execution cycles of processor units in two models. We can see the reduction of execution cycles in all units. Avoiding the concentration of data access reduces the execution cycles of NIMU remarkably. The improvement of TPUs and SUs is resulted from reducing the synchronization with other nodes by using spatial locality of member data. In case of N-body problem, there are two parameter variables to have an effect on performance. The one is the number of bodies N and the other is the time parameter T. Figure 7(a) shows that the average speedup is about 7 when the time parameter T is fixed as 20. Figure 7(b) compares that the average execution cycles of units in two models. The parallel objectoriented model reduces the execution cycles of most units, especially for NIMU. The improvement of NIMU is resulted from the spatial locality of member data in the same manner of matrix multiplication. As you can see in Figure 6(b) and 7(b), the rate of improvement of SU is relatively low, while the execution cycles of NIMUs and TPUs remarkably decrease. Especially, in case of N- body problem, execution cycles of SU are even increased. It results from keeping array data in frame memory instead of I-structure and then distributing the computation load of NIMU among SUs. But, it can lead to the improvement of total performance, because NIMU usually has the heavier loads than other units Figure 7(c) is the change of speedup for time parameter T. The speedup increases in proportion to the increase of time parameter, and it shows the exploitation of another locality in N-body problem. In parallel object-oriented model, it may happen that the messages concentrate at the node containing the main function frame, because it preserves the object frame pointers. This causes the overhead of SU at the node containing the main function frame to increase, because it must handle the concentrated messages. In case of N-body problem, however, we can improve performance by increasing the time parameter T. This improvement comes from temporal locality by keeping the data from previous time unit in the object frame. In functional model, all the data should be calculated again at each time unit. 10

11 N Functional Model Parallel OO Model Speedup (a) Total execution cycles and speedup (T=20) (b) Execution cycles of each unit (N=60, T=20) (c) Change of speedup for time parameter T (N=60) Figure 7. Simulation results of N-body problem benchmark 11

12 6. Conclusion and Future Works Most multithreading models based on dataflow use functional programming languages and there is much parallelism inherent in functional programs. However, this abundant parallelism could overwhelm the machine resources and make it difficult to exploit the locality of computations. Most multithreading model based on dataflow use I-structure and it is put in a global heap memory that is shared by all code blocks. So, it is necessary that the request and response for I-structure should be handled in split-phase to tolerate the latency, because a request for I-structure incurs long unpredictable latency. In this paper, we introduced a parallel object-oriented model to exploit the locality of array data by distributing them as the member data of an object. In this model, it is easy to write parallel programs and exploit the locality of array data naturally. Experiment results show that we can exploit the locality when the benchmark has the regular access pattern such as matrix multiplication. And, when the benchmark needs the production of many objects and complex communication among them, we can get a good performance improvement by using the state information in object frames. One of the important future works is to improve the load balancing. Generally, all messages that are related in requesting the frame allocation include the size of frame in its operand, and it helps the switching unit to balance the computing loads among nodes. This simple method of load balancing may lead to message concentration for the specific frame, because the node that requests allocation of object array must keep their frame pointers and process the messages that try to read them. We have studied to decrease this concentration using the new load balancing methods. In new methods, the concept of frame-token is used and NIMU is in charge of managing the allocated frame pointers. It will be able to decrease the synchronization overhead caused by message traffic for frame pointers and provide the general strategies of data distribution. References [1] Bhanu Shankar, Lucas Roh, Wim Bohm, and Walid Najjar, Control of Loop Parallelism in Multithreaded Code, Proc. of the Int l Conference on Parallel Architectures and Compilation Techniques, pp , June [2] Arvind, R. S. Nikhil, and K. K. Pingali,. I-Structures: Data Structures for Parallel Computing," ACM Transactions on Programming Languages and Systems, Vol. 11, No. 4, pp , Oct [3] Walid Najjar, W. M. Miller, and A. P. W. Bohm, Data Driven Vector Execution, Technical Paper, Dept. of Computer Science, Colorado State University. 12

13 [4] E. H. Rho, S. Y. Han, H. H. Kim, and D. J. Hwang, Effects of Data Bundling in Non-strict Data Structure, Proc. of the Int l Conference on Parallel Architectures and Compilation Techniques, pp , June [5] Y. H. Kim, S. H. Kim, D. W. Rhee, H. H. Kim, Juno Chang, and S. Y. Han, Exploiting the Locality of Data Structures in Multithreaded Architecture, Proc. of the Int l Conference on Parallel and Distributed Systems, pp , June [6] Hyong-Shik Kim, Soonhoi Ha, and Chu Shik Jhon, Quantitative Analysis of Caching Effect of I-structure Data in Frame-based Multithreaded Processing, Proc. of the Int l Conference on Parallel Processing, pp , August [7] Juno Chang, J. H. Song, J. H. Kim, H. H. Kim, and S. Y. Han, Implementation of an Object-Oriented Functional Language on the Multithreaded Architecture, Proc. of the Int l Conference on Parallel and Distributed Systems, pp , Dec [8] S.H.Ha, J.H.Kim, E.H.Rho, H.H.Kim, D.J.Hwang, and S.Y. Han, A Massively Parallel Multithreaded Architecture:DAVRID, Proc. of Int l Conference on Computer Design, pp70-74, Oct [9] Lucas Roh, Walid Najjar, Analysis of Communication and Overhead Reduction in Multithreaded Execution, PACT 95, pp , Cyprus, [10] K.E.Schauser, Compiling Lenient Languages for Parallel Asynchronous Execution, Ph.D. thesis, Computer Science Division, UC Berkeley, [11] R. S. Nikhil, Id-Language Reference Manual(Version 90.1). MIT CSG Memo 284-2, July [12] K. R. Traub, D. E. Culler, and K. E. Shauser, Global Analysis for Partitioning Non-strict Programs into Sequential Threads, Motolora Technical Report MCRC-TR-26, April [13] T. Johnsson, Lambda Lifting:Transforming Program To Recursive Equations, In Springer-Verlag LNCS 201, Sep [14] Andrew S. Grimshaw, Easy-to-Use Object-Oriented Parallel Processing with Mentat, computer, Vol. 26, No. 5, pp 39-51, May [15] E.H.Rho, S.H.Ha, H.H.Kim, D.J.Hwang, and S.Y.Han, Compilation of a Functional Language for the Multithreaded Architecture:DAVRID, Proc. of the Int l Conference on Parallel Processing, Vol. 2, pp , Aug [16] K. Mani Chandy and Stephen Taylor, An Introduction to Parallel Programming, Jones and Bartlett Publishers, [17] M. H. MacDougall, "Simulating Computer Systems Techniques and Tools", The MIT Press,

Dennis [1] developed the model of dataflowschemas, building on work by Karp and Miller [2]. These dataflow graphs, as they were later called, evolved

Dennis [1] developed the model of dataflowschemas, building on work by Karp and Miller [2]. These dataflow graphs, as they were later called, evolved Predictive Block Dataflow Model for Parallel Computation EE382C: Embedded Software Systems Literature Survey Vishal Mishra and Kursad Oney Dept. of Electrical and Computer Engineering The University of

More information

Computer Architecture: Dataflow/Systolic Arrays

Computer Architecture: Dataflow/Systolic Arrays Data Flow Computer Architecture: Dataflow/Systolic Arrays he models we have examined all assumed Instructions are fetched and retired in sequential, control flow order his is part of the Von-Neumann model

More information

Evaluation of Various Node Configurations for Fine-grain Multithreading on Stock Processors*

Evaluation of Various Node Configurations for Fine-grain Multithreading on Stock Processors* Evaluation of Various Node Configurations for Fine-grain Multithreading on tock Processors* Jin-oo Kim oonhoi Ha Chu hik Jhon eoul National University Department of Computer Engineering eoul 5-742, KOREA

More information

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer) Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Caches. Samira Khan March 23, 2017

Caches. Samira Khan March 23, 2017 Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

ronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Distributed Deadlock Detection for. Distributed Process Networks

Distributed Deadlock Detection for. Distributed Process Networks 0 Distributed Deadlock Detection for Distributed Process Networks Alex Olson Embedded Software Systems Abstract The distributed process network (DPN) model allows for greater scalability and performance

More information

History Importance Comparison to Tomasulo Demo Dataflow Overview: Static Dynamic Hybrids Problems: CAM I-Structures Potential Applications Conclusion

History Importance Comparison to Tomasulo Demo Dataflow Overview: Static Dynamic Hybrids Problems: CAM I-Structures Potential Applications Conclusion By Matthew Johnson History Importance Comparison to Tomasulo Demo Dataflow Overview: Static Dynamic Hybrids Problems: CAM I-Structures Potential Applications Conclusion 1964-1 st Scoreboard Computer: CDC

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

A Multi-Threaded Asynchronous Language

A Multi-Threaded Asynchronous Language A Multi-Threaded Asynchronous Language Hervé Paulino 1, Pedro Marques 2, Luís Lopes 2, Vasco Vasconcelos 3, and Fernando Silva 2 1 Department of Informatics, New University of Lisbon, Portugal herve@di.fct.unl.pt

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

An Object-Oriented Approach to Software Development for Parallel Processing Systems

An Object-Oriented Approach to Software Development for Parallel Processing Systems An Object-Oriented Approach to Software Development for Parallel Processing Systems Stephen S. Yau, Xiaoping Jia, Doo-Hwan Bae, Madhan Chidambaram, and Gilho Oh Computer and Information Sciences Department

More information

Data Flow Graph Partitioning Schemes

Data Flow Graph Partitioning Schemes Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802 Abstract: The

More information

Programming Languages Third Edition. Chapter 7 Basic Semantics

Programming Languages Third Edition. Chapter 7 Basic Semantics Programming Languages Third Edition Chapter 7 Basic Semantics Objectives Understand attributes, binding, and semantic functions Understand declarations, blocks, and scope Learn how to construct a symbol

More information

Data-triggered Multithreading for Near-Data Processing

Data-triggered Multithreading for Near-Data Processing Data-triggered Multithreading for Near-Data Processing Hung-Wei Tseng and Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego La Jolla, CA, U.S.A. Abstract

More information

The PCAT Programming Language Reference Manual

The PCAT Programming Language Reference Manual The PCAT Programming Language Reference Manual Andrew Tolmach and Jingke Li Dept. of Computer Science Portland State University September 27, 1995 (revised October 15, 2002) 1 Introduction The PCAT language

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Dataflow Architectures. Karin Strauss

Dataflow Architectures. Karin Strauss Dataflow Architectures Karin Strauss Introduction Dataflow machines: programmable computers with hardware optimized for fine grain data-driven parallel computation fine grain: at the instruction granularity

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Portland State University ECE 588/688. Dataflow Architectures

Portland State University ECE 588/688. Dataflow Architectures Portland State University ECE 588/688 Dataflow Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Hazards in von Neumann Architectures Pipeline hazards limit performance Structural hazards

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

RTC: Language Support for Real-Time Concurrency

RTC: Language Support for Real-Time Concurrency RTC: Language Support for Real-Time Concurrency Insup Lee, Susan Davidson, and Victor Wolfe 1 Introduction The RTC (Real-Time Concurrency) programming concepts and language constructs for expressing timing

More information

Functional modeling style for efficient SW code generation of video codec applications

Functional modeling style for efficient SW code generation of video codec applications Functional modeling style for efficient SW code generation of video codec applications Sang-Il Han 1)2) Soo-Ik Chae 1) Ahmed. A. Jerraya 2) SD Group 1) SLS Group 2) Seoul National Univ., Korea TIMA laboratory,

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Topics on Compilers Spring Semester Christine Wagner 2011/04/13

Topics on Compilers Spring Semester Christine Wagner 2011/04/13 Topics on Compilers Spring Semester 2011 Christine Wagner 2011/04/13 Availability of multicore processors Parallelization of sequential programs for performance improvement Manual code parallelization:

More information

VALLIAMMAI ENGINEERING COLLEGE

VALLIAMMAI ENGINEERING COLLEGE VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK B.E. - Electrical and Electronics Engineering IV SEMESTER CS6456 - OBJECT ORIENTED

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

and Multithreading Ben Lee, Oregon State University A.R. Hurson, Pennsylvania State University

and Multithreading Ben Lee, Oregon State University A.R. Hurson, Pennsylvania State University and Multithreading Ben Lee, Oregon State University A.R. Hurson, Pennsylvania State University - Contrary to initial expectations, implementing dataflow computers has presented a monumental challenge.

More information

CS553 Lecture Dynamic Optimizations 2

CS553 Lecture Dynamic Optimizations 2 Dynamic Optimizations Last time Predication and speculation Today Dynamic compilation CS553 Lecture Dynamic Optimizations 2 Motivation Limitations of static analysis Programs can have values and invariants

More information

Chapter 5. Names, Bindings, and Scopes

Chapter 5. Names, Bindings, and Scopes Chapter 5 Names, Bindings, and Scopes Chapter 5 Topics Introduction Names Variables The Concept of Binding Scope Scope and Lifetime Referencing Environments Named Constants 1-2 Introduction Imperative

More information

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Scalable Shared Memory Programing

Scalable Shared Memory Programing Scalable Shared Memory Programing Marc Snir www.parallel.illinois.edu What is (my definition of) Shared Memory Global name space (global references) Implicit data movement Caching: User gets good memory

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University

Dataflow Languages. Languages for Embedded Systems. Prof. Stephen A. Edwards. March Columbia University Dataflow Languages Languages for Embedded Systems Prof. Stephen A. Edwards Columbia University March 2009 Philosophy of Dataflow Languages Drastically different way of looking at computation Von Neumann

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

Node Prefetch Prediction in Dataflow Graphs

Node Prefetch Prediction in Dataflow Graphs Node Prefetch Prediction in Dataflow Graphs Newton G. Petersen Martin R. Wojcik The Department of Electrical and Computer Engineering The University of Texas at Austin newton.petersen@ni.com mrw325@yahoo.com

More information

Παράλληλη Επεξεργασία

Παράλληλη Επεξεργασία Παράλληλη Επεξεργασία Μέτρηση και σύγκριση Παράλληλης Απόδοσης Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2013 HW 1. Homework #3 due on cuda (summary of Tesla paper on web page) Slides based on Lin and Snyder textbook

More information

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,

More information

From IMP to Java. Andreas Lochbihler. parts based on work by Gerwin Klein and Tobias Nipkow ETH Zurich

From IMP to Java. Andreas Lochbihler. parts based on work by Gerwin Klein and Tobias Nipkow ETH Zurich From IMP to Java Andreas Lochbihler ETH Zurich parts based on work by Gerwin Klein and Tobias Nipkow 2015-07-14 1 Subtyping 2 Objects and Inheritance 3 Multithreading 1 Subtyping 2 Objects and Inheritance

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

PGAS: Partitioned Global Address Space

PGAS: Partitioned Global Address Space .... PGAS: Partitioned Global Address Space presenter: Qingpeng Niu January 26, 2012 presenter: Qingpeng Niu : PGAS: Partitioned Global Address Space 1 Outline presenter: Qingpeng Niu : PGAS: Partitioned

More information

Ade Miller Senior Development Manager Microsoft patterns & practices

Ade Miller Senior Development Manager Microsoft patterns & practices Ade Miller (adem@microsoft.com) Senior Development Manager Microsoft patterns & practices Save time and reduce risk on your software development projects by incorporating patterns & practices, Microsoft's

More information

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology exam Compiler Construction in4303 April 9, 2010 14.00-15.30 This exam (6 pages) consists of 52 True/False

More information

Implementation of Process Networks in Java

Implementation of Process Networks in Java Implementation of Process Networks in Java Richard S, Stevens 1, Marlene Wan, Peggy Laramie, Thomas M. Parks, Edward A. Lee DRAFT: 10 July 1997 Abstract A process network, as described by G. Kahn, is a

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Achieving Out-of-Order Performance with Almost In-Order Complexity

Achieving Out-of-Order Performance with Almost In-Order Complexity Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Performance of MP3D on the SB-PRAM prototype

Performance of MP3D on the SB-PRAM prototype Performance of MP3D on the SB-PRAM prototype Roman Dementiev, Michael Klein and Wolfgang J. Paul rd,ogrim,wjp @cs.uni-sb.de Saarland University Computer Science Department D-66123 Saarbrücken, Germany

More information

The basic operations defined on a symbol table include: free to remove all entries and free the storage of a symbol table

The basic operations defined on a symbol table include: free to remove all entries and free the storage of a symbol table SYMBOL TABLE: A symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is associated with information relating

More information

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1 Introduction to Computer Systems 15-213/18-243, fall 2009 24 th Lecture, Dec 1 Instructors: Roger B. Dannenberg and Greg Ganger Today Multi-core Thread Level Parallelism (TLP) Simultaneous Multi -Threading

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT.

SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT. SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT. SMT performance evaluation vs. Fine-grain multithreading Superscalar, Chip Multiprocessors.

More information

Distributed Objects and Remote Invocation. Programming Models for Distributed Applications

Distributed Objects and Remote Invocation. Programming Models for Distributed Applications Distributed Objects and Remote Invocation Programming Models for Distributed Applications Extending Conventional Techniques The remote procedure call model is an extension of the conventional procedure

More information

MT-SDF: Scheduled Dataflow Architecture with mini-threads

MT-SDF: Scheduled Dataflow Architecture with mini-threads 2013 Data-Flow Execution Models for Extreme Scale Computing MT-SDF: Scheduled Dataflow Architecture with mini-threads Domenico Pace University of Pisa Pisa, Italy col.pace@hotmail.it Krishna Kavi University

More information

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Field Analysis. Last time Exploit encapsulation to improve memory system performance Field Analysis Last time Exploit encapsulation to improve memory system performance This time Exploit encapsulation to simplify analysis Two uses of field analysis Escape analysis Object inlining April

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Progress report on the Integrative Model for Parallelism

Progress report on the Integrative Model for Parallelism Progress report on the Integrative Model for Parallelism Victor Eijkhout Presented at 4th Workshop on Extreme-scale Programming Tools Supercomputing 2015 Table of Contents 1 Introduction 2 IMP programming

More information

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

Lecture 27 Programming parallel hardware Suggested reading: (see next slide) Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)" 1" Suggested Readings" Readings" H&P: Chapter 7 especially 7.1-7.8" Introduction to Parallel Computing" https://computing.llnl.gov/tutorials/parallel_comp/"

More information

Oops known as object-oriented programming language system is the main feature of C# which further support the major features of oops including:

Oops known as object-oriented programming language system is the main feature of C# which further support the major features of oops including: Oops known as object-oriented programming language system is the main feature of C# which further support the major features of oops including: Abstraction Encapsulation Inheritance and Polymorphism Object-Oriented

More information

Python in the Cling World

Python in the Cling World Journal of Physics: Conference Series PAPER OPEN ACCESS Python in the Cling World To cite this article: W Lavrijsen 2015 J. Phys.: Conf. Ser. 664 062029 Recent citations - Giving pandas ROOT to chew on:

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 1 st appello January 13, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Breakpoints and Halting in Distributed Programs

Breakpoints and Halting in Distributed Programs 1 Breakpoints and Halting in Distributed Programs Barton P. Miller Jong-Deok Choi Computer Sciences Department University of Wisconsin-Madison 1210 W. Dayton Street Madison, Wisconsin 53706 Abstract Interactive

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box

More information

WaveScalar. Winter 2006 CSE WaveScalar 1

WaveScalar. Winter 2006 CSE WaveScalar 1 WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE

More information

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Computer Systems A Programmer s Perspective 1 (Beta Draft) Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Hock-Beng Lim Center for Supercomputing R & D University of Illinois Urbana, IL 61801 hblim@csrd.uiuc.edu Pen-Chung Yew Dept. of Computer

More information

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence

More information

Parallel Logic Simulation on General Purpose Machines

Parallel Logic Simulation on General Purpose Machines Parallel Logic Simulation on General Purpose Machines Larry Soule', Tom Blank Center for Integrated Systems Stanford University Abstract Three parallel algorithms for logic simulation have been developed

More information

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be

More information

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018 CS 31: Intro to Systems Threading & Parallel Applications Kevin Webb Swarthmore College November 27, 2018 Reading Quiz Making Programs Run Faster We all like how fast computers are In the old days (1980

More information