Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Size: px
Start display at page:

Download "Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993."

Transcription

1 Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel Processing for Scientic Computing",Norfolk, Virginia (USA), March Abstract Many iterative schemes in scientic applications require the multiplication of a sparse matrix by a vector. This kernel has been mainly studied on vector processors and shared-memory parallel computers. In this paper, we address the implementation issues when using a shared virtual memory system on a distributed memory parallel computer. We study in details the impact of loop distribution schemes in order to design an ecient algorithm. 1 Introduction Many scientic applications require computations on large sparse matrices. Most iterative schemes include the multiplication of a sparse matrix by a vector, which should be ecient on parallel computers. The algorithm depends directly on the storage scheme chosen. Various schemes have been devised as explained in [1]. This paper deals with a parallel version of this kernel designed for a distributed memory parallel computer (DMPC) supplied with a Shared Virtual Memory (SVM). The basic idea of a SVM is to hide the underlying architecture of DMPCs by providing a virtual address space to the user. This latter is partitioned into pages spread among local processor memories. Each local memory acts as a large software cache for storing pages requested by the processor. DMPC could be thus programmed as more conventional shared memory parallel computers. However, accessing data through a SVM can dramatically decrease the eciency of the parallel algorithm if data locality is not well exploited. The aim of this paper is to demonstrate that using both an adequate data storage and a loop distribution scheme can keep low this overhead. Blocked matrix vector multiply is not considered. Such an algorithm can decrease communication cost on vector broadcasting and reduction, however an ecient use of the SVM, in that case, would require a new storage of the matrix and so would introduce many changes in Fortran programs. We also do not address the use of the internal processor cache studied in [6]. The paper is organized as follows. Sections 2 and 3 describe briey the KOAN Shared Virtual Memory and Fortran-S compiler targeted to KOAN. Section 4 is devoted to the description of the parallel algorithm, mainly to the impact of sharing the resulting vector. A specic loop partition scheme is designed, in order to distribute the workload. Section 5 gives and comments experimental results on various general non symmetric sparse matrices. This work is partially supported by Intel SSD under contract no C y IRISA - Campus Universitaire de Beaulieu - F RENNES Cedex - FRANCE 1

2 2 Bodin et al. 2 KOAN: a Shared Virtual Memory for the ipsc/2 KOAN is a Shared Virtual Memory designed at IRISA for the Intel hypercube ipsc/2 [4]. It is embedded into the operating system of the ipsc/2. It allows the use of fast, low-level communication primitives as well as a Memory Management Unit (MMU). Pages are managed by using the xed distributed manager algorithm as described in [5] and consistency of the shared memory is guaranteed by an invalidation protocol. This algorithm represents a good tradeo between ease of implementation and eciency. Let us now summarize some of the functionalities of the KOAN SVM system. KOAN SVM provides to the user several memory management protocols for handling some particular memory access patterns. An important one occurs when several processors have to write into dierent locations of the same page. This pattern involves a lot of messages since the page has to move from processor to processor (ping-pong eect or false sharing). A weak cache coherence protocol can be used to let the processors modify concurrently their own copy of a page and to merge local copies into the shared one afterwards. Parallel algorithms based on a producer/consumer scheme are inecient on DMPCs when using a SVM system. Typically, a page is rst updated by a processor then accessed by the other processors. KOAN SVM can manage eciently this memory access pattern by using the broadcasting facility of the underlying topology of DMPCs (hypercube, 2D-mesh, etc...). The producer processor has to broadcast all the pages updated in the producer phase to the parallel consumer processors. These two memory management protocols are available to the user through several operating system calls that are used to specify a program section where a weak cache coherence protocol has to be used or to bound a producer phase. 3 Fortran-S: A programming interface for KOAN A \user-friendly" programming environment for DMPC has been designed at IRISA by providing high-level parallel constructs. The user's code is written in standard Fortran-77 and contains directives to express parallel loops and shared variables (A shared variable can be read or written by all the processors). Shared variables are used for data structures that can be computed in parallel. So parallel loops are intended to compute values of the shared variables declared in the program. Parallel execution is achieved using a SPMD execution model (Single Program Multiple Data). At the beginning of the program execution, a thread is created on each processor and each processor starts to execute the program. All non shared variables are duplicated on the processors. Shared memory space is allocated by the KOAN SVM at the beginning of the execution. For executing parallel do loops the compiler distributes chunks of the iteration space to each processor according to the iteration distribution specied in the directives. The Fortran-S compiler is in charge of generating parallel processes with KOAN low level operating system calls from the source code. This approach provides a convenient parallel programming environment for several reasons. It allows an easy and ecient way of programming parallel algorithms ; moreover it facilitates debugging since the programs can be compiled and executed on a workstation. Though parallelism is based on a sharedmemory approach, the programming environment also provides message based primitives that can be used to handle eciently global operations and synchronizations. The prototype compiler is implemented using the Sigma-II system developed at Indiana University [3]. The compilation technique is not described further in this paper to keep it short. In the remaining part of this section, we give a brief overview of the main directives.

3 Sparse Matrix-Vector Product ; Shared Virtual Memory 3 In Fortran-S, shared variables must be declared explicitly by means of a directive and must be declared in the main program. Other variables are non-shared, namely each processor has each own copy of the variable. A shared variable is declared using the following directive : REAL V(N,N) C$ann[Shared(v)] The iterations of a parallel loop are distributed among the processors and the processors are synchronized at the end of all the iterations. Several static scheduling strategies are provided by the compiler and the user can dene its own strategy which may be dynamic. Among these strategies, the compiler can allocate chunks of consecutive iterations to processors or can distribute the iterations cyclically. These schedulings provide a good load-balancing if the work in the iterations is almost equally distributed. Otherwise more sophisticated schemes must be used. Therefore the Fortran-S compiler accepts user-dened iterations partitions, which can be either static or dynamic (i.e. set at run time). A parallel loop is declared using the directive : C$ann[DoShared("scheduling")] do 20 nel = 1,nbnel sounds(nel) = sqrt(gama * p(nel) / ro(nel)) 20 continue where the string "scheduling" indicates the scheduling strategy of the iterations and can take the value : 1. "BLOCK": chunks of contiguous iterations are assigned over the processors. 2. "CYCLIC": iterations are distributed over the processor according to a modulo scheme. The rst iteration is aected to the rst processor, second to the second processor, and so on. 3. "USER": User-dened partitions. The partition of the iteration space is specied by the user, at run-time. This feature is important when load balancing depends on the data of the program. A weak cache coherence protocol as described briey in section 2 can be associated with a shared variable during the execution of a parallel loop thanks to the following directive: C$ann[WeakCoherency(y)] An example of use is given below : C$ann[DoShared("BLOCK")] C$ann[WeakCoherency(y)] do 1 i=1,n y(i) = f(...) 1 continue In this example y is assumed to be a shared variable written simultaneously by many processors, so there may be false sharing on pages where the variable y is stored. The weak coherence protocol removes that phenomenon, by merging updates of the shared pages only once, at the end of the loop. 4 Algorithm and Data Structures The sparse matrix by a vector multiplication is a CPU-intensive kernel found in most iterative schemes. The algorithm depends on the storage scheme, which may include some

4 4 Bodin et al. zeros or not. Here we choose a compressed storage by rows which is commonly used and which is well-suited for parallel multiplication. Let n be the order of the matrix and nz be the number of non zeros elements. A real array a of length nz contains all the coecients while an integer array ja of same length nz contains the corresponding column indices. An auxiliary integer array ia of length (n + 1) points to the rst element of each row. An example of a sparse matrix with this storage is given below : A = A a(1 : nz) ja(1 : nz) ia(1 : n + 1) An intrinsic parallelism derives readily from the storage scheme. Namely, since the sparse matrix multiplication is expressed by rows, it is sucient to partition the matrix by rows and to handle the dierent blocks of rows in parallel. More precisely, the sequential algorithm is composed of an outer loop on rows with an inner loop on the elements of the rows, as indicated by the program below. The outer iterations on the rows are clearly parallel so that they can be assigned to parallel tasks. Below is the sequential algorithm along with the parallel version. c shared variables : a,ja,y do i=1,n y(i) = 0. do k=ia(i),ia(i+1)-1 y(i) = y(i) + a(k)*x(ja(k)) end do end do c private variables : n,i,k,temp,x,ia C$ann[DoShared("sched")] C$ann[WeakCoherency(y)] do i=1,n temp = 0. do k=ia(i),ia(i+1)-1 temp = temp + a(k)*x(ja(k)) end do y(i)=temp end The data structures are the operand vector x, the resulting vector y and the sparse matrix composed of three vectors a; ia; ja. Only the vector y is written, others are readonly. The vectors x and ia of length n can be copied in local memories. On the contrary, we assume that the vectors a and ja cannot be duplicated because of memory requirements. In our environment, programming is simplied thanks to the use of shared arrays which are managed by KOAN and the compiler. Since the scheduling chosen for the algorithm requires a data distribution that diers from the initial distribution, the system has to move pages during the rst executions of the multiplication. But after a few iterations, the pages will be located correctly and will stay in the local memories provided there is enough space, so that the SVM overhead becomes negligible. Here we do not deal with further uses of the operand vector x in the application. In general, the matrix vector multiply would require a global broadcast of the vector x or implicit copies of it into local memories via the SVM, introducing a meaningful overhead. As far as the resulting vector y is concerned, two cases must be studied. If the vector is used along the application with the same static partition, it may remain local to the processors. Thanks to the SPMD programming model, if the vector y is not declared as shared, each processor will get partial results. On the other hand, if its next use require a dierent partition, the vector must be shared. We use the weak coherence protocol to

5 Sparse Matrix-Vector Product ; Shared Virtual Memory 5 eliminate false sharing when updating the shared array y. The rows must be partitioned in order to balance the workload. The simplest strategy, which is provided by automatic parallelizers for instance, is to divide the rows into slices of almost equal size. This has been done for example in [7] to implement a sparse linear solver on a KSR computer. However, the load of the algorithm is not measured by the order of the matrix, but rather by its number of non-zeros. Therefore, a better partition consists for some matrices in computing slices with almost the same number of non-zeros but with unequal numbers of rows. We have experimented both strategies, by using the directive "DoShared" with the two options called "BLOCK" and "USER", where "USER" implements a partition of the iteration space into blocks of roughly the same number of non-zeros. In the last strategy the vector ia is used to compute the iteration space partition among the processors. Because that operation is done once (the matrix structure is not modied between two matrix vector multiplies, so the computed scheduling stays valid) and because the vector ia is a private vector, the overhead is kept low. 5 Numerical Results The algorithm is implemented using Fortran-S on a hypercube ipsc/2 with at most 32 processors. It should be noted that the peak performance of one processor is only 0:3 Mops. The code is tested on various matrices (see table 1) in order to measure the impact of the order and the number of non-zeros. They are band matrices or come from the Harwell-Boeing collection [2]. The resulting vector y is either distributed in local memories (vector y is private, so the results of the matrix vector multiply are distributed in the local copies of the vector y) or managed by the shared virtual memory (delared as shared) using a weak cache coherence protocol. The performance dierences between the two experiments measure the cost of the weak coherence protocol used on vector y. The two scheduling strategies described above (BLOCK and USER) are also tested. Timings in seconds and rates in Mops are provided for hundred iterations in single precision (see tables 2,3,4,5). Large problem results are not given for on one or two processors because the size of the data exceeded the available memory. Experimental results show that the size of the problem is well measured by the number of non zeros in the matrix. Indeed, the results on the band matrices show similar performances, mainly for the version with distributed vectors, because they have almost the same number of non-zeros though they have dierent orders. Performance degradation comes from both load imbalance and the merging of the pages where parts of the vector y are stored (this is due to the weak coherence protocol used). The version with distributed vectors gives very good speed-ups, even for small problems with 32 processors, since data transfers between processors only occur during the rst matrix-vector multiply. We recall that here the operand vector x is not updated. The overhead induced by the merge into the shared vector is proportional to the maximal number of processors writing to a same page when merging the shared array. In case of the BLOCK strategy, this quantity can be roughly estimated by max(2; min(p; c=(n=p))) where c is the page size (here it is 1024 words of 32-bits) p is the number of processors and n is the order of the matrix. Small matrices such as bcspwr09 and band24 lead to a high overhead with 32 processors because the shared array holds in two pages. But for large n, the merge operation becomes relatively cheap since at most two processors share a same page. Load imbalance is particularly sensitive for the matrix orani678 because the distribution

6 6 Bodin et al. Table 1 Sparse Matrices Used. matrix order non-zeros bcspwr lns band orani band bcsstk Table 2 Results with distributed vectors and with row partitioning (BLOCK). matrix Number of processors bcspwr09 (s) MFLOPS lns3937 (s) MFLOPS band5 (s) MFLOPS orani678 (s) MFLOPS band20 (s) MFLOPS bcsstk24 (s) MFLOPS of non zeros is far from uniform. The rst partition (BLOCK) leads to poor performances whereas the second partition (USER) gives better performances by distributing equally the work. For band matrices where the rows have roughly the same number of non-zeros, both strategies yield similar results. We have measured the SVM overhead due to page moves for each iteration. As expected, the rst iteration is quite costly but in the next ones, this overhead can eectively be neglected. 6 Conclusion Results presented in this paper demonstrate that the use of a Shared Virtual Memory is an ecient way for programming distributed memory parallel computers. Our environment provides tools, based on directives, which automatically distribute both data and computations. However, users still have to carefully distribute the loop iterations in order to balance the workload and to limit as much as possible remote access through the SVM. We shown that computations involved in a sparse matrix vector multiply can be easily distributed by using an adequate loop scheduling strategy. Concerning data locality, the overhead induced by reading the matrix comes from an initial system distribution not related to the loop distribution. But these page moves become negligible as soon as the number of iterations calling the sparse matrix vector multiply becomes sizable. The main limitations to the speed-up are due to the eective sharing of the resulting vector and the operand

7 Sparse Matrix-Vector Product ; Shared Virtual Memory 7 Table 3 Results with one shared vector and with row partitioning (BLOCK). matrix Number of processors bcspwr09 (s) MFLOPS lns3937 (s) MFLOPS band5 (s) MFLOPS orani678 (s) MFLOPS band20 (s) MFLOPS bcsstk24 (s) MFLOPS Table 4 Results with one shared vector and with non-zeros partitioning (USER). matrix Number of processors bcspwr09 (s) MFLOPS lns3937 (s) MFLOPS band5 (s) MFLOPS orani678 (s) MFLOPS band20 (s) MFLOPS bcsstk24 (s) MFLOPS

8 8 Bodin et al. Table 5 Results on band matrices (BLOCK). 8 processors matrix order non zeros shared distributed band5 (s) MFLOPS band10 (s) MFLOPS band24 (s) MFLOPS processors band5 (s) MFLOPS band10 (s) MFLOPS band24 (s) MFLOPS vector. We studied here the rst eect on a shared virtual memory and found that the overhead becomes small for matrices of large order. The second limitation occurs in an application calling iteratively the kernel and distributing the vector x in each iteration. Since this overhead increases with the number of processors, it may become the main bottleneck of an application ([7]). We plan to investigate this problem and the global data usage by implementing a linear sparse iterative solver requiring a sparse matrix vector multiply. References [1] I. Du, A. Erisman, J. Reid, Direct methods for sparse matrices. Oxford University Press, London, [2] I. Du, R. Grimes, and J. Lewis, Sparse matrix test problems, ACM TOMS, 15 (1989), pp. 1{14. [3] Dennis Gannon and Jenq Kuen Lee and Bruce Shei and Sekhar Sarukai and Srivinas Narayana and Neelakantan Sundaresan and Daya Atapattu and Francois Bodin. Sigma II: a tool kit for building parallelizing compilers and performance analysis systems. Proceedings of the IFIP Edinburgh Workshop on Parallel Programming Environments, April [4] Z. Lahjomri and T. Priol. Koan: a shared virtual memory for the ipsc/2 hypercube. In CONPAR/VAPP92, September [5] Kai Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Yale University, September [6] O. Temam, W. Jalby, Characterizing the behavior of sparse algorithms on caches Proceedings of Supercomputing'92, pp [7] D. Windheiser, E. Boyd, E. Hao, S. Abraham, KSR1 Multiprocessor : Analysis of Latency Hiding Techniques in a Sparse Solver Research Report, University of Michigan, November 1992.

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH 43221 Columbus, OH 4321 Abstract We analyze

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Implementation choices

Implementation choices Towards designing SVM coherence protocols using high-level specications and aspect-oriented translations David Mentre, Daniel Le Metayer, Thierry Priol fdavid.mentre, Daniel.LeMetayer, Thierry.Priolg@irisa.fr

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo y Abstract Sparse matrix-vector multiplication is an important kernel that often runs ineciently on superscalar RISC

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

for Reducing Multiple-Writer False Sharing IRISA, Campus de Beaulieu, Rennes, Cedex, France

for Reducing Multiple-Writer False Sharing  IRISA, Campus de Beaulieu, Rennes, Cedex, France Evaluating Two Loop Transformations for Reducing Multiple-Writer False Sharing Francois Bodin Elana D: Granston y Thierry Montaut bodin@irisa:fr granston@cs:rice:edu montaut@irisa:fr IRISA, Campus de Beaulieu,

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

To appear in: Proceedings of the Supercomputing '93 Conference, Portland, Oregon, November 15{19, Implementing a Parallel C++ Runtime System for

To appear in: Proceedings of the Supercomputing '93 Conference, Portland, Oregon, November 15{19, Implementing a Parallel C++ Runtime System for To appear in: Proceedings of the Supercomputing '93 Conference, Portland, Oregon, November 15{19, 1993. Implementing a Parallel C++ Runtime System for Scalable Parallel Systems 1 F. Bodin P. Beckman, D.

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf.

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Sivan Toledo Coyote Hill Road. Palo Alto, CA November 25, Abstract

Sivan Toledo Coyote Hill Road. Palo Alto, CA November 25, Abstract Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo Xerox Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 9434 November 25, 1996 Abstract Sparse Matrix-Vector

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Experiments on string matching in memory structures

Experiments on string matching in memory structures Experiments on string matching in memory structures Thierry Lecroq LIR (Laboratoire d'informatique de Rouen) and ABISS (Atelier de Biologie Informatique Statistique et Socio-Linguistique), Universite de

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE* SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL

More information

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Technical Report TR , Computer and Information Sciences Department, University. Abstract

Technical Report TR , Computer and Information Sciences Department, University. Abstract An Approach for Parallelizing any General Unsymmetric Sparse Matrix Algorithm Tariq Rashid y Timothy A.Davis z Technical Report TR-94-036, Computer and Information Sciences Department, University of Florida,

More information

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group SAMOS: an Active Object{Oriented Database System Stella Gatziu, Klaus R. Dittrich Database Technology Research Group Institut fur Informatik, Universitat Zurich fgatziu, dittrichg@ifi.unizh.ch to appear

More information

Java Virtual Machine

Java Virtual Machine Evaluation of Java Thread Performance on Two Dierent Multithreaded Kernels Yan Gu B. S. Lee Wentong Cai School of Applied Science Nanyang Technological University Singapore 639798 guyan@cais.ntu.edu.sg,

More information

Evaluating Two Loop. Transformations for Reducing. Mutliple-Writer False Sharing. Franois Bodin. Elana D. Granston. Thierry Montaut.

Evaluating Two Loop. Transformations for Reducing. Mutliple-Writer False Sharing. Franois Bodin. Elana D. Granston. Thierry Montaut. Evaluating Two Loop Transformations for Reducing Mutliple-Writer False Sharing Franois Bodin Elana D. Granston Thierry Montaut CRPC-TR94479 August, 1994 Center for Research on Parallel Computation Rice

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Mark J. Clement and Michael J. Quinn. Oregon State University. January 17, a programmer to predict what eect modications to

Mark J. Clement and Michael J. Quinn. Oregon State University. January 17, a programmer to predict what eect modications to Appeared in \Proceedings Supercomputing '93" Analytical Performance Prediction on Multicomputers Mark J. Clement and Michael J. Quinn Department of Computer Science Oregon State University Corvallis, Oregon

More information

Assignment1 - CSG1102: Virtual Memory. Christoer V. Hallstensen snr: March 28, 2011

Assignment1 - CSG1102: Virtual Memory. Christoer V. Hallstensen snr: March 28, 2011 Assignment1 - CSG1102: Virtual Memory Christoer V. Hallstensen snr:10220862 March 28, 2011 1 Contents 1 Abstract 3 2 Virtual Memory with Pages 4 2.1 Virtual memory management.................... 4 2.2

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique Got 2 seconds Sequential 84 seconds Expected 84/84 = 1 second!?! Got 25 seconds MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Séminaire MATHEMATIQUES

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Latency Hiding on COMA Multiprocessors

Latency Hiding on COMA Multiprocessors Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access

More information

Native mesh ordering with Scotch 4.0

Native mesh ordering with Scotch 4.0 Native mesh ordering with Scotch 4.0 François Pellegrini INRIA Futurs Project ScAlApplix pelegrin@labri.fr Abstract. Sparse matrix reordering is a key issue for the the efficient factorization of sparse

More information

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract

More information

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University Parallel LU Factorization of Block-Diagonal-Bordered Sparse Matrices D. P. Koester, S. Ranka, and G. C. Fox School of Computer and Information Science and The Northeast Parallel Architectures Center (NPAC)

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Do! environment. DoT

Do! environment. DoT The Do! project: distributed programming using Java Pascale Launay and Jean-Louis Pazat IRISA, Campus de Beaulieu, F35042 RENNES cedex Pascale.Launay@irisa.fr, Jean-Louis.Pazat@irisa.fr http://www.irisa.fr/caps/projects/do/

More information

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Fast Primitives for Irregular Computations on the NEC SX-4

Fast Primitives for Irregular Computations on the NEC SX-4 To appear: Crosscuts 6 (4) Dec 1997 (http://www.cscs.ch/official/pubcrosscuts6-4.pdf) Fast Primitives for Irregular Computations on the NEC SX-4 J.F. Prins, University of North Carolina at Chapel Hill,

More information

Research on outlier intrusion detection technologybased on data mining

Research on outlier intrusion detection technologybased on data mining Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

A tabu search based memetic algorithm for the max-mean dispersion problem

A tabu search based memetic algorithm for the max-mean dispersion problem A tabu search based memetic algorithm for the max-mean dispersion problem Xiangjing Lai a and Jin-Kao Hao a,b, a LERIA, Université d'angers, 2 Bd Lavoisier, 49045 Angers, France b Institut Universitaire

More information

Bit Summation on the Recongurable Mesh. Martin Middendorf? Institut fur Angewandte Informatik

Bit Summation on the Recongurable Mesh. Martin Middendorf? Institut fur Angewandte Informatik Bit Summation on the Recongurable Mesh Martin Middendorf? Institut fur Angewandte Informatik und Formale Beschreibungsverfahren, Universitat Karlsruhe, D-76128 Karlsruhe, Germany mmi@aifb.uni-karlsruhe.de

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4 Algorithms Implementing Distributed Shared Memory Michael Stumm and Songnian Zhou University of Toronto Toronto, Canada M5S 1A4 Email: stumm@csri.toronto.edu Abstract A critical issue in the design of

More information

Parallelisation. Michael O Boyle. March 2014

Parallelisation. Michael O Boyle. March 2014 Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD

More information

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control. Neuro-Remodeling via Backpropagation of Utility K. Wendy Tang and Girish Pingle 1 Department of Electrical Engineering SUNY at Stony Brook, Stony Brook, NY 11794-2350. ABSTRACT Backpropagation of utility

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J. Compilation Issues for High Performance Computers: A Comparative Overview of a General Model and the Unied Model Abstract This paper presents a comparison of two models suitable for use in a compiler for

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Introduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P

Introduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P Performance Comparison of a Set of Periodic and Non-Periodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers Xian-He Sun Stuti Moitra Department of Computer Science Scientic Applications Branch

More information

1e+07 10^5 Node Mesh Step Number

1e+07 10^5 Node Mesh Step Number Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes

More information

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada Distributed Array Data Management on NUMA Multiprocessors Tarek S. Abdelrahman and Thomas N. Wong Department of Electrical and Computer Engineering University oftoronto Toronto, Ontario, M5S 1A Canada

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh Scalable Parallel Libraries Conference, Oct. 1994 PASSION Runtime Library for Parallel I/O Rajeev Thakur Rajesh Bordawekar Alok Choudhary Ravi Ponnusamy Tarvinder Singh Dept. of Electrical and Computer

More information

An Ecient Parallel Algorithm. for Matrix{Vector Multiplication. Albuquerque, NM 87185

An Ecient Parallel Algorithm. for Matrix{Vector Multiplication. Albuquerque, NM 87185 An Ecient Parallel Algorithm for Matrix{Vector Multiplication Bruce Hendrickson 1, Robert Leland 2 and Steve Plimpton 3 Sandia National Laboratories Albuquerque, NM 87185 Abstract. The multiplication of

More information

Mobile NFS. Fixed NFS. MFS Proxy. Client. Client. Standard NFS Server. Fixed NFS MFS: Proxy. Mobile. Client NFS. Wired Network.

Mobile NFS. Fixed NFS. MFS Proxy. Client. Client. Standard NFS Server. Fixed NFS MFS: Proxy. Mobile. Client NFS. Wired Network. On Building a File System for Mobile Environments Using Generic Services F. Andre M.T. Segarra IRISA Research Institute IRISA Research Institute Campus de Beaulieu Campus de Beaulieu 35042 Rennes Cedex,

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k TEMPORAL STABILITY IN SEQUENCE SEGMENTATION USING THE WATERSHED ALGORITHM FERRAN MARQU ES Dept. of Signal Theory and Communications Universitat Politecnica de Catalunya Campus Nord - Modulo D5 C/ Gran

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo Two-Stage Service Provision by Branch and Bound Shane Dye Department ofmanagement University of Canterbury Christchurch, New Zealand s.dye@mang.canterbury.ac.nz Asgeir Tomasgard SINTEF, Trondheim, Norway

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11 Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre,

More information

Analysis of Parallelization Techniques and Tools

Analysis of Parallelization Techniques and Tools International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information