A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer
|
|
- Erick Lynch
- 6 years ago
- Views:
Transcription
1 A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer Orieb AbuAlghanam, Mohammad Qatawneh Computer Science Department University of Jordan Hussein A. al Ofeishat Computer Science Department Al-Balqa applied university Jordan Omar Adwan, Ammar Huneiti Computer Information Systems University of Jordan Abstract The tree-hypercube (TH) interconnection network is relatively a new interconnection network, which is constructed from tree and hypercube topologies. TH is developed to support parallel algorithms for solving computation and communication intensive problems. In this paper, we propose a new parallel multiplication algorithm on TH network to present broadcast communication operation for TH using store-and-forward technique, namely, one-to-all broadcast operation which allows a message to be transmitted through the shortest path from the source node to all other nodes. The proposed algorithm is implemented and evaluated in terms of running time, efficiency and speedup with different data size using IMAN1. The experimental results show that the runtime, efficiency and the speedup of the proposed algorithm decrease as a number of processors increases for all cases of matrices size of , , and Keywords MPI; supercomputer; tree-hypercube; matrix multiplication I. INTRODUCTION Parallel matrix multiplication is considered as a backbone for several scientific applications. Many studies developed matrix multiplication from different perspective in [3], [6], [7], [10], [14] the authors studied the limitation bandwidth for memory-processor through communication and how to reduce the gap between memory and processor speed, they present a communication efficient mapping of a large-scale matrix multiplication algorithm. In [1] parallel matrix multiplication algorithm on hypercube multiprocessors, while in [2], [5], [6] the authors applied a matrix multiplication with a hypercube algorithm on multi-core processor cluster. In [3] the author proposed a Distribution-Independent Matrix Multiplication Algorithm called DIMMA which is new, fast, and scalable matrix multiplication algorithm, for block cyclic data distribution on distributed-memory concurrent computers. In this paper we apply matrix multiplication on tree-hypercube which was used before in adaptive fault tolerate in routing algorithm [8], [11], [12], [17]. Many proposals for matrix multiplication algorithms were done on different networks type whether it was homogeneous or heterogeneous in order to reduce the time drastically to improve the system performance. Matrix multiplication needs high computation time especially when the size of the matrix becomes huge, therefore some problems need years to be solved using a personal computer. The interconnection networks are the core of a parallel processing system which the system s processors are linked. Due to the big role played by the networks topology to improve the parallel system s performance, several interconnection network topologies have been proposed for that purpose; such as the tree, hypercube, mesh, ring, and Hex-Cell (HC) [12], [13], [16]. Among the wide variety of interconnection networks structures proposed for parallel computing systems is Tree- Hypercube network which received much attention due to the attractive properties inherited in their topology [4], [8], [9], [15]. This paper aimed to design and analyze efficient matrix multiplication algorithm on tree-hypercube network. Experimentation of the proposed algorithm was conducted using IMAN1 supercomputer which is Jordan's first supercomputer. The IMAN1 is available for use by academia and industry in Jordan and the region. The rest of the paper is organized as follows. Section 2 summarizes the definition of Tree-Hypercube network. In Section 3 the proposed algorithm is explained, Section 4 is an overview of sequential and parallel matrix multiplication is discussed. In Section 5, the evaluation results are illustrated. Finally, section 6 is the conclusion for the paper. II. DEFINITION OF TREE-HYPERCUBE NETWORK Tree-hypercube can be defined as follows: each node is labelled with the level number (L) and the node number (X) with the form (L,X). It is mainly a tree which represents a hypercube in each level because each level i in tree has s(i) nodes that represents the number of processing element as a hypercube [4], [9]. For very level L between 0 and d, each node (L,X) in level L, where X= X(d logs)-1 X0 is adjacent to the following s children nodes at level L+1 (L+1, X.a) for a =0,..,s-1 as shown in Fig. 1(a) and (b) [8]. 201 P a g e
2 In this section, we propose a new Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network Using IMAN1 Supercomputer. The data distribution for matrix multiplication can be divided either striped partitioning or checkerboard partitioning. We use a block striped partitioning distribution to study the performance of message passing execution model. The column wise block striping of an (n x n) matrix on p processors (P0, P1,, Pp-1) processor Pi contains columns: (n/p) (i), ((n/p) i)+1,, ((n/p) (i+1)) 1 [5].The proposed parallel matrix multiplication algorithm on Tree-Hypercube network consists of three stages as shown in Fig. 3. The first stage shows the distribution of data among nodes (processors), second stage describe the multiplication process at each processor, and finally the third stage represent the collecting the data. (a) Input: Matrix A and B Output: Matrix C on Tree-Hypercube using parallel Matrix Multiplication Stage 1: Broadcast matrix A and B 1. CN (Coordinator Node) generates a set of blocks of Matrix A. 2. CN generates a set of blocks according to the Matrix B. 3. Wait for acknowledgment message from the coordinator who received the data. 4. Send a message for the CN informing the process completion. 5. CN stops the process of distribution and announces the beginning of the next Stage (b) Fig. 1. Tree-hypercube (a) TH (2,2) and (b) TH (2,3). III. PROPOSED ALGORITHM Matrix multiplication is a simple and widely used algorithm in many scientific applications. Matrix multiplication is well structured in the sense that elements of the matrices can be evenly distributed to the nodes and communication among nodes which have regular pattern. Therefore, exploiting data parallelism based on message passing is suitable for solving the matrix multiplication problem. Fig. 2 shows a sequential algorithm of an n x n matrix multiplication consisting of three nested loops. As can be seen, the complicity of the algorithm is n3. Matrix-Multiplication(A,B) n =A.rows Let C be a new n n matrix For i=1 to n For j=1 to n Cij=0 For k=1 to n Cij=Cij + aik bkj Stage 2: Do the multiplication in parallel 6. Multiply the stripes of matrix A with stripes of Matrix B (for each block) of data. 7. Using sequential matrix Multiplication. Where all processors perform Cij Stage 3: Collecting the data A- Global Data Combining 8. In each level in the Tree-Hypercube interconnection, send in parallel the multiplication matrix to the TH root nodes. B- Combining Data in the Tree-Hypercube root nodes 9. CN combines the collected matrices correctly from roots nodes in matrix C. 10. Exit Fig. 3. Pseudocode Tree-Hypercube Algorithm for Matrix Multiplication. Fig. 2. A Sequential Multiplication Algorithm for two matrices [18]. 202 P a g e
3 Fig. 4. Parallel matrix multiplication tree-hypercube view. Fig. 4 shows one-to-all broadcast on a Tree-Hypercube (2,3). Broadcast communication operations are used in many parallel algorithms; such as vector-matrix multiplication, vector-vector multiplication, matrix multiplication, Gaussian elimination, shortest paths, LU (lower upper) factorization, maximal vectors and prefix sums. Our one-to-all broadcast algorithms on TH network are based on the E-Cube routing algorithm, where this routing algorithm can be found in [1]. Each node can send a message while receiving another message on the same or different link; that is bidirectional links are used. Also, it can send or receive a message on only one of its links at a time. Moreover, a store-and-forward routing technique is used, where each intermediate node must fully receive and store the entire message before sending it to the next node. Fig. 4 presents one-to-all broadcast algorithm which can be defined as broadcasting a message from a source node to all other nodes in the network. Respectively, in matrix multiplication which is applied in tree hypercube, the coordinator creates the partitions depending on the number of processors and the matrix size, once the data is received by any processor it takes the part that it is responsible for and then resends it again to the nodes that has a direct link with it. In TH sending data acts as hypercube in each level, after the data distribution is done. Each processor does it is own computations and sends back the result to the parent from which it got the data until all data is accumulated and sent back to the coordinator. A. Partition Procedure Analysis In the first partitioning of the original array will take n/p, assuming the partitioning is always balanced i.e. the number of columns or the numbers of rows are equal to the number of processors. In our tree-hypercube network only one processor has data, which is called the coordinator. In the first step part of data will be sent from the coordinator to the directly connected processor, secondly, both processors will start sending other parts of data to the processors that are connected directly to them. TH (2,2), each processor will receive a row form matrix A and a column from matrix B, assuming the matrix size is 7x7. Canon algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes. It is especially suitable for computers laid out in an N N mesh and it works well in homogeneous 2D grids, so the main advantage of the algorithm is that its storage requirements remain constant and are independent on the number of processors. As shown in Fig. 5, the matrix multiplication using Canon algorithm with 49 processor, it aligns the blocks of A and B in such a way that each process multiplies its local sub-matrices. This is done by shifting all sub-matrices Ai,j to the left (with wraparound) by i steps and all sub-matrices Bi,j up (with wraparound) by j steps [5]. B. Parallel Analysis Parallel matrix multiplication is analyzed according to the communication time, computation time and complexity. Communication steps involve the distribution of data splitter between the processors, and gathering results. First of all, the number of processors will affect both the data scatter and the communication time, to calculate the initialization time for the tree-hypercube TH. The initialization time will equal the number of steps to distribute data between the processor units. We need 4 steps to scatter data among 7 processors and 6 steps for 15 processors, so we need (2L) for initialization time and the same time to gather the results from all processors. So, the total communication time equals (2L). Complexity is the time required to perform matrix multiplication in each processor for data size (n/p) which is required for all processors, this means each processor needs (n/p) x ( ) time, which makes the total complexity (( )/p). Execution time depends on the number of processor and the size of data as discussed previously the complexity is the time needed for multiplication and the communication time which is (2L) + (( ) / ( )). Fig. 5. Initial data distribution of matrix A 7 7, B 7 7 and C 7 7 on 49 processors (2,2). 203 P a g e
4 IV. EVOLUTION RESULT Results are evaluated in terms of speedup, running time and the parallel efficiency of matrix multiplication. The proposed parallel matrix multiplication on the Tree-Hypercube network is implemented by the library Message Passing Interface MPI, in which MPI processes are assigned to the cores. It will be parallel computation if the MPI process is assigned to a core; but if more than one MPI process is assigned to the same core, then it will be concurrent computation. These results were produced in IMAN1 Zaina cluster. Also, open access MPI library was used in our implementation. The experiment was applied on treehypercube topology with different size of matrices and different number of processors in reference to the number of nodes in a tree-hypercube, for example in the below table the number of processors represent the number of nodes inside tree-hypercube definition. After running this experiment multiple times, the average was taken as result. The hardware and software specifications are illustrated in Table I. Fig. 6. Runtime for sequential matrix multiplication with different matrix size. TABLE I. HARDWARE AND SOFTWARE SPECIFICATION Hardware specification Software specification Matrix size Number of processors 1,3,7,15,31 Dual Quad Core Intel Xeon CPU with SMP 16 GB RAM Scientific Linux 6.4 with open MPI1.5.4 and C++ compiler 1000x1000, 2000x2000,4000x4000 A. Run Time Evaluation Fig. 6 depicts run time for sequential matrix multiplication in different size , and we can see as the matrix size increase the runtime increases proportionally. Fig. 7 shows the run time for Parallel matrix multiplication in different sizes , and as illustrated in Fig. 7 when the number of processes increases the time needed for multiplication decreases. On the other hand, as shown below the run time at a certain number of processes will stop decreasing because the problem becomes smaller than the number of processors and the communication time will increase because this will cause an overhead, the general behaviour can be summarized with the following points: Scenario 1: When the processors number increases the run time decreases due to parallelism and distribution of tasks on more than one processor that are working together at the same time. (This case is applied when we move from 3 processors to 31 processors.) Scenario 2: When the processors number increases the run time increases due to communication overhead which is caused from increasing the number of processors more than a specific data size which in its turn decreases the benefits of the parallelism. (This case is applied when we move from 15 processors to 31 processors in matrices and also may occur in number of processors larger than 31 in and matrices.) Fig. 7. Runtime of matrix multiplication according to the number of processors and with different matrix size. B. Speed up Evaluation The speedup is the ratio of serial to parallel execution time as depicted in Fig. 8. Using (1) the number of processors values tested were 3,7,15 and 31. The speedup increases when the number of processors increases but this is not always applicable because we have limitations for parallelism that appears in matrices with size Speedup = sequential processing time/ parallel processing time (1) Fig. 8. The Speedup of the three different sizes of matrices on different number of processors. 204 P a g e
5 C. Parallel Efficiency Evaluation Efficiency is the ratio of speedup to the number of processors. In an ideal parallel system S=P and E=1 but in practice S<P and E is between 0 and 1. Equation 2 is used to calculate the efficiency. Fig. 9 displays the parallel efficiency for matrix multiplication according to different number of processors and different matrix size. In small numbers of processors between (3 and 7) efficiency was 97% for all matrix multiplication of matrix size On the other hand, when the number of processors increased between (15 and 31) efficiency started to decrease i.e. reached 8% in with 31 processors because the communication time will increase between the processes. Total communication time equals (2L), so the communication time will increase when the number of processors increases. The reason why the efficiency goes down, is that the communication time equals 2L where L is number of levels for tree hypercube, as long as the communications processes are increasing the communication time will increase for example when the number of processers is 3 the communication time will be 2 on the other hand when we have 31 processors the communication time will be 8 seconds for the same data size. Also, as seen in equation 2 the number of processes is in inverse relationship with efficiency. Efficiency = speedup / number of processors (2) Fig. 9. Efficiency of the three different sizes of matrices on different number of processors. V. CONCLUSIONS AND FUTURE WORKS In this paper, we present Matrix multiplication using treehypercube in terms of running time, speedup and efficiency. The algorithm was implemented using MPI library, and the results were produced by IMAN1 super computer. The evaluation was based on matrix size and number of processors. Results showed that runtime was in general decreasing when the processer s number is increased except for matrix size when the number of processors exceeded 15 the runtime increased because of the communication overhead. The speedup in all different matrices size increased when the number of processors increased except for the cases in which the communication overhead increases. We achieved up to 97% efficiency in large matrices such as sized matrix. In our future work we are aiming to proceed with a comparative study for matrix multiplication using different interconnection networks and it will be applied also on different algorithms such as sorting, then see the performance for each and compare the results with different studies. REFERENCES [1] P. Lee, Parallel Matrix Multiplication Algorithms on Hypercube Multiprocessors, International Journal of High Speed Computing, (1995). [2] J. Zavala-Díaz, J. Pérez-Ortega, E. Salazar-Reséndiz, and L. Guadarrama-Rogel matrix, multiplication with a hypercube algorithm on multi-core processor cluster, DYNA,Vol (82), no(191) [3] J. Choi, New Parallel Matrix Multiplication Algorithm on Distributed- Memory,Concurrent Computers. High Performance Computing on the Information Superhighway, HPC Asia '97.Pages: , DOI: /HPC [4] M. Al-Omari, and H. Abu-Salem, Tree-hypercubes: A multiprocessor interconnection topology, Abhath Al-Yarmouk, 6: (1997). [5] A. Grama, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Second Edition. Addison Wesley (2003). [6] S. Panigrahiand, S. Chakraborty, Statistical Definition of an Algorithm in PRAM Model & Analysis of 2 2 Matrix Multiplication in 2n Processors Using Different Networks, IEEE International Advance Computing Conference (IACC),Pages: , DOI: /IAdCC (2014) [7] A. Haron, J. Yu, R. Nane, M. Taouil, S. Hamdioui, and K. Bertels, Parallel Matrix Multiplication on Memristor-Based Computation-in- Memory Architecture. International Conference on High Performance Computing & Simulation (HPCS)Pages: , DOI: /HPCSim [8] M. Qatawneh, Adaptive Fault Tolerant Routing Algorithm for Tree Hypercube Multicomputer.vol. 2, no. 2, pp (2006). [9] M. Qatawneh. Embedding Linear Array Network into the treehypercube Network, European Journal of Scientific Research, 10(2). 2005, pp [10] Md. Nazrul Islam, Md. Shohidul Islam, M.A. Kashem, M.R. Islam, M.S. Islam An Empirical Distributed Matrix Multiplication Algorithm to Reduce Time Complexity, Proceedings of the International MultiConference of Engineers and Computer Scientists Vol IIIMECS 2009, March 18-20, 2009, Hong Kong). [11] M. Qatawneh. Multilayer Hex-Cells: A New Class of Hex-Cell Interconnection Networks for Massively Parallel Systems., International journal of Communications, Network and System Sciences, 4(11) [12] M. Qatawneh. Embedding Binary Tree and Bus into Hex-Cell Interconnection Network, Journal of American Science, 7(12) [13] M. Qatawneh. New Efficient Algorithm for Mapping Linear Array into Hex-Cell Network, International Journal of Advanced Science and Technology, 90, [14] M. Saadeh, H. Saadeh, M. Qatawneh, Performance Evaluation of Parallel Sorting Algorithms on IMAN1 Supercomputer, International Journal of Advanced Science and Technology Vol.95 (2016), pp [15] M. Qatawneh, A. Alamoush, J. Alqatawna. Section Based Hex-Cell Routing Algorithm (SBHCR), International Journal of Computer Networks & Communications (IJCNC), 7(1) [16] Qatawneh Mohammad, Hebatallah Khattab. New Routing Algorithm for Hex-Cell Network, International Journal of Future Generation Communication and Networking, 8(2) [17] M. Qatawneh, A. Sleit, W. Almobaideen. Parallel implementation of polygon clipping using transputer, American journal of Applied Science 6 (2) [18] A.Grama, A. Gupta, G.Karypis and V.Kumar.. Introduction to Parallel Computing, second edition, Addison Wesley,January 16, ISBN: P a g e
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationSECTION BASED HEX-CELL ROUTING ALGORITHM (SBHCR)
SECTION BASED HEX-CELL ROUTING ALGORITHM (SBHCR) Mohammad Qatawneh, Ahmad Alamoush and Ja'far Alqatawna King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan ABSTRACT
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationDr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:
Self Introduction Dr Tay Seng Chuan Tel: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: http://www.physics.nus.edu.sg/~phytaysc I have been working in NUS since 0, and I teach mainly
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationA Parallel Genetic Algorithm for Maximum Flow Problem
A Parallel Genetic Algorithm for Maximum Flow Problem Ola M. Surakhi Computer Science Department University of Jordan Amman-Jordan Mohammad Qatawneh Computer Science Department University of Jordan Amman-Jordan
More informationConsultation for CZ4102
Self Introduction Dr Tay Seng Chuan Tel: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: http://www.physics.nus.edu.sg/~phytaysc I was a programmer from to. I have been working in
More informationChapter 8 Dense Matrix Algorithms
Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationFundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.
Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationData Communication and Parallel Computing on Twisted Hypercubes
Data Communication and Parallel Computing on Twisted Hypercubes E. Abuelrub, Department of Computer Science, Zarqa Private University, Jordan Abstract- Massively parallel distributed-memory architectures
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationParallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering
Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationAnalysis of Matrix Multiplication Computational Methods
European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods
More informationA COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT Amr Rekaby 1 and Mohamed Abo Rizka 2 1 Egyptian Research and Scientific Innovation Lab (ERSIL), Egypt 2 Arab Academy
More informationPrefix Computation and Sorting in Dual-Cube
Prefix Computation and Sorting in Dual-Cube Yamin Li and Shietung Peng Department of Computer Science Hosei University Tokyo - Japan {yamin, speng}@k.hosei.ac.jp Wanming Chu Department of Computer Hardware
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationParallel algorithms at ENS Lyon
Parallel algorithms at ENS Lyon Yves Robert Ecole Normale Supérieure de Lyon & Institut Universitaire de France TCPP Workshop February 2010 Yves.Robert@ens-lyon.fr February 2010 Parallel algorithms 1/
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationSorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel
More informationCSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)
CS 160 Lecture 23 Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives) Today s lecture All to all communication Application to Parallel Sorting Blocking for cache 2013
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationSolving Traveling Salesman Problem on High Performance Computing using Message Passing Interface
Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface IZZATDIN A. AZIZ, NAZLEENI HARON, MAZLINA MEHAT, LOW TAN JUNG, AISYAH NABILAH Computer and Information Sciences
More informationLast Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort
Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms
More informationCOMMUNICATION IN HYPERCUBES
PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)
More informationSearch Algorithms for Discrete Optimization Problems
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationJohn Mellor-Crummey Department of Computer Science Rice University
Parallel Sorting John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 23 6 April 2017 Topics for Today Introduction Sorting networks and Batcher s bitonic
More information15. The Software System ParaLab for Learning and Investigations of Parallel Methods
15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationAnalysis of Parallelization Techniques and Tools
International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationEvaluation of Parallel Programs by Measurement of Its Granularity
Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl
More informationInterconnection Network
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics
More informationBİL 542 Parallel Computing
BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,
More informationA Level-wise Priority Based Task Scheduling for Heterogeneous Systems
International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationSpider-Web Topology: A Novel Topology for Parallel and Distributed Computing
Spider-Web Topology: A Novel Topology for Parallel and Distributed Computing 1 Selvarajah Thuseethan, 2 Shanmuganathan Vasanthapriyan 1,2 Department of Computing and Information Systems, Sabaragamuwa University
More informationProf. Mohammad Sulieman Qatawneh
Personal Information Date of Birth: 12-11-1964. Place of Birth: Karak Jordan. Prof. Mohammad Sulieman Qatawneh Address: University of Jordan, King Abdullah II School for Information Technology- Department
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationMatrix Multiplication
Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c
More informationParallel Computing: Parallel Algorithm Design Examples Jin, Hai
Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a
More informationMulti MicroBlaze System for Parallel Computing
Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationMeshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods
Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods Uei-Ren Chen 1, Chin-Chi Wu 2, and Woei Lin 3 1 Department of Electronic Engineering, Hsiuping Institute of Technology
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationCamdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa
Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationExperiments with Massively Parallel Matrix Multiplication
Experiments with Massively Parallel Matrix Multiplication Timothy W. O Neil Dept. of Computer Science, The University of Akron, Akron OH 44325 USA toneil@uakron.edu Abstract This paper presents initial
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution
More informationMatrix multiplication
Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:
More informationA Scalable Parallel HITS Algorithm for Page Ranking
A Scalable Parallel HITS Algorithm for Page Ranking Matthew Bennett, Julie Stone, Chaoyang Zhang School of Computing. University of Southern Mississippi. Hattiesburg, MS 39406 matthew.bennett@usm.edu,
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationLecture 8 Parallel Algorithms II
Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCS575 Parallel Processing
CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
More informationI. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS
Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com
More informationCOMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS
International Journal of Computer Engineering and Applications, Volume VII, Issue II, Part II, COMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS Sanjukta
More informationPhysical Organization of Parallel Platforms. Alexandre David
Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:
More informationChapter 20: Database System Architectures
Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types
More informationChapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationNode-Independent Spanning Trees in Gaussian Networks
4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 Node-Independent Spanning Trees in Gaussian Networks Z. Hussain 1, B. AlBdaiwi 1, and A. Cerny 1 Computer Science Department, Kuwait University,
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationSubset Sum Problem Parallel Solution
Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in
More informationSearch Algorithms for Discrete Optimization Problems
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationA Test Suite for High-Performance Parallel Java
page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationSorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by
Sorting Algorithms Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel
More informationCSC630/CSC730: Parallel Computing
CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control
More informationHierarchical Multi level Approach to graph clustering
Hierarchical Multi level Approach to graph clustering by: Neda Shahidi neda@cs.utexas.edu Cesar mantilla, cesar.mantilla@mail.utexas.edu Advisor: Dr. Inderjit Dhillon Introduction Data sets can be presented
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationFoster s Methodology: Application Examples
Foster s Methodology: Application Examples Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 19, 2011 CPD (DEI / IST) Parallel and
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationHiTune. Dataflow-Based Performance Analysis for Big Data Cloud
HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More information