A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer

Size: px
Start display at page:

Download "A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer"

Transcription

1 A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer Orieb AbuAlghanam, Mohammad Qatawneh Computer Science Department University of Jordan Hussein A. al Ofeishat Computer Science Department Al-Balqa applied university Jordan Omar Adwan, Ammar Huneiti Computer Information Systems University of Jordan Abstract The tree-hypercube (TH) interconnection network is relatively a new interconnection network, which is constructed from tree and hypercube topologies. TH is developed to support parallel algorithms for solving computation and communication intensive problems. In this paper, we propose a new parallel multiplication algorithm on TH network to present broadcast communication operation for TH using store-and-forward technique, namely, one-to-all broadcast operation which allows a message to be transmitted through the shortest path from the source node to all other nodes. The proposed algorithm is implemented and evaluated in terms of running time, efficiency and speedup with different data size using IMAN1. The experimental results show that the runtime, efficiency and the speedup of the proposed algorithm decrease as a number of processors increases for all cases of matrices size of , , and Keywords MPI; supercomputer; tree-hypercube; matrix multiplication I. INTRODUCTION Parallel matrix multiplication is considered as a backbone for several scientific applications. Many studies developed matrix multiplication from different perspective in [3], [6], [7], [10], [14] the authors studied the limitation bandwidth for memory-processor through communication and how to reduce the gap between memory and processor speed, they present a communication efficient mapping of a large-scale matrix multiplication algorithm. In [1] parallel matrix multiplication algorithm on hypercube multiprocessors, while in [2], [5], [6] the authors applied a matrix multiplication with a hypercube algorithm on multi-core processor cluster. In [3] the author proposed a Distribution-Independent Matrix Multiplication Algorithm called DIMMA which is new, fast, and scalable matrix multiplication algorithm, for block cyclic data distribution on distributed-memory concurrent computers. In this paper we apply matrix multiplication on tree-hypercube which was used before in adaptive fault tolerate in routing algorithm [8], [11], [12], [17]. Many proposals for matrix multiplication algorithms were done on different networks type whether it was homogeneous or heterogeneous in order to reduce the time drastically to improve the system performance. Matrix multiplication needs high computation time especially when the size of the matrix becomes huge, therefore some problems need years to be solved using a personal computer. The interconnection networks are the core of a parallel processing system which the system s processors are linked. Due to the big role played by the networks topology to improve the parallel system s performance, several interconnection network topologies have been proposed for that purpose; such as the tree, hypercube, mesh, ring, and Hex-Cell (HC) [12], [13], [16]. Among the wide variety of interconnection networks structures proposed for parallel computing systems is Tree- Hypercube network which received much attention due to the attractive properties inherited in their topology [4], [8], [9], [15]. This paper aimed to design and analyze efficient matrix multiplication algorithm on tree-hypercube network. Experimentation of the proposed algorithm was conducted using IMAN1 supercomputer which is Jordan's first supercomputer. The IMAN1 is available for use by academia and industry in Jordan and the region. The rest of the paper is organized as follows. Section 2 summarizes the definition of Tree-Hypercube network. In Section 3 the proposed algorithm is explained, Section 4 is an overview of sequential and parallel matrix multiplication is discussed. In Section 5, the evaluation results are illustrated. Finally, section 6 is the conclusion for the paper. II. DEFINITION OF TREE-HYPERCUBE NETWORK Tree-hypercube can be defined as follows: each node is labelled with the level number (L) and the node number (X) with the form (L,X). It is mainly a tree which represents a hypercube in each level because each level i in tree has s(i) nodes that represents the number of processing element as a hypercube [4], [9]. For very level L between 0 and d, each node (L,X) in level L, where X= X(d logs)-1 X0 is adjacent to the following s children nodes at level L+1 (L+1, X.a) for a =0,..,s-1 as shown in Fig. 1(a) and (b) [8]. 201 P a g e

2 In this section, we propose a new Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network Using IMAN1 Supercomputer. The data distribution for matrix multiplication can be divided either striped partitioning or checkerboard partitioning. We use a block striped partitioning distribution to study the performance of message passing execution model. The column wise block striping of an (n x n) matrix on p processors (P0, P1,, Pp-1) processor Pi contains columns: (n/p) (i), ((n/p) i)+1,, ((n/p) (i+1)) 1 [5].The proposed parallel matrix multiplication algorithm on Tree-Hypercube network consists of three stages as shown in Fig. 3. The first stage shows the distribution of data among nodes (processors), second stage describe the multiplication process at each processor, and finally the third stage represent the collecting the data. (a) Input: Matrix A and B Output: Matrix C on Tree-Hypercube using parallel Matrix Multiplication Stage 1: Broadcast matrix A and B 1. CN (Coordinator Node) generates a set of blocks of Matrix A. 2. CN generates a set of blocks according to the Matrix B. 3. Wait for acknowledgment message from the coordinator who received the data. 4. Send a message for the CN informing the process completion. 5. CN stops the process of distribution and announces the beginning of the next Stage (b) Fig. 1. Tree-hypercube (a) TH (2,2) and (b) TH (2,3). III. PROPOSED ALGORITHM Matrix multiplication is a simple and widely used algorithm in many scientific applications. Matrix multiplication is well structured in the sense that elements of the matrices can be evenly distributed to the nodes and communication among nodes which have regular pattern. Therefore, exploiting data parallelism based on message passing is suitable for solving the matrix multiplication problem. Fig. 2 shows a sequential algorithm of an n x n matrix multiplication consisting of three nested loops. As can be seen, the complicity of the algorithm is n3. Matrix-Multiplication(A,B) n =A.rows Let C be a new n n matrix For i=1 to n For j=1 to n Cij=0 For k=1 to n Cij=Cij + aik bkj Stage 2: Do the multiplication in parallel 6. Multiply the stripes of matrix A with stripes of Matrix B (for each block) of data. 7. Using sequential matrix Multiplication. Where all processors perform Cij Stage 3: Collecting the data A- Global Data Combining 8. In each level in the Tree-Hypercube interconnection, send in parallel the multiplication matrix to the TH root nodes. B- Combining Data in the Tree-Hypercube root nodes 9. CN combines the collected matrices correctly from roots nodes in matrix C. 10. Exit Fig. 3. Pseudocode Tree-Hypercube Algorithm for Matrix Multiplication. Fig. 2. A Sequential Multiplication Algorithm for two matrices [18]. 202 P a g e

3 Fig. 4. Parallel matrix multiplication tree-hypercube view. Fig. 4 shows one-to-all broadcast on a Tree-Hypercube (2,3). Broadcast communication operations are used in many parallel algorithms; such as vector-matrix multiplication, vector-vector multiplication, matrix multiplication, Gaussian elimination, shortest paths, LU (lower upper) factorization, maximal vectors and prefix sums. Our one-to-all broadcast algorithms on TH network are based on the E-Cube routing algorithm, where this routing algorithm can be found in [1]. Each node can send a message while receiving another message on the same or different link; that is bidirectional links are used. Also, it can send or receive a message on only one of its links at a time. Moreover, a store-and-forward routing technique is used, where each intermediate node must fully receive and store the entire message before sending it to the next node. Fig. 4 presents one-to-all broadcast algorithm which can be defined as broadcasting a message from a source node to all other nodes in the network. Respectively, in matrix multiplication which is applied in tree hypercube, the coordinator creates the partitions depending on the number of processors and the matrix size, once the data is received by any processor it takes the part that it is responsible for and then resends it again to the nodes that has a direct link with it. In TH sending data acts as hypercube in each level, after the data distribution is done. Each processor does it is own computations and sends back the result to the parent from which it got the data until all data is accumulated and sent back to the coordinator. A. Partition Procedure Analysis In the first partitioning of the original array will take n/p, assuming the partitioning is always balanced i.e. the number of columns or the numbers of rows are equal to the number of processors. In our tree-hypercube network only one processor has data, which is called the coordinator. In the first step part of data will be sent from the coordinator to the directly connected processor, secondly, both processors will start sending other parts of data to the processors that are connected directly to them. TH (2,2), each processor will receive a row form matrix A and a column from matrix B, assuming the matrix size is 7x7. Canon algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes. It is especially suitable for computers laid out in an N N mesh and it works well in homogeneous 2D grids, so the main advantage of the algorithm is that its storage requirements remain constant and are independent on the number of processors. As shown in Fig. 5, the matrix multiplication using Canon algorithm with 49 processor, it aligns the blocks of A and B in such a way that each process multiplies its local sub-matrices. This is done by shifting all sub-matrices Ai,j to the left (with wraparound) by i steps and all sub-matrices Bi,j up (with wraparound) by j steps [5]. B. Parallel Analysis Parallel matrix multiplication is analyzed according to the communication time, computation time and complexity. Communication steps involve the distribution of data splitter between the processors, and gathering results. First of all, the number of processors will affect both the data scatter and the communication time, to calculate the initialization time for the tree-hypercube TH. The initialization time will equal the number of steps to distribute data between the processor units. We need 4 steps to scatter data among 7 processors and 6 steps for 15 processors, so we need (2L) for initialization time and the same time to gather the results from all processors. So, the total communication time equals (2L). Complexity is the time required to perform matrix multiplication in each processor for data size (n/p) which is required for all processors, this means each processor needs (n/p) x ( ) time, which makes the total complexity (( )/p). Execution time depends on the number of processor and the size of data as discussed previously the complexity is the time needed for multiplication and the communication time which is (2L) + (( ) / ( )). Fig. 5. Initial data distribution of matrix A 7 7, B 7 7 and C 7 7 on 49 processors (2,2). 203 P a g e

4 IV. EVOLUTION RESULT Results are evaluated in terms of speedup, running time and the parallel efficiency of matrix multiplication. The proposed parallel matrix multiplication on the Tree-Hypercube network is implemented by the library Message Passing Interface MPI, in which MPI processes are assigned to the cores. It will be parallel computation if the MPI process is assigned to a core; but if more than one MPI process is assigned to the same core, then it will be concurrent computation. These results were produced in IMAN1 Zaina cluster. Also, open access MPI library was used in our implementation. The experiment was applied on treehypercube topology with different size of matrices and different number of processors in reference to the number of nodes in a tree-hypercube, for example in the below table the number of processors represent the number of nodes inside tree-hypercube definition. After running this experiment multiple times, the average was taken as result. The hardware and software specifications are illustrated in Table I. Fig. 6. Runtime for sequential matrix multiplication with different matrix size. TABLE I. HARDWARE AND SOFTWARE SPECIFICATION Hardware specification Software specification Matrix size Number of processors 1,3,7,15,31 Dual Quad Core Intel Xeon CPU with SMP 16 GB RAM Scientific Linux 6.4 with open MPI1.5.4 and C++ compiler 1000x1000, 2000x2000,4000x4000 A. Run Time Evaluation Fig. 6 depicts run time for sequential matrix multiplication in different size , and we can see as the matrix size increase the runtime increases proportionally. Fig. 7 shows the run time for Parallel matrix multiplication in different sizes , and as illustrated in Fig. 7 when the number of processes increases the time needed for multiplication decreases. On the other hand, as shown below the run time at a certain number of processes will stop decreasing because the problem becomes smaller than the number of processors and the communication time will increase because this will cause an overhead, the general behaviour can be summarized with the following points: Scenario 1: When the processors number increases the run time decreases due to parallelism and distribution of tasks on more than one processor that are working together at the same time. (This case is applied when we move from 3 processors to 31 processors.) Scenario 2: When the processors number increases the run time increases due to communication overhead which is caused from increasing the number of processors more than a specific data size which in its turn decreases the benefits of the parallelism. (This case is applied when we move from 15 processors to 31 processors in matrices and also may occur in number of processors larger than 31 in and matrices.) Fig. 7. Runtime of matrix multiplication according to the number of processors and with different matrix size. B. Speed up Evaluation The speedup is the ratio of serial to parallel execution time as depicted in Fig. 8. Using (1) the number of processors values tested were 3,7,15 and 31. The speedup increases when the number of processors increases but this is not always applicable because we have limitations for parallelism that appears in matrices with size Speedup = sequential processing time/ parallel processing time (1) Fig. 8. The Speedup of the three different sizes of matrices on different number of processors. 204 P a g e

5 C. Parallel Efficiency Evaluation Efficiency is the ratio of speedup to the number of processors. In an ideal parallel system S=P and E=1 but in practice S<P and E is between 0 and 1. Equation 2 is used to calculate the efficiency. Fig. 9 displays the parallel efficiency for matrix multiplication according to different number of processors and different matrix size. In small numbers of processors between (3 and 7) efficiency was 97% for all matrix multiplication of matrix size On the other hand, when the number of processors increased between (15 and 31) efficiency started to decrease i.e. reached 8% in with 31 processors because the communication time will increase between the processes. Total communication time equals (2L), so the communication time will increase when the number of processors increases. The reason why the efficiency goes down, is that the communication time equals 2L where L is number of levels for tree hypercube, as long as the communications processes are increasing the communication time will increase for example when the number of processers is 3 the communication time will be 2 on the other hand when we have 31 processors the communication time will be 8 seconds for the same data size. Also, as seen in equation 2 the number of processes is in inverse relationship with efficiency. Efficiency = speedup / number of processors (2) Fig. 9. Efficiency of the three different sizes of matrices on different number of processors. V. CONCLUSIONS AND FUTURE WORKS In this paper, we present Matrix multiplication using treehypercube in terms of running time, speedup and efficiency. The algorithm was implemented using MPI library, and the results were produced by IMAN1 super computer. The evaluation was based on matrix size and number of processors. Results showed that runtime was in general decreasing when the processer s number is increased except for matrix size when the number of processors exceeded 15 the runtime increased because of the communication overhead. The speedup in all different matrices size increased when the number of processors increased except for the cases in which the communication overhead increases. We achieved up to 97% efficiency in large matrices such as sized matrix. In our future work we are aiming to proceed with a comparative study for matrix multiplication using different interconnection networks and it will be applied also on different algorithms such as sorting, then see the performance for each and compare the results with different studies. REFERENCES [1] P. Lee, Parallel Matrix Multiplication Algorithms on Hypercube Multiprocessors, International Journal of High Speed Computing, (1995). [2] J. Zavala-Díaz, J. Pérez-Ortega, E. Salazar-Reséndiz, and L. Guadarrama-Rogel matrix, multiplication with a hypercube algorithm on multi-core processor cluster, DYNA,Vol (82), no(191) [3] J. Choi, New Parallel Matrix Multiplication Algorithm on Distributed- Memory,Concurrent Computers. High Performance Computing on the Information Superhighway, HPC Asia '97.Pages: , DOI: /HPC [4] M. Al-Omari, and H. Abu-Salem, Tree-hypercubes: A multiprocessor interconnection topology, Abhath Al-Yarmouk, 6: (1997). [5] A. Grama, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Second Edition. Addison Wesley (2003). [6] S. Panigrahiand, S. Chakraborty, Statistical Definition of an Algorithm in PRAM Model & Analysis of 2 2 Matrix Multiplication in 2n Processors Using Different Networks, IEEE International Advance Computing Conference (IACC),Pages: , DOI: /IAdCC (2014) [7] A. Haron, J. Yu, R. Nane, M. Taouil, S. Hamdioui, and K. Bertels, Parallel Matrix Multiplication on Memristor-Based Computation-in- Memory Architecture. International Conference on High Performance Computing & Simulation (HPCS)Pages: , DOI: /HPCSim [8] M. Qatawneh, Adaptive Fault Tolerant Routing Algorithm for Tree Hypercube Multicomputer.vol. 2, no. 2, pp (2006). [9] M. Qatawneh. Embedding Linear Array Network into the treehypercube Network, European Journal of Scientific Research, 10(2). 2005, pp [10] Md. Nazrul Islam, Md. Shohidul Islam, M.A. Kashem, M.R. Islam, M.S. Islam An Empirical Distributed Matrix Multiplication Algorithm to Reduce Time Complexity, Proceedings of the International MultiConference of Engineers and Computer Scientists Vol IIIMECS 2009, March 18-20, 2009, Hong Kong). [11] M. Qatawneh. Multilayer Hex-Cells: A New Class of Hex-Cell Interconnection Networks for Massively Parallel Systems., International journal of Communications, Network and System Sciences, 4(11) [12] M. Qatawneh. Embedding Binary Tree and Bus into Hex-Cell Interconnection Network, Journal of American Science, 7(12) [13] M. Qatawneh. New Efficient Algorithm for Mapping Linear Array into Hex-Cell Network, International Journal of Advanced Science and Technology, 90, [14] M. Saadeh, H. Saadeh, M. Qatawneh, Performance Evaluation of Parallel Sorting Algorithms on IMAN1 Supercomputer, International Journal of Advanced Science and Technology Vol.95 (2016), pp [15] M. Qatawneh, A. Alamoush, J. Alqatawna. Section Based Hex-Cell Routing Algorithm (SBHCR), International Journal of Computer Networks & Communications (IJCNC), 7(1) [16] Qatawneh Mohammad, Hebatallah Khattab. New Routing Algorithm for Hex-Cell Network, International Journal of Future Generation Communication and Networking, 8(2) [17] M. Qatawneh, A. Sleit, W. Almobaideen. Parallel implementation of polygon clipping using transputer, American journal of Applied Science 6 (2) [18] A.Grama, A. Gupta, G.Karypis and V.Kumar.. Introduction to Parallel Computing, second edition, Addison Wesley,January 16, ISBN: P a g e

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast

More information

SECTION BASED HEX-CELL ROUTING ALGORITHM (SBHCR)

SECTION BASED HEX-CELL ROUTING ALGORITHM (SBHCR) SECTION BASED HEX-CELL ROUTING ALGORITHM (SBHCR) Mohammad Qatawneh, Ahmad Alamoush and Ja'far Alqatawna King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan ABSTRACT

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL: Self Introduction Dr Tay Seng Chuan Tel: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: http://www.physics.nus.edu.sg/~phytaysc I have been working in NUS since 0, and I teach mainly

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

A Parallel Genetic Algorithm for Maximum Flow Problem

A Parallel Genetic Algorithm for Maximum Flow Problem A Parallel Genetic Algorithm for Maximum Flow Problem Ola M. Surakhi Computer Science Department University of Jordan Amman-Jordan Mohammad Qatawneh Computer Science Department University of Jordan Amman-Jordan

More information

Consultation for CZ4102

Consultation for CZ4102 Self Introduction Dr Tay Seng Chuan Tel: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: http://www.physics.nus.edu.sg/~phytaysc I was a programmer from to. I have been working in

More information

Chapter 8 Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K. Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Data Communication and Parallel Computing on Twisted Hypercubes

Data Communication and Parallel Computing on Twisted Hypercubes Data Communication and Parallel Computing on Twisted Hypercubes E. Abuelrub, Department of Computer Science, Zarqa Private University, Jordan Abstract- Massively parallel distributed-memory architectures

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT Amr Rekaby 1 and Mohamed Abo Rizka 2 1 Egyptian Research and Scientific Innovation Lab (ERSIL), Egypt 2 Arab Academy

More information

Prefix Computation and Sorting in Dual-Cube

Prefix Computation and Sorting in Dual-Cube Prefix Computation and Sorting in Dual-Cube Yamin Li and Shietung Peng Department of Computer Science Hosei University Tokyo - Japan {yamin, speng}@k.hosei.ac.jp Wanming Chu Department of Computer Hardware

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Parallel algorithms at ENS Lyon

Parallel algorithms at ENS Lyon Parallel algorithms at ENS Lyon Yves Robert Ecole Normale Supérieure de Lyon & Institut Universitaire de France TCPP Workshop February 2010 Yves.Robert@ens-lyon.fr February 2010 Parallel algorithms 1/

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives) CS 160 Lecture 23 Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives) Today s lecture All to all communication Application to Parallel Sorting Blocking for cache 2013

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface IZZATDIN A. AZIZ, NAZLEENI HARON, MAZLINA MEHAT, LOW TAN JUNG, AISYAH NABILAH Computer and Information Sciences

More information

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms

More information

COMMUNICATION IN HYPERCUBES

COMMUNICATION IN HYPERCUBES PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

John Mellor-Crummey Department of Computer Science Rice University

John Mellor-Crummey Department of Computer Science Rice University Parallel Sorting John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 23 6 April 2017 Topics for Today Introduction Sorting networks and Batcher s bitonic

More information

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Analysis of Parallelization Techniques and Tools

Analysis of Parallelization Techniques and Tools International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Evaluation of Parallel Programs by Measurement of Its Granularity

Evaluation of Parallel Programs by Measurement of Its Granularity Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Spider-Web Topology: A Novel Topology for Parallel and Distributed Computing

Spider-Web Topology: A Novel Topology for Parallel and Distributed Computing Spider-Web Topology: A Novel Topology for Parallel and Distributed Computing 1 Selvarajah Thuseethan, 2 Shanmuganathan Vasanthapriyan 1,2 Department of Computing and Information Systems, Sabaragamuwa University

More information

Prof. Mohammad Sulieman Qatawneh

Prof. Mohammad Sulieman Qatawneh Personal Information Date of Birth: 12-11-1964. Place of Birth: Karak Jordan. Prof. Mohammad Sulieman Qatawneh Address: University of Jordan, King Abdullah II School for Information Technology- Department

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c

More information

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a

More information

Multi MicroBlaze System for Parallel Computing

Multi MicroBlaze System for Parallel Computing Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods Uei-Ren Chen 1, Chin-Chi Wu 2, and Woei Lin 3 1 Department of Electronic Engineering, Hsiuping Institute of Technology

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Experiments with Massively Parallel Matrix Multiplication

Experiments with Massively Parallel Matrix Multiplication Experiments with Massively Parallel Matrix Multiplication Timothy W. O Neil Dept. of Computer Science, The University of Akron, Akron OH 44325 USA toneil@uakron.edu Abstract This paper presents initial

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Matrix multiplication

Matrix multiplication Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:

More information

A Scalable Parallel HITS Algorithm for Page Ranking

A Scalable Parallel HITS Algorithm for Page Ranking A Scalable Parallel HITS Algorithm for Page Ranking Matthew Bennett, Julie Stone, Chaoyang Zhang School of Computing. University of Southern Mississippi. Hattiesburg, MS 39406 matthew.bennett@usm.edu,

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

CS575 Parallel Processing

CS575 Parallel Processing CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

COMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS

COMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS International Journal of Computer Engineering and Applications, Volume VII, Issue II, Part II, COMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS Sanjukta

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

Node-Independent Spanning Trees in Gaussian Networks

Node-Independent Spanning Trees in Gaussian Networks 4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 Node-Independent Spanning Trees in Gaussian Networks Z. Hussain 1, B. AlBdaiwi 1, and A. Cerny 1 Computer Science Department, Kuwait University,

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Subset Sum Problem Parallel Solution

Subset Sum Problem Parallel Solution Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Sorting Algorithms Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel

More information

CSC630/CSC730: Parallel Computing

CSC630/CSC730: Parallel Computing CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control

More information

Hierarchical Multi level Approach to graph clustering

Hierarchical Multi level Approach to graph clustering Hierarchical Multi level Approach to graph clustering by: Neda Shahidi neda@cs.utexas.edu Cesar mantilla, cesar.mantilla@mail.utexas.edu Advisor: Dr. Inderjit Dhillon Introduction Data sets can be presented

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Foster s Methodology: Application Examples

Foster s Methodology: Application Examples Foster s Methodology: Application Examples Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 19, 2011 CPD (DEI / IST) Parallel and

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information