A Bandwidth Latency Tradeoff for Broadcast and Reduction

Size: px
Start display at page:

Download "A Bandwidth Latency Tradeoff for Broadcast and Reduction"

Transcription

1 A Bandwidth Latency Tradeoff for Broadcast and Reduction Peter Sanders and Jop F. Sibeyn Max-Planck-Institut für Informatik Im Stadtwald, 66 Saarbrücken, Germany. sanders, jopsi Abstract. A novel algorithm is presented for broadcasting in the single-port and duplex model on a completely connected interconnection network with start-up latencies. None of the existing broadcasting algorithms performs close to optimal for packets that are neither very large nor very small. The broadcasting schedule of the new algorithm is based on trees of fractional degree. It performs close to optimal for all packet sizes. For realistic values of P it is up to 5% faster than the best alternative. Introduction Broadcasting. Consider P processing units, PUs, of a parallel machine. Broadcasting, the operation in which one processor has to send a message M to all other PUs, is a crucial building block for many parallel algorithms. Since it can be implemented once and for all in communication libraries such as MPI [], it makes sense to try to strive for optimal algorithms for all P and all message lengths k. Since broadcasting is sometimes a bottleneck operation, even constant factors should not be neglected. In addition, by reversing the direction of communication, broadcasting algorithms can usually be turned into reduction algorithms. Reduction is the task to compute a generalized sum L i<p M i, where initially message M i is stored on PU i and where can be any associative operator. Broadcasting and reduction are among the most important communication primitives. For example, some of the best algorithms for matrix multiplication or dense matrix-vector multiplication have these two functions as their sole communication routines []. Assumptions and Communication Models. We study broadcasting long messages using the following assumptions which reflect the properties of many current machines at least for the transmission of long messages: The PUs are connected by a network. Communication is synchronous, i.e., both sender and receiver have to cooperate in transmitting a message. Transferring a message of length k between any pair of PUs takes time t + k. Here t is the start-up overhead and the time unit is defined as the the time needed to transfer one data item.

2 In the last point we assume that the topology of the network does not play a role. However, our algorithms have enough locality to work well on networks with limited bisection width and we will also informally discuss how the communication structures can be embedded into meshes. We are looking at two communication models. The first is the single-port model, in which a PU can only participate in one communication at a time. It can either send or receive a packet but not both. The second is the somewhat more powerful duplex model, in which a PU can concurrently participate in sending one message and receiving another message (not necessarily from the same PU). Lower Bounds. In the duplex model, all non-source PUs must receive the k data, and the whole broadcasting takes at least log P steps. Thus in the duplex model there is a lower bound of T lower, duplex = k + t log P: In the single port model all non-source PUs must receive the k data. These (P )k data have been sent by at most P PUs. So, even with perfect balance the PUs are sending packets for ( =P ) k time. This gives T lower, one-port = ( =P ) k + t log P: For odd P there is always one idle PU, in that case the first term is k. These lower bounds hold in full generality. For a large and natural class of algorithms, we can prove a stronger bound though. Consider algorithms that divide the total data set of size k in s packets of size k=s each. All PUs operate synchronously, and in every step they send or receive at most one packet. So, until step s, there is still at least one packet known only to the source. Thus, for given s, at least s+log P steps are required in the duplex model. Each step takes k=s + t time. For given k, t and P, the minimum is assumed for s = p k t= log P : T lower, duplex = k + t log P + p k t log P : () We were not able to come with a corresponding bound for the one-port model. The problem is that one cannot conclude that the routing requires s + log P steps. Previous Results. Peter schreibt evt. noch was schoenes. In dem fall muss die naechste alinea umgestaltet werden. Abgrenzung gegenueber Hyperwuerfelalg. evt. beispiel (P = 6) fuer H-tree-einbettung. New Contributions. The earlier algorithms perform optimal in two extreme cases. If k is very large, the start-up effects become negligible, and one should concentrate on minimizing the amount of data to be transferred. In the other extreme case, if t is very large, the transmission time is negligible, and the sole objective is to minimize the number of communication steps. For intermediate cases the best of these approaches takes up to 5% more time than the lower bound. In the case of duplex connections even twice as much. Our refined broadcasting schedule comes much closer to the lower bounds for all values of k and t. For the duplex model we come to within lower-order terms from (). The schedule is rather simple and might be implemented easily.

3 In general a duplex algorithms cannot be turned into a one-port algorithm by simply replacing every duplex step by two one-port steps: this would amount to constructing a coloring with two colors of a graph with degree two. For graphs containing odd cycles, such as our duplex algorithm, this is impossible. However, a modification of the schedule leads to a good one-port algorithm. It requires twice as long as the duplex algorithm. Contents. After describing the known algorithms, we present the fractional-tree broadcasting: first for the simpler duplex model, then for the one-port model. At the end of the paper we consider embedding the fractional trees on sparse networks. Simple Algorithms In order to be able to compare the performance of our new algorithm, we first present two well-known algorithms.. Linear Pipeline Broadcasting For messages with k t P, a simple pipelined algorithm performs very well: The PUs are organized into a directed Hamiltonian path starting at the source PU and the message is subdivided into s packets of size k=s. The packets are handed through the path in a pipelined fashion. An internal PU simply alternates between receiving a packet from its predecessor and forwarding it to its successor. An example is given in Figure. Step PU PU PU PU Fig.. The simple pipelined algorithm for P = and s = 5. It is easy to see that the broadcasting time is T pipe, one-port = (d + (s )) (t + k=s): ()

4 Here d = P is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T pipe, one-port = ( p d t + p k) = ( + o()) k + O(t P ): This means that for k t P broadcasting takes little more than the (near) optimal k time. For the duplex model, the constant two can be dropped in the analysis. For long messages, the algorithm is almost twice as fast.. Binary Tree Broadcasting For shorter messages, the high latency of O(t P ) until the last PU begins to receive data is prohibitive for large P. This latency can be reduced to O(t log P ) by using a binary tree instead of an Hamiltionian path. Each internal PU performs the following three steps in a loop: receive a packet from the father; send the packet to the left child; send the packet to the right child. This approach is illustrated in Figure. d = d = d = d = H HHHj P Fig.. The tree structures used for the broadcasting for trees of various depths d. The tree for depth d is constructed out off the trees for d and d. The numbers indicate the routing steps in which a packet coming from the root is routed over the connections. Every three steps the root injects one packet. For single ported communication, each pass through the loop takes (t + k=s). Similar to the above calculation, we get T bin, one-port = (d + (s )) (t + k=s): () Here d = O(log P ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T bin, one-port = ( p d t + p k) = ( + o()) k + O(t log P ):

5 This is good algorithm even for rather small k but for large k it takes up to 5% more time than the linear pipeline. In the duplex model, the PUs are performing a two-step loop: receive a packet from the father and send the previous packet to the right child; send the latest packet to the left child. Thus each pass takes (t + k=s) time. For large k the linear pipeline is up to twice as fast. Fractional Tree Broadcasting The question arises whether better tradeoffs between bandwidth and latency can be achieved for medium sized messages than either a pipeline with linear latency or a binary tree with logarithmic latency but slower message transmission. Indeed we now design a family of broadcast algorithms which achieve execution times ( + r in the single ported model and + o()) k + O(r t log P ) ( + r + o()) k + O(r t log P ) in the duplex model for any integer r. The linear pipeline and the binary tree can be viewed as special cases of this scheme for r = and r =, respectively.. Duplex Model The idea is to replace the node of a binary tree by a chain of r PUs. The input is fed into the head of this chain. The data is passed down the chain and on to a successor chain as in the linear pipeline algorithm. In addition, the PUs of the chain cooperate in feeding the data to the head of a second successor chain. Figure shows the communication structure and an example. The resulting communication structure could be viewed as a tree with degree + =r. We call it a fractional tree. Schedule Details. Now we describe the routing schedule in more detail. The total data set of size k is divided in k=r runs consisting of r packets each. The packets in each run are indexed from through r. The same indices are used for the PUs in each chain. The routing is composed of phases. Each phase takes r + steps. The steps are indexed from through r. Initially all packets are stored in PU of chain. Consider a chain at depth l. In phase t l this chain is involved in forwarding the packets of run t l that enter at PU through the chain. Also it is involved in further forwarding the packets of run t l through the chain and into both successor chains. Now we consider step j of phase t more precisely. Any PU i with i j sends packet r + j i of run t l to the PU below it (for i = r, this means to the head of the successor chain). For j >, PU i = j sends packet i of run t l to the head of the second successor chain. Any PU i with i < j sends packet j i of run t l to the PU below it. The routing is illustrated in Figure. Performance Analysis. Having established a smooth timing of the algorithm the analysis can proceed analogously to that of the simple algorithms from Section. A tree of

6 .. r. Fig.. Left: the communication structure in a chain of r PUs which cooperate to supply two successor nodes with all the data they have to receive. Right: a complete binary tree of depth d = build out of chains with r =. depth d consists of d + chains, each with r PUs. Thus, for given r and P, the tree has a depth of d = blog(p=r)c. If the data are subdivided in s packets, then every phase takes (r + ) (t + k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r + d + phases, which can be performed in time T fract, duplex = (s (r + )=r + d ) (t + k=s): () Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d =t r=(r + ). Substitution gives T fract, duplex = ( + =r) k + d t + p k d =r t = ( + o()) (k + k=r + d r t): There are two additional terms, one is due to the slower progress of the packets, the other to the start-up latencies. If t is very large, the best strategy is to choose r =. If k is very large, one should take r = P. For moderate r the dependency of d on r is not so important. In that case, the optimum is assumed for r = p k=(d t). Substitution gives T fract = ( + o()) (k + p k log P t): (5) For many problem instances, the new approach is considerably faster than the best previous one. Actually, our result matches the lower bound of () for this type of algorithms.

7 PU PU PU before step step & & & step & step 5 & step 5 PU & & PU & & PU Fig.. The fractional tree algorithm for the duplex model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The five pictures show the development as a result of the = r + routing steps performed during some phase t l +. The packets with indices ; ; actually denote the packets of run tl. The packets with indices ; ; 5 denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. All other PUs send to the PUs below them. Performance Example. For large P almost a factor is gained for t P = k. In that case both previous algorithms require more than k. According to (5), the partial-tree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. For concrete moderate k the situation is slightly less good than the above result suggests. As an example we consider P = and t = k=96. We can take r = 8, d = 6 and s = 5. Then there are 6 runs of length 8 and the routing consists of 7 phases of 9 steps each. Thus, there are 69 steps that take 69 (k=5 + k=96) = : k all together. How about the other algorithms The pipelined algorithm in the duplex pmodel takes T pipe, duplex = (P + s) (t + k=s) time, which is minimized for s = P k=t. So in our case, we should take s = 8 and achieve a time consumption of :5 k. The binary tree algorithm in the duplex model takes T bin, duplex = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k.

8 . One-Port Model As was pointed out in the introduction, for one-port machines scheduling the send and receive operations so that all communicating pairs of PUs agree cannot be achieved by simply doubling every step of a duplex algorithm. Therefore we do not simply copy the above algorithm, but construct a new schedule based on the same idea of trees of partial degree. Schedule Description. Again we work with chains of length r, but now we choose r even. Another difference with the duplex schedule is that now only the PUs with odd index within their chain are in charge of transferring packets to the head of the second successor chain. In this way, the synchronization is trivial: PUs with even indices receive in even steps and send in odd steps; PUs with odd indices send in even steps and receive in odd steps (though in some steps a PU is idle). An example is given in Figure 5. before step step step step step step step 5 PU PU & PU PU & PU PU & & PU PU & & Fig. 5. The fractional-tree algorithm for the one-port model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The seven pictures show the development as a result of the 6 = r + routing steps performed during some phase t l +. The packets with indices ; actually denote the packets of run t l. The packets with indices ; denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. For chains of length r every phase takes r + steps. PUs with even indices are idle for two steps: one sending and one receiving step. PUs with odd indices are idle during

9 one receiving step. In the duplex schedule PUs were sending all the time and receiving r packets during r + steps. Thus here we pas some penalty: the average usage of the connections is slightly lower. During each phase a run of length r= progresses r steps deeper into the tree. As before, the depth d of the tree measured in chains equals d = blog P=rc. Performance Analysis. We copy the analysis from Section. and modify it where necessary. If the data are subdivided in s packets, then every phase takes (r+)(t+k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r +d+ phases, which can be performed in time T fract, one-port = (s (r + )=r + d ) (t + k=s): (6) Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. (6) is minimized for s = p k d =t r=( (r + )). Substitution gives T fract, one-port = ( + =r) k + d t + p k ( + =r) d t = ( + o()) ( k + k=r + r d t): A good choice is r = p k=(d t). Substitution gives T fract, one-port = ( + o()) ( k + p k log P t): (7) Except for lower-order terms, this is exactly twice the result for the duplex algorithm. Performance Example. For large P almost a factor / is gained for t P = k. In that case both previous algorithms require more than k. According to (7), the partialtree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. As a concrete example we consider P = and t = k=96. We can take r = 6, d = 5 and s = 8. Then there are 5 runs of length 8 and the routing consists of 57 phases of 8 steps each. Thus, there are 6 steps that take 6(k=8+k=96) = :77 k all together. How about the other algorithms The pipelined algorithm takes T pipe, one-port = (P + p s ) (t + k=s) time, which is minimized for s = P k=( t). So in our case, we should take s = 8 and achieve a time consumption of :66 k. The binary tree algorithm in the duplex model takes T bin, one-port = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k. Sparse Interconnection Networks 5 Conclusion We have shown that the partial-tree schedule leads to faster broadcasting in the one-port and duplex model. It is up to twice as fast as earlier simple approaches. In comparison to the optimal hypercube schedule it has the advantage of being simpler and that it requires much smaller bandwith when embedding it on networks such as meshes and hierarchical crossbars.

10 References. Kumar, V., A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin/Cummings, 99.. Snir, M., S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI: the Complete Reference, MIT Press, 996.

Parallel Prefix (Scan) Algorithms for MPI

Parallel Prefix (Scan) Algorithms for MPI Parallel Prefix (Scan) Algorithms for MPI Peter Sanders 1 and Jesper Larsson Träff 2 1 Universität Karlsruhe Am Fasanengarten 5, D-76131 Karlsruhe, Germany sanders@ira.uka.de 2 C&C Research Laboratories,

More information

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Efficient Bufferless Packet Switching on Trees and Leveled Networks

Efficient Bufferless Packet Switching on Trees and Leveled Networks Efficient Bufferless Packet Switching on Trees and Leveled Networks Costas Busch Malik Magdon-Ismail Marios Mavronicolas Abstract In bufferless networks the packets cannot be buffered while they are in

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

Expected Time: 90 min PART-A Max Marks: 42

Expected Time: 90 min PART-A Max Marks: 42 Birla Institute of Technology & Science, Pilani First Semester 2010-2011 Computer Networks (BITS C481) Comprehensive Examination Thursday, December 02, 2010 (AN) Duration: 3 Hrs Weightage: 40% [80M] Instructions-:

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

From Static to Dynamic Routing: Efficient Transformations of Store-and-Forward Protocols

From Static to Dynamic Routing: Efficient Transformations of Store-and-Forward Protocols SIAM Journal on Computing to appear From Static to Dynamic Routing: Efficient Transformations of StoreandForward Protocols Christian Scheideler Berthold Vöcking Abstract We investigate how static storeandforward

More information

A Simple and Asymptotically Accurate Model for Parallel Computation

A Simple and Asymptotically Accurate Model for Parallel Computation A Simple and Asymptotically Accurate Model for Parallel Computation Ananth Grama, Vipin Kumar, Sanjay Ranka, Vineet Singh Department of Computer Science Paramark Corp. and Purdue University University

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model. Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

Worst-case Ethernet Network Latency for Shaped Sources

Worst-case Ethernet Network Latency for Shaped Sources Worst-case Ethernet Network Latency for Shaped Sources Max Azarov, SMSC 7th October 2005 Contents For 802.3 ResE study group 1 Worst-case latency theorem 1 1.1 Assumptions.............................

More information

Throughout this course, we use the terms vertex and node interchangeably.

Throughout this course, we use the terms vertex and node interchangeably. Chapter Vertex Coloring. Introduction Vertex coloring is an infamous graph theory problem. It is also a useful toy example to see the style of this course already in the first lecture. Vertex coloring

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Application Layer Multicast Algorithm

Application Layer Multicast Algorithm Application Layer Multicast Algorithm Sergio Machado Universitat Politècnica de Catalunya Castelldefels Javier Ozón Universitat Politècnica de Catalunya Castelldefels Abstract This paper presents a multicast

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Chapter 7 CONCLUSION

Chapter 7 CONCLUSION 97 Chapter 7 CONCLUSION 7.1. Introduction A Mobile Ad-hoc Network (MANET) could be considered as network of mobile nodes which communicate with each other without any fixed infrastructure. The nodes in

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

A Message Passing Standard for MPP and Workstations

A Message Passing Standard for MPP and Workstations A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker Message Passing Interface (MPI) Message passing library Can be

More information

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Principles of Parallel Algorithm Design: Concurrency and Decomposition Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

From Parallel to External List Ranking

From Parallel to External List Ranking From Parallel to External List Ranking Jop F. Sibeyn Abstract Novel algorithms are presented for parallel and external memory list-ranking. The same algorithms can be used for computing basic tree functions,

More information

Direct Routing: Algorithms and Complexity

Direct Routing: Algorithms and Complexity Direct Routing: Algorithms and Complexity Costas Busch, RPI Malik Magdon-Ismail, RPI Marios Mavronicolas, Univ. Cyprus Paul Spirakis, Univ. Patras 1 Outline 1. Direct Routing. 2. Direct Routing is Hard.

More information

Distributed Sorting. Chapter Array & Mesh

Distributed Sorting. Chapter Array & Mesh Chapter 9 Distributed Sorting Indeed, I believe that virtually every important aspect of programming arises somewhere in the context of sorting [and searching]! Donald E. Knuth, The Art of Computer Programming

More information

A GENERIC SIMULATION OF COUNTING NETWORKS

A GENERIC SIMULATION OF COUNTING NETWORKS A GENERIC SIMULATION OF COUNTING NETWORKS By Eric Neil Klein A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of

More information

Distributed Computing over Communication Networks: Leader Election

Distributed Computing over Communication Networks: Leader Election Distributed Computing over Communication Networks: Leader Election Motivation Reasons for electing a leader? Reasons for not electing a leader? Motivation Reasons for electing a leader? Once elected, coordination

More information

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

6.895 Final Project: Serial and Parallel execution of Funnel Sort

6.895 Final Project: Serial and Parallel execution of Funnel Sort 6.895 Final Project: Serial and Parallel execution of Funnel Sort Paul Youn December 17, 2003 Abstract The speed of a sorting algorithm is often measured based on the sheer number of calculations required

More information

: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day

: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day 184.727: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day Jesper Larsson Träff, Francesco Versaci Parallel Computing Group TU Wien October 16,

More information

Computing intersections in a set of line segments: the Bentley-Ottmann algorithm

Computing intersections in a set of line segments: the Bentley-Ottmann algorithm Computing intersections in a set of line segments: the Bentley-Ottmann algorithm Michiel Smid October 14, 2003 1 Introduction In these notes, we introduce a powerful technique for solving geometric problems.

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Communication Fundamentals in Computer Networks

Communication Fundamentals in Computer Networks Lecture 10 Communication Fundamentals in Computer Networks M. Adnan Quaium Assistant Professor Department of Electrical and Electronic Engineering Ahsanullah University of Science and Technology Room 4A07

More information

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 84 CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 3.1 INTRODUCTION The introduction of several new asynchronous designs which provides high throughput and low latency is the significance of this chapter. The

More information

Model Questions and Answers on

Model Questions and Answers on BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Model Questions and Answers on PARALLEL COMPUTING Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. Model Questions and Answers Subject Parallel Computing

More information

An efficient implementation of the greedy forwarding strategy

An efficient implementation of the greedy forwarding strategy An efficient implementation of the greedy forwarding strategy Hannes Stratil Embedded Computing Systems Group E182/2 Technische Universität Wien Treitlstraße 3 A-1040 Vienna Email: hannes@ecs.tuwien.ac.at

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

15-750: Parallel Algorithms

15-750: Parallel Algorithms 5-750: Parallel Algorithms Scribe: Ilari Shafer March {8,2} 20 Introduction A Few Machine Models of Parallel Computation SIMD Single instruction, multiple data: one instruction operates on multiple data

More information

Efficient Communication in Metacube: A New Interconnection Network

Efficient Communication in Metacube: A New Interconnection Network International Symposium on Parallel Architectures, Algorithms and Networks, Manila, Philippines, May 22, pp.165 170 Efficient Communication in Metacube: A New Interconnection Network Yamin Li and Shietung

More information

Deep Learning. Architecture Design for. Sargur N. Srihari

Deep Learning. Architecture Design for. Sargur N. Srihari Architecture Design for Deep Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation

More information

Chapter 2 Overview of the Design Methodology

Chapter 2 Overview of the Design Methodology Chapter 2 Overview of the Design Methodology This chapter presents an overview of the design methodology which is developed in this thesis, by identifying global abstraction levels at which a distributed

More information

CS 455/555 Intro to Networks and Communications. Link Layer

CS 455/555 Intro to Networks and Communications. Link Layer CS 455/555 Intro to Networks and Communications Link Layer Dr. Michele Weigle Department of Computer Science Old Dominion University mweigle@cs.odu.edu http://www.cs.odu.edu/~mweigle/cs455-s13 1 Link Layer

More information

Cross Layer Design in d

Cross Layer Design in d Australian Journal of Basic and Applied Sciences, 3(3): 59-600, 009 ISSN 99-878 Cross Layer Design in 80.6d Ali Al-Hemyari, Y.A. Qassem, Chee Kyun Ng, Nor Kamariah Noordin, Alyani Ismail, and Sabira Khatun

More information

Heuristics for Semi-External Depth-First Search on Directed Graphs

Heuristics for Semi-External Depth-First Search on Directed Graphs Heuristics for Semi-External Depth-First Search on Directed Graphs Jop F. Sibeyn Institut für Informatik, Universität Halle, Germany www.informatik.uni-halle.de/ jopsi James Abello ATT Labs Research, Florham

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

ECE 333: Introduction to Communication Networks Fall Lecture 6: Data Link Layer II

ECE 333: Introduction to Communication Networks Fall Lecture 6: Data Link Layer II ECE 333: Introduction to Communication Networks Fall 00 Lecture 6: Data Link Layer II Error Correction/Detection 1 Notes In Lectures 3 and 4, we studied various impairments that can occur at the physical

More information

We assume uniform hashing (UH):

We assume uniform hashing (UH): We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a

More information

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp. Networks for Multi-core hips A A ontrarian View Shekhar Borkar Aug 27, 2007 Intel orp. 1 Outline Multi-core system outlook On die network challenges A simple contrarian proposal Benefits Summary 2 A Sample

More information

Homework #4 Due Friday 10/27/06 at 5pm

Homework #4 Due Friday 10/27/06 at 5pm CSE 160, Fall 2006 University of California, San Diego Homework #4 Due Friday 10/27/06 at 5pm 1. Interconnect. A k-ary d-cube is an interconnection network with k d nodes, and is a generalization of the

More information

MITOCW watch?v=w_-sx4vr53m

MITOCW watch?v=w_-sx4vr53m MITOCW watch?v=w_-sx4vr53m The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To

More information

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4)

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4) S-72.2420/T-79.5203 Basic Concepts 1 S-72.2420/T-79.5203 Basic Concepts 3 Characterizing Graphs (1) Characterizing Graphs (3) Characterizing a class G by a condition P means proving the equivalence G G

More information

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth Lecture 17 Graph Contraction I: Tree Contraction Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Kanat Tangwongsan March 20, 2012 In this lecture, we will explore

More information

Introduction to Parallel Programming

Introduction to Parallel Programming University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Section 3. Introduction to Parallel Programming Communication Complexity Analysis of Parallel Algorithms Gergel V.P., Professor,

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

Balance of Processing and Communication using Sparse Networks

Balance of Processing and Communication using Sparse Networks Balance of essing and Communication using Sparse Networks Ville Leppänen and Martti Penttonen Department of Computer Science University of Turku Lemminkäisenkatu 14a, 20520 Turku, Finland and Department

More information

Approximation Algorithms

Approximation Algorithms 18.433 Combinatorial Optimization Approximation Algorithms November 20,25 Lecturer: Santosh Vempala 1 Approximation Algorithms Any known algorithm that finds the solution to an NP-hard optimization problem

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

Generalized Network Flow Programming

Generalized Network Flow Programming Appendix C Page Generalized Network Flow Programming This chapter adapts the bounded variable primal simplex method to the generalized minimum cost flow problem. Generalized networks are far more useful

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

What is Multicasting? Multicasting Fundamentals. Unicast Transmission. Agenda. L70 - Multicasting Fundamentals. L70 - Multicasting Fundamentals

What is Multicasting? Multicasting Fundamentals. Unicast Transmission. Agenda. L70 - Multicasting Fundamentals. L70 - Multicasting Fundamentals What is Multicasting? Multicasting Fundamentals Unicast transmission transmitting a packet to one receiver point-to-point transmission used by most applications today Multicast transmission transmitting

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Efficient pebbling for list traversal synopses

Efficient pebbling for list traversal synopses Efficient pebbling for list traversal synopses Yossi Matias Ely Porat Tel Aviv University Bar-Ilan University & Tel Aviv University Abstract 1 Introduction 1.1 Applications Consider a program P running

More information

An Efficient Method for Constructing a Distributed Depth-First Search Tree

An Efficient Method for Constructing a Distributed Depth-First Search Tree An Efficient Method for Constructing a Distributed Depth-First Search Tree S. A. M. Makki and George Havas School of Information Technology The University of Queensland Queensland 4072 Australia sam@it.uq.oz.au

More information

Graphs. The ultimate data structure. graphs 1

Graphs. The ultimate data structure. graphs 1 Graphs The ultimate data structure graphs 1 Definition of graph Non-linear data structure consisting of nodes & links between them (like trees in this sense) Unlike trees, graph nodes may be completely

More information

Topology basics. Constraints and measures. Butterfly networks.

Topology basics. Constraints and measures. Butterfly networks. EE48: Advanced Computer Organization Lecture # Interconnection Networks Architecture and Design Stanford University Topology basics. Constraints and measures. Butterfly networks. Lecture #: Monday, 7 April

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2010 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Approximation and Heuristic Algorithms for Minimum Delay Application-Layer Multicast Trees

Approximation and Heuristic Algorithms for Minimum Delay Application-Layer Multicast Trees THE IBY AND ALADAR FLEISCHMAN FACULTY OF ENGINEERING Approximation and Heuristic Algorithms for Minimum Delay Application-Layer Multicast Trees A thesis submitted toward the degree of Master of Science

More information