A Bandwidth Latency Tradeoff for Broadcast and Reduction

Similar documents
Parallel Prefix (Scan) Algorithms for MPI

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Design of Parallel Algorithms. Models of Parallel Computation

The Encoding Complexity of Network Coding

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Interconnection Network

Principles of Parallel Algorithm Design: Concurrency and Mapping

FUTURE communication networks are expected to support

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Efficient Bufferless Packet Switching on Trees and Leveled Networks

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

An algorithm for Performance Analysis of Single-Source Acyclic graphs

Expected Time: 90 min PART-A Max Marks: 42

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

From Static to Dynamic Routing: Efficient Transformations of Store-and-Forward Protocols

A Simple and Asymptotically Accurate Model for Parallel Computation

Principles of Parallel Algorithm Design: Concurrency and Mapping

SHARED MEMORY VS DISTRIBUTED MEMORY

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Boosting the Performance of Myrinet Networks

Chapter 5: Analytical Modelling of Parallel Programs

Worst-case Ethernet Network Latency for Shaped Sources

Throughout this course, we use the terms vertex and node interchangeably.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Application Layer Multicast Algorithm

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Interconnection networks

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Design of Parallel Algorithms. Course Introduction

Chapter 7 CONCLUSION

Network-on-chip (NOC) Topologies

Distributed minimum spanning tree problem

A Message Passing Standard for MPP and Workstations

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Lecture: Interconnection Networks

From Parallel to External List Ranking

Direct Routing: Algorithms and Complexity

Distributed Sorting. Chapter Array & Mesh

A GENERIC SIMULATION OF COUNTING NETWORKS

Distributed Computing over Communication Networks: Leader Election

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

6 Distributed data management I Hashing

DDS Dynamic Search Trees

6.895 Final Project: Serial and Parallel execution of Funnel Sort

: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day

Computing intersections in a set of line segments: the Bentley-Ottmann algorithm

Search Algorithms for Discrete Optimization Problems

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Communication Fundamentals in Computer Networks

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

Model Questions and Answers on

An efficient implementation of the greedy forwarding strategy

Dense Matrix Algorithms

BlueGene/L. Computer Science, University of Warwick. Source: IBM

15-750: Parallel Algorithms

Efficient Communication in Metacube: A New Interconnection Network

Deep Learning. Architecture Design for. Sargur N. Srihari

Chapter 2 Overview of the Design Methodology

CS 455/555 Intro to Networks and Communications. Link Layer

Cross Layer Design in d

Heuristics for Semi-External Depth-First Search on Directed Graphs

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

ECE 333: Introduction to Communication Networks Fall Lecture 6: Data Link Layer II

We assume uniform hashing (UH):

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Homework #4 Due Friday 10/27/06 at 5pm

MITOCW watch?v=w_-sx4vr53m

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4)

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth

Introduction to Parallel Programming

CS Parallel Algorithms in Scientific Computing

Parallel Programming Platforms

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Balance of Processing and Communication using Sparse Networks

Approximation Algorithms

High Performance Computing. University questions with solution

Basic Communication Operations (Chapter 4)

INTERCONNECTION NETWORKS LECTURE 4

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Generalized Network Flow Programming

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

What is Multicasting? Multicasting Fundamentals. Unicast Transmission. Agenda. L70 - Multicasting Fundamentals. L70 - Multicasting Fundamentals

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Efficient pebbling for list traversal synopses

An Efficient Method for Constructing a Distributed Depth-First Search Tree

Graphs. The ultimate data structure. graphs 1

Topology basics. Constraints and measures. Butterfly networks.

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

EE/CSCI 451: Parallel and Distributed Computation

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Approximation and Heuristic Algorithms for Minimum Delay Application-Layer Multicast Trees

Transcription:

A Bandwidth Latency Tradeoff for Broadcast and Reduction Peter Sanders and Jop F. Sibeyn Max-Planck-Institut für Informatik Im Stadtwald, 66 Saarbrücken, Germany. sanders, jopsi@mpi-sb.mpg.de. http://www.mpi-sb.mpg.de/sanders, jopsi Abstract. A novel algorithm is presented for broadcasting in the single-port and duplex model on a completely connected interconnection network with start-up latencies. None of the existing broadcasting algorithms performs close to optimal for packets that are neither very large nor very small. The broadcasting schedule of the new algorithm is based on trees of fractional degree. It performs close to optimal for all packet sizes. For realistic values of P it is up to 5% faster than the best alternative. Introduction Broadcasting. Consider P processing units, PUs, of a parallel machine. Broadcasting, the operation in which one processor has to send a message M to all other PUs, is a crucial building block for many parallel algorithms. Since it can be implemented once and for all in communication libraries such as MPI [], it makes sense to try to strive for optimal algorithms for all P and all message lengths k. Since broadcasting is sometimes a bottleneck operation, even constant factors should not be neglected. In addition, by reversing the direction of communication, broadcasting algorithms can usually be turned into reduction algorithms. Reduction is the task to compute a generalized sum L i<p M i, where initially message M i is stored on PU i and where can be any associative operator. Broadcasting and reduction are among the most important communication primitives. For example, some of the best algorithms for matrix multiplication or dense matrix-vector multiplication have these two functions as their sole communication routines []. Assumptions and Communication Models. We study broadcasting long messages using the following assumptions which reflect the properties of many current machines at least for the transmission of long messages: The PUs are connected by a network. Communication is synchronous, i.e., both sender and receiver have to cooperate in transmitting a message. Transferring a message of length k between any pair of PUs takes time t + k. Here t is the start-up overhead and the time unit is defined as the the time needed to transfer one data item.

In the last point we assume that the topology of the network does not play a role. However, our algorithms have enough locality to work well on networks with limited bisection width and we will also informally discuss how the communication structures can be embedded into meshes. We are looking at two communication models. The first is the single-port model, in which a PU can only participate in one communication at a time. It can either send or receive a packet but not both. The second is the somewhat more powerful duplex model, in which a PU can concurrently participate in sending one message and receiving another message (not necessarily from the same PU). Lower Bounds. In the duplex model, all non-source PUs must receive the k data, and the whole broadcasting takes at least log P steps. Thus in the duplex model there is a lower bound of T lower, duplex = k + t log P: In the single port model all non-source PUs must receive the k data. These (P )k data have been sent by at most P PUs. So, even with perfect balance the PUs are sending packets for ( =P ) k time. This gives T lower, one-port = ( =P ) k + t log P: For odd P there is always one idle PU, in that case the first term is k. These lower bounds hold in full generality. For a large and natural class of algorithms, we can prove a stronger bound though. Consider algorithms that divide the total data set of size k in s packets of size k=s each. All PUs operate synchronously, and in every step they send or receive at most one packet. So, until step s, there is still at least one packet known only to the source. Thus, for given s, at least s+log P steps are required in the duplex model. Each step takes k=s + t time. For given k, t and P, the minimum is assumed for s = p k t= log P : T lower, duplex = k + t log P + p k t log P : () We were not able to come with a corresponding bound for the one-port model. The problem is that one cannot conclude that the routing requires s + log P steps. Previous Results. Peter schreibt evt. noch was schoenes. In dem fall muss die naechste alinea umgestaltet werden. Abgrenzung gegenueber Hyperwuerfelalg. evt. beispiel (P = 6) fuer H-tree-einbettung. New Contributions. The earlier algorithms perform optimal in two extreme cases. If k is very large, the start-up effects become negligible, and one should concentrate on minimizing the amount of data to be transferred. In the other extreme case, if t is very large, the transmission time is negligible, and the sole objective is to minimize the number of communication steps. For intermediate cases the best of these approaches takes up to 5% more time than the lower bound. In the case of duplex connections even twice as much. Our refined broadcasting schedule comes much closer to the lower bounds for all values of k and t. For the duplex model we come to within lower-order terms from (). The schedule is rather simple and might be implemented easily.

In general a duplex algorithms cannot be turned into a one-port algorithm by simply replacing every duplex step by two one-port steps: this would amount to constructing a coloring with two colors of a graph with degree two. For graphs containing odd cycles, such as our duplex algorithm, this is impossible. However, a modification of the schedule leads to a good one-port algorithm. It requires twice as long as the duplex algorithm. Contents. After describing the known algorithms, we present the fractional-tree broadcasting: first for the simpler duplex model, then for the one-port model. At the end of the paper we consider embedding the fractional trees on sparse networks. Simple Algorithms In order to be able to compare the performance of our new algorithm, we first present two well-known algorithms.. Linear Pipeline Broadcasting For messages with k t P, a simple pipelined algorithm performs very well: The PUs are organized into a directed Hamiltonian path starting at the source PU and the message is subdivided into s packets of size k=s. The packets are handed through the path in a pipelined fashion. An internal PU simply alternates between receiving a packet from its predecessor and forwarding it to its successor. An example is given in Figure. Step 5 6 7 8 9 PU PU PU PU Fig.. The simple pipelined algorithm for P = and s = 5. It is easy to see that the broadcasting time is T pipe, one-port = (d + (s )) (t + k=s): ()

Here d = P is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T pipe, one-port = ( p d t + p k) = ( + o()) k + O(t P ): This means that for k t P broadcasting takes little more than the (near) optimal k time. For the duplex model, the constant two can be dropped in the analysis. For long messages, the algorithm is almost twice as fast.. Binary Tree Broadcasting For shorter messages, the high latency of O(t P ) until the last PU begins to receive data is prohibitive for large P. This latency can be reduced to O(t log P ) by using a binary tree instead of an Hamiltionian path. Each internal PU performs the following three steps in a loop: receive a packet from the father; send the packet to the left child; send the packet to the right child. This approach is illustrated in Figure. d = d = d = d = d = @ @R H HHHj P PPPP Pq @ @R H HHHj @ @R @ @R Fig.. The tree structures used for the broadcasting for trees of various depths d. The tree for depth d is constructed out off the trees for d and d. The numbers indicate the routing steps in which a packet coming from the root is routed over the connections. Every three steps the root injects one packet. For single ported communication, each pass through the loop takes (t + k=s). Similar to the above calculation, we get T bin, one-port = (d + (s )) (t + k=s): () Here d = O(log P ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T bin, one-port = ( p d t + p k) = ( + o()) k + O(t log P ):

This is good algorithm even for rather small k but for large k it takes up to 5% more time than the linear pipeline. In the duplex model, the PUs are performing a two-step loop: receive a packet from the father and send the previous packet to the right child; send the latest packet to the left child. Thus each pass takes (t + k=s) time. For large k the linear pipeline is up to twice as fast. Fractional Tree Broadcasting The question arises whether better tradeoffs between bandwidth and latency can be achieved for medium sized messages than either a pipeline with linear latency or a binary tree with logarithmic latency but slower message transmission. Indeed we now design a family of broadcast algorithms which achieve execution times ( + r in the single ported model and + o()) k + O(r t log P ) ( + r + o()) k + O(r t log P ) in the duplex model for any integer r. The linear pipeline and the binary tree can be viewed as special cases of this scheme for r = and r =, respectively.. Duplex Model The idea is to replace the node of a binary tree by a chain of r PUs. The input is fed into the head of this chain. The data is passed down the chain and on to a successor chain as in the linear pipeline algorithm. In addition, the PUs of the chain cooperate in feeding the data to the head of a second successor chain. Figure shows the communication structure and an example. The resulting communication structure could be viewed as a tree with degree + =r. We call it a fractional tree. Schedule Details. Now we describe the routing schedule in more detail. The total data set of size k is divided in k=r runs consisting of r packets each. The packets in each run are indexed from through r. The same indices are used for the PUs in each chain. The routing is composed of phases. Each phase takes r + steps. The steps are indexed from through r. Initially all packets are stored in PU of chain. Consider a chain at depth l. In phase t l this chain is involved in forwarding the packets of run t l that enter at PU through the chain. Also it is involved in further forwarding the packets of run t l through the chain and into both successor chains. Now we consider step j of phase t more precisely. Any PU i with i j sends packet r + j i of run t l to the PU below it (for i = r, this means to the head of the successor chain). For j >, PU i = j sends packet i of run t l to the head of the second successor chain. Any PU i with i < j sends packet j i of run t l to the PU below it. The routing is illustrated in Figure. Performance Analysis. Having established a smooth timing of the algorithm the analysis can proceed analogously to that of the simple algorithms from Section. A tree of

.. r. Fig.. Left: the communication structure in a chain of r PUs which cooperate to supply two successor nodes with all the data they have to receive. Right: a complete binary tree of depth d = build out of chains with r =. depth d consists of d + chains, each with r PUs. Thus, for given r and P, the tree has a depth of d = blog(p=r)c. If the data are subdivided in s packets, then every phase takes (r + ) (t + k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r + d + phases, which can be performed in time T fract, duplex = (s (r + )=r + d ) (t + k=s): () Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d =t r=(r + ). Substitution gives T fract, duplex = ( + =r) k + d t + p k d =r t = ( + o()) (k + k=r + d r t): There are two additional terms, one is due to the slower progress of the packets, the other to the start-up latencies. If t is very large, the best strategy is to choose r =. If k is very large, one should take r = P. For moderate r the dependency of d on r is not so important. In that case, the optimum is assumed for r = p k=(d t). Substitution gives T fract = ( + o()) (k + p k log P t): (5) For many problem instances, the new approach is considerably faster than the best previous one. Actually, our result matches the lower bound of () for this type of algorithms.

PU PU PU before step step & & & step & step 5 & step 5 PU & & PU & & PU Fig.. The fractional tree algorithm for the duplex model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The five pictures show the development as a result of the = r + routing steps performed during some phase t l +. The packets with indices ; ; actually denote the packets of run tl. The packets with indices ; ; 5 denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. All other PUs send to the PUs below them. Performance Example. For large P almost a factor is gained for t P = k. In that case both previous algorithms require more than k. According to (5), the partial-tree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. For concrete moderate k the situation is slightly less good than the above result suggests. As an example we consider P = and t = k=96. We can take r = 8, d = 6 and s = 5. Then there are 6 runs of length 8 and the routing consists of 7 phases of 9 steps each. Thus, there are 69 steps that take 69 (k=5 + k=96) = : k all together. How about the other algorithms The pipelined algorithm in the duplex pmodel takes T pipe, duplex = (P + s) (t + k=s) time, which is minimized for s = P k=t. So in our case, we should take s = 8 and achieve a time consumption of :5 k. The binary tree algorithm in the duplex model takes T bin, duplex = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k.

. One-Port Model As was pointed out in the introduction, for one-port machines scheduling the send and receive operations so that all communicating pairs of PUs agree cannot be achieved by simply doubling every step of a duplex algorithm. Therefore we do not simply copy the above algorithm, but construct a new schedule based on the same idea of trees of partial degree. Schedule Description. Again we work with chains of length r, but now we choose r even. Another difference with the duplex schedule is that now only the PUs with odd index within their chain are in charge of transferring packets to the head of the second successor chain. In this way, the synchronization is trivial: PUs with even indices receive in even steps and send in odd steps; PUs with odd indices send in even steps and receive in odd steps (though in some steps a PU is idle). An example is given in Figure 5. before step step step step step step step 5 PU PU & PU PU & PU PU & & PU PU & & Fig. 5. The fractional-tree algorithm for the one-port model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The seven pictures show the development as a result of the 6 = r + routing steps performed during some phase t l +. The packets with indices ; actually denote the packets of run t l. The packets with indices ; denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. For chains of length r every phase takes r + steps. PUs with even indices are idle for two steps: one sending and one receiving step. PUs with odd indices are idle during

one receiving step. In the duplex schedule PUs were sending all the time and receiving r packets during r + steps. Thus here we pas some penalty: the average usage of the connections is slightly lower. During each phase a run of length r= progresses r steps deeper into the tree. As before, the depth d of the tree measured in chains equals d = blog P=rc. Performance Analysis. We copy the analysis from Section. and modify it where necessary. If the data are subdivided in s packets, then every phase takes (r+)(t+k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r +d+ phases, which can be performed in time T fract, one-port = (s (r + )=r + d ) (t + k=s): (6) Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. (6) is minimized for s = p k d =t r=( (r + )). Substitution gives T fract, one-port = ( + =r) k + d t + p k ( + =r) d t = ( + o()) ( k + k=r + r d t): A good choice is r = p k=(d t). Substitution gives T fract, one-port = ( + o()) ( k + p k log P t): (7) Except for lower-order terms, this is exactly twice the result for the duplex algorithm. Performance Example. For large P almost a factor / is gained for t P = k. In that case both previous algorithms require more than k. According to (7), the partialtree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. As a concrete example we consider P = and t = k=96. We can take r = 6, d = 5 and s = 8. Then there are 5 runs of length 8 and the routing consists of 57 phases of 8 steps each. Thus, there are 6 steps that take 6(k=8+k=96) = :77 k all together. How about the other algorithms The pipelined algorithm takes T pipe, one-port = (P + p s ) (t + k=s) time, which is minimized for s = P k=( t). So in our case, we should take s = 8 and achieve a time consumption of :66 k. The binary tree algorithm in the duplex model takes T bin, one-port = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k. Sparse Interconnection Networks 5 Conclusion We have shown that the partial-tree schedule leads to faster broadcasting in the one-port and duplex model. It is up to twice as fast as earlier simple approaches. In comparison to the optimal hypercube schedule it has the advantage of being simpler and that it requires much smaller bandwith when embedding it on networks such as meshes and hierarchical crossbars.

References. Kumar, V., A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin/Cummings, 99.. Snir, M., S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI: the Complete Reference, MIT Press, 996.