A Bandwidth Latency Tradeoff for Broadcast and Reduction
|
|
- Jerome Mosley
- 5 years ago
- Views:
Transcription
1 A Bandwidth Latency Tradeoff for Broadcast and Reduction Peter Sanders and Jop F. Sibeyn Max-Planck-Institut für Informatik Im Stadtwald, 66 Saarbrücken, Germany. sanders, jopsi Abstract. A novel algorithm is presented for broadcasting in the single-port and duplex model on a completely connected interconnection network with start-up latencies. None of the existing broadcasting algorithms performs close to optimal for packets that are neither very large nor very small. The broadcasting schedule of the new algorithm is based on trees of fractional degree. It performs close to optimal for all packet sizes. For realistic values of P it is up to 5% faster than the best alternative. Introduction Broadcasting. Consider P processing units, PUs, of a parallel machine. Broadcasting, the operation in which one processor has to send a message M to all other PUs, is a crucial building block for many parallel algorithms. Since it can be implemented once and for all in communication libraries such as MPI [], it makes sense to try to strive for optimal algorithms for all P and all message lengths k. Since broadcasting is sometimes a bottleneck operation, even constant factors should not be neglected. In addition, by reversing the direction of communication, broadcasting algorithms can usually be turned into reduction algorithms. Reduction is the task to compute a generalized sum L i<p M i, where initially message M i is stored on PU i and where can be any associative operator. Broadcasting and reduction are among the most important communication primitives. For example, some of the best algorithms for matrix multiplication or dense matrix-vector multiplication have these two functions as their sole communication routines []. Assumptions and Communication Models. We study broadcasting long messages using the following assumptions which reflect the properties of many current machines at least for the transmission of long messages: The PUs are connected by a network. Communication is synchronous, i.e., both sender and receiver have to cooperate in transmitting a message. Transferring a message of length k between any pair of PUs takes time t + k. Here t is the start-up overhead and the time unit is defined as the the time needed to transfer one data item.
2 In the last point we assume that the topology of the network does not play a role. However, our algorithms have enough locality to work well on networks with limited bisection width and we will also informally discuss how the communication structures can be embedded into meshes. We are looking at two communication models. The first is the single-port model, in which a PU can only participate in one communication at a time. It can either send or receive a packet but not both. The second is the somewhat more powerful duplex model, in which a PU can concurrently participate in sending one message and receiving another message (not necessarily from the same PU). Lower Bounds. In the duplex model, all non-source PUs must receive the k data, and the whole broadcasting takes at least log P steps. Thus in the duplex model there is a lower bound of T lower, duplex = k + t log P: In the single port model all non-source PUs must receive the k data. These (P )k data have been sent by at most P PUs. So, even with perfect balance the PUs are sending packets for ( =P ) k time. This gives T lower, one-port = ( =P ) k + t log P: For odd P there is always one idle PU, in that case the first term is k. These lower bounds hold in full generality. For a large and natural class of algorithms, we can prove a stronger bound though. Consider algorithms that divide the total data set of size k in s packets of size k=s each. All PUs operate synchronously, and in every step they send or receive at most one packet. So, until step s, there is still at least one packet known only to the source. Thus, for given s, at least s+log P steps are required in the duplex model. Each step takes k=s + t time. For given k, t and P, the minimum is assumed for s = p k t= log P : T lower, duplex = k + t log P + p k t log P : () We were not able to come with a corresponding bound for the one-port model. The problem is that one cannot conclude that the routing requires s + log P steps. Previous Results. Peter schreibt evt. noch was schoenes. In dem fall muss die naechste alinea umgestaltet werden. Abgrenzung gegenueber Hyperwuerfelalg. evt. beispiel (P = 6) fuer H-tree-einbettung. New Contributions. The earlier algorithms perform optimal in two extreme cases. If k is very large, the start-up effects become negligible, and one should concentrate on minimizing the amount of data to be transferred. In the other extreme case, if t is very large, the transmission time is negligible, and the sole objective is to minimize the number of communication steps. For intermediate cases the best of these approaches takes up to 5% more time than the lower bound. In the case of duplex connections even twice as much. Our refined broadcasting schedule comes much closer to the lower bounds for all values of k and t. For the duplex model we come to within lower-order terms from (). The schedule is rather simple and might be implemented easily.
3 In general a duplex algorithms cannot be turned into a one-port algorithm by simply replacing every duplex step by two one-port steps: this would amount to constructing a coloring with two colors of a graph with degree two. For graphs containing odd cycles, such as our duplex algorithm, this is impossible. However, a modification of the schedule leads to a good one-port algorithm. It requires twice as long as the duplex algorithm. Contents. After describing the known algorithms, we present the fractional-tree broadcasting: first for the simpler duplex model, then for the one-port model. At the end of the paper we consider embedding the fractional trees on sparse networks. Simple Algorithms In order to be able to compare the performance of our new algorithm, we first present two well-known algorithms.. Linear Pipeline Broadcasting For messages with k t P, a simple pipelined algorithm performs very well: The PUs are organized into a directed Hamiltonian path starting at the source PU and the message is subdivided into s packets of size k=s. The packets are handed through the path in a pipelined fashion. An internal PU simply alternates between receiving a packet from its predecessor and forwarding it to its successor. An example is given in Figure. Step PU PU PU PU Fig.. The simple pipelined algorithm for P = and s = 5. It is easy to see that the broadcasting time is T pipe, one-port = (d + (s )) (t + k=s): ()
4 Here d = P is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T pipe, one-port = ( p d t + p k) = ( + o()) k + O(t P ): This means that for k t P broadcasting takes little more than the (near) optimal k time. For the duplex model, the constant two can be dropped in the analysis. For long messages, the algorithm is almost twice as fast.. Binary Tree Broadcasting For shorter messages, the high latency of O(t P ) until the last PU begins to receive data is prohibitive for large P. This latency can be reduced to O(t log P ) by using a binary tree instead of an Hamiltionian path. Each internal PU performs the following three steps in a loop: receive a packet from the father; send the packet to the left child; send the packet to the right child. This approach is illustrated in Figure. d = d = d = d = H HHHj P Fig.. The tree structures used for the broadcasting for trees of various depths d. The tree for depth d is constructed out off the trees for d and d. The numbers indicate the routing steps in which a packet coming from the root is routed over the connections. Every three steps the root injects one packet. For single ported communication, each pass through the loop takes (t + k=s). Similar to the above calculation, we get T bin, one-port = (d + (s )) (t + k=s): () Here d = O(log P ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T bin, one-port = ( p d t + p k) = ( + o()) k + O(t log P ):
5 This is good algorithm even for rather small k but for large k it takes up to 5% more time than the linear pipeline. In the duplex model, the PUs are performing a two-step loop: receive a packet from the father and send the previous packet to the right child; send the latest packet to the left child. Thus each pass takes (t + k=s) time. For large k the linear pipeline is up to twice as fast. Fractional Tree Broadcasting The question arises whether better tradeoffs between bandwidth and latency can be achieved for medium sized messages than either a pipeline with linear latency or a binary tree with logarithmic latency but slower message transmission. Indeed we now design a family of broadcast algorithms which achieve execution times ( + r in the single ported model and + o()) k + O(r t log P ) ( + r + o()) k + O(r t log P ) in the duplex model for any integer r. The linear pipeline and the binary tree can be viewed as special cases of this scheme for r = and r =, respectively.. Duplex Model The idea is to replace the node of a binary tree by a chain of r PUs. The input is fed into the head of this chain. The data is passed down the chain and on to a successor chain as in the linear pipeline algorithm. In addition, the PUs of the chain cooperate in feeding the data to the head of a second successor chain. Figure shows the communication structure and an example. The resulting communication structure could be viewed as a tree with degree + =r. We call it a fractional tree. Schedule Details. Now we describe the routing schedule in more detail. The total data set of size k is divided in k=r runs consisting of r packets each. The packets in each run are indexed from through r. The same indices are used for the PUs in each chain. The routing is composed of phases. Each phase takes r + steps. The steps are indexed from through r. Initially all packets are stored in PU of chain. Consider a chain at depth l. In phase t l this chain is involved in forwarding the packets of run t l that enter at PU through the chain. Also it is involved in further forwarding the packets of run t l through the chain and into both successor chains. Now we consider step j of phase t more precisely. Any PU i with i j sends packet r + j i of run t l to the PU below it (for i = r, this means to the head of the successor chain). For j >, PU i = j sends packet i of run t l to the head of the second successor chain. Any PU i with i < j sends packet j i of run t l to the PU below it. The routing is illustrated in Figure. Performance Analysis. Having established a smooth timing of the algorithm the analysis can proceed analogously to that of the simple algorithms from Section. A tree of
6 .. r. Fig.. Left: the communication structure in a chain of r PUs which cooperate to supply two successor nodes with all the data they have to receive. Right: a complete binary tree of depth d = build out of chains with r =. depth d consists of d + chains, each with r PUs. Thus, for given r and P, the tree has a depth of d = blog(p=r)c. If the data are subdivided in s packets, then every phase takes (r + ) (t + k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r + d + phases, which can be performed in time T fract, duplex = (s (r + )=r + d ) (t + k=s): () Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d =t r=(r + ). Substitution gives T fract, duplex = ( + =r) k + d t + p k d =r t = ( + o()) (k + k=r + d r t): There are two additional terms, one is due to the slower progress of the packets, the other to the start-up latencies. If t is very large, the best strategy is to choose r =. If k is very large, one should take r = P. For moderate r the dependency of d on r is not so important. In that case, the optimum is assumed for r = p k=(d t). Substitution gives T fract = ( + o()) (k + p k log P t): (5) For many problem instances, the new approach is considerably faster than the best previous one. Actually, our result matches the lower bound of () for this type of algorithms.
7 PU PU PU before step step & & & step & step 5 & step 5 PU & & PU & & PU Fig.. The fractional tree algorithm for the duplex model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The five pictures show the development as a result of the = r + routing steps performed during some phase t l +. The packets with indices ; ; actually denote the packets of run tl. The packets with indices ; ; 5 denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. All other PUs send to the PUs below them. Performance Example. For large P almost a factor is gained for t P = k. In that case both previous algorithms require more than k. According to (5), the partial-tree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. For concrete moderate k the situation is slightly less good than the above result suggests. As an example we consider P = and t = k=96. We can take r = 8, d = 6 and s = 5. Then there are 6 runs of length 8 and the routing consists of 7 phases of 9 steps each. Thus, there are 69 steps that take 69 (k=5 + k=96) = : k all together. How about the other algorithms The pipelined algorithm in the duplex pmodel takes T pipe, duplex = (P + s) (t + k=s) time, which is minimized for s = P k=t. So in our case, we should take s = 8 and achieve a time consumption of :5 k. The binary tree algorithm in the duplex model takes T bin, duplex = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k.
8 . One-Port Model As was pointed out in the introduction, for one-port machines scheduling the send and receive operations so that all communicating pairs of PUs agree cannot be achieved by simply doubling every step of a duplex algorithm. Therefore we do not simply copy the above algorithm, but construct a new schedule based on the same idea of trees of partial degree. Schedule Description. Again we work with chains of length r, but now we choose r even. Another difference with the duplex schedule is that now only the PUs with odd index within their chain are in charge of transferring packets to the head of the second successor chain. In this way, the synchronization is trivial: PUs with even indices receive in even steps and send in odd steps; PUs with odd indices send in even steps and receive in odd steps (though in some steps a PU is idle). An example is given in Figure 5. before step step step step step step step 5 PU PU & PU PU & PU PU & & PU PU & & Fig. 5. The fractional-tree algorithm for the one-port model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The seven pictures show the development as a result of the 6 = r + routing steps performed during some phase t l +. The packets with indices ; actually denote the packets of run t l. The packets with indices ; denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. For chains of length r every phase takes r + steps. PUs with even indices are idle for two steps: one sending and one receiving step. PUs with odd indices are idle during
9 one receiving step. In the duplex schedule PUs were sending all the time and receiving r packets during r + steps. Thus here we pas some penalty: the average usage of the connections is slightly lower. During each phase a run of length r= progresses r steps deeper into the tree. As before, the depth d of the tree measured in chains equals d = blog P=rc. Performance Analysis. We copy the analysis from Section. and modify it where necessary. If the data are subdivided in s packets, then every phase takes (r+)(t+k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r +d+ phases, which can be performed in time T fract, one-port = (s (r + )=r + d ) (t + k=s): (6) Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. (6) is minimized for s = p k d =t r=( (r + )). Substitution gives T fract, one-port = ( + =r) k + d t + p k ( + =r) d t = ( + o()) ( k + k=r + r d t): A good choice is r = p k=(d t). Substitution gives T fract, one-port = ( + o()) ( k + p k log P t): (7) Except for lower-order terms, this is exactly twice the result for the duplex algorithm. Performance Example. For large P almost a factor / is gained for t P = k. In that case both previous algorithms require more than k. According to (7), the partialtree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. As a concrete example we consider P = and t = k=96. We can take r = 6, d = 5 and s = 8. Then there are 5 runs of length 8 and the routing consists of 57 phases of 8 steps each. Thus, there are 6 steps that take 6(k=8+k=96) = :77 k all together. How about the other algorithms The pipelined algorithm takes T pipe, one-port = (P + p s ) (t + k=s) time, which is minimized for s = P k=( t). So in our case, we should take s = 8 and achieve a time consumption of :66 k. The binary tree algorithm in the duplex model takes T bin, one-port = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k. Sparse Interconnection Networks 5 Conclusion We have shown that the partial-tree schedule leads to faster broadcasting in the one-port and duplex model. It is up to twice as fast as earlier simple approaches. In comparison to the optimal hypercube schedule it has the advantage of being simpler and that it requires much smaller bandwith when embedding it on networks such as meshes and hierarchical crossbars.
10 References. Kumar, V., A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin/Cummings, 99.. Snir, M., S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI: the Complete Reference, MIT Press, 996.
Parallel Prefix (Scan) Algorithms for MPI
Parallel Prefix (Scan) Algorithms for MPI Peter Sanders 1 and Jesper Larsson Träff 2 1 Universität Karlsruhe Am Fasanengarten 5, D-76131 Karlsruhe, Germany sanders@ira.uka.de 2 C&C Research Laboratories,
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationAdvanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret
Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely
More informationInterconnection Network
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationFUTURE communication networks are expected to support
1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationEfficient Bufferless Packet Switching on Trees and Leveled Networks
Efficient Bufferless Packet Switching on Trees and Leveled Networks Costas Busch Malik Magdon-Ismail Marios Mavronicolas Abstract In bufferless networks the packets cannot be buffered while they are in
More informationTreaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19
CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationAn algorithm for Performance Analysis of Single-Source Acyclic graphs
An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs
More informationExpected Time: 90 min PART-A Max Marks: 42
Birla Institute of Technology & Science, Pilani First Semester 2010-2011 Computer Networks (BITS C481) Comprehensive Examination Thursday, December 02, 2010 (AN) Duration: 3 Hrs Weightage: 40% [80M] Instructions-:
More information2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006
2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,
More informationFrom Static to Dynamic Routing: Efficient Transformations of Store-and-Forward Protocols
SIAM Journal on Computing to appear From Static to Dynamic Routing: Efficient Transformations of StoreandForward Protocols Christian Scheideler Berthold Vöcking Abstract We investigate how static storeandforward
More informationA Simple and Asymptotically Accurate Model for Parallel Computation
A Simple and Asymptotically Accurate Model for Parallel Computation Ananth Grama, Vipin Kumar, Sanjay Ranka, Vineet Singh Department of Computer Science Paramark Corp. and Purdue University University
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationSummary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.
Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationBoosting the Performance of Myrinet Networks
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations
More informationChapter 5: Analytical Modelling of Parallel Programs
Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel
More informationWorst-case Ethernet Network Latency for Shaped Sources
Worst-case Ethernet Network Latency for Shaped Sources Max Azarov, SMSC 7th October 2005 Contents For 802.3 ResE study group 1 Worst-case latency theorem 1 1.1 Assumptions.............................
More informationThroughout this course, we use the terms vertex and node interchangeably.
Chapter Vertex Coloring. Introduction Vertex coloring is an infamous graph theory problem. It is also a useful toy example to see the style of this course already in the first lecture. Vertex coloring
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationApplication Layer Multicast Algorithm
Application Layer Multicast Algorithm Sergio Machado Universitat Politècnica de Catalunya Castelldefels Javier Ozón Universitat Politècnica de Catalunya Castelldefels Abstract This paper presents a multicast
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationSorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel
More informationDesign of Parallel Algorithms. Course Introduction
+ Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:
More informationChapter 7 CONCLUSION
97 Chapter 7 CONCLUSION 7.1. Introduction A Mobile Ad-hoc Network (MANET) could be considered as network of mobile nodes which communicate with each other without any fixed infrastructure. The nodes in
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationDistributed minimum spanning tree problem
Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with
More informationA Message Passing Standard for MPP and Workstations
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker Message Passing Interface (MPI) Message passing library Can be
More informationPrinciples of Parallel Algorithm Design: Concurrency and Decomposition
Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationFrom Parallel to External List Ranking
From Parallel to External List Ranking Jop F. Sibeyn Abstract Novel algorithms are presented for parallel and external memory list-ranking. The same algorithms can be used for computing basic tree functions,
More informationDirect Routing: Algorithms and Complexity
Direct Routing: Algorithms and Complexity Costas Busch, RPI Malik Magdon-Ismail, RPI Marios Mavronicolas, Univ. Cyprus Paul Spirakis, Univ. Patras 1 Outline 1. Direct Routing. 2. Direct Routing is Hard.
More informationDistributed Sorting. Chapter Array & Mesh
Chapter 9 Distributed Sorting Indeed, I believe that virtually every important aspect of programming arises somewhere in the context of sorting [and searching]! Donald E. Knuth, The Art of Computer Programming
More informationA GENERIC SIMULATION OF COUNTING NETWORKS
A GENERIC SIMULATION OF COUNTING NETWORKS By Eric Neil Klein A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of
More informationDistributed Computing over Communication Networks: Leader Election
Distributed Computing over Communication Networks: Leader Election Motivation Reasons for electing a leader? Reasons for not electing a leader? Motivation Reasons for electing a leader? Once elected, coordination
More informationParallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering
Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel
More information6 Distributed data management I Hashing
6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication
More informationDDS Dynamic Search Trees
DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion
More information6.895 Final Project: Serial and Parallel execution of Funnel Sort
6.895 Final Project: Serial and Parallel execution of Funnel Sort Paul Youn December 17, 2003 Abstract The speed of a sorting algorithm is often measured based on the sheer number of calculations required
More information: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day
184.727: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day Jesper Larsson Träff, Francesco Versaci Parallel Computing Group TU Wien October 16,
More informationComputing intersections in a set of line segments: the Bentley-Ottmann algorithm
Computing intersections in a set of line segments: the Bentley-Ottmann algorithm Michiel Smid October 14, 2003 1 Introduction In these notes, we introduce a powerful technique for solving geometric problems.
More informationSearch Algorithms for Discrete Optimization Problems
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic
More informationLarge-Scale Network Simulation Scalability and an FPGA-based Network Simulator
Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid
More informationCommunication Fundamentals in Computer Networks
Lecture 10 Communication Fundamentals in Computer Networks M. Adnan Quaium Assistant Professor Department of Electrical and Electronic Engineering Ahsanullah University of Science and Technology Room 4A07
More informationCHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER
84 CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 3.1 INTRODUCTION The introduction of several new asynchronous designs which provides high throughput and low latency is the significance of this chapter. The
More informationModel Questions and Answers on
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Model Questions and Answers on PARALLEL COMPUTING Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. Model Questions and Answers Subject Parallel Computing
More informationAn efficient implementation of the greedy forwarding strategy
An efficient implementation of the greedy forwarding strategy Hannes Stratil Embedded Computing Systems Group E182/2 Technische Universität Wien Treitlstraße 3 A-1040 Vienna Email: hannes@ecs.tuwien.ac.at
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More information15-750: Parallel Algorithms
5-750: Parallel Algorithms Scribe: Ilari Shafer March {8,2} 20 Introduction A Few Machine Models of Parallel Computation SIMD Single instruction, multiple data: one instruction operates on multiple data
More informationEfficient Communication in Metacube: A New Interconnection Network
International Symposium on Parallel Architectures, Algorithms and Networks, Manila, Philippines, May 22, pp.165 170 Efficient Communication in Metacube: A New Interconnection Network Yamin Li and Shietung
More informationDeep Learning. Architecture Design for. Sargur N. Srihari
Architecture Design for Deep Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation
More informationChapter 2 Overview of the Design Methodology
Chapter 2 Overview of the Design Methodology This chapter presents an overview of the design methodology which is developed in this thesis, by identifying global abstraction levels at which a distributed
More informationCS 455/555 Intro to Networks and Communications. Link Layer
CS 455/555 Intro to Networks and Communications Link Layer Dr. Michele Weigle Department of Computer Science Old Dominion University mweigle@cs.odu.edu http://www.cs.odu.edu/~mweigle/cs455-s13 1 Link Layer
More informationCross Layer Design in d
Australian Journal of Basic and Applied Sciences, 3(3): 59-600, 009 ISSN 99-878 Cross Layer Design in 80.6d Ali Al-Hemyari, Y.A. Qassem, Chee Kyun Ng, Nor Kamariah Noordin, Alyani Ismail, and Sabira Khatun
More informationHeuristics for Semi-External Depth-First Search on Directed Graphs
Heuristics for Semi-External Depth-First Search on Directed Graphs Jop F. Sibeyn Institut für Informatik, Universität Halle, Germany www.informatik.uni-halle.de/ jopsi James Abello ATT Labs Research, Florham
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationAn In-place Algorithm for Irregular All-to-All Communication with Limited Memory
An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de
More informationECE 333: Introduction to Communication Networks Fall Lecture 6: Data Link Layer II
ECE 333: Introduction to Communication Networks Fall 00 Lecture 6: Data Link Layer II Error Correction/Detection 1 Notes In Lectures 3 and 4, we studied various impairments that can occur at the physical
More informationWe assume uniform hashing (UH):
We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a
More informationNetworks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.
Networks for Multi-core hips A A ontrarian View Shekhar Borkar Aug 27, 2007 Intel orp. 1 Outline Multi-core system outlook On die network challenges A simple contrarian proposal Benefits Summary 2 A Sample
More informationHomework #4 Due Friday 10/27/06 at 5pm
CSE 160, Fall 2006 University of California, San Diego Homework #4 Due Friday 10/27/06 at 5pm 1. Interconnect. A k-ary d-cube is an interconnection network with k d nodes, and is a generalization of the
More informationMITOCW watch?v=w_-sx4vr53m
MITOCW watch?v=w_-sx4vr53m The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To
More informationCharacterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4)
S-72.2420/T-79.5203 Basic Concepts 1 S-72.2420/T-79.5203 Basic Concepts 3 Characterizing Graphs (1) Characterizing Graphs (3) Characterizing a class G by a condition P means proving the equivalence G G
More information1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth
Lecture 17 Graph Contraction I: Tree Contraction Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Kanat Tangwongsan March 20, 2012 In this lecture, we will explore
More informationIntroduction to Parallel Programming
University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Section 3. Introduction to Parallel Programming Communication Complexity Analysis of Parallel Algorithms Gergel V.P., Professor,
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationReduction of Periodic Broadcast Resource Requirements with Proxy Caching
Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota
More informationBalance of Processing and Communication using Sparse Networks
Balance of essing and Communication using Sparse Networks Ville Leppänen and Martti Penttonen Department of Computer Science University of Turku Lemminkäisenkatu 14a, 20520 Turku, Finland and Department
More informationApproximation Algorithms
18.433 Combinatorial Optimization Approximation Algorithms November 20,25 Lecturer: Santosh Vempala 1 Approximation Algorithms Any known algorithm that finds the solution to an NP-hard optimization problem
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationINTERCONNECTION NETWORKS LECTURE 4
INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationGeneralized Network Flow Programming
Appendix C Page Generalized Network Flow Programming This chapter adapts the bounded variable primal simplex method to the generalized minimum cost flow problem. Generalized networks are far more useful
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationWhat is Multicasting? Multicasting Fundamentals. Unicast Transmission. Agenda. L70 - Multicasting Fundamentals. L70 - Multicasting Fundamentals
What is Multicasting? Multicasting Fundamentals Unicast transmission transmitting a packet to one receiver point-to-point transmission used by most applications today Multicast transmission transmitting
More informationParallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationLecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance
Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,
More informationEfficient pebbling for list traversal synopses
Efficient pebbling for list traversal synopses Yossi Matias Ely Porat Tel Aviv University Bar-Ilan University & Tel Aviv University Abstract 1 Introduction 1.1 Applications Consider a program P running
More informationAn Efficient Method for Constructing a Distributed Depth-First Search Tree
An Efficient Method for Constructing a Distributed Depth-First Search Tree S. A. M. Makki and George Havas School of Information Technology The University of Queensland Queensland 4072 Australia sam@it.uq.oz.au
More informationGraphs. The ultimate data structure. graphs 1
Graphs The ultimate data structure graphs 1 Definition of graph Non-linear data structure consisting of nodes & links between them (like trees in this sense) Unlike trees, graph nodes may be completely
More informationTopology basics. Constraints and measures. Butterfly networks.
EE48: Advanced Computer Organization Lecture # Interconnection Networks Architecture and Design Stanford University Topology basics. Constraints and measures. Butterfly networks. Lecture #: Monday, 7 April
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2010 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationJob Re-Packing for Enhancing the Performance of Gang Scheduling
Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT
More informationApproximation and Heuristic Algorithms for Minimum Delay Application-Layer Multicast Trees
THE IBY AND ALADAR FLEISCHMAN FACULTY OF ENGINEERING Approximation and Heuristic Algorithms for Minimum Delay Application-Layer Multicast Trees A thesis submitted toward the degree of Master of Science
More information