CS174 Lecture 11. Routing in a Parallel Computer

Similar documents
Figure 1: An example of a hypercube 1: Given that the source and destination addresses are n-bit vectors, consider the following simple choice of rout

Lectures 8/9. 1 Overview. 2 Prelude:Routing on the Grid. 3 A couple of networks.

4.7 Approximate Integration

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Scheduling Unsplittable Flows Using Parallel Switches

Today. Types of graphs. Complete Graphs. Trees. Hypercubes.

From Routing to Traffic Engineering

From Static to Dynamic Routing: Efficient Transformations of Store-and-Forward Protocols

CS70 - Lecture 6. Graphs: Coloring; Special Graphs. 1. Review of L5 2. Planar Five Color Theorem 3. Special Graphs:

Homework 1 Solutions:

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

CPSC 536N: Randomized Algorithms Term 2. Lecture 5

INTERCONNECTION networks are used in a variety of applications,

CPSC 536N: Randomized Algorithms Term 2. Lecture 4

Notes for Lecture 21. From One-Time Signatures to Fully Secure Signatures

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 24: Online Algorithms

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7

Problem Set 3. MATH 778C, Spring 2009, Austin Mohr (with John Boozer) April 15, 2009

JOB SHOP SCHEDULING WITH UNIT LENGTH TASKS

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Lecture 9: (Semi-)bandits and experts with linear costs (part I)

Paul's Online Math Notes Calculus III (Notes) / Applications of Partial Derivatives / Lagrange Multipliers Problems][Assignment Problems]

Oblivious Deadlock-Free Routing in a Faulty Hypercube

Simple Graph. General Graph

CS256 Applied Theory of Computation

Joint Entity Resolution

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Collision Detection CS434. Daniel G. Aliaga Department of Computer Science Purdue University

Graph theory - solutions to problem set 1

Discrete Mathematics for CS Spring 2008 David Wagner Note 13. An Introduction to Graphs

Math 221 Final Exam Review

Chapter 7 Slicing and Dicing

Crossbar - example. Crossbar. Crossbar. Combination: Time-space switching. Simple space-division switch Crosspoints can be turned on or off

Discrete Mathematics and Probability Theory Summer 2016 Dinh, Psomas, and Ye HW 2

From Static to Dynamic Routing: Efficient Transformations of Store-and-Forward Protocols

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Convex Hulls in 3D. Definitions. Size of a Convex Hull in 3D. Collision detection: The smallest volume polyhedral that contains a set of points in 3D.

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Direct Routing: Algorithms and Complexity

Analysis of Link Reversal Routing Algorithms for Mobile Ad Hoc Networks

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Efficient Bufferless Packet Switching on Trees and Leveled Networks

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

SDSU CS 662 Theory of Parallel Algorithms Networks part 2

CS590R - Algorithms for communication networks Lecture 19 Peer-to-Peer Networks (continued)

Optimal Oblivious Path Selection on the Mesh

Graph Contraction. Graph Contraction CSE341T/CSE549T 10/20/2014. Lecture 14

Routing Algorithms. Review

BYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)

Randomized incremental construction. Trapezoidal decomposition: Special sampling idea: Sample all except one item

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Õ(Congestion + Dilation) Hot-Potato Routing on Leveled Networks

Parameterized Approximation Schemes Using Graph Widths. Michael Lampis Research Institute for Mathematical Sciences Kyoto University

AAL 217: DATA STRUCTURES

Using Bounding Volume Hierarchies Efficient Collision Detection for Several Hundreds of Objects

III Data Structures. Dynamic sets

Algorithms for COOPERATIVE DS: Leader Election in the MPS model

HW Graph Theory SOLUTIONS (hbovik) - Q

Lecture and notes by: Sarah Fletcher and Michael Xu November 3rd, Multicommodity Flow

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CSE 521: Design and Analysis of Algorithms I

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

For Review Purposes Only. Two Types of Definitions Closed Form and Recursive Developing Habits of Mind. Table H Output.

Dr e v prasad Dt

1 Metric spaces. d(x, x) = 0 for all x M, d(x, y) = d(y, x) for all x, y M,

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Lecture 2 The k-means clustering problem

One-Pass Streaming Algorithms

Direct Routing: Algorithms and Complexity

Worst-case Ethernet Network Latency for Shaped Sources

Selection (deterministic & randomized): finding the median in linear time

AGENT-BASED ROUTING ALGORITHMS ON A LAN


EAI Endorsed Transactions on Industrial Networks And Intelligent Systems

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017

Answers to specimen paper questions. Most of the answers below go into rather more detail than is really needed. Please let me know of any mistakes.

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

MAC Theory. Chapter 7. Ad Hoc and Sensor Networks Roger Wattenhofer

c 1999 Society for Industrial and Applied Mathematics

Parameterized Approximation Schemes Using Graph Widths. Michael Lampis Research Institute for Mathematical Sciences Kyoto University

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych

Solutions to Problem Set 1

Basic Switch Organization

Chapter 3 MEDIA ACCESS CONTROL

Electrical Circuits and Random Walks: Challenge Problems

CMSC 754 Computational Geometry 1

OPERATING SYSTEMS. After A.S.Tanenbaum, Modern Operating Systems 3rd edition Uses content with permission from Assoc. Prof. Florin Fortis, PhD

Basic Reliable Transport Protocols

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

Introduction to Wireless Networking ECE 401WN Spring 2008

ECE 333: Introduction to Communication Networks Fall 2001

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8

DDS Dynamic Search Trees

9/24/ Hash functions

CIS 505: Software Systems

AXIOMS FOR THE INTEGERS

Transcription:

CS74 Lecture Routing in a Parallel Computer We study the problem of moving pacets around in a parallel computer. In this lecture we will consider parallel computers with hypercube connection networs. The methods we describe are easy to adapt to other connection networs, lie butterflys or fat trees. An 8-processor machine is shown below. 0 00 00 0 0 Z Y 000 00 X The processors are at the vertices of the cube, and edges represent communication lins. Each processor is numbered with a 3-bit binary number. otice that the processor numbers are also X- Y-Z coordinates of the 3D layout of the cube. A hypercube is a higher-dimensional cube. You can create a hypercube in + dimensions from an -dimensional by maing an extra copy of the -dimensional cube, and adding an extra bit which is zero in one copy and one in the other. Then join corresponding vertices. E.g. a 4D hypercube can be built from the 3D hypercube this way as shown below. 000 00 00 000 0 00 0 00 00 0 0 Z Y 0000 000 000 00 W X The processors are number with 4-bit binary numbers which represent the X-Y-Z-W coordinates

(aside, this is a valid projection of 4D space, and this is exactly what the 4D cube loos lie. Projecting from 4D or -D into 2D is analogous to 3D projection). Definitions Processor number i is wants to send a pacet v i to destination d(i). We assume that the computer runs synchronously, so that all processors try to send their pacets at the same time. Permutation Routing In permutation routing, every processor tries to send to a different destination. That is, the function d(i) is a permutation. Oblivious Routing The route taen by a pacet v i depends only on its destination d(i) and not on any other pacet s destination d(j). Oblivious routing is the most practical method to use for parallel computers. Routing which is not oblivious requires central coordination. But that implies that some processor is able to loo at the destinations of many pacets in the current time step and compute a good global routing for them. This taes a considerable amount of time and communication. It is certainly not a good idea to use it for routing pacets, which is the lowest level operation that the parallel machine performs. Routing must be very fast because all of the high level machine operations rely on it. E.g. some machines even borrow main memory from other machines on the networ. So they must access that memory even faster than the simplest computations. Collisions A collision occurs when two pacets arrive at the same processor at the same time and try to leave via the same lin. Bit Fixing Bit fixing is a simple process for routing a single pacet obliviously. Tae the bit address of the source processor and change one bit at a time to the address of the destination processor. Each time a bit is changed, the pacet is forwarded to a neighboring processor. This wors because in a hypercube, each processor i is connected to all the processors j whose bit addresses differ from i in one bit. E.g. processor 0 is connected to processor 00, processor, processor 00 and processor 00. So changing one bit of the address corresponds to sending the pacet from the current node to a neighboring node. It should be clear that bit fixing is an optimal routing scheme for a single pacet. If the source i and destination address j differ by bits, then the pacet must traverse at least lins in the hypercube to get to its destination. Bit fixing gets it there in exactly steps. If we have to route many pacets though, even if the routing is a permutation routing, bit fixing can cause collisions. In fact, so can any deterministic oblivious routing strategy. To deal with collisions, every processor has a queue and a prioritizing scheme for each incoming pacet. If

incoming pacets try to leave along the same lin, they are placed in a queue and sent off in different time steps. v d()=3 v 2 d(2)=3 v 3 v 2 v v 3 d(3)=3 e.g. in the figure above, three pacets arrive at the same node and attempt to leave along the same lin (we show their destinations as the same which doesn t happen with permutation routing, but what we really mean is that the next node address is the same). If the 3 pacets arrive at time step T, the pacets are queued, and v leaves in time step T+, v 2 leaves at T+2, and v 3 leaves at T+3. Thus routing can still occur with collisions, since all pacets eventually get to their destinations, but there may be long delays because of queueing. In fact we have: Theorem Any determistic oblivious permutation routing scheme for a parallel machine with processors each with n outward lins requires Ω( (/n)) steps. Lucily we can avoid this bad case by using a randomized routing scheme. In fact most permutations cause very few collisions. So the idea is to first route all the pacets using a random permutation, and then from there to their final destination. That is, Randomized Routing. Choose a random permutation σ of {,,}. Route pacet v i to destination σ(i) using bit fixing. 2. Route pacet v i from σ(i) to d(i) using bit fixing. There is an important observation we can mae about bit fixing. amely, during a given phase of the above algorithm (step. or step 2.) two pacets can come together along a route and then separate, but only once. That is, a pair of routes can loo lie the left picture but not the right: Possible Impossible

To see this, notice that during bit fixing routing, an intermediate address always loos lie D,,D,S +,,S n where S i is a bit of the Source address, and D j is a bit of the destination address. If two routes collide at the th step, that means their destination addresses agree in their first bits and the source addresses agree in their last n- bits. Let 0 be the first value of for which the pacets collide. At each time step, we add one more bit of the destination address, which means we increment. Eventually, the destination bits must disagree (because the destinations are different). Let be the value of at which this happens. Then the D destination bit is different for the two pacets. At that point, the two pacets separate. They will never collide again, because all the later intermediate destinations will include the D bit. This observation is the crux of the proof of the following theorem: Theorem The delay of a pacet v i is S where S is the set of pacets whose routes intersect v i s route. We won t give a detailed proof. It s enough to notice that whenever the routes of two pacets intersect, one of the pacets may be delayed by one time step. Once that pacet is delayed by one time step at the first shared node, it will flow along the shared route behind the other pacet, and will not be delayed any more by that pacet. If the same route intersects other routes, each of them may add a delay of one time step. This happens either (i) because another pacet collides with the current pacet along a shared route, or (ii) because another pacet collides with a pacet that is ahead of the current pacet along a part of the route which is shared by all three. In either case, an extra delay of at most one results. To get the running time of this scheme, we compute the expected value of the size of the set S above. Define an indicator random variable which is when the routes of pacet v i and pacet v j share at least one edge, and is 0 otherwise. Then by the above theorem, the expected delay of pacet v i is the expected size of S which is It s rather difficult to an exact estimate of this quantity because is a complicated condition. It s easier to thin about T(e) which is the number of routes that pass through a given edge e. ow suppose the route of pacet v i consists of the edges (e, e 2,, e ). Then we have And so T ( e ) l T ( el )

To use this bound we next compute E[T(e)] which turns out to be ½. To see that, notice that E[T(e)] = (sum of lengths of all routes)/(total edges in the networ) The sum of lengths of all routes is the expected length of a route times (the number of all routes). The average length of a route is n/2 because an n-bit source differs from a random destination address is n/2 bits on average. So the sum of route lengths is n/2. The total number of edges in the networ is the number of nodes times the number of outbound lins, which is n. So E[T(e)] = (n/2)/(n) = ½ Then if the path for pacet v i has edges along it, then ow we can apply Chernoff bounds to the probability of there being a substantial number of paths intersecting v i s path. The Chernoff bound is Aside, this is the Chernoff bound for large δ > 2e-. We now compute the probability that v i is delayed at least 3n steps. So we require that (+δ)µ = 3n. otice that we dont actually now what µ is, but we have a bound for it of µ 0.5n. It follows that µδ 2.5n. Thus the probability v i is delayed by at least 3n steps is bounded by 2-2.5n. This is a bound for the probability that a given pacet is delayed more than 3n steps. But we would lie to get a bound for the probability that no pacet gets delayed more than 3n steps. For that, it s enough to use the sum rule for probabilities as a bound. There are = 2 n routes total, and the probability that none of these taes more than 3n steps is bounded above by 2 n 2-2.5n = 2 -.5n n T ( el ) = 2 2 So we can mae the following assertion: [ µδ > ( + δ ) µ ] Pr 2 With probability at least 2 -.5n every pacet reaches its destination σ(i) in 4n or fewer steps. The 4n comes from the delay time (3n) plus the time for the bit fixing steps, which is n. otice that all of this applies to just one phase of the algorithm (recall that phase one routes a pacet to a random location, and phase two routes from there to the actual destination). So the full algorithm routes all pacets to their destinations with high probability in 8n or fewer steps.