Lecture 4: Graph Algorithms

Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e = (u, v) is an edge in an undirected graph => e is incident on vertices u and v If directed graph => e is incident from u and incident into v If e = (u, v) is an edge and G undirected => u and v are adjacent to each other If e = (u, v) is an edge and G directed => v is adjacent to u

Definitions Undirected graph Directed graph

Definitions A path from u to v: a sequence <v 0, v 1, v 2,, v k > of vertices where v 0 = v, v k = u and (v i, v i+1 ) are edges. Path length = number of edges in the path u reachable from v: if there exists a path from v to u Simple path: if all of its vertices are distinct Cycle: if v 0 = v k Acyclic graph = a graph without cycles Simple cycle: if all the intermediate vertices are distinct Connected graph: if every pair of vertices is connected by a path. G = (V, E ) is a subgraph of G = (V, E) if V is a subset of V and E is a subset of E Complete graph: if each pair of vertices is adjacent Weighted graph: each edge has an associated weight

Adjacency Matrix Representation a i, j = 1 0 if ( v i, v ) E otherwise j Space complexity = Θ( V 2 ) Lower bound for an algorithm (using adjacency matrix representation) needing to traverse all the edges = Ω( V 2 )

Adjacency List Representation Adj[v] = list of vertices adjacent to v Space complexity = Θ( E ) Efficient when the graph is sparse (i.e. E << O( V 2 ) ) Lower bound for an algorithm (using adjacency list representation) needing to traverse all the edges = Ω( V + E ) (= Ω( E )

Minimum Spanning Tree: Prim s Algorithm Spanning Tree of an undirected graph G: a subgraph that is a tree containing all the vertices of G. Minimum Spanning Tree (MST) for a weighted graph: a spanning tree with minimum weight. If G not connected it cannot have a spanning tree, it has a spanning forest. Determining the spanning forest => apply MST algorithm for each connected component Assumption: we consider only connected graphs.

MST: Example G MST of G

Prim s Algorithm Greedy algorithm i.e. it makes whatever is the best choice at the moment Idea: Selects an arbitrary starting vertex and then grows the tree by choosing a new vertex and edge that is guaranteed to be in the MST. Notations: V T = set that holds the vertices of the MST during its construction d[v] = holds the weight of the edge with least weight from any vertex in V T to vertex v. Complexity: T s = Θ(n 2 ) Best implementation using priority queues O(E log V)

Prim s Algorithm

Prim s Algorithm: Example

Prim s Algorithm: Parallel Formulation It is hard to select more than one vertex to include in the MST The iterations of the while loop cannot be easily parallelized. Parallelization: partition V into p subsets using the 1-D block mapping. V i = the set of vertices assigned to P i Steps: P i computes d i [u] = min {d i [v] v belongs to (V-V T ) V i } All-to-one reduction to compute the minimum over all d i [u] at P 0 P 0 inserts the new vertex u into V T and broadcasts u to all processes P i responsible for vertex u marks it as belonging to V T Each process updates the value of d[v] for its local vertices

Prim s Algorithm: Parallel Formulation

Parallel Prim s Algorithm: Analysis Steps: (executed n times) P i computes d i [u] = min {d i [v] v belongs to (V-V T ) V i } => takes Θ(n/p) All-to-one reduction to compute the minimum over all d i [u] at P 0 => takes Θ(log p) P 0 inserts the new vertex u into V T and broadcasts u to all processes => takes Θ(log p) P i responsible for vertex u marks it as belonging to V T => takes Θ(1) Each process updates the value of d[v] for its local vertices => takes Θ(n/p) T p = Θ(n 2 /p) + Θ(n log p) Cost optimal if (p log p)/n = O(1) => p = O(n/log n) Isoefficiency function Θ(p 2 log 2 p)

Single-Source Shortest Paths: Dijkstra s Algorithm Edsger W. Dijkstra (1930-2002) http://www.cs.utexas.edu/users/ewd/ Single-source shortest paths problem: find the shortest paths from a vertex v to all other vertices in V Shortest path = minimum weight path Dijkstra s Algorithm (1959) solves the problem on both directed and undirected graphs with non-negative weights. Similar to Prim s algorithm Greedy algorithm: it always chooses an edge to a vertex that appears closest. l[u] = minimum cost to reach vertex u from vertex s by means of vertices in V T.

Single-Source Shortest Paths: Dijkstra s Algorithm T s = Θ(n 2 )

Example: Dijkstra s Single-Source Shortest-Paths Algorithm 10 s 2 3 9 0 u * 1 v * 4 6 5 * x 2 7 * y

10 u 10 1 v * s 0 2 3 9 4 6 5 7 5 x 2 * y

10 u 8 1 v 14 s 0 2 3 9 4 6 5 7 5 x 2 7 y

10 u 8 1 v 13 s 0 2 3 9 4 6 5 7 5 x 2 7 y

10 u 8 1 v 9 s 0 2 3 9 4 6 5 7 5 x 2 7 y

Parallel Dijkstra s Algorithm: Analysis Same as Prim s algorithm T p = Θ(n 2 /p) + Θ(n log p) Cost optimal if (p log p)/n = O(1) => p = O(n/log n) Isoefficiency function Θ(p 2 log 2 p)

All-Pairs Shortest Paths All-pairs shortest paths problem: find the shortest paths between all pairs of vertices v i, v j such that i j. Output: nxn matrix D = (d i,j ) such that d i,j is the cost of the shortest path from v i to v j. Dijkstra s Algorithm: for graphs with nonnegative weights Idea: Apply Dijkstra s Algorithm for each vertex => T s = Θ(n x n 2 ) Floyd s Algorithm: for graphs having negative weights but no negative-weight cycles.

Dijkstra s Algorithm: Parallel Formulation Two parallel formulations: Source-partitioned formulation: partition the vertices among different processes and have each process compute the single-source shortest path for all vertices assigned to it. Source-parallel formulation: assign each vertex to a set of processes and use the parallel formulation of the single-source shortest path algorithm.

Source-Partitioned Formulation Uses n processes Each process P i finds the shortest paths from vertex v i to all other vertices by executing the sequential Dijkstra s algorithm locally. If adjacency matrix is replicated at each process => no communication T p = Θ(n 2 ) S = Θ(n 3 )/Θ(n 2 ) = Θ(n) E = Θ(1) => very good algorithm! Isoefficiency: at most n processes can be used => p = n => W = Θ(n 3 ) = Θ(p 3 ) not very scalable!

Source-Parallel Formulation Uses parallel Dijkstra for each vertex => p > n p processes are divided into n partitions each with p/n processes. Each partition solves one single-source shortest path problem. Total number of processes that can be used efficiently => O(n 2 ) T p = Θ(n 2 /(p/n)) + Θ(n log(p/n)) = Θ(n 3 /p)+θ(n log p) Cost optimal if p log p = O(n 2 ) => p log n = O(n 2 ) => p = O(n 2 /log n) Isoefficiency due to communication: Θ((p log p) 1.5 ) Isoefficiency due to concurrency: Θ(p 1.5 ) (because p = O(n 2 ) ) Overall isoefficiency: Θ((p log p) 1.5 ) More scalable than source-partitioned formulation It exploits more parallelism!

Floyd s Algorithm Robert W Floyd (1936-2001) Bob used to say that he was planning to get a Ph.D. by the green stamp method, namely by saving envelopes addressed to him as Dr. Floyd. After collecting 500 such letters, he mused, a university somewhere in Arizona would probably grant him a degree. D. Knuth (sigact.acm.org/floyd/) Observations: A k = {v 1, v 2,, v k } a subset of vertices for k <= n p i,j (k) the minimum weight path from v i to v j whose intermediate vertices belong to A k d i,j (k) = weight of P i,j (k) If v k is not on the shortest path from v i to v j then p i,j (k)=p i,j (k-1) If v k is in p i,j (k) then break p j,j (k) into two paths v i ->v k and v k -> v j. These paths use vertices from {v 1, v 2,, v k-1 } => d i,j (k) = d i,k (k-1)+d k,j (k-1) Recurrence relation: d i, j ( k) = min{ d i, j ( k w( v 1), d i, k i, v ( k j ) 1) + d k, j ( k 1)} k k = 0 1 The solution is given by D(n)

Floyd s Algorithm T s = Θ(n 3 ) Space complexity = Θ(n 2 ) d k d j d

Floyd s Algorithm: Example 7 D(0) D(1) D(2) D(3)

Floyd s Algorithm: Parallel Formulation Partition matrix D(k) using 2-D block partitioning Each (n/p 1/2 x n/p 1/2 ) block is assigned to one process

Floyd s Algorithm: Parallel Formulation Each process updates its part of the matrix during each iteration During the k-th iteration each process needs certain segments of the k- th row and k-th column Idea: broadcast row-wise and column-wise the corresponding segments

Floyd s Algorithm: Parallel Formulation

Parallel Floyd s Algorithm: Analysis Assumption: bisection bandwidth = Θ(p) Broadcast of n/p 1/2 elements to p 1/2-1 processes => Θ(n/p 1/2 log p) Synchronization step => Θ(log p) Local computation of D(k) => Θ(n 2 /p) T p = Θ(n 3 /p) + Θ(n 2 /p 1/2 log p) Cost optimal if (p 1/2 log p)/n = O(1) => p = O(n 2 /log 2 n) Isoefficiency due to communication: Θ(p 1.5 log 3 p) Isoefficiency due to concurrency: Θ(p 1.5 ) Overall isoefficiency: Θ(p 1.5 log 3 p) The overhead due to communication can be improved if the execution of the algorithm is pipelined => the most scalable algorithm for all-pairs shortest paths.

Transitive Closure Problem: determine if two vertices in a graph are connected Transitive closure of G: the graph G* = (V, E*) where E* = { (v i, v j ) there is a path from v i to v j in G} Connectivity matrix A* =(a* i,j ) a* i,j = 1 if there is a path from v i to v j or i = j a* i,j = otherwise Method 1: assign weights of 1 to each edge and use any of the all-pairs shortest paths algorithms and obtain matrix A from D a* i,j = 1 if d j,j > 0 or i = j a* i,j = if d j,j = Method 2: use Floyd s algorithm on the adjacency matrix replacing the min and + by or and and Initially: a i,j = 1 if i=j or (v i,v j ) is an edge of G and a i,j = 0 otherwise Matrix A*: a* i,j = if d j,j = 0 and a* i,j = 1 otherwise

Transitive Closure: Example G Transitive closure G*

Connected Components Connected components of undirected graph G: the maximal disjoint sets C 1, C 2, C k such that V = C 1 U C 2 U U C k and u, v in C i u is reachable from v and v is reachable from u C i are equivalence classes of vertices under is reachable from relation Graph with 3 connected components

Depth-First Search (DFS) Based Algorithm Idea: Search deeper in the graph DFS Algorithm: Edges are explored out of the most recently discovered vertex v that still has unexplored edges leaving v. When all of v s edges have been explored, the search backtracks to explore edges leaving the vertex from which v was discovered. The process continues until we have discovered all vertices reachable from the original source If any undiscovered vertices remain, then one of them is selected as a new source and the search is repeated. The process is repeated until all vertices are discovered. T s = Θ( E )

Depth-First Search (DFS) Based Algorithm

Using DFS to Find Connected Components G Connected components of G

Connected Components: Parallel Formulation Partition the adjacency matrix of G into p parts and assign each part to one of p processes P i has subgraph G i Algorithm: First step: Each P i computes the dept-first forest of G i using DFS algorithm => p spanning forests Second step: Pairwise merging of the p spanning forests into one spanning forest How to merge the forests? Solution: use disjoint sets operations

Connected Components: Parallel Disjoint sets operations: Formulation Find(x) returns a pointer to the representative element of the set containing vertex x (unique to each set) Union(x,y) unites the sets containing elements x and y Algorithm: Initially each edge is a set for each edge (u,v) in E if ( Find(u)!= Find(v) ) then Union(u,v) Complexity: O(n) For two forests => at most n-1 edges of one forest are merged to the edges of the other

Parallel Connected Components: Example G P1 P1 P2 P2

Parallel CC using 1-D Block Mapping The adjacency matrix is partitioned into p stripes Assume a p process message passing system Step 1: computing the spanning forest for graphs with (n/p)xn adjacency matrix => Θ(n 2 /p) Step 2: pairwise merging Performed by embedding a virtual tree on the processes => log p merging stages each taking Θ(n) => Θ(n log p) Communication time => Spanning forests (Θ(n) edges) are sent between nearest neighbors => Θ(n log p) T p = Θ(n 2 /p) + Θ(n log p) Cost optimal if p = O(n/log n) Isoefficiency due to communication and extra computation: Θ(p 2 log 2 p) Isoefficiency due to concurrency: Θ(p) Overall isoefficiency: Θ(p 2 log 2 p) => same as Prim s algorithm and Dijkstra s algorithm

Algorithms for Sparse Graphs Sparse graphs: E << V 2 Algorithms based on the adjacency list representation Lower bound => Ω( V + E ) The lower bound depend on the sparseness of the graph. Difficult to achieve even work distribution and low communication overhead for sparse graphs Possible partitioning methods: Assign an equal number of vertices and their adjacency list to each process => may lead to significant load imbalance Assign equal number of edges to each process => may require splitting the adjacency list of a vertex => communication overhead increases Hard to derive efficient parallel formulations for general sparse graphs Efficient parallel formulations for some structured sparse graphs

Sparse Graphs Linear graph Grid graph Random sparse graph

Maximal Independent Set (MIS) A set of vertices I is called independent if no pair of vertices in I is connected via an edge in G. I is a maximal independent set if by including any other vertex not in I, the independence property is violated.

MIS: Simple Algorithm Initially I is empty and the set of candidates C= V Algorithm: while (C not empty) { move a vertex v from C into I; remove all vertices adjacent to v from C; } Correctness: I is an independent set because all the vertices whose subsequent inclusion will violate the condition are removed form C I is maximal because any other vertex that is not in I is adjacent to at least one of the vertices in I. Inherently serial!

MIS: Luby s Algorithm Initially I is empty and the set of candidates C = V Algorithm: while (C not empty) { - Assign a unique random number to each vertex in C - If a vertex has a random number smaller than all of the random numbers of the adjacent vertices, it is included in I. - Update C by removing the vertices included in I and their adjacent vertices. } Suitable for parallelization!

MIS: Luby s Algorithm

Luby s Algorithm: Shared Address Space Formulation I[i] = 1 if v i is part of the MIS, or 0 otherwise C[i] = 1 if vertex i is part of the candidate set, or 0 otherwise R[i] stores the random number assigned to vertex v i During each iteration the vector C is partitioned among the p processes Each process: Generates a random number for its assigned vertices from C Wait for all processes to generate the random numbers For each vertex assigned to it checks to see if it can be included in I. If yes, set the corresponding entry of I to 1. Updates C. Concurrent writes of zero into C do not affect the correctness (same value is written).

Single-Source Shortest Paths: Johnson s Algorithm A variant of Dijkstra s algorithm Uses a priority queue Q to store the value l[v] for each vertex in V-V T In the priority queue the element with the smallest value in l is always at the front => implemented as a min-heap => O(log n) time for update. Overall complexity => O( E log n)

Single-Source Shortest Paths: Johnson s Algorithm

Johnson s Algorithm: Parallelization Strategy Problem: How to maintain the priority queue efficiently in a parallel implementation? Simple strategy: P 0 maintains the queue The other processes compute new values for l[v] and send them to P 0 to update Q Limitations: A single process updates Q => O( E log n) time => No asymptotic speedup! During each iteration it updates roughly E / V vertices => No more than E / V processes can be kept busy How to alleviate these limitations?

Johnson s Algorithm: Parallelization Strategy First limitation: Distribute the maintenance of Q to multiple processes => nontrivial Achieves a small speedup of O(log n) (assuming one update takes O(1) ) Second limitation: Extract more than one vertex from Q at the same time. If v at the top of Q, extract all vertices u that have l[u] = l[v] If we know that the minimum weight over all the edges is m => extract all vertices such that l[u] <= l[v] + m (these are called safe vertices, i.e. the update will not modify the next min in Q). Complicated algorithm with limited concurrency.

Johnson s Algorithm: Parallelization Strategy Alternate approach: Design a parallel algorithm that processes both safe and unsafe vertices concurrently as long as these safe edges can be reached from the source via a path involving vertices whose shortest paths have already been computed. Each of the p processes extracts one of the top p vertices from Q and updates the value of adjacent vertices Does not ensure that the l value of the vertices extracted from the priority queue corresponds to the cost of the shortest path Solution: detect when we have incorrectly computed the shortest path to a vertex and insert it back into Q with the updated l value. u just extracted, v has already been extracted, u adjacent to v if l[v] + w(v,u) <= l[u] then the shortest path to u incorrectly computed => insert back u with l[u] = l[v]+w(u,v)

Johnson s Algorithm: Parallelization Strategy P 2 P 1 P 0 P 0 P 1 Example: processing unsafe vertices concurrently P 0 : l[h]+w(h,g) = 5 < l[g]=10 (from previous iteration) => insert g back into Q with l[g] = 5

Johnson s Algorithm: Distributed Memory Formulation Idea: remove the bottleneck of working with a single priority queue. V is partitioned among p processes P i has: a local priority queue Q i an array sp[ ] that store the shortest path from source to the vertices assigned to P i sp[v] is updated from l[v] each time v is extracted from the queue. Initially sp[v] =, sp[s] = 0 Each process executes Johnson s algorithm locally. At the end sp[v] stores the shortest path from source to vertex v.

Johnson s Algorithm: Distributed Memory Formulation How to maintain the queues? Assume (u,v) is an edge, P i has just extracted u from Q i P i sends l[u] + w(u,v) to P j P j receives the message and sets the values of l[v] in Q j to min{l[v], l[v] + w(u,v)} P j might have already computed sp[v] => two cases: If sp[v] <= l[u]+w(u,v) => longer path => P j does nothing If sp[v] > l[u]+w(u,v) => P j must insert v back into Q j with l[v] = l[u]+w(u,v) and disregards the value sp[v] The algorithm terminates when all the queues are empty.

Johnson s Algorithm: Distributed Memory Formulation The wave of activity in the priority queues for a grid graph