Advanced Data Management

Similar documents
Algorithms and Data Structures

Graph. Vertex. edge. Directed Graph. Undirected Graph

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Path-Hop: efficiently indexing large graphs for reachability queries. Tylor Cai and C.K. Poon CityU of Hong Kong

CS521 \ Notes for the Final Exam

Konigsberg Bridge Problem

Algorithms and Data Structures 2014 Exercises and Solutions Week 9

Outline. Graphs. Divide and Conquer.

CS 220: Discrete Structures and their Applications. graphs zybooks chapter 10

Graph Algorithms Using Depth First Search

Strongly Connected Components. Andreas Klappenecker

A Reachability Query Method Based on Labeling Index on Large-Scale Graphs

CHAPTER 13 GRAPH ALGORITHMS ORD SFO LAX DFW

Matching Theory. Figure 1: Is this graph bipartite?

Chapter 9 Graph Algorithms

11/22/2016. Chapter 9 Graph Algorithms. Introduction. Definitions. Definitions. Definitions. Definitions

Lecture 4: Graph Algorithms

Answering reachability queries on large directed graphs Introducing a new data structure using bit vector compression

Course Introduction / Review of Fundamentals of Graph Theory

1 Random Walks on Graphs

Advanced Data Management

Homework Assignment #3 Graph

Parallel Breadth First Search

Algorithms and Data Structures

IS 709/809: Computational Methods in IS Research. Graph Algorithms: Introduction

UNIT Name the different ways of representing a graph? a.adjacencymatrix b. Adjacency list

Graphs. Edges may be directed (from u to v) or undirected. Undirected edge eqvt to pair of directed edges

CSci 231 Final Review

Directed Graphs BOS Goodrich, Tamassia Directed Graphs 1 ORD JFK SFO DFW LAX MIA

Chapter 9 Graph Algorithms

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

Depth First Search (DFS)

CSE 100 Minimum Spanning Trees Prim s and Kruskal

Algorithms and Data Structures

Algorithm Design (8) Graph Algorithms 1/2

Chapter 9 Graph Algorithms

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019

Algorithm Design and Analysis

CS 361 Data Structures & Algs Lecture 15. Prof. Tom Hayes University of New Mexico

Copyright 2000, Kevin Wayne 1

CS4800: Algorithms & Data Jonathan Ullman

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008

Graph and Digraph Glossary

Unit 5F: Layout Compaction

Unit 3: Layout Compaction

3.1 Basic Definitions and Applications

Problem 1. Which of the following is true of functions =100 +log and = + log? Problem 2. Which of the following is true of functions = 2 and =3?

& ( D. " mnp ' ( ) n 3. n 2. ( ) C. " n

CS781 Lecture 2 January 13, Graph Traversals, Search, and Ordering

Reachability Querying: An Independent Permutation Labeling Approach

COMP 251 Winter 2017 Online quizzes with answers

Graphs. Part I: Basic algorithms. Laura Toma Algorithms (csci2200), Bowdoin College

Directed Graphs (II) Hwansoo Han

MAT 7003 : Mathematical Foundations. (for Software Engineering) J Paul Gibson, A207.

COL 702 : Assignment 1 solutions

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

MapReduce Algorithms. Barna Saha. March 28, 2016

Data Structures and Algorithms. Chapter 7. Graphs

CS 125 Section #6 Graph Traversal and Linear Programs 10/13/14

( ) D. Θ ( ) ( ) Ο f ( n) ( ) Ω. C. T n C. Θ. B. n logn Ο

Evaluating find a path reachability queries

Directed acyclic graphs

9. The expected time for insertion sort for n keys is in which set? (All n! input permutations are equally likely.)

Theorem 2.9: nearest addition algorithm

Applications of BDF and DFS

( ). Which of ( ) ( ) " #& ( ) " # g( n) ( ) " # f ( n) Test 1

Chapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Lecture 8: PATHS, CYCLES AND CONNECTEDNESS

Part VI Graph algorithms. Chapter 22 Elementary Graph Algorithms Chapter 23 Minimum Spanning Trees Chapter 24 Single-source Shortest Paths

Algorithms and Data Structures

Minimum Spanning Trees Ch 23 Traversing graphs

Introduction to Graph Theory

Chapter 3. Graphs CLRS Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Greedy algorithms is another useful way for solving optimization problems.

Weighted Graph Algorithms Presented by Jason Yuan

22c:31 Algorithms. Kruskal s Algorithm for MST Union-Find for Disjoint Sets

Copyright 2000, Kevin Wayne 1

Strongly Connected Components

Algorithms (VII) Yijia Chen Shanghai Jiaotong University

Data Structure. IBPS SO (IT- Officer) Exam 2017

Graphs. Data Structures and Algorithms CSE 373 SU 18 BEN JONES 1

Introduction to Algorithms May 14, 2003 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser.

Graph Algorithms (part 3 of CSC 282),

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI. Department of Computer Science and Engineering CS6301 PROGRAMMING DATA STRUCTURES II

Minimal Spanning Tree

Sample Solutions to Homework #4

22.3 Depth First Search

3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications

1 The Traveling Salesperson Problem (TSP)

CS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting

E.G.S. PILLAY ENGINEERING COLLEGE (An Autonomous Institution, Affiliated to Anna University, Chennai) Nagore Post, Nagapattinam , Tamilnadu.

Algorithms (VII) Yijia Chen Shanghai Jiaotong University

Final. Name: TA: Section Time: Course Login: Person on Left: Person on Right: U.C. Berkeley CS170 : Algorithms, Fall 2013

Multiple Choice. Write your answer to the LEFT of each problem. 3 points each

Strongly connected: A directed graph is strongly connected if every pair of vertices are reachable from each other.

Graph Representation DFS BFS Dijkstra A* Search Bellman-Ford Floyd-Warshall Iterative? Non-iterative? MST Flow Edmond-Karp

Searching in Graphs (cut points)

Lecture 9 Graph Traversal

More Graph Algorithms

CSE 417: Algorithms and Computational Complexity. 3.1 Basic Definitions and Applications. Goals. Chapter 3. Winter 2012 Graphs and Graph Algorithms

Transcription:

Advanced Data Management Medha Atre Office: KD-219 atrem@cse.iitk.ac.in Sept 26, 2016

defined Given a graph G(V, E) with V as the set of nodes and E as the set of edges, a reachability query asks does there exists any path between nodes x and y. The pair of nodes can be any two nodes from the given graph. In case of directed graph, the path is directed and other for undirected graphs the path is undirected. The graphs may have cycles too, giving rise to the strongly connected components.

Challenges Identification of strongly connected components and merging them. Efficient traversal over large graphs, e.g., number of vertices and edges being several millions. High computational complexity O( V 3 ) for standard algorithms like all-pair shortest path, which also gives us transitive clousure for answering reachability. High storage space O( V 2 ) if the entire transitive clousure is meant to be stored.

Key aspects to consider Index creation time, e.g., O( V 3 ) for a naïve technique like all pair shortest path. Index size, e.g., O( V 2 ) for storing complete transitive closure. answering time, O(1) if we have complete transitive closure. However, the first two aspects time to create index and the index size become prohibitives large if the graph is dense and has a large number of nodes. Hence several approaches defined over the past decade to either reduce index creation time, index size, or both, at the cost of compromising some query answering time.

Existing published approaches which answer queries using only index (node-idx-labels) These focus on building a reachability index such that for any pair of nodes (u, v) the reachability query over them can be answered using only this index without traversing any part of the original graph. 2-Hop [SODA 2002], Path-Tree [SIGMOD 2008], 3-Hop [SIGMOD 2009], PWAH8 [SIGMOD 2011], TF-label [SIGMOD 2013], Hierarchical Labeling (HL) and Distribution Labeling (DL) [VLDB 2013]. Not all of these build indexes of O( V 2 ), they try to improve upon this worst case bound.

Existing published approaches that answer queries by partly using the index These focus on reducing the time spent in building the index and the size (space) of the index. For certain queries they have to traverse part of the graph to give a definitive answer to the given reachability query. For such queries the index alone cannot answer the reachability query definitively, since it is incomplete. Tree+SSPI [VLDB 2005], GRIPP [SIGMOD 2007], [VLDB 2010], Ferrari [ICDE 2013], SCARAB [SIGMOD 2012].

[SODA 2002] The aim is to reduce the size of the reachability index using a method called such that entire transitive closure is not needed. 2-hop Labeling: each vertex v V has a label L(v) = (L in (v), L out (v)). L in (v) contains all nodes x V such that v can be reached from x. L out is defined similarly. u v iff L out (u) L in (v) φ. s: for every (u, v) s.t. u v, let P uv be all the paths between (u, v). A hop is defined as (H, u) s.t. H is a path which ends in u. u is said to be the handle of hop H.

A collection of such hops is called a, if for every u, v V, where v is reachable from u, there exist two hops (H 1, u) and (H 2, v) where H 1.H 2 gives some path between u and v. Optimal problem is NP-hard as it can be reduced to the optimal set-cover problem. But due to the above observation, we use greedy method of set-cover for problem. Start with a ground set T = {(u, v) P uv φ}. We will update this T and call it T in each iteration, by removing the pairs of nodes, as we cover them with the hops.

For each vertex w V, we construct an undirected bipartite graph G w = (V w, E w ) where V w contains L in and L out nodes representing the respective sets for every other node in the graph. There is an edge between these nodes, if w is a node on some path between u, v. From all such bipartite graphs for each w, iteratively choose the w for which the following ratio is maximized until T is not empty (equivalent to greedy set-cover): {(u, v)} T C in (v) + C out (u) This greedy method is also equivalent to finding densest subgraph of each G w.

[VLDB 2010] Background: for a directed tree-shaped graph, we can create complete interval reachability label of type [l, r] for each node in 1-dimension, s.t., for any pair of nodes (u, v) if u v then v s label interval is completely contained within u s label interval. The above observation does not hold for an acylic graph which is not a tree.

Figure taken from Grail, PVLDB 2010.

Exceptions: (1, 2), (2, 4), (3, 4), (4, 7) etc. Exercise: compute all the exceptions from this node labeling scheme. Figure taken from Grail, PVLDB 2010.

Aim is to cover as many pairs of vertices of the graph as possible by hyperdimensional labels, because 1-dimensional labels are insufficient to report the true reachability, i.e., they may generate false positives as seen in the example before. For generating hyperdimensional labels, consider k different spanning trees over the given directed acyclic graph, and for each spanning tree, generate 1-dimensional labels for each node. At the end combine these k labels for each node from each spanning tree. A node v is not reachable from u, if for any of these k label intervals of type [l i v, r i v ], [l i u, r i u], 1 i k v s label is not contained within u s label. That is to say: with, we will certainly know if u v by just checking the interval labels of u, v.

Figure taken from Grail, PVLDB 2010.

If the k interval labels of u, v indicate that u v, then we have to do a DFS walk over the original graph beginning from u to check if we actually reach v, due to the possibility of false positives. Key obervation: If we consider large enough DFS trees over the given DAG and generate interval labels for each of these trees, then we will most likely eliminate all false positives. Open question: can we come up with an upper bound k such that interval labeling of k-dimension will always eliminate any false positives? In practice k is kept fixed and false positives are eliminated at the run-time. Optimization: While doing a DFS walk on the graph from u, if you are at vertex w, and if v s interval label is outside w s, you can ignore that branch of DFS since w v.

Partitioned Word-Aligned Hybrid compression [SIGMOD 2011] Merge strongly connected components. Sort all the vertices in the resulting DAG using a topological sort assign level l v to each node v V dag such that l v = (longest possible path from a node u to v, s.t., indegree(u) = φ). Do a reverse DFS walk over this DAG considering the topological sort. Key observation: When you do a reverse DFS walk on a DAG, vertices that are adjacent in the topological sort tend to cluster together in the adjacency list hence techniques like run-length-encoding can be applied on them. While doing a reverse DFS walk to compute reachability, compress contiguous sequence of vertices using compressed bitvectors bitvector compression favours contiguous lengths of 1s and 0s.

Partitioned Word-Aligned Hybrid compression Computing the reachability of a any given intermediate node w, is nothing but doing a bitwise OR of the compressed interval lists of all its children and adding the children themselves to the reachability bitvector, and then applying compression on them. Word-Aligned Hybrid compression scheme has limitations, hence authors introduce Partitioned Word-Aligned Hybrid (PWAH) scheme. The generated index is complete, and saves significant amount of memory compared to approaches that store interval lists without any compression.

In the next class we will mainly review other reachability computation methods such as IP-labeling [SIGMOD 2014], TF-folding [SIGMOD 2013] and some others as Chain-cover, Path-tree etc (time permitted).