University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases

Similar documents
A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July ISSN

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere

BSIT 1 Technology Skills: Apply current technical tools and methodologies to solve problems.

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Copyright 2000, Kevin Wayne 1

Kdb+ Transitive Comparisons

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11

Advanced Data Management

Evaluating find a path reachability queries

A Model for Streaming 3D Meshes and Its Applications

SAS System Powers Web Measurement Solution at U S WEST

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Algorithms: Lecture 10. Chalmers University of Technology

Link Analysis in the Cloud

Graph. Vertex. edge. Directed Graph. Undirected Graph

Graph Data Management

International School of informatics and Management

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL

Three Paths to Better Business Decisions

Merge Sort Algorithm

Graph Mining Extensions in Postgresql

Hierarchical Data in RDBMS

Popularity of Twitter Accounts: PageRank on a Social Network

CS781 Lecture 2 January 13, Graph Traversals, Search, and Ordering

CSE 5236 Project Description

PASSWORDS TREES AND HIERARCHIES. CS121: Relational Databases Fall 2017 Lecture 24

Databases The McGraw-Hill Companies, Inc. All rights reserved.

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

16/06/56. Databases. Databases. Databases The McGraw-Hill Companies, Inc. All rights reserved.

Balanced Trees Part One

We re working full time this summer alongside 3 UCOSP (project course) students (2 from Waterloo: Mark Rada & Su Zhang, 1 from UofT: Angelo Maralit)

2.3 Algorithms Using Map-Reduce

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Bayesian Machine Learning - Lecture 6

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Elementary Graph Algorithms. Ref: Chapter 22 of the text by Cormen et al. Representing a graph:

Analysis of Algorithms

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Mining Social Network Graphs

Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix

Chapter 5. Database Processing

Graph Databases. Guilherme Fetter Damasio. University of Ontario Institute of Technology and IBM Centre for Advanced Studies IBM Corporation

Roberta Brown BA305 web Biweekly Written Assignment #2. Positive, Negative, and Persuasive Messages

Implementing Table Operations Using Structured Query Language (SQL) Using Multiple Operations. SQL: Structured Query Language

In this section you will find the 6 easy steps for using the Candidate Search section.

Advanced Migration of Schema and Data across Multiple Databases

CSC 172 Data Structures and Algorithms. Lecture 24 Fall 2017

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Chapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

CSI 604 Elementary Graph Algorithms

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Product Release Notes Alderstone cmt 2.0

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CGS 3066: Spring 2017 SQL Reference

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Analyzing Flight Data

Report Exec Enterprise System Specifications

CTL.SC4x Technology and Systems

Gradintelligence student support FAQs

Qsync. Cross-device File Sync for Optimal Teamwork. Share your life and work

HEARTLAND DEVELOPER CONFERENCE 2017 APPLICATION DATA INTEGRATION WITH SQL SERVER INTEGRATION SERVICES

Column Stores vs. Row Stores How Different Are They Really?

Figure 1: A directed graph.

3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Oracle HCM Cloud Common Features

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Unifying Big Data Workloads in Apache Spark

UAccess ANALYTICS. Fundamentals of Reporting. updated v.1.00

Rank Preserving Clustering Algorithms for Paths in Social Graphs

HANA Performance. Efficient Speed and Scale-out for Real-time BI

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

A Fast and High Throughput SQL Query System for Big Data

Edinburgh Research Explorer

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

Evolution of Database Systems

Project Overview Distributed Network Traffic Controller

University of Maryland. Tuesday, March 2, 2010

Graph Algorithms. Imran Rashid. Jan 16, University of Washington

Big Data Analysis Using Hadoop and MapReduce

Advanced Database Systems

CSE 530A. Query Planning. Washington University Fall 2013

Efficient and Scalable Friend Recommendations

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS

(Refer Slide Time: 05:25)

Arkuda Concert. Audio Network Solutions

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

CHAPTER 18: CLIENT COMMUNICATION

Performance impact of dynamic parallelism on different clustering algorithms

Writing Reports with Report Designer and SSRS 2014 Level 1

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

3.1 Basic Definitions and Applications

MERGE SORT SYSTEM IJIRT Volume 1 Issue 7 ISSN:

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Apache Kylin. OLAP on Hadoop

Performance and Scalability with Griddable.io

CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information

CHRIS Introduction Guide

Transcription:

University of Waterloo Software Engineering Storing Directed Acyclic Graphs in Relational Databases Spotify USA Inc New York, NY, USA Prepared by Soheil Koushan Student ID: 20523416 User ID: skoushan 4A Software Engineering May 8, 2017

Soheil Koushan 212 Parkview Ave Toronto, ON, M2N 3Y8 May 8, 2017 Dr. P. Lam, Director Software Engineering University of Waterloo Waterloo, ON, N2L 3G1 Dear Dr. P. Lam: This report, titled Storing Directed Acyclic Graphs in Relational Databases, is my third work term report. It is based on my experience at Spotify USA Inc for my 4A co-op term. Spotify is a music streaming service. During my co-op, I worked on internal tools with the focus of making data at Spotify easier to discover and understand for developers. Part of that was uncovering the dependencies between different datasets. Hence, the need arose to store dependency information in a database that is easy to query. Our team designed a schema that performs much faster than the conventional approach. This report outlines our proposed design. I would like to thank all my coworkers, as well as my supervisor, for giving me the opportunity to work on this problem. I would like to thank the numerous online resources, which are cited in the References and Acknowledgement sections, that contain information that helped me reach my conclusions. I hereby confirm that I have received no help, other than what is mentioned above, in writing this report. I also confirm this report has not been previously submitted for academic credit at this or any other academic institution. Sincerely, Soheil Koushan Student ID: 20523416

Executive Summary Spotify, a leading music streaming service, has thousands of datasets internally which are produced by hundreds of thousands of jobs running daily. Mapping the dependencies between these datasets is valuable for gaining an understanding of what went into a piece of data and who consumes it. This mapping takes the form of a directed acyclic graph (DAG). In this report, I propose a schema for storing directed acyclic graphs in a relational database. We consider directed acyclic graphs which have well defined breakpoints. For example, a breakpoint can be a dataset that is recommended for consumption, and non-breakpoints can be intermediary datasets that are used in the final product but should not be consumed by others. The way we query this data is by asking for all nodes leaving a breakpoint up until the next breakpoint. Our design criteria are write speed, read speed, and space complexity for storing dense graphs using the schema. The conventional approach is to use a table of nodes and a table of edges with transitive closure. The problem is that for dense graphs, the number edges that needs to be stored grows quadratically with the number of nodes. In addition, performing a database join is costly. Our proposed design stores only a nodes table, but includes an array field containing all the nodes that have a path to this node up. This way, we encode in a single column information that would ve been spread over many rows. This improves write time and reduces the need for a join. Our proposed solution is significantly faster in all the design criteria and meets the design constraints. The recommendation is to use the proposed schema instead of the more general, conventional approach, because it is specially optimized for the types of queries we require. iii

Table of Contents Executive Summary... iii Table of Contents... iv List of Figures... v List of Tables... vi 1 Introduction... 1 2 Problem Specification... 2 2.1 Problem Statement...2 2.2 Design Constraints...2 2.3 Design Criteria...2 3 Design alternatives... 3 3.1 Conventional design: adjacency list with transitive closure...3 3.2 Proposed design: accumulation array...4 4 Evaluation... 6 4.1 Experimental Setup...6 4.2 Write Speed...6 4.3 Read Speed...7 4.4 Space Complexity...8 5 Conclusion... 10 6 Recommendations... 11 References... 12 Acknowledgements... 13 iv

List of Figures Figure 2-1. An example directed acyclic graph.... 2 Figure 3-1. Downstream query for the conventional design.... 4 Figure 3-2. Downstream query for the proposed design.... 5 Figure 4-1. The type of graph used for measurements. Here, n = 3.... 6 Figure 4-2. Results for the write speed test.... 7 Figure 4-3. Results for the read speed test.... 8 Figure 4-4. Results for the space complexity test.... 9 v

List of Tables Table 3-1. An example node table in the adjacency list solution.... 3 Table 3-2. An example edges table in the adjacency list solution.... 3 Table 3-3. An example edges table with transitive closure. C is transitively a descendent of A through node B, hence hops = 1.... 4 Table 3-4. The nodes table for the proposed design, using the graph in Figure 2-1.... 5 vi

1 Introduction Much of our data today is graphical in nature. Social networks are an obvious example. In this analogy, people are nodes and friendships are edges. Another example is mapping, where nodes are places and edges are transportation options between them. Directed acyclic graphs are a special type of graph where the edges are directed and no cycles exist in the graph. They are often used to model dependencies. At Spotify, this type of dependency information is valuable in understanding relationships between different datasets. Because Spotify has tasks running hundreds of thousands of times a day, we need a storage solution that can support writes at this rate. Because of the amount of data that we need to store, it also needs to be space efficient. Most importantly, this data will be presented in a web user interface, meaning it needs to be quickly query-able. These are the three design criteria for a storage solution. In terms of design constraints, Spotify runs on Google s Cloud Platform, which does not offer an off the shelf graph database solution. Hence, we are constraint to traditional relational databases. Also, the result of the query will be shown in a UI. Thus, it needs to run in under 50ms for graphs with up to 1000 nodes. This report begins by elaborating on the problem, the design constraints, and design criteria. It then describes the conventional solution to the problem, which is followed by our proposed solution. Next is an evaluation of the two designs against our design criteria, followed by conclusions and recommendations. The intended audience is software engineers looking to decide on a database storage solution for directed acyclic graphs. It assumes basic algorithmic knowledge as well as knowledge in SQL. 1

2 Problem Specification 2.1 Problem Statement The task is to store directed acyclic graphs (DAGs) in a database. We consider directed acyclic graphs with well-defined breakpoints. This is illustrated in Figure 2-1, where breakpoints are depicted by blue squares and non-breakpoints by green circles. We will query the database by specifying a start node. The database should return all the nodes leaving that node up until the next breakpoint. We shall call this the downstream query. For example, a query for node A should return nodes B, D, E, and F. The schema should be optimized for queries of this type. 2.2 Design Constraints Figure 2-1. An example directed acyclic graph. There are two design constraints. The first is that Spotify runs on Google s Cloud Platform, which does not offer an off the shelf graph database solution. Hence, we are constraint to traditional relational databases. The second is that the downstream query for a graph with 1000 nodes needs to run in under 50ms. This is to ensure that the UI that will be presenting this data feels responsive and snappy. 2.3 Design Criteria There are three design criteria. The first is write speed, which is defined as the amount of time it takes to write a graph into the database. The second is read speed, which is defined as the amount of time a downstream query takes. The third is space complexity, which is defined as the amount of space the database needs to store a graph. These three design criteria cover all aspects of performance for the storage solution. 2

3 Design alternatives Two design alternatives were considered. The first is the conventional approach for storing graphs in a relational database. The second is a design proposal that is optimized for the types of queries we are interested in. 3.1 Conventional design: adjacency list with transitive closure One of the most common ways to store graphs in SQL databases is with an adjacency list table ([1], [2]). In this design, there exist two tables. The first contains nodes, as shown in Table 3-1. Name A B C Table 3-1. An example node table in the adjacency list solution. The second table contains edges. Each entry in the table represents one edge in the graph. An example is presented in Table 3-2, corresponding to the nodes table presented above. Parent Child A B B C Table 3-2. An example edges table in the adjacency list solution. This schema alone is sufficient to get all direct descendants of a node with a single query, but we want all transitive descendants, all the way up until the next breakpoint. This can be achieved using a recursive query, but it can be slow for long chains [1]. A common remedy for this is to use an adjacency list with transitive closure. This means that at write-time, we create an entry in the edges table for each transitive descendent. For example, because C descends from B, and B descends from A, then C transitively descends from A. An example is shown in Table 3-3. Parent Child Hops A B 0 3

B C 0 A C 1 Table 3-3. An example edges table with transitive closure. C is transitively a descendent of A through node B, hence hops = 1. For the types of queries specified in this report, transitive edges only need to be added up until the next breakpoint. Figure 3-1 contains the query which would return all nodes leaving a given node up until the next breakpoint. SELECT * FROM nodes JOIN edges ON name = child WHERE parent = X ; Figure 3-1. Downstream query for the conventional design. 3.2 Proposed design: accumulation array The proposed solution also applies the idea of transitive closure. It works by accumulating dependencies from parent to child. There still exists a table of nodes, but an array field is added. This array field contains all accumulated nodes up until the previous breakpoint, and is called AccumulatedNodes. This list of accumulated nodes is built by first applying topological sort. Kahn s algorithm, for example, can be used to do this. From this, we obtain an ordering of the nodes such that for each edge from node A to B, A comes before B in the ordering. Then, we iterate through this list. At each node, we first add the node itself to AccumulatedNodes. We then iterate through its parents. If the parent is not a breakpoint, we add its AccumulatedNodes too. If it is a breakpoint, we just add the parent node itself. Table 3-4 shows what the data should look like for the DAG in Figure 2-1. Note that we also need to add a field containing the node s parents. Otherwise, we would lose information about the graph. Node Parents AccumulatedNodes A - A B A A, B D B A, B, D E A, G A, E, G F E A, E, F, G G - G 4

Table 3-4. The nodes table for the proposed design, using the graph in Figure 2-1. To perform the downstream query for a given node, we search for that node in the AccumulatedNodes column. This query is provided in Figure 3-2. For a downstream query of node A, it should return A, B, D, E, and F, because A is in the AccumulatedNodes column for those nodes. SELECT * FROM nodes WHERE X = ANY(AccumulatedNodes); Figure 3-2. Downstream query for the proposed design. The benefit of this approach over the conventional approach is that it does not require a join across two tables. All the data needed to find the nodes for the downstream query and to build the graph is stored in one table. 5

4 Evaluation The two designs were evaluated by performing write, read, and speed tests for graphs of various size. 4.1 Experimental Setup A PostgreSQL 9.6.2 database was used running on macos 10.12.4 with a 2.6 GHz Intel Core i5 CPU and 8 GB 1600 MHz DDR3 RAM. The graph used for the experiment resembles a fully connected neural network, with the input layer and the output layer as breakpoints. The number of input nodes and the number of layers were always the same number, denoted by n. Figure 4-1 contains the graph for n = 3, which contains n 2 = 9 nodes. A downstream query was performed on the top-left node. Figure 4-1. The type of graph used for measurements. Here, n = 3. The graph used for the experiment is a dense one. The reason is that dense graphs are the most difficult storage solutions to deal with. Hence, we are performing tests with the worst-case scenario, and can expect better performance in the average case. 4.2 Write Speed Figure 4-2 displays the time taken to insert graphs of increasing size into the database. 6

Figure 4-2. Results for the write speed test. Figure 4-2 shows that for all values of n, the proposed design performs better. Large graphs can take upwards of two minutes to get inserted into the database with the conventional design. This is because we must insert O(n 4 ) edges into the database as there are n 2 nodes in the graph, and the average node has approximately n 2 /2 edges due to transitive closure. For this reason, insertion slows down significantly for large values of n. 4.3 Read Speed increases. Figure 4-3 shows the amount of time the downstream query takes as the size of the graph 7

Figure 4-3. Results for the read speed test. Once again, the proposed design performs better for all values of n. The reason the proposed design is faster is the elimination of the need for joins, which are one of the most expensive database operations [3]. This time, however, the difference in performance at n = 30 is only 40ms, which is much smaller than the difference for write speed. 4.4 Space Complexity grows. Figure 4-4 shows the amount of space taken up by the database as the size of the graph 8

Figure 4-4. Results for the space complexity test. Once again, because of the large number of edges we need to insert for the conventional method, as described in Section 4.2, the amount of space needed grows immensely. 9

5 Conclusion In this report, two different schemas for storing directed acyclic graphs with breakpoints have been presented. The first is the conventional approach, which involves a nodes table and an edges table with transitive closure. The second approach just has a nodes table, but adds a field containing accumulated nodes traversed since the last breakpoint. Through experimentation, the proposed design was shown to perform much better than the conventional design in all the design criteria: write speed, read speed, and space complexity. Write speed and space complexity are greatly reduced because the number of rows needed to be written is reduced. Read speed is improved because we avoid a database join, which is a costly operation [3]. The proposed solution also meets the all the design constraints. Firstly, the schema works with almost any relational database. In addition, the design constraint of a downstream query taking less than 50ms for a graph with 1000 nodes was met, as this query took 44.2 ms. 10

6 Recommendations Based on the conclusions, implementing the proposed solution is highly recommended. The proposed design performs better than the conventional design in all three design criteria (write speed, read speed, and space complexity) and meets all the design constraints. 11

References [1] K. Erdogan, A Model to Represent Directed Acyclic Graphs (DAG) on SQL Databases, CodeProject, 14-Jan-2008. [Online]. Available: https://www.codeproject.com/articles/22824/a-model-to-represent-directed-acyclic- Graphs-DAG-o. [Accessed: 04-May-2017]. [2] J. Horak, DAG structures in SQL databases, Apache Software Foundation, 19-Sep- 2010. [Online]. Available: http://people.apache.org/~dongsheng/horak/100309_dag_structures_sql.pdf. [Accessed: 04-May-2017]. [3] B. A. Johnson, Joins are slow, memory is fast, Database Science, 28-Nov-2008. [Online]. Available: http://dbscience.blogspot.ca/2007/11/joins-are-slow-memory-isfast.html. [Accessed: 04-May-2017]. 12

Acknowledgements I want to acknowledge my coworkers Stephen Enders and Rouzbeh Delavari, who came up with the design of the proposed schema. I want to acknowledge my employer Spotify for giving me the opportunity to work on implementing the solution discussed in this report. 13