University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases

University of Waterloo Software Engineering Storing Directed Acyclic Graphs in Relational Databases Spotify USA Inc New York, NY, USA Prepared by Soheil Koushan Student ID: 20523416 User ID: skoushan 4A Software Engineering May 8, 2017

Soheil Koushan 212 Parkview Ave Toronto, ON, M2N 3Y8 May 8, 2017 Dr. P. Lam, Director Software Engineering University of Waterloo Waterloo, ON, N2L 3G1 Dear Dr. P. Lam: This report, titled Storing Directed Acyclic Graphs in Relational Databases, is my third work term report. It is based on my experience at Spotify USA Inc for my 4A co-op term. Spotify is a music streaming service. During my co-op, I worked on internal tools with the focus of making data at Spotify easier to discover and understand for developers. Part of that was uncovering the dependencies between different datasets. Hence, the need arose to store dependency information in a database that is easy to query. Our team designed a schema that performs much faster than the conventional approach. This report outlines our proposed design. I would like to thank all my coworkers, as well as my supervisor, for giving me the opportunity to work on this problem. I would like to thank the numerous online resources, which are cited in the References and Acknowledgement sections, that contain information that helped me reach my conclusions. I hereby confirm that I have received no help, other than what is mentioned above, in writing this report. I also confirm this report has not been previously submitted for academic credit at this or any other academic institution. Sincerely, Soheil Koushan Student ID: 20523416

Executive Summary Spotify, a leading music streaming service, has thousands of datasets internally which are produced by hundreds of thousands of jobs running daily. Mapping the dependencies between these datasets is valuable for gaining an understanding of what went into a piece of data and who consumes it. This mapping takes the form of a directed acyclic graph (DAG). In this report, I propose a schema for storing directed acyclic graphs in a relational database. We consider directed acyclic graphs which have well defined breakpoints. For example, a breakpoint can be a dataset that is recommended for consumption, and non-breakpoints can be intermediary datasets that are used in the final product but should not be consumed by others. The way we query this data is by asking for all nodes leaving a breakpoint up until the next breakpoint. Our design criteria are write speed, read speed, and space complexity for storing dense graphs using the schema. The conventional approach is to use a table of nodes and a table of edges with transitive closure. The problem is that for dense graphs, the number edges that needs to be stored grows quadratically with the number of nodes. In addition, performing a database join is costly. Our proposed design stores only a nodes table, but includes an array field containing all the nodes that have a path to this node up. This way, we encode in a single column information that would ve been spread over many rows. This improves write time and reduces the need for a join. Our proposed solution is significantly faster in all the design criteria and meets the design constraints. The recommendation is to use the proposed schema instead of the more general, conventional approach, because it is specially optimized for the types of queries we require. iii

Table of Contents Executive Summary... iii Table of Contents... iv List of Figures... v List of Tables... vi 1 Introduction... 1 2 Problem Specification... 2 2.1 Problem Statement...2 2.2 Design Constraints...2 2.3 Design Criteria...2 3 Design alternatives... 3 3.1 Conventional design: adjacency list with transitive closure...3 3.2 Proposed design: accumulation array...4 4 Evaluation... 6 4.1 Experimental Setup...6 4.2 Write Speed...6 4.3 Read Speed...7 4.4 Space Complexity...8 5 Conclusion... 10 6 Recommendations... 11 References... 12 Acknowledgements... 13 iv

List of Figures Figure 2-1. An example directed acyclic graph.... 2 Figure 3-1. Downstream query for the conventional design.... 4 Figure 3-2. Downstream query for the proposed design.... 5 Figure 4-1. The type of graph used for measurements. Here, n = 3.... 6 Figure 4-2. Results for the write speed test.... 7 Figure 4-3. Results for the read speed test.... 8 Figure 4-4. Results for the space complexity test.... 9 v

List of Tables Table 3-1. An example node table in the adjacency list solution.... 3 Table 3-2. An example edges table in the adjacency list solution.... 3 Table 3-3. An example edges table with transitive closure. C is transitively a descendent of A through node B, hence hops = 1.... 4 Table 3-4. The nodes table for the proposed design, using the graph in Figure 2-1.... 5 vi

1 Introduction Much of our data today is graphical in nature. Social networks are an obvious example. In this analogy, people are nodes and friendships are edges. Another example is mapping, where nodes are places and edges are transportation options between them. Directed acyclic graphs are a special type of graph where the edges are directed and no cycles exist in the graph. They are often used to model dependencies. At Spotify, this type of dependency information is valuable in understanding relationships between different datasets. Because Spotify has tasks running hundreds of thousands of times a day, we need a storage solution that can support writes at this rate. Because of the amount of data that we need to store, it also needs to be space efficient. Most importantly, this data will be presented in a web user interface, meaning it needs to be quickly query-able. These are the three design criteria for a storage solution. In terms of design constraints, Spotify runs on Google s Cloud Platform, which does not offer an off the shelf graph database solution. Hence, we are constraint to traditional relational databases. Also, the result of the query will be shown in a UI. Thus, it needs to run in under 50ms for graphs with up to 1000 nodes. This report begins by elaborating on the problem, the design constraints, and design criteria. It then describes the conventional solution to the problem, which is followed by our proposed solution. Next is an evaluation of the two designs against our design criteria, followed by conclusions and recommendations. The intended audience is software engineers looking to decide on a database storage solution for directed acyclic graphs. It assumes basic algorithmic knowledge as well as knowledge in SQL. 1

2 Problem Specification 2.1 Problem Statement The task is to store directed acyclic graphs (DAGs) in a database. We consider directed acyclic graphs with well-defined breakpoints. This is illustrated in Figure 2-1, where breakpoints are depicted by blue squares and non-breakpoints by green circles. We will query the database by specifying a start node. The database should return all the nodes leaving that node up until the next breakpoint. We shall call this the downstream query. For example, a query for node A should return nodes B, D, E, and F. The schema should be optimized for queries of this type. 2.2 Design Constraints Figure 2-1. An example directed acyclic graph. There are two design constraints. The first is that Spotify runs on Google s Cloud Platform, which does not offer an off the shelf graph database solution. Hence, we are constraint to traditional relational databases. The second is that the downstream query for a graph with 1000 nodes needs to run in under 50ms. This is to ensure that the UI that will be presenting this data feels responsive and snappy. 2.3 Design Criteria There are three design criteria. The first is write speed, which is defined as the amount of time it takes to write a graph into the database. The second is read speed, which is defined as the amount of time a downstream query takes. The third is space complexity, which is defined as the amount of space the database needs to store a graph. These three design criteria cover all aspects of performance for the storage solution. 2

3 Design alternatives Two design alternatives were considered. The first is the conventional approach for storing graphs in a relational database. The second is a design proposal that is optimized for the types of queries we are interested in. 3.1 Conventional design: adjacency list with transitive closure One of the most common ways to store graphs in SQL databases is with an adjacency list table ([1], [2]). In this design, there exist two tables. The first contains nodes, as shown in Table 3-1. Name A B C Table 3-1. An example node table in the adjacency list solution. The second table contains edges. Each entry in the table represents one edge in the graph. An example is presented in Table 3-2, corresponding to the nodes table presented above. Parent Child A B B C Table 3-2. An example edges table in the adjacency list solution. This schema alone is sufficient to get all direct descendants of a node with a single query, but we want all transitive descendants, all the way up until the next breakpoint. This can be achieved using a recursive query, but it can be slow for long chains [1]. A common remedy for this is to use an adjacency list with transitive closure. This means that at write-time, we create an entry in the edges table for each transitive descendent. For example, because C descends from B, and B descends from A, then C transitively descends from A. An example is shown in Table 3-3. Parent Child Hops A B 0 3

B C 0 A C 1 Table 3-3. An example edges table with transitive closure. C is transitively a descendent of A through node B, hence hops = 1. For the types of queries specified in this report, transitive edges only need to be added up until the next breakpoint. Figure 3-1 contains the query which would return all nodes leaving a given node up until the next breakpoint. SELECT * FROM nodes JOIN edges ON name = child WHERE parent = X ; Figure 3-1. Downstream query for the conventional design. 3.2 Proposed design: accumulation array The proposed solution also applies the idea of transitive closure. It works by accumulating dependencies from parent to child. There still exists a table of nodes, but an array field is added. This array field contains all accumulated nodes up until the previous breakpoint, and is called AccumulatedNodes. This list of accumulated nodes is built by first applying topological sort. Kahn s algorithm, for example, can be used to do this. From this, we obtain an ordering of the nodes such that for each edge from node A to B, A comes before B in the ordering. Then, we iterate through this list. At each node, we first add the node itself to AccumulatedNodes. We then iterate through its parents. If the parent is not a breakpoint, we add its AccumulatedNodes too. If it is a breakpoint, we just add the parent node itself. Table 3-4 shows what the data should look like for the DAG in Figure 2-1. Note that we also need to add a field containing the node s parents. Otherwise, we would lose information about the graph. Node Parents AccumulatedNodes A - A B A A, B D B A, B, D E A, G A, E, G F E A, E, F, G G - G 4

Table 3-4. The nodes table for the proposed design, using the graph in Figure 2-1. To perform the downstream query for a given node, we search for that node in the AccumulatedNodes column. This query is provided in Figure 3-2. For a downstream query of node A, it should return A, B, D, E, and F, because A is in the AccumulatedNodes column for those nodes. SELECT * FROM nodes WHERE X = ANY(AccumulatedNodes); Figure 3-2. Downstream query for the proposed design. The benefit of this approach over the conventional approach is that it does not require a join across two tables. All the data needed to find the nodes for the downstream query and to build the graph is stored in one table. 5

4 Evaluation The two designs were evaluated by performing write, read, and speed tests for graphs of various size. 4.1 Experimental Setup A PostgreSQL 9.6.2 database was used running on macos 10.12.4 with a 2.6 GHz Intel Core i5 CPU and 8 GB 1600 MHz DDR3 RAM. The graph used for the experiment resembles a fully connected neural network, with the input layer and the output layer as breakpoints. The number of input nodes and the number of layers were always the same number, denoted by n. Figure 4-1 contains the graph for n = 3, which contains n 2 = 9 nodes. A downstream query was performed on the top-left node. Figure 4-1. The type of graph used for measurements. Here, n = 3. The graph used for the experiment is a dense one. The reason is that dense graphs are the most difficult storage solutions to deal with. Hence, we are performing tests with the worst-case scenario, and can expect better performance in the average case. 4.2 Write Speed Figure 4-2 displays the time taken to insert graphs of increasing size into the database. 6

Figure 4-2. Results for the write speed test. Figure 4-2 shows that for all values of n, the proposed design performs better. Large graphs can take upwards of two minutes to get inserted into the database with the conventional design. This is because we must insert O(n 4 ) edges into the database as there are n 2 nodes in the graph, and the average node has approximately n 2 /2 edges due to transitive closure. For this reason, insertion slows down significantly for large values of n. 4.3 Read Speed increases. Figure 4-3 shows the amount of time the downstream query takes as the size of the graph 7

Figure 4-3. Results for the read speed test. Once again, the proposed design performs better for all values of n. The reason the proposed design is faster is the elimination of the need for joins, which are one of the most expensive database operations [3]. This time, however, the difference in performance at n = 30 is only 40ms, which is much smaller than the difference for write speed. 4.4 Space Complexity grows. Figure 4-4 shows the amount of space taken up by the database as the size of the graph 8

Figure 4-4. Results for the space complexity test. Once again, because of the large number of edges we need to insert for the conventional method, as described in Section 4.2, the amount of space needed grows immensely. 9

5 Conclusion In this report, two different schemas for storing directed acyclic graphs with breakpoints have been presented. The first is the conventional approach, which involves a nodes table and an edges table with transitive closure. The second approach just has a nodes table, but adds a field containing accumulated nodes traversed since the last breakpoint. Through experimentation, the proposed design was shown to perform much better than the conventional design in all the design criteria: write speed, read speed, and space complexity. Write speed and space complexity are greatly reduced because the number of rows needed to be written is reduced. Read speed is improved because we avoid a database join, which is a costly operation [3]. The proposed solution also meets the all the design constraints. Firstly, the schema works with almost any relational database. In addition, the design constraint of a downstream query taking less than 50ms for a graph with 1000 nodes was met, as this query took 44.2 ms. 10

6 Recommendations Based on the conclusions, implementing the proposed solution is highly recommended. The proposed design performs better than the conventional design in all three design criteria (write speed, read speed, and space complexity) and meets all the design constraints. 11

References [1] K. Erdogan, A Model to Represent Directed Acyclic Graphs (DAG) on SQL Databases, CodeProject, 14-Jan-2008. [Online]. Available: https://www.codeproject.com/articles/22824/a-model-to-represent-directed-acyclic- Graphs-DAG-o. [Accessed: 04-May-2017]. [2] J. Horak, DAG structures in SQL databases, Apache Software Foundation, 19-Sep- 2010. [Online]. Available: http://people.apache.org/~dongsheng/horak/100309_dag_structures_sql.pdf. [Accessed: 04-May-2017]. [3] B. A. Johnson, Joins are slow, memory is fast, Database Science, 28-Nov-2008. [Online]. Available: http://dbscience.blogspot.ca/2007/11/joins-are-slow-memory-isfast.html. [Accessed: 04-May-2017]. 12

Acknowledgements I want to acknowledge my coworkers Stephen Enders and Rouzbeh Delavari, who came up with the design of the proposed schema. I want to acknowledge my employer Spotify for giving me the opportunity to work on implementing the solution discussed in this report. 13