University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases

Size: px
Start display at page:

Download "University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases"

Transcription

1 University of Waterloo Software Engineering Storing Directed Acyclic Graphs in Relational Databases Spotify USA Inc New York, NY, USA Prepared by Soheil Koushan Student ID: User ID: skoushan 4A Software Engineering May 8, 2017

2 Soheil Koushan 212 Parkview Ave Toronto, ON, M2N 3Y8 May 8, 2017 Dr. P. Lam, Director Software Engineering University of Waterloo Waterloo, ON, N2L 3G1 Dear Dr. P. Lam: This report, titled Storing Directed Acyclic Graphs in Relational Databases, is my third work term report. It is based on my experience at Spotify USA Inc for my 4A co-op term. Spotify is a music streaming service. During my co-op, I worked on internal tools with the focus of making data at Spotify easier to discover and understand for developers. Part of that was uncovering the dependencies between different datasets. Hence, the need arose to store dependency information in a database that is easy to query. Our team designed a schema that performs much faster than the conventional approach. This report outlines our proposed design. I would like to thank all my coworkers, as well as my supervisor, for giving me the opportunity to work on this problem. I would like to thank the numerous online resources, which are cited in the References and Acknowledgement sections, that contain information that helped me reach my conclusions. I hereby confirm that I have received no help, other than what is mentioned above, in writing this report. I also confirm this report has not been previously submitted for academic credit at this or any other academic institution. Sincerely, Soheil Koushan Student ID:

3 Executive Summary Spotify, a leading music streaming service, has thousands of datasets internally which are produced by hundreds of thousands of jobs running daily. Mapping the dependencies between these datasets is valuable for gaining an understanding of what went into a piece of data and who consumes it. This mapping takes the form of a directed acyclic graph (DAG). In this report, I propose a schema for storing directed acyclic graphs in a relational database. We consider directed acyclic graphs which have well defined breakpoints. For example, a breakpoint can be a dataset that is recommended for consumption, and non-breakpoints can be intermediary datasets that are used in the final product but should not be consumed by others. The way we query this data is by asking for all nodes leaving a breakpoint up until the next breakpoint. Our design criteria are write speed, read speed, and space complexity for storing dense graphs using the schema. The conventional approach is to use a table of nodes and a table of edges with transitive closure. The problem is that for dense graphs, the number edges that needs to be stored grows quadratically with the number of nodes. In addition, performing a database join is costly. Our proposed design stores only a nodes table, but includes an array field containing all the nodes that have a path to this node up. This way, we encode in a single column information that would ve been spread over many rows. This improves write time and reduces the need for a join. Our proposed solution is significantly faster in all the design criteria and meets the design constraints. The recommendation is to use the proposed schema instead of the more general, conventional approach, because it is specially optimized for the types of queries we require. iii

4 Table of Contents Executive Summary... iii Table of Contents... iv List of Figures... v List of Tables... vi 1 Introduction Problem Specification Problem Statement Design Constraints Design Criteria Design alternatives Conventional design: adjacency list with transitive closure Proposed design: accumulation array Evaluation Experimental Setup Write Speed Read Speed Space Complexity Conclusion Recommendations References Acknowledgements iv

5 List of Figures Figure 2-1. An example directed acyclic graph Figure 3-1. Downstream query for the conventional design Figure 3-2. Downstream query for the proposed design Figure 4-1. The type of graph used for measurements. Here, n = Figure 4-2. Results for the write speed test Figure 4-3. Results for the read speed test Figure 4-4. Results for the space complexity test v

6 List of Tables Table 3-1. An example node table in the adjacency list solution Table 3-2. An example edges table in the adjacency list solution Table 3-3. An example edges table with transitive closure. C is transitively a descendent of A through node B, hence hops = Table 3-4. The nodes table for the proposed design, using the graph in Figure vi

7 1 Introduction Much of our data today is graphical in nature. Social networks are an obvious example. In this analogy, people are nodes and friendships are edges. Another example is mapping, where nodes are places and edges are transportation options between them. Directed acyclic graphs are a special type of graph where the edges are directed and no cycles exist in the graph. They are often used to model dependencies. At Spotify, this type of dependency information is valuable in understanding relationships between different datasets. Because Spotify has tasks running hundreds of thousands of times a day, we need a storage solution that can support writes at this rate. Because of the amount of data that we need to store, it also needs to be space efficient. Most importantly, this data will be presented in a web user interface, meaning it needs to be quickly query-able. These are the three design criteria for a storage solution. In terms of design constraints, Spotify runs on Google s Cloud Platform, which does not offer an off the shelf graph database solution. Hence, we are constraint to traditional relational databases. Also, the result of the query will be shown in a UI. Thus, it needs to run in under 50ms for graphs with up to 1000 nodes. This report begins by elaborating on the problem, the design constraints, and design criteria. It then describes the conventional solution to the problem, which is followed by our proposed solution. Next is an evaluation of the two designs against our design criteria, followed by conclusions and recommendations. The intended audience is software engineers looking to decide on a database storage solution for directed acyclic graphs. It assumes basic algorithmic knowledge as well as knowledge in SQL. 1

8 2 Problem Specification 2.1 Problem Statement The task is to store directed acyclic graphs (DAGs) in a database. We consider directed acyclic graphs with well-defined breakpoints. This is illustrated in Figure 2-1, where breakpoints are depicted by blue squares and non-breakpoints by green circles. We will query the database by specifying a start node. The database should return all the nodes leaving that node up until the next breakpoint. We shall call this the downstream query. For example, a query for node A should return nodes B, D, E, and F. The schema should be optimized for queries of this type. 2.2 Design Constraints Figure 2-1. An example directed acyclic graph. There are two design constraints. The first is that Spotify runs on Google s Cloud Platform, which does not offer an off the shelf graph database solution. Hence, we are constraint to traditional relational databases. The second is that the downstream query for a graph with 1000 nodes needs to run in under 50ms. This is to ensure that the UI that will be presenting this data feels responsive and snappy. 2.3 Design Criteria There are three design criteria. The first is write speed, which is defined as the amount of time it takes to write a graph into the database. The second is read speed, which is defined as the amount of time a downstream query takes. The third is space complexity, which is defined as the amount of space the database needs to store a graph. These three design criteria cover all aspects of performance for the storage solution. 2

9 3 Design alternatives Two design alternatives were considered. The first is the conventional approach for storing graphs in a relational database. The second is a design proposal that is optimized for the types of queries we are interested in. 3.1 Conventional design: adjacency list with transitive closure One of the most common ways to store graphs in SQL databases is with an adjacency list table ([1], [2]). In this design, there exist two tables. The first contains nodes, as shown in Table 3-1. Name A B C Table 3-1. An example node table in the adjacency list solution. The second table contains edges. Each entry in the table represents one edge in the graph. An example is presented in Table 3-2, corresponding to the nodes table presented above. Parent Child A B B C Table 3-2. An example edges table in the adjacency list solution. This schema alone is sufficient to get all direct descendants of a node with a single query, but we want all transitive descendants, all the way up until the next breakpoint. This can be achieved using a recursive query, but it can be slow for long chains [1]. A common remedy for this is to use an adjacency list with transitive closure. This means that at write-time, we create an entry in the edges table for each transitive descendent. For example, because C descends from B, and B descends from A, then C transitively descends from A. An example is shown in Table 3-3. Parent Child Hops A B 0 3

10 B C 0 A C 1 Table 3-3. An example edges table with transitive closure. C is transitively a descendent of A through node B, hence hops = 1. For the types of queries specified in this report, transitive edges only need to be added up until the next breakpoint. Figure 3-1 contains the query which would return all nodes leaving a given node up until the next breakpoint. SELECT * FROM nodes JOIN edges ON name = child WHERE parent = X ; Figure 3-1. Downstream query for the conventional design. 3.2 Proposed design: accumulation array The proposed solution also applies the idea of transitive closure. It works by accumulating dependencies from parent to child. There still exists a table of nodes, but an array field is added. This array field contains all accumulated nodes up until the previous breakpoint, and is called AccumulatedNodes. This list of accumulated nodes is built by first applying topological sort. Kahn s algorithm, for example, can be used to do this. From this, we obtain an ordering of the nodes such that for each edge from node A to B, A comes before B in the ordering. Then, we iterate through this list. At each node, we first add the node itself to AccumulatedNodes. We then iterate through its parents. If the parent is not a breakpoint, we add its AccumulatedNodes too. If it is a breakpoint, we just add the parent node itself. Table 3-4 shows what the data should look like for the DAG in Figure 2-1. Note that we also need to add a field containing the node s parents. Otherwise, we would lose information about the graph. Node Parents AccumulatedNodes A - A B A A, B D B A, B, D E A, G A, E, G F E A, E, F, G G - G 4

11 Table 3-4. The nodes table for the proposed design, using the graph in Figure 2-1. To perform the downstream query for a given node, we search for that node in the AccumulatedNodes column. This query is provided in Figure 3-2. For a downstream query of node A, it should return A, B, D, E, and F, because A is in the AccumulatedNodes column for those nodes. SELECT * FROM nodes WHERE X = ANY(AccumulatedNodes); Figure 3-2. Downstream query for the proposed design. The benefit of this approach over the conventional approach is that it does not require a join across two tables. All the data needed to find the nodes for the downstream query and to build the graph is stored in one table. 5

12 4 Evaluation The two designs were evaluated by performing write, read, and speed tests for graphs of various size. 4.1 Experimental Setup A PostgreSQL database was used running on macos with a 2.6 GHz Intel Core i5 CPU and 8 GB 1600 MHz DDR3 RAM. The graph used for the experiment resembles a fully connected neural network, with the input layer and the output layer as breakpoints. The number of input nodes and the number of layers were always the same number, denoted by n. Figure 4-1 contains the graph for n = 3, which contains n 2 = 9 nodes. A downstream query was performed on the top-left node. Figure 4-1. The type of graph used for measurements. Here, n = 3. The graph used for the experiment is a dense one. The reason is that dense graphs are the most difficult storage solutions to deal with. Hence, we are performing tests with the worst-case scenario, and can expect better performance in the average case. 4.2 Write Speed Figure 4-2 displays the time taken to insert graphs of increasing size into the database. 6

13 Figure 4-2. Results for the write speed test. Figure 4-2 shows that for all values of n, the proposed design performs better. Large graphs can take upwards of two minutes to get inserted into the database with the conventional design. This is because we must insert O(n 4 ) edges into the database as there are n 2 nodes in the graph, and the average node has approximately n 2 /2 edges due to transitive closure. For this reason, insertion slows down significantly for large values of n. 4.3 Read Speed increases. Figure 4-3 shows the amount of time the downstream query takes as the size of the graph 7

14 Figure 4-3. Results for the read speed test. Once again, the proposed design performs better for all values of n. The reason the proposed design is faster is the elimination of the need for joins, which are one of the most expensive database operations [3]. This time, however, the difference in performance at n = 30 is only 40ms, which is much smaller than the difference for write speed. 4.4 Space Complexity grows. Figure 4-4 shows the amount of space taken up by the database as the size of the graph 8

15 Figure 4-4. Results for the space complexity test. Once again, because of the large number of edges we need to insert for the conventional method, as described in Section 4.2, the amount of space needed grows immensely. 9

16 5 Conclusion In this report, two different schemas for storing directed acyclic graphs with breakpoints have been presented. The first is the conventional approach, which involves a nodes table and an edges table with transitive closure. The second approach just has a nodes table, but adds a field containing accumulated nodes traversed since the last breakpoint. Through experimentation, the proposed design was shown to perform much better than the conventional design in all the design criteria: write speed, read speed, and space complexity. Write speed and space complexity are greatly reduced because the number of rows needed to be written is reduced. Read speed is improved because we avoid a database join, which is a costly operation [3]. The proposed solution also meets the all the design constraints. Firstly, the schema works with almost any relational database. In addition, the design constraint of a downstream query taking less than 50ms for a graph with 1000 nodes was met, as this query took 44.2 ms. 10

17 6 Recommendations Based on the conclusions, implementing the proposed solution is highly recommended. The proposed design performs better than the conventional design in all three design criteria (write speed, read speed, and space complexity) and meets all the design constraints. 11

18 References [1] K. Erdogan, A Model to Represent Directed Acyclic Graphs (DAG) on SQL Databases, CodeProject, 14-Jan [Online]. Available: Graphs-DAG-o. [Accessed: 04-May-2017]. [2] J. Horak, DAG structures in SQL databases, Apache Software Foundation, 19-Sep [Online]. Available: [Accessed: 04-May-2017]. [3] B. A. Johnson, Joins are slow, memory is fast, Database Science, 28-Nov [Online]. Available: [Accessed: 04-May-2017]. 12

19 Acknowledgements I want to acknowledge my coworkers Stephen Enders and Rouzbeh Delavari, who came up with the design of the proposed schema. I want to acknowledge my employer Spotify for giving me the opportunity to work on implementing the solution discussed in this report. 13

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July ISSN International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July-201 971 Comparative Performance Analysis Of Sorting Algorithms Abhinav Yadav, Dr. Sanjeev Bansal Abstract Sorting Algorithms

More information

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere ETL Best Practices and Techniques Marc Beacom, Managing Partner, Datalere Thank you Sponsors Experience 10 years DW/BI Consultant 20 Years overall experience Marc Beacom Managing Partner, Datalere Current

More information

BSIT 1 Technology Skills: Apply current technical tools and methodologies to solve problems.

BSIT 1 Technology Skills: Apply current technical tools and methodologies to solve problems. Bachelor of Science in Information Technology At Purdue Global, we employ a method called Course-Level Assessment, or CLA, to determine student mastery of Course Outcomes. Through CLA, we measure how well

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Copyright 2000, Kevin Wayne 1

Copyright 2000, Kevin Wayne 1 Chapter 3 - Graphs Undirected Graphs Undirected graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures pairwise relationship between objects. Graph size parameters: n = V, m = E. Directed

More information

Kdb+ Transitive Comparisons

Kdb+ Transitive Comparisons Kdb+ Transitive Comparisons 15 May 2018 Hugh Hyndman, Director, Industrial IoT Solutions Copyright 2018 Kx Kdb+ Transitive Comparisons Introduction Last summer, I wrote a blog discussing my experiences

More information

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11 Chapter 3 - Graphs Undirected Graphs Undirected graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures pairwise relationship between objects. Graph size parameters: n = V, m = E. V = {

More information

Advanced Data Management

Advanced Data Management Advanced Data Management Medha Atre Office: KD-219 atrem@cse.iitk.ac.in Sept 26, 2016 defined Given a graph G(V, E) with V as the set of nodes and E as the set of edges, a reachability query asks does

More information

Evaluating find a path reachability queries

Evaluating find a path reachability queries Evaluating find a path reachability queries Panagiotis ouros and Theodore Dalamagas and Spiros Skiadopoulos and Timos Sellis Abstract. Graphs are used for modelling complex problems in many areas, such

More information

A Model for Streaming 3D Meshes and Its Applications

A Model for Streaming 3D Meshes and Its Applications A Model for Streaming D Meshes and Its Applications ABSTRACT Ong Yuh Shin and Ooi Wei Tsang Department of Computer Science, School of Computing, National University of Singapore In this paper, we present

More information

SAS System Powers Web Measurement Solution at U S WEST

SAS System Powers Web Measurement Solution at U S WEST SAS System Powers Web Measurement Solution at U S WEST Bob Romero, U S WEST Communications, Technical Expert - SAS and Data Analysis Dale Hamilton, U S WEST Communications, Capacity Provisioning Process

More information

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data

More information

Algorithms: Lecture 10. Chalmers University of Technology

Algorithms: Lecture 10. Chalmers University of Technology Algorithms: Lecture 10 Chalmers University of Technology Today s Topics Basic Definitions Path, Cycle, Tree, Connectivity, etc. Graph Traversal Depth First Search Breadth First Search Testing Bipartatiness

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

Graph. Vertex. edge. Directed Graph. Undirected Graph

Graph. Vertex. edge. Directed Graph. Undirected Graph Module : Graphs Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS E-mail: natarajan.meghanathan@jsums.edu Graph Graph is a data structure that is a collection

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

International School of informatics and Management

International School of informatics and Management 1 International School of informatics and Management Subject: System Design Lab Project Name: Student Admission System Group Number: 5 Team Guide: Jyoti Khurana (Lecturer) Members: Ashok Kumar Soni Hridayesh

More information

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL 32901 rhibbler@cs.fit.edu ABSTRACT Given an array of elements, we want to arrange those elements into

More information

Three Paths to Better Business Decisions

Three Paths to Better Business Decisions Three Paths to Better Business Decisions Business decisions take you down many paths. The Micron 5210 ION SSD gets you where you want to go, quickly and efficiently. Overview Leaders depend on data, and

More information

Merge Sort Algorithm

Merge Sort Algorithm Merge Sort Algorithm Jaiveer Singh (16915) & Raju Singh(16930) Department of Information and Technology Dronacharya College of Engineering Gurgaon, India Jaiveer.16915@ggnindia.dronacharya.info ; Raju.16930@ggnindia.dronacharya.info

More information

Graph Mining Extensions in Postgresql

Graph Mining Extensions in Postgresql Indian Journal of Science and Technology, Vol 9(35), DOI: 10.17485/ijst/2016/v9i35/98941, September 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Graph Mining Extensions in Postgresql G. Anuradha

More information

Hierarchical Data in RDBMS

Hierarchical Data in RDBMS Hierarchical Data in RDBMS Introduction There are times when we need to store "tree" or "hierarchical" data for various modelling problems: Categories, sub-categories and sub-sub-categories in a manufacturing

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

CS781 Lecture 2 January 13, Graph Traversals, Search, and Ordering

CS781 Lecture 2 January 13, Graph Traversals, Search, and Ordering CS781 Lecture 2 January 13, 2010 Graph Traversals, Search, and Ordering Review of Lecture 1 Notions of Algorithm Scalability Worst-Case and Average-Case Analysis Asymptotic Growth Rates: Big-Oh Prototypical

More information

CSE 5236 Project Description

CSE 5236 Project Description Instructor: Adam C. Champion, Ph.D. Spring 2018 Semester Total: 60 points The team project (2 3 students per team) for this class involves conceptualizing, designing, and developing a mobile application

More information

PASSWORDS TREES AND HIERARCHIES. CS121: Relational Databases Fall 2017 Lecture 24

PASSWORDS TREES AND HIERARCHIES. CS121: Relational Databases Fall 2017 Lecture 24 PASSWORDS TREES AND HIERARCHIES CS121: Relational Databases Fall 2017 Lecture 24 Account Password Management 2 Mentioned a retailer with an online website Need a database to store user account details

More information

Databases The McGraw-Hill Companies, Inc. All rights reserved.

Databases The McGraw-Hill Companies, Inc. All rights reserved. Distinguish between the physical and logical views of data. Describe how data is organized: characters, fields, records, tables, and databases. Define key fields and how they are used to integrate data

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

16/06/56. Databases. Databases. Databases The McGraw-Hill Companies, Inc. All rights reserved.

16/06/56. Databases. Databases. Databases The McGraw-Hill Companies, Inc. All rights reserved. Distinguish between the physical and logical views of data. Describe how data is organized: characters, fields, records, tables, and databases. Define key fields and how they are used to integrate data

More information

Balanced Trees Part One

Balanced Trees Part One Balanced Trees Part One Balanced Trees Balanced search trees are among the most useful and versatile data structures. Many programming languages ship with a balanced tree library. C++: std::map / std::set

More information

We re working full time this summer alongside 3 UCOSP (project course) students (2 from Waterloo: Mark Rada & Su Zhang, 1 from UofT: Angelo Maralit)

We re working full time this summer alongside 3 UCOSP (project course) students (2 from Waterloo: Mark Rada & Su Zhang, 1 from UofT: Angelo Maralit) We re working full time this summer alongside 3 UCOSP (project course) students (2 from Waterloo: Mark Rada & Su Zhang, 1 from UofT: Angelo Maralit) Our supervisors: Karen: heads project, which has been

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Bayesian Machine Learning - Lecture 6

Bayesian Machine Learning - Lecture 6 Bayesian Machine Learning - Lecture 6 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 2, 2015 Today s lecture 1

More information

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP

More information

Elementary Graph Algorithms. Ref: Chapter 22 of the text by Cormen et al. Representing a graph:

Elementary Graph Algorithms. Ref: Chapter 22 of the text by Cormen et al. Representing a graph: Elementary Graph Algorithms Ref: Chapter 22 of the text by Cormen et al. Representing a graph: Graph G(V, E): V set of nodes (vertices); E set of edges. Notation: n = V and m = E. (Vertices are numbered

More information

Analysis of Algorithms

Analysis of Algorithms Algorithm An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions. A computer program can be viewed as an elaborate algorithm. In mathematics and

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix

Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix Spring 2010 Review Topics Big O Notation Heaps Sorting Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix Hashtables Tree Balancing: AVL trees and DSW algorithm Graphs: Basic terminology and

More information

Chapter 5. Database Processing

Chapter 5. Database Processing Chapter 5 Database Processing No, Drew, You Don t Know Anything About Creating Queries." AllRoad Parts operational database used to determine which parts to consider for 3D printing. If Addison and Drew

More information

Graph Databases. Guilherme Fetter Damasio. University of Ontario Institute of Technology and IBM Centre for Advanced Studies IBM Corporation

Graph Databases. Guilherme Fetter Damasio. University of Ontario Institute of Technology and IBM Centre for Advanced Studies IBM Corporation Graph Databases Guilherme Fetter Damasio University of Ontario Institute of Technology and IBM Centre for Advanced Studies Outline Introduction Relational Database Graph Database Our Research 2 Introduction

More information

Roberta Brown BA305 web Biweekly Written Assignment #2. Positive, Negative, and Persuasive Messages

Roberta Brown BA305 web Biweekly Written Assignment #2. Positive, Negative, and Persuasive Messages Roberta Brown BA305 web Biweekly Written Assignment #2 Positive, Negative, and Persuasive Messages Positive Messages Example 1. Email Message TO: n-smith12@mailplace.com FROM: roberta.brown@ mscompany.com

More information

Implementing Table Operations Using Structured Query Language (SQL) Using Multiple Operations. SQL: Structured Query Language

Implementing Table Operations Using Structured Query Language (SQL) Using Multiple Operations. SQL: Structured Query Language Implementing Table Operations Using Structured Query Language (SQL) Using Multiple Operations Show Only certain columns and rows from the join of Table A with Table B The implementation of table operations

More information

In this section you will find the 6 easy steps for using the Candidate Search section.

In this section you will find the 6 easy steps for using the Candidate Search section. Searching for Candidates Career Centers will often define collections of students that meet certain criteria, and make these Resume Books available for employers to review. In many cases, it is up to the

More information

Advanced Migration of Schema and Data across Multiple Databases

Advanced Migration of Schema and Data across Multiple Databases Advanced Migration of Schema and Data across Multiple Databases D.M.W.E. Dissanayake 139163B Faculty of Information Technology University of Moratuwa May 2017 Advanced Migration of Schema and Data across

More information

CSC 172 Data Structures and Algorithms. Lecture 24 Fall 2017

CSC 172 Data Structures and Algorithms. Lecture 24 Fall 2017 CSC 172 Data Structures and Algorithms Lecture 24 Fall 2017 ANALYSIS OF DIJKSTRA S ALGORITHM CSC 172, Fall 2017 Implementation and analysis The initialization requires Q( V ) memory and run time We iterate

More information

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science Entrance Examination, 5 May 23 This question paper has 4 printed sides. Part A has questions of 3 marks each. Part B has 7 questions

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Chapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Chapter 3 Graphs Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 3.1 Basic Definitions and Applications Undirected Graphs Undirected graph. G = (V, E) V = nodes. E

More information

CSI 604 Elementary Graph Algorithms

CSI 604 Elementary Graph Algorithms CSI 604 Elementary Graph Algorithms Ref: Chapter 22 of the text by Cormen et al. (Second edition) 1 / 25 Graphs: Basic Definitions Undirected Graph G(V, E): V is set of nodes (or vertices) and E is the

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Product Release Notes Alderstone cmt 2.0

Product Release Notes Alderstone cmt 2.0 Alderstone cmt product release notes Product Release Notes Alderstone cmt 2.0 Alderstone Consulting is a technology company headquartered in the UK and established in 2008. A BMC Technology Alliance Premier

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

CGS 3066: Spring 2017 SQL Reference

CGS 3066: Spring 2017 SQL Reference CGS 3066: Spring 2017 SQL Reference Can also be used as a study guide. Only covers topics discussed in class. This is by no means a complete guide to SQL. Database accounts are being set up for all students

More information

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes AN UNDER THE HOOD LOOK Databricks Delta, a component of the Databricks Unified Analytics Platform*, is a unified

More information

Analyzing Flight Data

Analyzing Flight Data IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo

More information

Report Exec Enterprise System Specifications

Report Exec Enterprise System Specifications Report Exec Enterprise System Specifications Contents Overview... 2 Technical Support... 2 At a Glance... 2 Report Exec Systems Diagram... 4 Hardware Specifications... 6 SQL Server... 6 RAM... 6 Processor...

More information

CTL.SC4x Technology and Systems

CTL.SC4x Technology and Systems in Supply Chain Management CTL.SC4x Technology and Systems Key Concepts Document This document contains the Key Concepts for the SC4x course, Weeks 1 and 2. These are meant to complement, not replace,

More information

Gradintelligence student support FAQs

Gradintelligence student support FAQs Gradintelligence student support FAQs Account activation issues... 2 I have not received my activation link / I cannot find it / it has expired. Please can you send me a new one?... 2 My account is showing

More information

Qsync. Cross-device File Sync for Optimal Teamwork. Share your life and work

Qsync. Cross-device File Sync for Optimal Teamwork. Share your life and work Qsync Cross-device File Sync for Optimal Teamwork Share your life and work Agenda Users' common issues QNAP NAS specifications recommended by various types of users Usage scenarios and Qsync application

More information

HEARTLAND DEVELOPER CONFERENCE 2017 APPLICATION DATA INTEGRATION WITH SQL SERVER INTEGRATION SERVICES

HEARTLAND DEVELOPER CONFERENCE 2017 APPLICATION DATA INTEGRATION WITH SQL SERVER INTEGRATION SERVICES HEARTLAND DEVELOPER CONFERENCE 2017 APPLICATION DATA INTEGRATION WITH SQL SERVER INTEGRATION SERVICES SESSION ABSTRACT: APPLICATION DATA INTEGRATION WITH SQL SERVER INTEGRATION SERVICES What do you do

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Figure 1: A directed graph.

Figure 1: A directed graph. 1 Graphs A graph is a data structure that expresses relationships between objects. The objects are called nodes and the relationships are called edges. For example, social networks can be represented as

More information

3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications

3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications Chapter 3 31 Basic Definitions and Applications Graphs Slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley All rights reserved 1 Undirected Graphs Some Graph Applications Undirected graph G = (V,

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Oracle HCM Cloud Common Features

Oracle HCM Cloud Common Features Oracle HCM Cloud Common Features Release 11 Release Content Document December 2015 Revised: January 2017 TABLE OF CONTENTS REVISION HISTORY... 3 OVERVIEW... 5 HCM COMMON FEATURES... 6 HCM SECURITY... 6

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

UAccess ANALYTICS. Fundamentals of Reporting. updated v.1.00

UAccess ANALYTICS. Fundamentals of Reporting. updated v.1.00 UAccess ANALYTICS Arizona Board of Regents, 2010 THE UNIVERSITY OF ARIZONA updated 07.01.2010 v.1.00 For information and permission to use our PDF manuals, please contact uitsworkshopteam@listserv.com

More information

Rank Preserving Clustering Algorithms for Paths in Social Graphs

Rank Preserving Clustering Algorithms for Paths in Social Graphs University of Waterloo Faculty of Engineering Rank Preserving Clustering Algorithms for Paths in Social Graphs LinkedIn Corporation Mountain View, CA 94043 Prepared by Ziyad Mir ID 20333385 2B Department

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Profiling OGSA-DAI Performance for Common Use Patterns Citation for published version: Dobrzelecki, B, Antonioletti, M, Schopf, JM, Hume, AC, Atkinson, M, Hong, NPC, Jackson,

More information

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION The process of planning and executing SQL Server migrations can be complex and risk-prone. This is a case where the right approach and

More information

Evolution of Database Systems

Evolution of Database Systems Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second

More information

Project Overview Distributed Network Traffic Controller

Project Overview Distributed Network Traffic Controller Project Overview Distributed Network Traffic Controller Revision Number: 1.1 Last date of revision: 5/11/05 22c:198 Johnson, Chadwick Hugh 1 Motivation When a limited resource is shared between multiple

More information

University of Maryland. Tuesday, March 2, 2010

University of Maryland. Tuesday, March 2, 2010 Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Graph Algorithms. Imran Rashid. Jan 16, University of Washington

Graph Algorithms. Imran Rashid. Jan 16, University of Washington Graph Algorithms Imran Rashid University of Washington Jan 16, 2008 1 / 26 Lecture Outline 1 BFS Bipartite Graphs 2 DAGs & Topological Ordering 3 DFS 2 / 26 Lecture Outline 1 BFS Bipartite Graphs 2 DAGs

More information

Big Data Analysis Using Hadoop and MapReduce

Big Data Analysis Using Hadoop and MapReduce Big Data Analysis Using Hadoop and MapReduce Harrison Carranza, MSIS Marist College, Harrison.Carranza2@marist.edu Mentor: Aparicio Carranza, PhD New York City College of Technology - CUNY, USA, acarranza@citytech.cuny.edu

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

CSE 530A. Query Planning. Washington University Fall 2013

CSE 530A. Query Planning. Washington University Fall 2013 CSE 530A Query Planning Washington University Fall 2013 Scanning When finding data in a relation, we've seen two types of scans Table scan Index scan There is a third common way Bitmap scan Bitmap Scans

More information

Efficient and Scalable Friend Recommendations

Efficient and Scalable Friend Recommendations Efficient and Scalable Friend Recommendations Comparing Traditional and Graph-Processing Approaches Nicholas Tietz Software Engineer at GraphSQL nicholas@graphsql.com January 13, 2014 1 Introduction 2

More information

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS Authors: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly Presenter: Zelin Dai WHAT IS DRYAD Combines computational

More information

(Refer Slide Time: 05:25)

(Refer Slide Time: 05:25) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering IIT Delhi Lecture 30 Applications of DFS in Directed Graphs Today we are going to look at more applications

More information

Arkuda Concert. Audio Network Solutions

Arkuda Concert. Audio Network Solutions Arkuda Concert Audio Network Solutions How could manufacturers add value to their products and services? Companies aspire to provide services that meet their client`s needs. They try to offer a smart solution

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

CHAPTER 18: CLIENT COMMUNICATION

CHAPTER 18: CLIENT COMMUNICATION CHAPTER 18: CLIENT COMMUNICATION Chapter outline When to communicate with clients What modes of communication to use How much to communicate How to benefit from client communication Understanding your

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Writing Reports with Report Designer and SSRS 2014 Level 1

Writing Reports with Report Designer and SSRS 2014 Level 1 Writing Reports with Report Designer and SSRS 2014 Level 1 Duration- 2days About this course In this 2-day course, students are introduced to the foundations of report writing with Microsoft SQL Server

More information

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs Computational Optimization ISE 407 Lecture 16 Dr. Ted Ralphs ISE 407 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms in

More information

3.1 Basic Definitions and Applications

3.1 Basic Definitions and Applications Chapter 3 Graphs Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 3.1 Basic Definitions and Applications Undirected Graphs Undirected graph. G = (V, E) V = nodes. E

More information

MERGE SORT SYSTEM IJIRT Volume 1 Issue 7 ISSN:

MERGE SORT SYSTEM IJIRT Volume 1 Issue 7 ISSN: MERGE SORT SYSTEM Abhishek, Amit Sharma, Nishant Mishra Department Of Electronics And Communication Dronacharya College Of Engineering, Gurgaon Abstract- Given an assortment with n rudiments, we dearth

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Apache Kylin. OLAP on Hadoop

Apache Kylin. OLAP on Hadoop Apache Kylin OLAP on Hadoop Agenda What s Apache Kylin? Tech Highlights Performance Roadmap Q & A http://kylin.io What s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information

CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information Abdeltawab M. Hendawi, Eugene Sturm, Dev Oliver, Shashi Shekhar hendawi@cs.umn.edu, sturm049@umn.edu,

More information

CHRIS Introduction Guide

CHRIS Introduction Guide 1 Introduction... 3 1.1 The Login screen... 3 1.2 The itrent Home page... 5 1.2.1 Out of Office... 8 1.2.2 Default User Preferences... 9 1.2.3 Bookmarks... 10 1.3 The itrent Screen... 11 The Control Bar...

More information