Graph Analytics using Vertica Relational Database
|
|
- Erica Casey
- 6 years ago
- Views:
Transcription
1 Graph Analytics using ertica Relational Database Alekh Jindal* Samuel Madden Malú Castellanos Meichun Hsu Microsoft MIT ertica ertica * work done while at MIT
2 Motivation for graphs on DB Data anyways in a DB - avoid expensive copying - end-to-end data analysis - leverage other DB features Processing involves full scans and joins - relational engines could run them efficiently - particularly suited for column stores Relational algebra/sql offers powerful declarative syntax - in fact, we could express Giraph as an operator DAG - can even express more complex graph analytics
3 5-point Agenda From graph queries to SQL: how do we make the translation? Graph query optimization: can we leverage decades of relational wisdom? Column store backends: why are they a good choice? Comparison with specialized graph systems: how do the numbers look? xtending column stores: can we do better?
4 . From Graph to SQL
5 ertex-centric Graph Queries Popular language for graph analytics ertex programs that run in supersets and communicate via message passing
6 ertex-centric Graph Queries Popular language for graph analytics ertex programs that run in supersets and communicate via message passing inf 0 inf inf 4 inf
7 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing inf 4 inf
8 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing
9 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing
10 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing Programmer only specifies a vertex program 4 2 System takes care of running it in parallel
11 The Giraph Plan Giraph: a popular, open-source graph analytics system on Hadoop
12 The Giraph Plan Giraph: a popular, open-source graph analytics system on Hadoop The Giraph physical plan: hard coded physical execution pipeline Input Superstep Superstep Output Superstep. HDFS Scan RecRead Shuffle W W2 W3 W4 Server Data Shuffle W W2 W3 partition store W4 edge store message store Master Server Data Shuffle synchronize W W2 W3 partition store W4 edge store message store Master Server Data cleanup store synchronize W W2 W3 partition store W4 edge store message store Master HDFS G=(,) Split synchronize G =(, )
13 The Giraph Plan Giraph: a popular, open-source graph analytics system on Hadoop The Giraph physical plan: hard coded physical execution pipeline Giraph logical query plan using relational operators Modified ertices U M ertices.id=.from γ.id=m.to M dges New Messages Messages
14 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by U M U M M.id=.from.id=.from.id=.from γ γ.id=m.to.id=m.to M M γ.id=m.to M γ.id=.to 2.id=.from 2
15 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by.id=.from U M U M.id=.from γ γ.id=m.to.id=m.to M M γ M.id=M.to M.id=.from σ γ d <.d Γ d =min( 2.d+).id=.to γ 2.id=.from 2 2.id=.to 2.id=.from Single Source Shortest Path
16 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by.id=.from U M U M.id=.from γ γ.id=m.to.id=m.to M M γ M.id=M.to M.id=.from σ γ cc <.cc Γ cc =min( 2.id).id=.to γ 2.id=.from 2 2.id=.to 2.id=.from Connected Components
17 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by.id=.from U M U M.id=.from γ γ.id=m.to.id=m.to M M Γ cc =min( 2.id) γ γ M σ cc <.cc.id=.from Γ.r=0.5/n+0.85* sum(2.r/2.outd) γ.id=.to.id=.to.id=.to 2.id=.from.id=M.to 2.id=.from 2.id=.from M PageRank
18 Hash-based shuffling No sorting, as opposed to Hadoop Intermediate data is not persisted Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF as Table UDF UDF Replacing M join by with union U M U M.id=.from.id=.from U M.id=.from γ γ γ.id=m.to.id=m.to.id=m.to M Table UDF sort γ.id=.from U M M γ.pid.id=m.to.id=m.to M M.id=.from U M Table UDF Γ.r=0.5/n+0.85* sum(2.r/2.outd) γ sort.id=.to 2 2 γ.pid.id=.to 2.id=.from 2.id=.from U M Unmodified ertex Compute Program, e.g. Disk-based Iterations 2 In-Memory SGD Iterations Optimized Unmodified ertex Compute Program
19 2. Graph Query Optimization
20 Leveraging Relational Query Optimizers Multiple rule- or cost-based query rewriting possible; pick the best one using an optimizer No hard-coded physical execution plan Several new optimizations proposed: - update vs replace - incremental evaluation - join elimination - alternate direction graph exploration
21 Inner Join Update Updated Input Node alue Output Node alue SSSP inf 3 4 inf Inner Join 4 5 inf inf Good for small number of updates!
22 Outer Join Replace Input Node alue Output Node alue New Input Node alue 0 0 SSSP 2 Outer Join 0 3 inf 2 inf inf 3 4 inf 4 inf 4 5 inf 5 inf inf Good for bulk updates!
23 Incremental Computation Inc. Input Output Node alue SSSP 0 Node alue Input Node alue 0 2 inf 3 inf 4 inf 5 inf 2 3 New Inc. Input Node alue 2 3 New Input Node alue Outer Join inf 5 inf inf inf
24 Incremental Computation Inc. Input Node alue 2 3 SSSP Output Node alue New Inc. Input Node alue Input Node alue inf 5 inf Outer Join New Input Node alue Faster Iteration Runtime!
25 3. Column Store Backends
26 Why columns stores could be a good choice? Modern column stores provide several features - physical design - join optimizations - query pipelining - intra-query parallelism For more details, pick your favorite column store papers: - MonetDB [Database Architecture volution: Mammals Flourished long before Dinosaurs became xtinct, Peter A. Boncz et. al., PLDB 2009.] - C-Store [C-Store: A Column-oriented DBMS, Mike Stonebraker et. al., LDB 2005.] - ertica [The ertica Analytic Database: C-Store 7 Years Later, Andrew Lamb et. al., LDB 202.]
27 Root OutBlk=[UncTuple(2)] Illustration: ertica Query Plan for SSSP NewNode OutBlk=[UncTuple(2)] xprval: e.to_node, <SAR> Recv from: node0,node,node2,node3 Send to: node0 arly filtering using sideways information passing Fully pipelined query execution Picks the right join strategies, e.g. broadcast Join: Hash-Join: using twitter_edge and twitter_node_b0 ScanStep: twitter_edge SIP2(HashJoin): e.from_node SIP(MergeJoin): e.to_node to_node (not emitted),from_node FilterStep: (<SAR> < <SAR>) GroupByPipe: keys Aggs: min((n.value + )), min(n2.value) StorageMergeStep: twitter_edge; sorted GroupByPipe: keys Aggs: min((n.value + )), min(n2.value) xprval: e.to_node, (n.value + ), n2.value Join: Merge-Join: using previous join and twitter_node_b0 Recv from: node0,node,node2,node3 Send to: node0,node,node2,node3 StorageMergeStep: twitter_node_b0; sorted ScanStep: twitter_node_b0 id, value StorageUnionStep: twitter_node_b0 ScanStep: twitter_node_b0 id, value
28 4. Comparison with Specialized Graph Systems
29 Setup Systems: - ertica - Giraph - GraphLab Datasets: - directed (Twitter, LiveJournal) - undirected (Youtube, LiveJournal) Machines - 4 machines (2 cores, 48GB memory,.4tb disk) Data preparation - upload time [ertica: 96s; GraphLab: 472s; Giraph: 268s] - disk usage [ertica: 0GB; GraphLab/Giraph: 73GB]
30 Giraph ertica Typical Graph Analytics Twitter graph:.4 billion edges, 4.6 million nodes Time (seconds), GraphLab Giraph ertica Time (seconds) 0 PR SSSP CC
31 y ertica Time (seconds) Advanced Graph Analytics ertica (SQL + Disk) ertica (UDF + Shared Memory) Time (seconds) Load/Sto ertica 0 PageRank Shortest Path (a) Comparing different implementations Mixing graph and on ertica relational (LiveJournal graph) queries Fig.. Multi-hop neighborhood queries 0 PR SSSP CC PR SSSP (b) Comparing in-memory compu GraphLab and ertica (LiveJourna Improving I/O Performance in ertica with In-memory Graph Analysis. Twitter graph with synthetic metadata Query Type ertica Giraph SpeedUp Sub-graph Projection & Selection PR SSSP Graph Analysis Aggregation PR SSSP Graph Joins PR+SSSP TABL III. COMBINING GRAPH AND RLATIONAL ANALYSIS. those nodes which are either very near (path distance less than a given threshold) or are relatively very important (PageRank Query Dataset ertica Giraph greater than a given threshold). We compare against Giraph, Youtube which Strong Overlap allows users LiveJournal-undir to provide custom input/output of memoryformats that could Youtube out of memory Weak Ties be used to perform the projection and selection. We write additionallivejournal-undir MapReduce jobs, for theout aggregation of memory and join. TableTABL III shows I. the - result N on the Twitter A dataset. over 4 nodes. We can see that the performance difference between ertica Que Stro Wea pairs. Th need to te this could join bein edge doe SLCT e sum(cas FROM edg JOIN edg AND e. LFT JOI AND e.
32 Detailed Analysis: Cost Breakdown Twitter graph:.4 billion edges, 4.6 million nodes Time (seconds) Iterations Load/Store 0 PR SSSP CC PR SSSP CC PR SSSP CC GraphLab Giraph ertica (c) Cost Breakdown (Twitter graph)
33 Detailed Analysis: Memory Footprint Twitter graph:.4 billion edges, 4.6 million nodes (GB) Size GraphLab Read (GB) Size Giraph Write 4 2 (GB) Size ertica Total
34 Detailed Analysis: I/O Footprint Twitter graph:.4 billion edges, 4.6 million nodes Read GraphLab Giraph ertica Write Total Time (ms) Read Write
35 Problem: significantly high I/O Can we do better?
36 5. xtending Column Stores
37 Rewriting Graph Query Plan (Yet again!) Disk-based Iterations 2 In-Memory Iterations Table UDF U M Updates sort Table UDF U M ertex Compute M Synchronization γ.pid U M sort γ.pid U
38 Trading Memory for I/O In-Memory Iterations Table UDF Updates U M ertex Compute sort γ.pid U M Synchronization Loading and keeping data in mainmemory no additional I/Os for each iteration All iterations run as a single transaction reduce overheads such as logging, locking, buffer lookups Unmodified vertex-program run via table UDFs Communication (message passing) via shared memory
39 Comparing Different Implementations in ertica LiveJournal graph: 69 million edges, 4.8 million nodes Time (seconds) , PageRank ertica (UDF + Disk) ertica (SQL + Disk) ertica (UDF + Shared Memory) Shortest Path 6.50
40 Comparison with GraphLab LiveJournal graph: 69 million edges, 4.8 million nodes Time (seconds) GraphLab Algorithm Time Load/Store Time ertica PR SSSP CC PR SSSP CC
41 Scaling to larger graphs Twitter graph:.4 billion edges, 4.6 million nodes Time (seconds) Algorithm Time Load/Store Time GraphLab 4 nodes Giraph 4 nodes ertica (SQL) 4 nodes ertica (Mem) node
42 Conclusion fficient graph analytics possible within column stores such as ertica - graph queries can be mapped to SQL - several query optimizations can be applied - column stores serve as efficient backends - could extend column stores to trade memory for I/O The curious case of relational database re-discovery - repeatedly emerged as the backend for several new data/ applications, e.g., XML, RDF, Spatial, Array, etc. - cycles of branch-innovate-merge-commit Next time you have a big data problem > try relational databases!
The Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationOne Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationNavigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets
Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming
More informationAndrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationArchitecture and Implementation of Database Systems (Winter 2014/15)
Jens Teubner Architecture & Implementation of DBMS Winter 2014/15 1 Architecture and Implementation of Database Systems (Winter 2014/15) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2014/15
More informationResearch Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland
Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationHeckaton. SQL Server's Memory Optimized OLTP Engine
Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationSociaLite: A Datalog-based Language for
SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationChapter 12: Query Processing
Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationCompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy
CompSci 516 Database Systems Lecture 20 Parallel DBMS Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements HW3 due on Monday, Nov 20, 11:55 pm (in 2 weeks) See some
More informationGraph-Processing Systems. (focusing on GraphChi)
Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationQuerying Microsoft SQL Server (461)
Querying Microsoft SQL Server 2012-2014 (461) Create database objects Create and alter tables using T-SQL syntax (simple statements) Create tables without using the built in tools; ALTER; DROP; ALTER COLUMN;
More informationAdaptDB: Adaptive Partitioning for Distributed Joins
AdaptDB: Adaptive Partitioning for Distributed Joins Yi Lu Anil Shanbhag Alekh Jindal Samuel Madden MIT CSAIL MIT CSAIL Microsoft MIT CSAIL yilu@csail.mit.edu anils@mit.edu aljindal@microsoft.com madden@csail.mit.edu
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationImpala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?
More informationImpala Intro. MingLi xunzhang
Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,
More informationAuthors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.
Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationQuery Processing. Introduction to Databases CompSci 316 Fall 2017
Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationGiraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi
Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationEECS 647: Introduction to Database Systems
EECS 647: Introduction to Database Systems Instructor: Luke Huan Spring 2009 External Sorting Today s Topic Implementing the join operation 4/8/2009 Luke Huan Univ. of Kansas 2 Review DBMS Architecture
More informationParallel Query Optimisation
Parallel Query Optimisation Contents Objectives of parallel query optimisation Parallel query optimisation Two-Phase optimisation One-Phase optimisation Inter-operator parallelism oriented optimisation
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationApache Hive for Oracle DBAs. Luís Marques
Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,
More informationApache Flink Big Data Stream Processing
Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017
More information1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda
Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:
More informationCSE 344 MAY 7 TH EXAM REVIEW
CSE 344 MAY 7 TH EXAM REVIEW EXAMINATION STATIONS Exam Wednesday 9:30-10:20 One sheet of notes, front and back Practice solutions out after class Good luck! EXAM LENGTH Production v. Verification Practice
More informationBatch & Stream Graph Processing with Apache Flink. Vasia
Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph
More informationCSE 190D Spring 2017 Final Exam
CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationDatabase Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.
Database Systems ( 料 ) December 13/14, 2006 Lecture #10 1 Announcement Assignment #4 is due next week. 2 1 Overview of Query Evaluation Chapter 12 3 Outline Query evaluation (Overview) Relational Operator
More informationDistributed Machine Learning: An Intro. Chen Huang
: An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationArchitecture and Implementation of Database Systems (Summer 2018)
Jens Teubner Architecture & Implementation of DBMS Summer 2018 1 Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2018 Jens
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationGraphs (Part II) Shannon Quinn
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation
More informationCSC 261/461 Database Systems Lecture 19
CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationHadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin and Avi Silberschatz Presented by
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationDistributed Graph Storage. Veronika Molnár, UZH
Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems
More informationFinal Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23
Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE
More informationHAWQ: A Massively Parallel Processing SQL Engine in Hadoop
HAWQ: A Massively Parallel Processing SQL Engine in Hadoop Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, Milind Bhandarkar
More informationWhat happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques
376a. Database Design Dept. of Computer Science Vassar College http://www.cs.vassar.edu/~cs376 Class 16 Query optimization What happens Database is given a query Query is scanned - scanner creates a list
More informationCPSC 421 Database Management Systems. Lecture 19: Physical Database Design Concurrency Control and Recovery
CPSC 421 Database Management Systems Lecture 19: Physical Database Design Concurrency Control and Recovery * Some material adapted from R. Ramakrishnan, L. Delcambre, and B. Ludaescher Agenda Physical
More informationAnnouncements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing
Announcements Database Systems CSE 414 HW4 is due tomorrow 11pm Lectures 18: Parallel Databases (Ch. 20.1) 1 2 Why compute in parallel? Multi-cores: Most processors have multiple cores This trend will
More informationFrom Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson
From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing
More informationModule 4. Implementation of XQuery. Part 0: Background on relational query processing
Module 4 Implementation of XQuery Part 0: Background on relational query processing The Data Management Universe Lecture Part I Lecture Part 2 2 What does a Database System do? Input: SQL statement Output:
More informationPrinciples of Data Management. Lecture #9 (Query Processing Overview)
Principles of Data Management Lecture #9 (Query Processing Overview) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s Notable News v Midterm
More informationEvolution From Shark To Spark SQL:
Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese
More informationDryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212
DryadLINQ by Yuan Yu et al., OSDI 08 Ilias Giechaskiel Cambridge University, R212 ig305@cam.ac.uk January 28, 2014 Conclusions Takeaway Messages SQL cannot express iteration Unsuitable for machine learning,
More informationCMSC424: Database Design. Instructor: Amol Deshpande
CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons
More informationShark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley
Shark: SQL and Rich Analytics at Scale Reynold Xin UC Berkeley Challenges in Modern Data Analysis Data volumes expanding. Faults and stragglers complicate parallel database design. Complexity of analysis:
More informationOutline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012
Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 6 April 12, 2012 1 Acknowledgements: The slides are provided by Nikolaus Augsten
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationAdministrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments
Administrivia Midterm on Thursday 10/18 CS 133: Databases Fall 2018 Lec 12 10/16 Prof. Beth Trushkowsky Assignments Lab 3 starts after fall break No problem set out this week Goals for Today Cost-based
More informationiihadoop: an asynchronous distributed framework for incremental iterative computations
DOI 10.1186/s40537-017-0086-3 RESEARCH Open Access iihadoop: an asynchronous distributed framework for incremental iterative computations Afaf G. Bin Saadon * and Hoda M. O. Mokhtar *Correspondence: eng.afaf.fci@gmail.com
More informationTime Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel
More informationToday s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17
EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 30 th, 2013 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores Announcements Shumo office hours change See website for details HW2 due next Thurs
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More information