Database Learning: Toward a Database that Becomes Smarter Over Time
|
|
- Juliana York
- 6 years ago
- Views:
Transcription
1 Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor
2 Today s databases Database Users 1
3 Today s databases query Database Users 1
4 Today s databases Database Users 1
5 Today s databases Answer to query Database Users 1
6 Today s databases Database Users After answering queries, THE WORK is GONE 1
7 Today s databases Database Users After answering queries, THE WORK is GONE Our Goal: reuse the work 1
8 Our high-level approach AQP engine Users 2
9 Our high-level approach Users Q AQP engine 2
10 Our high-level approach Users A (10% err, 1 sec) AQP engine 2
11 Our high-level approach Query Synopsis Users Database Learning AQP engine 2
12 Our high-level approach Query Synopsis Users Q Database Learning AQP engine 2
13 Our high-level approach Query Synopsis Q Database Learning Q AQP engine Users 2
14 Our high-level approach Query Synopsis Q Database Learning Q A (10% err) AQP engine Users 2
15 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users 2
16 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3
17 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3
18 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3
19 Technical challenges 4
20 Technical challenges 4
21 Technical challenges 4
22 Technical challenges Queries use the data in different columns/rows 4
23 Technical challenges Queries use the data in different columns/rows How to leverage those queries for future queries? 4
24 Our idea? 5
25 Our idea Q1? 5
26 Our idea (Q1, A1)? 5
27 Our idea (Q1, A1) 5
28 Our idea Q2 5
29 Our idea (Q2, A2) 5
30 Our idea (Q2, A2) 5
31 Our idea more queries and answers 5
32 Concrete example SUM(count) 40M 30M 20M Week Number 6
33 Concrete example SUM(count) 40M 30M 20M Week Number True data 6
34 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries 6
35 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) 6
36 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) SUM(count) 40M 30M 20M Week Number 6
37 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) SUM(count) 40M 30M 20M Week Number SUM(count) 40M 30M 20M Week Number 6
38 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries 7
39 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries 2 No Assumptions about Data 7
40 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries latency 2 No Assumptions about Data BlinkDB 3 Lightweight DBL 7
41 Our Approach
42 Problem statement 8
43 Problem statement Problem: Given past queries (q 1,, q n ), a new query (q n+1 ), and their approximate answers, Find the most likely answer to the new query (q n+1 ) and its estimated error 8
44 Problem statement Problem: Given past queries (q 1,, q n ), a new query (q n+1 ), and their approximate answers, Find the most likely answer to the new query (q n+1 ) and its estimated error Our result: Under a certain model assumption, our answer s error bound original answer s error bound (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees 8
45 Overview of our technique select count(y2) select avg(y2) from t from t where 1 < X1 < 2; where 6 < X1 < 8; 9
46 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 9
47 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 9
48 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution 9
49 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution Two aggregations involve common values correlation between answers 9
50 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; Pr(θ 3 θ 1, θ 2 ) Estimated answer 1 3 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution Two aggregations involve common values correlation between answers 9
51 How to define random variables select sum(y2) from t where 5 < X1 < 8; 10
52 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; 10
53 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function 10
54 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates 10
55 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates select X3, avg(y1), sum(y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3; What if your query is complex? 10
56 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates select X3, avg(y1), sum(y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3; What if your query is complex? 10
57 How to determine the probability distribution The Principle of Maximum Entropy (ME) 11
58 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) 11
59 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) 11
60 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info 11
61 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info 11
62 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info Simple Pr Complex Pr 11
63 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity 11
64 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Our choice: (co)variances between pairs of answers Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity 11
65 Most-likely probability distribution θ 1 θ 2 θ 3 12
66 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 12
67 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 MaxEnt Multivariate normal distribution 12
68 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 MaxEnt Multivariate normal distribution Fast inference using a closed form 12
69 Benefits of database learning Database learning vs indexing 13
70 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead 13
71 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view 13
72 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view date 2 Without alignment 13
73 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view overhead DBL view selection date 2 Without alignment system uptime 3 No upfront overhead 13
74 Experiment
75 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 14
76 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 2 Datasets: Customer1: 536GB data and query log from a customer TPC-H: 100GB TPC-H dataset 14
77 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 2 Datasets: Customer1: 536GB data and query log from a customer TPC-H: 100GB TPC-H dataset 3 Environment: 5 Amazon EC2 workers (m42xlarge) + 1 master SSD-backed HDFS for Spark s data loading 14
78 Our experimental claims 1 Verdict supports a large portion of real-world queries 15
79 Our experimental claims 1 Verdict supports a large portion of real-world queries 2 Verdict achieves speedup compared to NoLearn 15
80 Our experimental claims 1 Verdict supports a large portion of real-world queries 2 Verdict achieves speedup compared to NoLearn 3 Verdict works with small memory and computational overhead 15
81 Generality of Verdict Dataset # Analyzed # Supported Percentage Customer1 3,342 2, % TPC-H % Unsupported queries: 1 Nested queries (that cannot be flattened) 2 Textual filters: city like '%arbor%' 16
82 Runtime-error trade-off Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50 NoLearn Verdict Error bound (%) 10 5 Error bound (%) Runtime (sec) Runtime (min) (a) Data in Memory (b) Data on SSD 17
83 Runtime-error trade-off Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50 NoLearn Verdict Error bound (%) 10 5 Error bound (%) Runtime (sec) Runtime (min) (a) Data in Memory (b) Data on SSD 17
84 Speedup The results on the Customer1 dataset (the paper has the TPC-H results) Speedup (x) Speedup (x) % 2% 4% 2% Target Error Bound Target Error Bound (a) Data in memory (b) Data on SSD 18
85 Speedup The results on the Customer1 dataset (the paper has the TPC-H results) Speedup (x) Speedup (x) % 2% 4% 2% Target Error Bound Target Error Bound (a) Data in memory (b) Data on SSD 18
86 Memory and computational overhead 1 Memory overhead: 19
87 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 19
88 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 232 KB per query for the Customer1 dataset 158 KB per query for the TPC-H dataset 19
89 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 232 KB per query for the Customer1 dataset 158 KB per query for the TPC-H dataset 2 Computational overhead: Latency for memory Latency for SSD NoLearn 2083 sec 5250 sec Verdict 2093 sec 5251 sec Overhead 0010 sec (048%) 0010 sec (002%) 19
90 Thank You! 19
Fast Data Analytics by Learning
Fast Data Analytics by Learning by Yongjoo Park A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University
More informationWarehouse- Scale Computing and the BDAS Stack
Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,
More informationComputations with Bounded Errors and Response Times on Very Large Data
Computations with Bounded Errors and Response Times on Very Large Data Ion Stoica UC Berkeley (joint work with: Sameer Agarwal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita
More informationApproximate Query Engines
Approximate Query Engines Commercial Challenges and Research Opportunities Barzan Mozafari University of Michigan, Ann Arbor SnappyData Inc. SIGMOD 2017 SnappyData Inc. 2017 1 What Is Approximate Query
More informationcstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman
cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman What is CitusDB? CitusDB is a scalable analytics database that extends PostgreSQL Citus shards your data and automa/cally parallelizes
More informationApproximate Query Processing: Overview and Challenges
Approximate Query Processing: Overview and Challenges Peter J. Haas College of Information and Computer Sciences University of Massachusetts Amherst Thanks to: Andrew McGregor Barzan Mozafari EDBT 2018
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationApproximate Query Processing: What is New and Where to Go?
Data Science and Engineering (2018) 3:379 397 https://doi.org/10.1007/s41019-018-0074-4 Approximate Query Processing: What is New and Where to Go? A Survey on Approximate Query Processing Kaiyu Li 1 Guoliang
More informationAlbis: High-Performance File Format for Big Data Systems
Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference
More informationDatabase Group Research Overview. Immanuel Trummer
Database Group Research Overview Immanuel Trummer Talk Overview User Query Data Analysis Result Processing Talk Overview Fact Checking Query User Data Vocalization Data Analysis Result Processing Query
More informationShark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko
Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines
More informationSpeeding Up Data Science: From a Data Management Perspective
Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab (DSL) Simon Fraser University NWDS Meeting, Jan 5, 2018 1 Simon Fraser University 2 SFU DB/DM Group Ke Wang
More informationTriangle SQL Server User Group Adaptive Query Processing with Azure SQL DB and SQL Server 2017
Triangle SQL Server User Group Adaptive Query Processing with Azure SQL DB and SQL Server 2017 Joe Sack, Principal Program Manager, Microsoft Joe.Sack@Microsoft.com Adaptability Adapt based on customer
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationImpala Intro. MingLi xunzhang
Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,
More informationWorkload Characterization and Optimization of TPC-H Queries on Apache Spark
Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationPrivApprox. Privacy- Preserving Stream Analytics.
PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017 Motivation Clients Analysts
More informationTatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research
Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research IBM Research 2 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs à à Application
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationSparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Outline The Spark scheduling bottleneck Sparrow s fully distributed, fault-tolerant technique
More informationVisualization-Aware Sampling for Very Large Databases
Visualization-Aware Sampling for Very Large Databases Yongjoo Park, Michael Cafarella, Barzan Mozafari University of Michigan, Ann Arbor, USA {pyongjoo, michjc, mozafari}@umich.edu arxiv:1510.03921v2 [cs.db]
More informationImplementing Randomized Matrix Algorithms in Parallel and Distributed Environments
Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments Jiyan Yang ICME, Stanford University Nov 1, 2015 INFORMS 2015, Philadephia Joint work with Xiangrui Meng (Databricks),
More informationDynamic Flow Regulation for IP Integration on Network-on-Chip
Dynamic Flow Regulation for IP Integration on Network-on-Chip Zhonghai Lu and Yi Wang Dept. of Electronic Systems KTH Royal Institute of Technology Stockholm, Sweden Agenda The IP integration problem Why
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationConstructing Popular Routes from Uncertain Trajectories
Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei, Yu Zheng, Wen-Chih Peng presented by Slawek Goryczka Scenarios A trajectory is a sequence of data points recording location information
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationMLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley
MLlib & ML base Distributed Machine Learning on Evan Sparks UC Berkeley January 31st, 2014 Collaborators: Ameet Talwalkar, Xiangrui Meng, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia,
More informationDynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin
Dynamic Resource Allocation for Distributed Dataflows Lauritz Thamsen Technische Universität Berlin 04.05.2018 Distributed Dataflows E.g. MapReduce, SCOPE, Spark, and Flink Used for scalable processing
More informationSHARDS & Talus: Online MRC estimation and optimization for very large caches
SHARDS & Talus: Online MRC estimation and optimization for very large caches Nohhyun Park CloudPhysics, Inc. Introduction Efficient MRC Construction with SHARDS FAST 15 Waldspurger at al. Talus: A simple
More informationUniversalizing Approximate Query Processing
Universalizing Approximate Query Processing Yongjoo Park Joseph Sorenson Barzan Mozafari Junhao Wang Universal Approximate Query Processing Universal Approximate Query Processing What is Approximate Query
More informationRuntime Support for Human-in-the-Loop Feature Engineering Systems
Runtime Support for Human-in-the-Loop Feature Engineering Systems Michael R. Anderson University of Michigan mrander@umich.edu Dolan Antenucci University of Michigan dol@umich.edu Michael Cafarella University
More informationData Blocks: Hybrid OLTP and OLAP on compressed storage
Data Blocks: Hybrid OLTP and OLAP on compressed storage Ben Brümmer Technische Universität München Fürstenfeldbruck, 26. November 208 Ben Brümmer 26..8 Lehrstuhl für Datenbanksysteme Problem HDD/Archive/Tape-Storage
More informationJoin Processing for Flash SSDs: Remembering Past Lessons
Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of
More informationOne-Shot Learning with a Hierarchical Nonparametric Bayesian Model
One-Shot Learning with a Hierarchical Nonparametric Bayesian Model R. Salakhutdinov, J. Tenenbaum and A. Torralba MIT Technical Report, 2010 Presented by Esther Salazar Duke University June 10, 2011 E.
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationCSC 261/461 Database Systems Lecture 19
CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will
More informationASAP: Fast, Approximate Graph Pattern Mining at Scale
ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;
More informationHotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li
HotCloud 17 Lube: Hao Wang* Baochun Li Mitigating Bottlenecks in Wide Area Data Analytics iqua Wide Area Data Analytics DC Master Namenode Workers Datanodes 2 Wide Area Data Analytics Why wide area data
More informationRandom Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK
Random Walk Inference and Learning in A Large Scale Knowledge Base Ni Lao, Tom Mitchell, William W. Cohen Carnegie Mellon University 2011.7.28 1 Outline Motivation Inference in Knowledge Bases The NELL
More informationSub-millisecond Stateful Stream Querying over Fast-evolving Linked Data
Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University Stream Query
More informationOLA-RAW: Scalable Exploration over Raw Data
OLA-RAW: Scalable Exploration over Raw Data Yu Cheng Weijie Zhao Florin Rusu University of California Merced 52 N Lake Road Merced, CA 95343 {ycheng4, wzhao23, frusu}@ucmerced.edu February 27 Abstract
More informationDBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki
DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki Data loading: A necessary evil Volume => Expensive 4 zettabytes
More informationHow Eventual is Eventual Consistency?
Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February
More informationOtterTune. Automatic Database Management System Tuning Through Large-scale Machine Learning
OtterTune Automatic Database Management System Tuning Through Large-scale Machine Learning Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, Bohan Zhang [image source] 2 DBMS Tuning Tuning a DBMS s configuration
More informationScaling Distributed Machine Learning with the Parameter Server
Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationSociaLite: An Efficient Query Language for Social Network Analysis
SociaLite: An Efficient Query Language for Social Network Analysis Jiwon Seo Stephen Guo Jongsoo Park Jaeho Shin Monica S. Lam Stanford Mobile & Social Workshop Social Network Analysis Social network services
More informationFast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann
Fast Approximations for Analyzing Ten Trillion Cells Filip Buruiana (filipb@google.com) Reimar Hofmann (reimar.hofmann@hs-karlsruhe.de) Outline of the Talk Interactive analysis at AdSpam @ Google Trade
More informationSparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY
SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies
More informationBi-Level Online Aggregation on Raw Data
Bi-Level Online Aggregation on Raw Data Yu Cheng Turn, Inc. leo.cheng@turn.com Weijie Zhao University of California Merced wzhao23@ucmerced.edu Florin Rusu University of California Merced frusu@ucmerced.edu
More informationA c t i v e w o r k s p a c e f o r e x t e r n a l d a t a a g g r e g a t i o n a n d S e a r c h. 1
A c t i v e w o r k s p a c e f o r e x t e r n a l d a t a a g g r e g a t i o n a n d S e a r c h B a l a K a n t h i www.intelizign.com 1 Active workspace can search and visualize PLM data better! Problems:
More informationDistributed Sampling in a Big Data Management System
Distributed Sampling in a Big Data Management System Dan Radion University of Washington Department of Computer Science and Engineering Undergraduate Departmental Honors Thesis Advised by Dan Suciu Contents
More informationMulti-threaded Queries. Intra-Query Parallelism in LLVM
Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted
More informationCoflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan
Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationAchieving Horizontal Scalability. Alain Houf Sales Engineer
Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches
More informationC=(FS) 2 : Cubing by Composition of Faceted Search
C=(FS) : Cubing by Composition of Faceted Search Ronny Lempel Dafna Sheinwald IBM Haifa Research Lab Introduction to Multifaceted Search and to On-Line Analytical Processing (OLAP) Intro Multifaceted Search
More informationPowerVault MD3 SSD Cache Overview
PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS
More informationUsing Alluxio to Improve the Performance and Consistency of HDFS Clusters
ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationBe Fast, Cheap and in Control with SwitchKV Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationIntelligent Services Serving Machine Learning
Intelligent Services Serving Machine Learning Joseph E. Gonzalez jegonzal@cs.berkeley.edu; Assistant Professor @ UC Berkeley joseph@dato.com; Co-Founder @ Dato Inc. Contemporary Learning Systems Big Data
More informationQuerying Data with Transact SQL
Course 20761A: Querying Data with Transact SQL Course details Course Outline Module 1: Introduction to Microsoft SQL Server 2016 This module introduces SQL Server, the versions of SQL Server, including
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationImproving Hadoop MapReduce Performance on Supercomputers with JVM Reuse
Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core
More informationExtracting and Querying Probabilistic Information From Text in BayesStore-IE
Extracting and Querying Probabilistic Information From Text in BayesStore-IE Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis 2, Joseph M. Hellerstein University of California, Berkeley Technical
More informationLeveraging Lock Contention to Improve Transaction Applications. University of Washington
Leveraging Lock Contention to Improve Transaction Applications Cong Yan Alvin Cheung University of Washington 1 Background Database transactions Airline ticket reservation, banking, online shopping...
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationApproxJoin: Approximate Distributed Joins
Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas #, Ruichuan Chen, Christof Fetzer, Thorsten Strufe TU Dresden, Germany Nokia Bell Labs, Germany The University
More informationNear-Data Processing for Differentiable Machine Learning Models
Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1, Seil Lee 1, Hyunha Nam 1, Seongsik Park 1, Seijoon Kim 1, Eui-Young Chung 2 and Sungroh Yoon 1,3 1 Electrical and Computer
More informationDremel: Interac-ve Analysis of Web- Scale Datasets
Dremel: Interac-ve Analysis of Web- Scale Datasets Google Inc VLDB 2010 presented by Arka BhaEacharya some slides adapted from various Dremel presenta-ons on the internet The Problem: Interactive data
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationLiangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*
Tracking Trends: Incorporating Term Volume into Temporal Topic Models Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA,
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationCreating Probabilistic Databases from Information Extraction Models
Creating Probabilistic Databases from Information Extraction Models Rahul Gupta, Sunita Sarawagi Presented by Guozhang Wang DB Lunch, April 13 rd, 2009 Several slides are from the authors Outline Problem
More informationIV Statistical Modelling of MapReduce Joins
IV Statistical Modelling of MapReduce Joins In this chapter, we will also explain each component used while constructing our statistical model such as: The construction of the dataset used. The use of
More informationEfficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang, Jidong Zhai, Xipeng Shen #, Onur Mutlu, Wenguang Chen Renmin University of China Tsinghua University
More informationToday s content. Resilient Distributed Datasets(RDDs) Spark and its data model
Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,
More informationAdvanced Data Management Technologies
ADMT 2017/18 Unit 13 J. Gamper 1/42 Advanced Data Management Technologies Unit 13 DW Pre-aggregation and View Maintenance J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements:
More informationSinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley
Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC Berkeley Communication is Crucial for Analytics at Scale Performance Facebook analytics
More informationDesign Tools for HPC SoC Challenges, Opportunities, or Business as Usual?
Design Tools for HPC SoC Challenges, Opportunities, or Business as Usual? X. Sharon Hu Department of Science and Engineering University of Notre Dame To SoC, or not to SoC If HPC does not adopt SoC design,
More informationBased on Big Data: Hype or Hallelujah? by Elena Baralis
Based on Big Data: Hype or Hallelujah? by Elena Baralis http://dbdmg.polito.it/wordpress/wp-content/uploads/2010/12/bigdata_2015_2x.pdf 1 3 February 2010 Google detected flu outbreak two weeks ahead of
More informationSensor Tasking and Control
Sensor Tasking and Control Outline Task-Driven Sensing Roles of Sensor Nodes and Utilities Information-Based Sensor Tasking Joint Routing and Information Aggregation Summary Introduction To efficiently
More informationAn efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup)
Rensselaer Polytechnic Institute Universidade Federal de Viçosa An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Prof. Dr. W Randolph Franklin, RPI Salles Viana Gomes
More informationHow to Price a House
How to Price a House An Interpretable Bayesian Approach Dustin Lennon dustin@inferentialist.com Inferentialist Consulting Seattle, WA April 9, 2014 Introduction Project to tie up loose ends / came out
More informationData Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationInterruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs
Interruptible s: Treating Pressure as Interrupts for Highly Scalable Data-Parallel Programs Lu Fang 1, Khanh Nguyen 1, Guoqing(Harry) Xu 1, Brian Demsky 1, Shan Lu 2 1 University of California, Irvine
More informationAgenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache
Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,
More informationData Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationLightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University
Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND
More informationTime Series Storage with Apache Kudu (incubating)
Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry
More informationShark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )
Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at
More informationMAPR DATA GOVERNANCE WITHOUT COMPROMISE
MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance
More information