Database Learning: Toward a Database that Becomes Smarter Over Time

Size: px
Start display at page:

Download "Database Learning: Toward a Database that Becomes Smarter Over Time"

Transcription

1 Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor

2 Today s databases Database Users 1

3 Today s databases query Database Users 1

4 Today s databases Database Users 1

5 Today s databases Answer to query Database Users 1

6 Today s databases Database Users After answering queries, THE WORK is GONE 1

7 Today s databases Database Users After answering queries, THE WORK is GONE Our Goal: reuse the work 1

8 Our high-level approach AQP engine Users 2

9 Our high-level approach Users Q AQP engine 2

10 Our high-level approach Users A (10% err, 1 sec) AQP engine 2

11 Our high-level approach Query Synopsis Users Database Learning AQP engine 2

12 Our high-level approach Query Synopsis Users Q Database Learning AQP engine 2

13 Our high-level approach Query Synopsis Q Database Learning Q AQP engine Users 2

14 Our high-level approach Query Synopsis Q Database Learning Q A (10% err) AQP engine Users 2

15 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users 2

16 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3

17 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3

18 Our high-level approach Query Synopsis Q Â (2% err) Database Learning Q A (10% err) AQP engine Users Error(%) AQP engine Database learning Time (sec) 3

19 Technical challenges 4

20 Technical challenges 4

21 Technical challenges 4

22 Technical challenges Queries use the data in different columns/rows 4

23 Technical challenges Queries use the data in different columns/rows How to leverage those queries for future queries? 4

24 Our idea? 5

25 Our idea Q1? 5

26 Our idea (Q1, A1)? 5

27 Our idea (Q1, A1) 5

28 Our idea Q2 5

29 Our idea (Q2, A2) 5

30 Our idea (Q2, A2) 5

31 Our idea more queries and answers 5

32 Concrete example SUM(count) 40M 30M 20M Week Number 6

33 Concrete example SUM(count) 40M 30M 20M Week Number True data 6

34 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries 6

35 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) 6

36 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) SUM(count) 40M 30M 20M Week Number 6

37 Concrete example SUM(count) 40M 30M 20M Week Number True data Ranges observed by past queries Model (with 95% confidence interval) SUM(count) 40M 30M 20M Week Number SUM(count) 40M 30M 20M Week Number 6

38 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries 7

39 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries 2 No Assumptions about Data 7

40 Design goals select X3, avg(y1) from t where 5 < X1 < 8; select sum(y2) from t where X2 between Apr and May group by X3; 1 Support a wide class of SQL queries latency 2 No Assumptions about Data BlinkDB 3 Lightweight DBL 7

41 Our Approach

42 Problem statement 8

43 Problem statement Problem: Given past queries (q 1,, q n ), a new query (q n+1 ), and their approximate answers, Find the most likely answer to the new query (q n+1 ) and its estimated error 8

44 Problem statement Problem: Given past queries (q 1,, q n ), a new query (q n+1 ), and their approximate answers, Find the most likely answer to the new query (q n+1 ) and its estimated error Our result: Under a certain model assumption, our answer s error bound original answer s error bound (in practice, much more accurate) if the error bounds provide the same probabilistic guarantees 8

45 Overview of our technique select count(y2) select avg(y2) from t from t where 1 < X1 < 2; where 6 < X1 < 8; 9

46 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 9

47 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 9

48 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution 9

49 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; 1 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution Two aggregations involve common values correlation between answers 9

50 Overview of our technique select count(y2) select avg(y2) from select t sum(y2) from t where from1 t< X1 < 2; where 6 < X1 < 8; where 5 < X1 < 8; Pr(θ 3 θ 1, θ 2 ) Estimated answer 1 3 Random variables (our uncertainty on answers) θ 1, θ 2, θ 3 2 Pr(θ 1, θ 2, θ 3 ) Probability distribution Two aggregations involve common values correlation between answers 9

51 How to define random variables select sum(y2) from t where 5 < X1 < 8; 10

52 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; 10

53 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function 10

54 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates 10

55 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates select X3, avg(y1), sum(y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3; What if your query is complex? 10

56 How to define random variables We define a random variable θ for every combination of: select sum(y2) from t where 5 < X1 < 8; Aggregate function Selection predicates select X3, avg(y1), sum(y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3; What if your query is complex? 10

57 How to determine the probability distribution The Principle of Maximum Entropy (ME) 11

58 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) 11

59 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) 11

60 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info 11

61 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info 11

62 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info Simple Pr Complex Pr 11

63 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity 11

64 How to determine the probability distribution Statistical Info of (θ 1, θ 2, θ 3) The Principle of Maximum Entropy (ME) Most-likely Pr(θ 1, θ 2, θ 3) Our choice: (co)variances between pairs of answers Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity 11

65 Most-likely probability distribution θ 1 θ 2 θ 3 12

66 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 12

67 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 MaxEnt Multivariate normal distribution 12

68 Most-likely probability distribution θ 1 Statistical Information: Mean, variances, covariances θ 2 θ 3 MaxEnt Multivariate normal distribution Fast inference using a closed form 12

69 Benefits of database learning Database learning vs indexing 13

70 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead 13

71 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view 13

72 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view date 2 Without alignment 13

73 Benefits of database learning Database learning vs indexing storage Indexing DBL database size 1 Little storage overhead Database learning vs materialized view overhead DBL view selection date 2 Without alignment system uptime 3 No upfront overhead 13

74 Experiment

75 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 14

76 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 2 Datasets: Customer1: 536GB data and query log from a customer TPC-H: 100GB TPC-H dataset 14

77 Experiment setup 1 Two systems: NoLearn: Approximate query processing engine (The longer runtime, the more accurate answer) Verdict: Our database learning system (on top of NoLearn) 2 Datasets: Customer1: 536GB data and query log from a customer TPC-H: 100GB TPC-H dataset 3 Environment: 5 Amazon EC2 workers (m42xlarge) + 1 master SSD-backed HDFS for Spark s data loading 14

78 Our experimental claims 1 Verdict supports a large portion of real-world queries 15

79 Our experimental claims 1 Verdict supports a large portion of real-world queries 2 Verdict achieves speedup compared to NoLearn 15

80 Our experimental claims 1 Verdict supports a large portion of real-world queries 2 Verdict achieves speedup compared to NoLearn 3 Verdict works with small memory and computational overhead 15

81 Generality of Verdict Dataset # Analyzed # Supported Percentage Customer1 3,342 2, % TPC-H % Unsupported queries: 1 Nested queries (that cannot be flattened) 2 Textual filters: city like '%arbor%' 16

82 Runtime-error trade-off Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50 NoLearn Verdict Error bound (%) 10 5 Error bound (%) Runtime (sec) Runtime (min) (a) Data in Memory (b) Data on SSD 17

83 Runtime-error trade-off Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50 NoLearn Verdict Error bound (%) 10 5 Error bound (%) Runtime (sec) Runtime (min) (a) Data in Memory (b) Data on SSD 17

84 Speedup The results on the Customer1 dataset (the paper has the TPC-H results) Speedup (x) Speedup (x) % 2% 4% 2% Target Error Bound Target Error Bound (a) Data in memory (b) Data on SSD 18

85 Speedup The results on the Customer1 dataset (the paper has the TPC-H results) Speedup (x) Speedup (x) % 2% 4% 2% Target Error Bound Target Error Bound (a) Data in memory (b) Data on SSD 18

86 Memory and computational overhead 1 Memory overhead: 19

87 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 19

88 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 232 KB per query for the Customer1 dataset 158 KB per query for the TPC-H dataset 19

89 Memory and computational overhead 1 Memory overhead: Queries and their answer, some matrices and their inverses 232 KB per query for the Customer1 dataset 158 KB per query for the TPC-H dataset 2 Computational overhead: Latency for memory Latency for SSD NoLearn 2083 sec 5250 sec Verdict 2093 sec 5251 sec Overhead 0010 sec (048%) 0010 sec (002%) 19

90 Thank You! 19

Fast Data Analytics by Learning

Fast Data Analytics by Learning Fast Data Analytics by Learning by Yongjoo Park A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University

More information

Warehouse- Scale Computing and the BDAS Stack

Warehouse- Scale Computing and the BDAS Stack Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,

More information

Computations with Bounded Errors and Response Times on Very Large Data

Computations with Bounded Errors and Response Times on Very Large Data Computations with Bounded Errors and Response Times on Very Large Data Ion Stoica UC Berkeley (joint work with: Sameer Agarwal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita

More information

Approximate Query Engines

Approximate Query Engines Approximate Query Engines Commercial Challenges and Research Opportunities Barzan Mozafari University of Michigan, Ann Arbor SnappyData Inc. SIGMOD 2017 SnappyData Inc. 2017 1 What Is Approximate Query

More information

cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman

cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman What is CitusDB? CitusDB is a scalable analytics database that extends PostgreSQL Citus shards your data and automa/cally parallelizes

More information

Approximate Query Processing: Overview and Challenges

Approximate Query Processing: Overview and Challenges Approximate Query Processing: Overview and Challenges Peter J. Haas College of Information and Computer Sciences University of Massachusetts Amherst Thanks to: Andrew McGregor Barzan Mozafari EDBT 2018

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Approximate Query Processing: What is New and Where to Go?

Approximate Query Processing: What is New and Where to Go? Data Science and Engineering (2018) 3:379 397 https://doi.org/10.1007/s41019-018-0074-4 Approximate Query Processing: What is New and Where to Go? A Survey on Approximate Query Processing Kaiyu Li 1 Guoliang

More information

Albis: High-Performance File Format for Big Data Systems

Albis: High-Performance File Format for Big Data Systems Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference

More information

Database Group Research Overview. Immanuel Trummer

Database Group Research Overview. Immanuel Trummer Database Group Research Overview Immanuel Trummer Talk Overview User Query Data Analysis Result Processing Talk Overview Fact Checking Query User Data Vocalization Data Analysis Result Processing Query

More information

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines

More information

Speeding Up Data Science: From a Data Management Perspective

Speeding Up Data Science: From a Data Management Perspective Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab (DSL) Simon Fraser University NWDS Meeting, Jan 5, 2018 1 Simon Fraser University 2 SFU DB/DM Group Ke Wang

More information

Triangle SQL Server User Group Adaptive Query Processing with Azure SQL DB and SQL Server 2017

Triangle SQL Server User Group Adaptive Query Processing with Azure SQL DB and SQL Server 2017 Triangle SQL Server User Group Adaptive Query Processing with Azure SQL DB and SQL Server 2017 Joe Sack, Principal Program Manager, Microsoft Joe.Sack@Microsoft.com Adaptability Adapt based on customer

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

Impala Intro. MingLi xunzhang

Impala Intro. MingLi xunzhang Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

PrivApprox. Privacy- Preserving Stream Analytics.

PrivApprox. Privacy- Preserving Stream Analytics. PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017 Motivation Clients Analysts

More information

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research IBM Research 2 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs à à Application

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Outline The Spark scheduling bottleneck Sparrow s fully distributed, fault-tolerant technique

More information

Visualization-Aware Sampling for Very Large Databases

Visualization-Aware Sampling for Very Large Databases Visualization-Aware Sampling for Very Large Databases Yongjoo Park, Michael Cafarella, Barzan Mozafari University of Michigan, Ann Arbor, USA {pyongjoo, michjc, mozafari}@umich.edu arxiv:1510.03921v2 [cs.db]

More information

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments Jiyan Yang ICME, Stanford University Nov 1, 2015 INFORMS 2015, Philadephia Joint work with Xiangrui Meng (Databricks),

More information

Dynamic Flow Regulation for IP Integration on Network-on-Chip

Dynamic Flow Regulation for IP Integration on Network-on-Chip Dynamic Flow Regulation for IP Integration on Network-on-Chip Zhonghai Lu and Yi Wang Dept. of Electronic Systems KTH Royal Institute of Technology Stockholm, Sweden Agenda The IP integration problem Why

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Constructing Popular Routes from Uncertain Trajectories

Constructing Popular Routes from Uncertain Trajectories Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei, Yu Zheng, Wen-Chih Peng presented by Slawek Goryczka Scenarios A trajectory is a sequence of data points recording location information

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

MLlib. Distributed Machine Learning on. Evan Sparks.  UC Berkeley MLlib & ML base Distributed Machine Learning on Evan Sparks UC Berkeley January 31st, 2014 Collaborators: Ameet Talwalkar, Xiangrui Meng, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia,

More information

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin Dynamic Resource Allocation for Distributed Dataflows Lauritz Thamsen Technische Universität Berlin 04.05.2018 Distributed Dataflows E.g. MapReduce, SCOPE, Spark, and Flink Used for scalable processing

More information

SHARDS & Talus: Online MRC estimation and optimization for very large caches

SHARDS & Talus: Online MRC estimation and optimization for very large caches SHARDS & Talus: Online MRC estimation and optimization for very large caches Nohhyun Park CloudPhysics, Inc. Introduction Efficient MRC Construction with SHARDS FAST 15 Waldspurger at al. Talus: A simple

More information

Universalizing Approximate Query Processing

Universalizing Approximate Query Processing Universalizing Approximate Query Processing Yongjoo Park Joseph Sorenson Barzan Mozafari Junhao Wang Universal Approximate Query Processing Universal Approximate Query Processing What is Approximate Query

More information

Runtime Support for Human-in-the-Loop Feature Engineering Systems

Runtime Support for Human-in-the-Loop Feature Engineering Systems Runtime Support for Human-in-the-Loop Feature Engineering Systems Michael R. Anderson University of Michigan mrander@umich.edu Dolan Antenucci University of Michigan dol@umich.edu Michael Cafarella University

More information

Data Blocks: Hybrid OLTP and OLAP on compressed storage

Data Blocks: Hybrid OLTP and OLAP on compressed storage Data Blocks: Hybrid OLTP and OLAP on compressed storage Ben Brümmer Technische Universität München Fürstenfeldbruck, 26. November 208 Ben Brümmer 26..8 Lehrstuhl für Datenbanksysteme Problem HDD/Archive/Tape-Storage

More information

Join Processing for Flash SSDs: Remembering Past Lessons

Join Processing for Flash SSDs: Remembering Past Lessons Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of

More information

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model One-Shot Learning with a Hierarchical Nonparametric Bayesian Model R. Salakhutdinov, J. Tenenbaum and A. Torralba MIT Technical Report, 2010 Presented by Esther Salazar Duke University June 10, 2011 E.

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

CSC 261/461 Database Systems Lecture 19

CSC 261/461 Database Systems Lecture 19 CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will

More information

ASAP: Fast, Approximate Graph Pattern Mining at Scale

ASAP: Fast, Approximate Graph Pattern Mining at Scale ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;

More information

HotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li

HotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li HotCloud 17 Lube: Hao Wang* Baochun Li Mitigating Bottlenecks in Wide Area Data Analytics iqua Wide Area Data Analytics DC Master Namenode Workers Datanodes 2 Wide Area Data Analytics Why wide area data

More information

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK

Random Walk Inference and Learning. Carnegie Mellon University 7/28/2011 EMNLP 2011, Edinburgh, Scotland, UK Random Walk Inference and Learning in A Large Scale Knowledge Base Ni Lao, Tom Mitchell, William W. Cohen Carnegie Mellon University 2011.7.28 1 Outline Motivation Inference in Knowledge Bases The NELL

More information

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University Stream Query

More information

OLA-RAW: Scalable Exploration over Raw Data

OLA-RAW: Scalable Exploration over Raw Data OLA-RAW: Scalable Exploration over Raw Data Yu Cheng Weijie Zhao Florin Rusu University of California Merced 52 N Lake Road Merced, CA 95343 {ycheng4, wzhao23, frusu}@ucmerced.edu February 27 Abstract

More information

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki Data loading: A necessary evil Volume => Expensive 4 zettabytes

More information

How Eventual is Eventual Consistency?

How Eventual is Eventual Consistency? Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February

More information

OtterTune. Automatic Database Management System Tuning Through Large-scale Machine Learning

OtterTune. Automatic Database Management System Tuning Through Large-scale Machine Learning OtterTune Automatic Database Management System Tuning Through Large-scale Machine Learning Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, Bohan Zhang [image source] 2 DBMS Tuning Tuning a DBMS s configuration

More information

Scaling Distributed Machine Learning with the Parameter Server

Scaling Distributed Machine Learning with the Parameter Server Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

SociaLite: An Efficient Query Language for Social Network Analysis

SociaLite: An Efficient Query Language for Social Network Analysis SociaLite: An Efficient Query Language for Social Network Analysis Jiwon Seo Stephen Guo Jongsoo Park Jaeho Shin Monica S. Lam Stanford Mobile & Social Workshop Social Network Analysis Social network services

More information

Fast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann

Fast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann Fast Approximations for Analyzing Ten Trillion Cells Filip Buruiana (filipb@google.com) Reimar Hofmann (reimar.hofmann@hs-karlsruhe.de) Outline of the Talk Interactive analysis at AdSpam @ Google Trade

More information

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies

More information

Bi-Level Online Aggregation on Raw Data

Bi-Level Online Aggregation on Raw Data Bi-Level Online Aggregation on Raw Data Yu Cheng Turn, Inc. leo.cheng@turn.com Weijie Zhao University of California Merced wzhao23@ucmerced.edu Florin Rusu University of California Merced frusu@ucmerced.edu

More information

A c t i v e w o r k s p a c e f o r e x t e r n a l d a t a a g g r e g a t i o n a n d S e a r c h. 1

A c t i v e w o r k s p a c e f o r e x t e r n a l d a t a a g g r e g a t i o n a n d S e a r c h.   1 A c t i v e w o r k s p a c e f o r e x t e r n a l d a t a a g g r e g a t i o n a n d S e a r c h B a l a K a n t h i www.intelizign.com 1 Active workspace can search and visualize PLM data better! Problems:

More information

Distributed Sampling in a Big Data Management System

Distributed Sampling in a Big Data Management System Distributed Sampling in a Big Data Management System Dan Radion University of Washington Department of Computer Science and Engineering Undergraduate Departmental Honors Thesis Advised by Dan Suciu Contents

More information

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Multi-threaded Queries. Intra-Query Parallelism in LLVM Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted

More information

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Achieving Horizontal Scalability. Alain Houf Sales Engineer Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches

More information

C=(FS) 2 : Cubing by Composition of Faceted Search

C=(FS) 2 : Cubing by Composition of Faceted Search C=(FS) : Cubing by Composition of Faceted Search Ronny Lempel Dafna Sheinwald IBM Haifa Research Lab Introduction to Multifaceted Search and to On-Line Analytical Processing (OLAP) Intro Multifaceted Search

More information

PowerVault MD3 SSD Cache Overview

PowerVault MD3 SSD Cache Overview PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS

More information

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

Intelligent Services Serving Machine Learning

Intelligent Services Serving Machine Learning Intelligent Services Serving Machine Learning Joseph E. Gonzalez jegonzal@cs.berkeley.edu; Assistant Professor @ UC Berkeley joseph@dato.com; Co-Founder @ Dato Inc. Contemporary Learning Systems Big Data

More information

Querying Data with Transact SQL

Querying Data with Transact SQL Course 20761A: Querying Data with Transact SQL Course details Course Outline Module 1: Introduction to Microsoft SQL Server 2016 This module introduces SQL Server, the versions of SQL Server, including

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core

More information

Extracting and Querying Probabilistic Information From Text in BayesStore-IE

Extracting and Querying Probabilistic Information From Text in BayesStore-IE Extracting and Querying Probabilistic Information From Text in BayesStore-IE Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis 2, Joseph M. Hellerstein University of California, Berkeley Technical

More information

Leveraging Lock Contention to Improve Transaction Applications. University of Washington

Leveraging Lock Contention to Improve Transaction Applications. University of Washington Leveraging Lock Contention to Improve Transaction Applications Cong Yan Alvin Cheung University of Washington 1 Background Database transactions Airline ticket reservation, banking, online shopping...

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

ApproxJoin: Approximate Distributed Joins

ApproxJoin: Approximate Distributed Joins Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas #, Ruichuan Chen, Christof Fetzer, Thorsten Strufe TU Dresden, Germany Nokia Bell Labs, Germany The University

More information

Near-Data Processing for Differentiable Machine Learning Models

Near-Data Processing for Differentiable Machine Learning Models Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1, Seil Lee 1, Hyunha Nam 1, Seongsik Park 1, Seijoon Kim 1, Eui-Young Chung 2 and Sungroh Yoon 1,3 1 Electrical and Computer

More information

Dremel: Interac-ve Analysis of Web- Scale Datasets

Dremel: Interac-ve Analysis of Web- Scale Datasets Dremel: Interac-ve Analysis of Web- Scale Datasets Google Inc VLDB 2010 presented by Arka BhaEacharya some slides adapted from various Dremel presenta-ons on the internet The Problem: Interactive data

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Tracking Trends: Incorporating Term Volume into Temporal Topic Models Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA,

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Creating Probabilistic Databases from Information Extraction Models

Creating Probabilistic Databases from Information Extraction Models Creating Probabilistic Databases from Information Extraction Models Rahul Gupta, Sunita Sarawagi Presented by Guozhang Wang DB Lunch, April 13 rd, 2009 Several slides are from the authors Outline Problem

More information

IV Statistical Modelling of MapReduce Joins

IV Statistical Modelling of MapReduce Joins IV Statistical Modelling of MapReduce Joins In this chapter, we will also explain each component used while constructing our statistical model such as: The construction of the dataset used. The use of

More information

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang, Jidong Zhai, Xipeng Shen #, Onur Mutlu, Wenguang Chen Renmin University of China Tsinghua University

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 13 J. Gamper 1/42 Advanced Data Management Technologies Unit 13 DW Pre-aggregation and View Maintenance J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements:

More information

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC Berkeley Communication is Crucial for Analytics at Scale Performance Facebook analytics

More information

Design Tools for HPC SoC Challenges, Opportunities, or Business as Usual?

Design Tools for HPC SoC Challenges, Opportunities, or Business as Usual? Design Tools for HPC SoC Challenges, Opportunities, or Business as Usual? X. Sharon Hu Department of Science and Engineering University of Notre Dame To SoC, or not to SoC If HPC does not adopt SoC design,

More information

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Based on Big Data: Hype or Hallelujah? by Elena Baralis Based on Big Data: Hype or Hallelujah? by Elena Baralis http://dbdmg.polito.it/wordpress/wp-content/uploads/2010/12/bigdata_2015_2x.pdf 1 3 February 2010 Google detected flu outbreak two weeks ahead of

More information

Sensor Tasking and Control

Sensor Tasking and Control Sensor Tasking and Control Outline Task-Driven Sensing Roles of Sensor Nodes and Utilities Information-Based Sensor Tasking Joint Routing and Information Aggregation Summary Introduction To efficiently

More information

An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup)

An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Rensselaer Polytechnic Institute Universidade Federal de Viçosa An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Prof. Dr. W Randolph Franklin, RPI Salles Viana Gomes

More information

How to Price a House

How to Price a House How to Price a House An Interpretable Bayesian Approach Dustin Lennon dustin@inferentialist.com Inferentialist Consulting Seattle, WA April 9, 2014 Introduction Project to tie up loose ends / came out

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs

Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs Interruptible s: Treating Pressure as Interrupts for Highly Scalable Data-Parallel Programs Lu Fang 1, Khanh Nguyen 1, Guoqing(Harry) Xu 1, Brian Demsky 1, Shan Lu 2 1 University of California, Irvine

More information

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND

More information

Time Series Storage with Apache Kudu (incubating)

Time Series Storage with Apache Kudu (incubating) Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry

More information

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( ) Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at

More information

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

MAPR DATA GOVERNANCE WITHOUT COMPROMISE MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance

More information