Dremel: Interactive Analysis of Web- Scale Datasets

Size: px
Start display at page:

Download "Dremel: Interactive Analysis of Web- Scale Datasets"

Transcription

1 Dremel: Interactive Analysis of Web- Scale Datasets S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton Google Inc. VLDB 200 Presented by Ke Hong (slide adapted from Melnik s)

2 Outline Problem Motivation Nested Columnar Storage Query Execution Evaluation Summary 2

3 Interactive Query Run a MapReduce to extract billions of signals from web pages Launch Dremel to execute several interactive commands and extract an irregularity in signal DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal, 00), COUNT(*) FROM t Feed the incoming input data into another MapReduce pipeline 3

4 Problem How to do SQL-like aggregation queries On trillion-row tables On petabytes of data Using thousands of CPUs Executed within seconds Serving thousands of concurrent users 4

5 5 Motivation Analysis of crawled web documents Tracking install data for applications on Android Market Crash reporting for Google products Nested Columnar Storage OCR results from Google Books Spam analysis Debugging of map tiles on Google Maps Tablet migrations in managed Bigtable instances Results of tests run on Google's distributed build system Disk I/O statistics for hundreds of thousands of disks Resource monitoring for jobs run in Google's data centers Symbols and dependencies in Google's codebase

6 Motivation Parallel DBMS Limited scalability Limited interactive speed Not fault tolerant Translating SQL into MapReduce jobs HadoopDB, Pig Latin, Hive One job takes thousands of seconds Hard to support multi-user 6

7 Dremel Data Layout Nested structure τ = dom <A :τ[*?],..., A n :τ[*?]> τ as an atomic or record type * as repeated fields? as optional fields C B * * A D * r E r r r 2 r r 2 7 r 2 row-oriented Read less, cheaper decompression r 2 column-oriented

8 Row vs. Column sec Execution time on 3000 nodes to process an 85 billion record table 87 TB 0.5 TB 0.5 TB 8

9 9 Nested Data Model message Document { multiplicity: required int64 DocId; [,] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group { repeated group { required string Code; optional string Country; [0,] } optional string Url; } } DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

10 Column-Striped Representation DocId.Url Links.Forward Links.Backward value r d value r d value r d value r d NULL NULL Code value r d en-us 0 2 en 2 2 NULL en-gb 2 NULL 0..Country value r d us 0 3 NULL 2 2 NULL gb 3 NULL 0

11 Repetition & Definition Level..Code value r d en-us 0 2 en NULL en-gb NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

12 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 2 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

13 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 3 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

14 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb 2 NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 4 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

15 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb 2 NULL 0 r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 5 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

16 Record Assembly Transitions labeled with repetition levels DocId 0 0 Links.Backward 0 Links.Forward..Code 0,, 2 2..Country.Url 0 0, 6

17 Record Assembly Simpler FSM to retrieve a subset of fields Cheaper to execute Useful for queries like / N a m e [ 2 ] / L a n g u a g e [ ] / C o u n t r y,2 DocId 0..Country 0 s DocId: 0 Country: 'us' Country: 'gb' 7 DocId: 20 s 2

18 Query Execution SELECT DocId AS Id, COUNT(..Code) WITHIN AS Cnt,.Url + ',' +..Code AS Str FROM t WHERE REGEXP(.Url, '^http') AND DocId < 20; 8 Output table Id: 0 t Cnt: 2 Str: ' Str: ' Cnt: 0 Output schema message QueryResult { required int64 Id; repeated group { optional uint64 Cnt; repeated group { optional string Str; } } }

19 Multi-Level Serving Tree client Parallelize scheduling and aggregation root server Mostly one-pass aggregation Well suited for aggregation queries returning small and medium-sized results (<M records) Common class of interactive queries intermediate servers leaf servers (with local storage) 9 storage layer (e.g., GFS)

20 Multi-Level Serving Tree client Receive a query from client, determine all tablets root server 20

21 Multi-Level Serving Tree client Rewrite the query and route it to intermediate servers root server intermediate servers 2

22 Multi-Level Serving Tree client root server Rewrite the query into a set of partial queries and assign them to leaf servers intermediate servers leaf servers (with local storage) 22

23 Multi-Level Serving Tree client root server intermediate servers Access assigned tablets from the storage layer leaf servers (with local storage) 23 storage layer (e.g., GFS)

24 Example: count() 0 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/, /gfs/2,, /gfs/00000} SELECT A, SUM(c) FROM (R UNION ALL R n ) GROUP BY A 24

25 Example: count() 0 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/, /gfs/2,, /gfs/00000} SELECT A, SUM(c) FROM (R UNION ALL R n ) GROUP BY A R R 2 SELECT A, COUNT(B) AS c FROM T GROUP BY A T = {/gfs/,, /gfs/0000} SELECT A, COUNT(B) AS c FROM T 2 GROUP BY A T 2 = {/gfs/000,, /gfs/20000} 25

26 Example: count() 0 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/, /gfs/2,, /gfs/00000} SELECT A, SUM(c) FROM (R UNION ALL R n ) GROUP BY A R R 2 SELECT A, COUNT(B) AS c FROM T GROUP BY A T = {/gfs/,, /gfs/0000} SELECT A, COUNT(B) AS c FROM T 2 GROUP BY A T 2 = {/gfs/000,, /gfs/20000} 3 SELECT A, COUNT(B) AS c FROM T 3 GROUP BY A T 3 = {/gfs/} 26 Data access ops

27 Multi-Level Serving Tree Scan assigned tablets to return partial results to intermediate servers leaf servers (with local storage) 27 storage layer (e.g., GFS)

28 Multi-Level Serving Tree Parallel aggregation of partial results intermediate servers leaf servers (with local storage) 28 storage layer (e.g., GFS)

29 Multi-Level Serving Tree client Return the aggregate query result to client root server intermediate servers leaf servers (with local storage) 29 storage layer (e.g., GFS)

30 Query Dispatcher Dremel as a multi-user system Schedule queries based on priorities and load balancing Fault tolerance When the number of tablets processed in one query is larger than the number of available execution threads on leaf servers Compute a histogram of tablet processing times Reschedule tablet processing on another server if a tablet takes disproportionally long time Internal execution tree as a physical query execution plan 30

31 execution time (sec) Evaluation Scalability of Dremel 3 On a trillion-row table T4 (05TB, 50 fields): SELECT TOP(aid, 20), COUNT(*) FROM T4 WHERE bid = {value} AND cid = {value2} number of leaf servers

32 percentage of queries Evaluation Interactive speed of Dremel Monthly query workload of one 3000-node Dremel instance 32 Most queries complete within 0 sec execution time (sec)

33 Summary Propose a scenario for interactive queries on web-scale datasets Design a columnar storage for Dremel s nested data model Develop SQL-like Dremel s query language Design a multi-level serving tree for efficient query execution Evaluate Dremel s scalability and interactive speed using web-scale workloads 33

Dremel: Interactive Analysis of Web-Scale Database

Dremel: Interactive Analysis of Web-Scale Database Dremel: Interactive Analysis of Web-Scale Database Presented by Jian Fang Most parts of these slides are stolen from here: http://bit.ly/hipzeg What is Dremel Trillion-record, multi-terabyte datasets at

More information

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Sameer Agarwal sameerag@cs.berkeley.edu

More information

Dremel: Interac-ve Analysis of Web- Scale Datasets

Dremel: Interac-ve Analysis of Web- Scale Datasets Dremel: Interac-ve Analysis of Web- Scale Datasets Google Inc VLDB 2010 presented by Arka BhaEacharya some slides adapted from various Dremel presenta-ons on the internet The Problem: Interactive data

More information

Dremel: Interactice Analysis of Web-Scale Datasets

Dremel: Interactice Analysis of Web-Scale Datasets Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32 Overview

More information

Google Dremel. Interactive Analysis of Web-Scale Datasets

Google Dremel. Interactive Analysis of Web-Scale Datasets Google Dremel Interactive Analysis of Web-Scale Datasets Summary Introduction Data Model Nested Columnar Storage Query Execution Experiments Implementations: Google BigQuery(Demo) and Apache Drill Conclusions

More information

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran Apache Drill Interactive Analysis of Large-Scale Datasets Tomer Shiran Latency Matters Ad-hoc analysis with interactive tools Real-time dashboards Event/trend detection Network intrusions Fraud Failures

More information

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Google, Inc. {melnik,andrey,jlong,gromer,shiva,mtolton,theov}@google.com

More information

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores CSE 444: Database Internals Lectures 26 NoSQL: Extensible Record Stores CSE 444 - Spring 2014 1 References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No. 4)

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything

More information

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed

More information

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines A programming model in Cloud: MapReduce Programming model and implementation for processing and generating large data sets Users specify a map function to generate a set of intermediate key/value pairs

More information

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Presented by: Ahmed Elbagoury Outline Background & Motivation What is Restore? Types of Result Reuse System Architecture Experiments Conclusion Discussion Background

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:

More information

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access Map/Reduce vs. DBMS Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu

More information

BigTable A System for Distributed Structured Storage

BigTable A System for Distributed Structured Storage BigTable A System for Distributed Structured Storage Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber Adapted

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Apache Drill: interactive query and analysis on large-scale datasets

Apache Drill: interactive query and analysis on large-scale datasets Apache Drill: interactive query and analysis on large-scale datasets Michael Hausenblas, Chief Data Engineer EMEA, MapR NoSQL matters Training Day, 2013-04-25 Agenda Introduction round (15min) Overview

More information

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( ) Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

BigTable: A Distributed Storage System for Structured Data

BigTable: A Distributed Storage System for Structured Data BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26

More information

CS 102. Big Data. Spring Big Data Platforms

CS 102. Big Data. Spring Big Data Platforms CS 102 Big Data Spring 2016 Big Data Platforms How Big is Big? The Data Data Sets 1000 2 (5.3 MB) Complete works of Shakespeare (text) 1000 3 (~5-500 GB) Your data 1000 4 (10 TB) Library of Congress (text)

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid

More information

Multi-indexed Graph Based Knowledge Storage System

Multi-indexed Graph Based Knowledge Storage System Multi-indexed Graph Based Knowledge Storage System Hongming Zhu 1,2, Danny Morton 2, Wenjun Zhou 3, Qin Liu 1, and You Zhou 1 1 School of software engineering, Tongji University, China {zhu_hongming,qin.liu}@tongji.edu.cn,

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

MapReduce & BigTable

MapReduce & BigTable CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis:

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek Percolator Large-Scale Incremental Processing using Distributed Transactions and Notifications D. Peng & F. Dabek Motivation Built to maintain the Google web search index Need to maintain a large repository,

More information

Start Working with Parquet!!!!

Start Working with Parquet!!!! My Goal Tonight. Start Working with Parquet!!!! Parquet Query Performance Origin of Parquet Parquet Storage Query Request Usage with Hadoop Tools Customer Examples Topics Parquet Defined Storage & Encoding

More information

Pig A language for data processing in Hadoop

Pig A language for data processing in Hadoop Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop

More information

How to survive the Data Deluge: Petabyte scale Cloud Computing

How to survive the Data Deluge: Petabyte scale Cloud Computing How to survive the Data Deluge: Petabyte scale Cloud Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 Outline Part 1: Introduction What,

More information

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce Practice and Applications of Data Management CMPSCI 345 Lecture 18: Big Data, Hadoop, and MapReduce Why Big Data, Hadoop, M-R? } What is the connec,on with the things we learned? } What about SQL? } What

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures Lecture 20 -- 11/20/2017 BigTable big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures what does paper say Google

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Processing a Trillion Cells per Mouse Click

Processing a Trillion Cells per Mouse Click Processing a Trillion Cells per Mouse Click Common Sense 13/01 21.3.2013 Alex Hall, Google Zurich Olaf Bachmann, Robert Buessow, Silviu Ganceanu, Marc Nunkesser Outline of the Talk AdSpam team at Google

More information

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation

More information

BigTable: A System for Distributed Structured Storage

BigTable: A System for Distributed Structured Storage BigTable: A System for Distributed Structured Storage Jeff Dean Joint work with: Mike Burrows, Tushar Chandra, Fay Chang, Mike Epstein, Andrew Fikes, Sanjay Ghemawat, Robert Griesemer, Bob Gruber, Wilson

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10 Onur Kahraman High Performance Is No Longer A Nice To Have In Analytical Applications Users expect Google Like performance from

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Plumbing the Web. Narayanan Shivakumar. Google Distinguished Entrepreneur & Director

Plumbing the Web. Narayanan Shivakumar. Google Distinguished Entrepreneur & Director Plumbing the Web Narayanan Shivakumar Google Distinguished Entrepreneur & Director Google Developer Day 2007 powered by 3 Copyright 2007, Google Inc Developers and Google AJAX Search API Google Code Project

More information

Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services

More information

10 Million Smart Meter Data with Apache HBase

10 Million Smart Meter Data with Apache HBase 10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on

More information

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

7680: Distributed Systems

7680: Distributed Systems Cristina Nita-Rotaru 7680: Distributed Systems BigTable. Hbase.Spanner. 1: BigTable Acknowledgement } Slides based on material from course at UMichigan, U Washington, and the authors of BigTable and Spanner.

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Distributed Transactions in Hadoop's HBase and Google's Bigtable

Distributed Transactions in Hadoop's HBase and Google's Bigtable Distributed Transactions in Hadoop's HBase and Google's Bigtable Hans De Sterck Department of Applied Mathematics University of Waterloo Chen Zhang David R. Cheriton School of Computer Science University

More information

Evolution of Database Systems

Evolution of Database Systems Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second

More information

Sempala. Interactive SPARQL Query Processing on Hadoop

Sempala. Interactive SPARQL Query Processing on Hadoop Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation

More information

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop HAWQ: A Massively Parallel Processing SQL Engine in Hadoop Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, Milind Bhandarkar

More information

data parallelism Chris Olston Yahoo! Research

data parallelism Chris Olston Yahoo! Research data parallelism Chris Olston Yahoo! Research set-oriented computation data management operations tend to be set-oriented, e.g.: apply f() to each member of a set compute intersection of two sets easy

More information

CSE 124: Networked Services Lecture-17

CSE 124: Networked Services Lecture-17 Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Advanced Database Technologies NoSQL: Not only SQL

Advanced Database Technologies NoSQL: Not only SQL Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at

More information

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with

More information

Columnstore and B+ tree. Are Hybrid Physical. Designs Important?

Columnstore and B+ tree. Are Hybrid Physical. Designs Important? Columnstore and B+ tree Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 B+ tree & Columnstore on same table = Hybrid design 4? C O L C O L B+ tree B+ tree ? C O L C O L B+ tree B+ tree

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Indexing Large-Scale Data

Indexing Large-Scale Data Indexing Large-Scale Data Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook November 16, 2010

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information