Dremel: Interactive Analysis of Web- Scale Datasets

Size: px

Start display at page:

Download "Dremel: Interactive Analysis of Web- Scale Datasets"

Ashley Bond
5 years ago
Views:

1 Dremel: Interactive Analysis of Web- Scale Datasets S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton Google Inc. VLDB 200 Presented by Ke Hong (slide adapted from Melnik s)

2 Outline Problem Motivation Nested Columnar Storage Query Execution Evaluation Summary 2

Interactive Query Run a MapReduce to extract billions of signals from web pages Launch Dremel to execute several interactive commands and extract an

3 Interactive Query Run a MapReduce to extract billions of signals from web pages Launch Dremel to execute several interactive commands and extract an irregularity in signal DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal, 00), COUNT(*) FROM t Feed the incoming input data into another MapReduce pipeline 3

4 Problem How to do SQL-like aggregation queries On trillion-row tables On petabytes of data Using thousands of CPUs Executed within seconds Serving thousands of concurrent users 4

5 5 Motivation Analysis of crawled web documents Tracking install data for applications on Android Market Crash reporting for Google products Nested Columnar Storage OCR results from Google Books Spam analysis Debugging of map tiles on Google Maps Tablet migrations in managed Bigtable instances Results of tests run on Google's distributed build system Disk I/O statistics for hundreds of thousands of disks Resource monitoring for jobs run in Google's data centers Symbols and dependencies in Google's codebase

6 Motivation Parallel DBMS Limited scalability Limited interactive speed Not fault tolerant Translating SQL into MapReduce jobs HadoopDB, Pig Latin, Hive One job takes thousands of seconds Hard to support multi-user 6

7 Dremel Data Layout Nested structure τ = dom <A :τ[*?],..., A n :τ[*?]> τ as an atomic or record type * as repeated fields? as optional fields C B * * A D * r E r r r 2 r r 2 7 r 2 row-oriented Read less, cheaper decompression r 2 column-oriented

8 Row vs. Column sec Execution time on 3000 nodes to process an 85 billion record table 87 TB 0.5 TB 0.5 TB 8

9 9 Nested Data Model message Document { multiplicity: required int64 DocId; [,] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group { repeated group { required string Code; optional string Country; [0,] } optional string Url; } } DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

10 Column-Striped Representation DocId.Url Links.Forward Links.Backward value r d value r d value r d value r d NULL NULL Code value r d en-us 0 2 en 2 2 NULL en-gb 2 NULL 0..Country value r d us 0 3 NULL 2 2 NULL gb 3 NULL 0

11 Repetition & Definition Level..Code value r d en-us 0 2 en NULL en-gb NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

12 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 2 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

13 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 3 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

14 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb 2 NULL r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 4 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

15 Repetition & Definition Level..Code value r d en-us 0 2 en 2 2 NULL en-gb 2 NULL 0 r...code: 'en-us' r.. 2.Code: 'en' r. 2 r. 3..Code: 'en-gb' r 2. : common prefix DocId: 0 Links Forward: 20 Forward: 40 Forward: 60 Code: 'en-us' Country: 'us' Code: 'en' Url: ' Url: ' Code: 'en-gb' Country: 'gb' 5 Repetition and definition levels encode the structural delta between the current value and the previous value. r: length of common path prefix d: length of the current path DocId: 20 Links Backward: 0 Backward: 30 Forward: 80 Url: '

16 Record Assembly Transitions labeled with repetition levels DocId 0 0 Links.Backward 0 Links.Forward..Code 0,, 2 2..Country.Url 0 0, 6

17 Record Assembly Simpler FSM to retrieve a subset of fields Cheaper to execute Useful for queries like / N a m e [ 2 ] / L a n g u a g e [ ] / C o u n t r y,2 DocId 0..Country 0 s DocId: 0 Country: 'us' Country: 'gb' 7 DocId: 20 s 2

18 Query Execution SELECT DocId AS Id, COUNT(..Code) WITHIN AS Cnt,.Url + ',' +..Code AS Str FROM t WHERE REGEXP(.Url, '^http') AND DocId < 20; 8 Output table Id: 0 t Cnt: 2 Str: ' Str: ' Cnt: 0 Output schema message QueryResult { required int64 Id; repeated group { optional uint64 Cnt; repeated group { optional string Str; } } }

19 Multi-Level Serving Tree client Parallelize scheduling and aggregation root server Mostly one-pass aggregation Well suited for aggregation queries returning small and medium-sized results (<M records) Common class of interactive queries intermediate servers leaf servers (with local storage) 9 storage layer (e.g., GFS)

20 Multi-Level Serving Tree client Receive a query from client, determine all tablets root server 20

21 Multi-Level Serving Tree client Rewrite the query and route it to intermediate servers root server intermediate servers 2

22 Multi-Level Serving Tree client root server Rewrite the query into a set of partial queries and assign them to leaf servers intermediate servers leaf servers (with local storage) 22

23 Multi-Level Serving Tree client root server intermediate servers Access assigned tablets from the storage layer leaf servers (with local storage) 23 storage layer (e.g., GFS)

24 Example: count() 0 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/, /gfs/2,, /gfs/00000} SELECT A, SUM(c) FROM (R UNION ALL R n ) GROUP BY A 24

25 Example: count() 0 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/, /gfs/2,, /gfs/00000} SELECT A, SUM(c) FROM (R UNION ALL R n ) GROUP BY A R R 2 SELECT A, COUNT(B) AS c FROM T GROUP BY A T = {/gfs/,, /gfs/0000} SELECT A, COUNT(B) AS c FROM T 2 GROUP BY A T 2 = {/gfs/000,, /gfs/20000} 25

26 Example: count() 0 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/, /gfs/2,, /gfs/00000} SELECT A, SUM(c) FROM (R UNION ALL R n ) GROUP BY A R R 2 SELECT A, COUNT(B) AS c FROM T GROUP BY A T = {/gfs/,, /gfs/0000} SELECT A, COUNT(B) AS c FROM T 2 GROUP BY A T 2 = {/gfs/000,, /gfs/20000} 3 SELECT A, COUNT(B) AS c FROM T 3 GROUP BY A T 3 = {/gfs/} 26 Data access ops

27 Multi-Level Serving Tree Scan assigned tablets to return partial results to intermediate servers leaf servers (with local storage) 27 storage layer (e.g., GFS)

28 Multi-Level Serving Tree Parallel aggregation of partial results intermediate servers leaf servers (with local storage) 28 storage layer (e.g., GFS)

29 Multi-Level Serving Tree client Return the aggregate query result to client root server intermediate servers leaf servers (with local storage) 29 storage layer (e.g., GFS)

30 Query Dispatcher Dremel as a multi-user system Schedule queries based on priorities and load balancing Fault tolerance When the number of tablets processed in one query is larger than the number of available execution threads on leaf servers Compute a histogram of tablet processing times Reschedule tablet processing on another server if a tablet takes disproportionally long time Internal execution tree as a physical query execution plan 30

31 execution time (sec) Evaluation Scalability of Dremel 3 On a trillion-row table T4 (05TB, 50 fields): SELECT TOP(aid, 20), COUNT(*) FROM T4 WHERE bid = {value} AND cid = {value2} number of leaf servers

32 percentage of queries Evaluation Interactive speed of Dremel Monthly query workload of one 3000-node Dremel instance 32 Most queries complete within 0 sec execution time (sec)

33 Summary Propose a scenario for interactive queries on web-scale datasets Design a columnar storage for Dremel s nested data model Develop SQL-like Dremel s query language Design a multi-level serving tree for efficient query execution Evaluate Dremel s scalability and interactive speed using web-scale workloads 33

Dremel: Interactive Analysis of Web-Scale Database

Dremel: Interactive Analysis of Web-Scale Database Presented by Jian Fang Most parts of these slides are stolen from here: http://bit.ly/hipzeg What is Dremel Trillion-record, multi-terabyte datasets at