Computations with Bounded Errors and Response Times on Very Large Data

Size: px

Start display at page:

Download "Computations with Bounded Errors and Response Times on Very Large Data"

Bryce Richard
5 years ago
Views:

Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita

1 Computations with Bounded Errors and Response Times on Very Large Data Ion Stoica UC Berkeley (joint work with: Sameer Agarwal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita Sarkar, Michael Jordan, Sam Madden) Paris, May 15, 2013 UC BERKELEY

2 Problem Support interactive ad-hoc exploration queries over very large datasets

3 Why is This a Problem? 100 TB on 1000 cores/disks 1-2 Hours Minutes 1 second? Hard Disks Memory

4 Why is This a Problem? Even if no communication and all data in memory, query may take tens of sec» Just scanning GB RAM may take 10 sec Still slow for interactive queries

5 Why is This a Problem? 60 Projected Growth Increase over Moore's Law Overall Data Particle Accel. DNA Sequencers Data Grows faster than Moore s Law [IDC report, Kathy Yelick, LBNL]

6 Key Insight Computations don t always need exact answers Input often noisy: exact computations do not guarantee exact answers Error often acceptable if small and bounded Best scale ± 200g error Speedometers ± 2.5 % error (edmunds.com) OmniPod Insulin Pump ± 0.96 % error (

7 Approach: Sampling Compute results on samples instead of full data» Typically, error depends on sample size (n) not on original data size, i.e., error α 1/ n Can trade between answer s latency and accuracy Data rapid increase no longer a problem :» Error decreases with Moore s law: halves every 36 months

8 This Talk BlinkDB: approximate query engine for very large data sets using off- line sampling

9 BlinkDB Interface SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt= WITHIN 1 SECONDS ± 15.32

10 BlinkDB Interface SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt= WITHIN 2 SECONDS SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt= ERROR 0.1 CONFIDENCE 95.0% ± ± 4.96

11 BlinkDB Overview TABLE Original Data Sampling Module Offline- sampling: OpHmal set of samples across different dimensions (columns or sets of columns) to support ad- hoc exploratory queries

12 BlinkDB Overview TABLE Original Data Sampling Module Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in- memory. On- Disk Samples In- Memory Samples

13 BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query Query Plan Sample Selection TABLE Original Data Sampling Module On- Disk Samples In- Memory Samples

14 BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query TABLE Original Data Sampling Module Query Plan Sample Selection Online sample selection to pick best sample(s) based on query latency and accuracy requirements On- Disk Samples In- Memory Samples

15 BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query New Query Plan Sample Selection Error Bars & Confidence Intervals Hadoop/Spark TABLE Original Data Sampling Module On- Disk Samples In- Memory Samples Result ± 5.56 (95% confidence) Parallel query execuhon on mulhple samples striped across mulhple machines.

16 Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?

17 Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?

18 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)

19 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)

20 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)

21 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)

22 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Error 1/ n Time Taken (in seconds) Sample Size (in MB)

23 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Error 1/ n Latency n Time Taken (in seconds) Sample Size (in MB)

24 Error Latency Profile (ELP) SELECT foo (*) FROM TABLE WITHIN 2

25 Error Latency Profile (ELP) SELECT foo (*) FROM TABLE WITHIN ms

26 Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?

27 How to Accurately Estimate Error? Close formulas for limited number of operators» E.g., count, mean, percentiles What about user defined functions (UDFs)?

28 Experimental Workload Conviva: 30- day log of media accesses by Conviva users. Raw data 17 TB, partitioned this data across 100 nodes Log of 20,000 queries» 43.6% queries have one or more UDFs Storage budget: 50% of original data» 8 stratified samples

29 Experimental Setting 100 node cluster of EC2 extra large instances:» 800 cores» 6.8TB RAM» 75TB disk Two datasests: 2.5TB, and 7.5TB, respectively» Significantly larger when stored in memory

30 Sampling vs. No Sampling (close formula) Partially Fully Cached

31 Sampling vs. No Sampling (close formula)

32 Sampling vs. No Sampling (close formula) 12x Faster! x Faster!

33 How to Accurately Estimate Error? Close formulas for limited number of operators» E.g., count, mean, percentiles What about user defined functions (UDFs)?

34 Bootstrap Quantify accuracy of a query on a sample table Original Table T Q(T) Q(T) takes too long! sample S = N S Q(S) what is Q(S) s error? sampling with replacement S 1 S k Q (S 1 ) Q (S k ) use Q(S 1 ),, Q(S K ) to compute estimator quality assessment, e.g., stdev(q(s i )) S i = N

35 Also Useful for Close Formulas filter1 sum filter2 sum x, ε 1 y, ε 2 Every error propagation step may div introduce additional error r = x/y, ε = epr(ε 1, ε 2 )

36 Also Useful for Close Formulas S S 1 S k filter1 filter2 filter1 filter2 filter1 filter2 sum x, ε 1 div sum y, ε 2 sum sum x y sum Bootstrapping div only computes div errors on the final result x sum y r = x/y, ε = epr(ε 1, ε 2 ) r 1 r, ε r k

37 Also Useful for Close Formulas 6x

38 Bootstrap Challenges How do you know the bootstrap is working?» Depends on distribution, computation, sample size Overhead

39 How Do You Know Bootstrap is Working? Assumption: f() is Hadamard differentiable» How do you know an UDF is Hadamard differentiable?» Only asymptotic consistency» Sufficient, not necessary condition Developed data- driven diagnostic for Bootstrap» Compare bootstrapping with ground truth for small samples» Check whether error improves as sample size increases

40 Ground Truth (Approximation) T Original Table

41 Ground Truth (Approximation) T Original Table S 1 Q (S 1 ) ξ = stdev(q(s j )) S p Q (S p ) Estimator quality assessment

42 Ground Truth and Bootstrap T Original Table Bootstrap on Individual Samples S 11 Q (S 11 ) S 1 ξ * 1 = stdev(q(s 1 j )) S 1k Q (S 1k ) S p1 Q (S p1 ) S p ξ * p = stdev(q(s pj )) S pk Q (S pk )

43 Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * = stdev(q(s )) ip ipj RelaHve Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size

44 Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * = stdev(q(s )) ip ipj RelaHve Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size

45 Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * = stdev(q(s )) ip ipj RelaHve Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size

46 Bootstrap Diagnosis Expectation test:» Bootstrap results should not deviate too much from ground truth, on average Standard deviation test:» Bootstrap results should not vary too much Confidence interval test:» Most (e.g., 95%) of bootstrap results should be close to ground truth

47 Expectation Test Δ i mean(ξ * i1,...,ξ * ip) ξ i ξ i (Δ i+1 < Δ i ) (Δ i+1 c 1 ) i =1,..., s Δ c 1 n 1 n 2 n 3 Sample Size sample size

48 Standard Deviation Test σ i stddev(ξ * i1,...,ξ * ip ) (σ i+1 < σ i ) (σ i+1 c 2 ), ξ i i =1,..., s σ Test fails! c 2 n 1 n 2 n 3 Sample Size sample size

49 Confidence Interval Test $ & # j 1,..., p : ξ * ij ξ ( i & % c 3 ) '& ξ i *& p α FracHon α of bootstrap results have rel. error of at most c 3 from ground truth (e.g., α = 0.95, c 3 = 0.5)

50 How Well Does it Work in Practice? Evaluated on 268 real- world Conviva Queries of which 113 had custom User- Defined Functions Diagnostic predicted that 207 (77%) queries can be approximated» False Positives: 3 (conditional UDFs)» False Negatives: 18

51 Bootstrap Challenges How do you know the bootstrap is working?» Depends on distribution, computation, sample size Overhead

52 Very, Very Preliminary Results Setting:» 25 EC2 instances with 4 slots and 15GB RAM» Input: 365mil rows, 204GB on disk, > 600GB in memory (deserialized format)» Workload: query computing 95- th percentile Overheads:» Bootstrap to estimate result s error» Bootstrap diagnosis

53 Query Resp. Time & Overhead Operation Computation complexity I/O complexity Full data C(N) O(N) N data size

54 Query Resp. Time & Overhead Operation Full data Sample Computation complexity C(N) C(n) I/O complexity O(N) O(n) N data size n sample size

55 Query Resp. Time & Overhead Operation Full data Sample Bootstrap Computation complexity C(N) C(n) k C(n) I/O complexity O(N) O(n) k O(n) N data size n sample size k # of samples used by bootstrap

56 Query Resp. Time & Overhead Operation Full data Sample Bootstrap Diagnosis Computation complexity C(N) C(n) k C(n) s i=1 p k C(n i ) I/O complexity O(N) O(n) k O(n) s i=1 p k O(n i ) N data size n sample size n i sample size for iteration i of diagnostic k # of samples used by bootstrap p # of samples used by ground truth s # of iteration used by diagnostic

57 Query Response Time seconds 1020 (0.02%) 10x as response time is dominated by I/O Query Response Time 103 (0.07%) (1.1%) (3.4%) (11%) Fraction of full data

58 Query Resp. Time + Bootstrap seconds Bootstrap overhead Query Response Time Bootstrap overhead up to 10x higher than query itself Fraction of full data

59 Bootstrap Query Plan Query plan Table Scan Table 1 Scan sample Table Bootstrap plan sample Table k Scan Filter Filter Filter Perc. Perc. Perc.

60 Detailed Query Plan Query plan Table Scan Filter Perc. Query SELECT percentile (sesstiontimes, 0.95) FROM Table WHERE dt >= start_day AND dt <= end_day AND customerid = customer1 AND sessiontype= type1

61 Detailed Query Plan Typically, filters remove many rows from input table Table Scan Filter Perc. SELECT percentile (sesstiontimes, 0.95) FROM Table WHERE dt >= start_day AND dt <= end_day AND customerid = customer1 AND sessiontype= type1

62 Filter Pushdown Optimization Table Table Scan Scan Filter Typically, much smaller than Table Filter Table f 1 Table f Table f k Perc. Perc. Bootstrap Perc.

Query + Bootstrap seconds 1000 900 800 700 600 500 400 300 200 100 0 (with Filter Pushdown) 202 Up to 2x

63 Query + Bootstrap seconds (with Filter Pushdown) 202 Up to 2x overall improvement Bootstrap overhead Query Response Time Fraction of full data

Query + Bootstrap + Diagnosis seconds 1000 900 800 700 600 500 400

overhead Query Response Time Diagnosis overhead up to 8x higher than

64 Query + Bootstrap + Diagnosis seconds (with Filter Pushdown) 532 Diagnosis overhead Bootstrap overhead Query Response Time Diagnosis overhead up to 8x higher than the rest Fraction of full data

65 Overhead Operation Full data Comp. complexity C(N) Comm. complexity O(N) # of tasks O(d) Sample Bootstrap Diagnosis C(n) k C(n) s i=1 p k C(n i ) O(n) k O(n) s i=1 p k O(n i ) O(d) Can generate millions k O(d) of tasks s p k O(d) N data size n sample size n i sample size for iteration i of diagnostic k # of samples used by bootstrap p # of samples used by ground truth s # of iteration used by diagnostic d degree of parallelism

66 Diagnosis Optimization Problem: too many tasks to launch & schedule» Fixed overhead dominates Solution: task consolidation» Consolidate the computation into d tasks, where d is the number of slots in the system» Caveat: currently a hack; not part of BlinkDB codebase Result:» Reduce diagnosis from 330s to 4.5s» Reduce bootstrap overhead by up to 10x

67 Query + Bootstrap + Diagnosis (with Filter Pushdown and Task Consolidation) sec Diagnosis overhead Bootstrap overhead Query Response Time Less than 70% overhead Fraction of full data

68 Summary Computing error bounds for approximate queries on massive data sets: a hard, important, and exciting problem At the intersection between:» ML: error computation» Databases: query plan optimization, sample creation and selection» Systems: improved parallelism, scheduling Preliminary results encouraging

69 Future Work Improve bootstrap diagnostic:» Provable properties for specific settings Improve coverage for error estimation» E.g., use static analysis to decompose programs in multiple Hadamard differentiable components Improve Bootstrap scalability with Bag of Little Bootstraps [Kleiner et al. 2012]:» No need to distribute query computation even for huge samples (e.g., 100 billion 1KB per record) Scheduling concurrent parallel, dependent jobs

Warehouse- Scale Computing and the BDAS Stack

Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,