Estimating Quantiles from the Union of Historical and Streaming Data

Size: px

Start display at page:

Download "Estimating Quantiles from the Union of Historical and Streaming Data"

Brendan Watts
5 years ago
Views:

1 Estimating Quantiles from the Union of Historical and Streaming Data Sneha Aman Singh, Iowa State University Divesh Srivastava, AT&T Labs - Research Srikanta Tirthapura, Iowa State University

$fraction of the entire data set falls 50% data < median Normal distribution Fundamental database query.$

2 Quantiles Quantile A statistical measure used to characterize distribution of data sets Indicates a point in a data distribution at or below which a given fraction of the entire data set falls 50% data < median Normal distribution Fundamental database query. An operator in many big data tools, such as R and Apache Spark 2

φ Quantile of a Set Rank of an element x in a set S is the number of

5 9 6 For instance, rank of elements in above set S S (sorted): Ranks

of rank φn, for 0<φ<1 Find 0.5-Quantile: Rank of 0.5 quantile = φn = 0.

3 φ Quantile of a Set Rank of an element x in a set S is the number of elements with value less than or equal to x S: N = For instance, rank of elements in above set S S (sorted): Ranks φ-quantile is an element of rank φn, for 0<φ<1 Find 0.5-Quantile: Rank of 0.5 quantile = φn = 0.5*11 = 5 0-quantile is the minimum and 1-quantile is the maximum of S. 3

Latency Experienced by All the Users Indicator of user experience 0.

4 Application Web Latency Web Latency delay experienced by a user between sending a request to the web server and getting back a response. Latency Experienced by All the Users Indicator of user experience 0.95 Quantiles 0.95 quantile > Threshold latency -- Track poor network performance 4

5 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 5

Historical Data Historical Data Data collected

performance User Requests Web Server User

6 Historical Data Historical Data Data collected and stored for later analysis. Web user requests collected at the server end: to analyze user trends / behavior and web site performance User Requests Web Server User Requests Stored for later analysis Batch Processing System Storage 6

7 Streaming Data Data Stream: Continuous sequence of data observed in the form of events, messages, tuples, in a single pass or limited number of pass. User Requests User Requests Sent for storage Web Server Data observed on the fly Stream Processing System Memory 7

8 Stream Processing System Memory (stores stream state) Update state Retrieve state e Retrieve information 43 e 23 e 12 from e 1 2 e 1 e 2 Data stream Stream Processing Engine Query ex: 0.5 quantile (Approximate) Answer to query Continuous update of the state of the data, with every incoming element 8

9 Integrated Processing of Historical and Streaming Data Long term data analysis with higher accuracy in real time Archived Historical data Live Data stream Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D1 D2 D3 D4 D5 Correlation of historical data with live streaming data Predictive analysis 9

10 Historical vs Stream Processing System Batch Processing System Stream Processing System Data stored and can be traversed multiple times Single or limited pass over data Accurate results of user query J Approximate results of user query L Latency due to disk I/O L Fast processing time J Can we have our cake and eat it too? 10

11 Our Focus: (Approximate) Quantile N : size of data stream Computation ε-approximate quantiles Find an element of rank ρ from a dataset of size N, such that, (φ-ε)n ρ (φ+ε)n, 0 < ε << φ < 1 ε : approximation parameter rank of element e in D : {x D x<e} 11

12 Prior Work on Approximate Quantile Computation Space efficient ε-approximate streaming algorithms: Manku et al [MRL98] with space cost O(1/ε log 2 (εn)) Greenwald-Khanna [GK01] with space cost O(1/ε log(εn)) Qdigest [SBAS04] with space cost O(1/ε log(u)) N: size of data stream, U: universe size, ε: approximation parameter, 0<ε<1 12

13 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 13

Processing Historical Data + Streaming Data Server Live Data

stream state for current timestep, using state-of-the-art

upon query Batching Processing System maintains batch summary

Coordinator Batch Data loaded in batch at every timestep, ex: 1

14 Processing Historical Data + Streaming Data Server Live Data Stream R Stream Processing System continuously maintains a stream state for current timestep, using state-of-the-art streaming algorithm, which is used to generate a summary S(R) upon query Batching Processing System maintains batch summary S(H) (a set of index for faster lookup) Stream summary S(R) Coordinator Batch Data loaded in batch at every timestep, ex: 1 day Time Step Batch summary S(H) Answer Query H R Data Warehouse Historical Data H 14

streaming algorithms The algorithm is memory efficient requiring O memory log(εn) ε m : size of

15 Contribution An algorithm that returns an element of rank ρ as a φ- quantile from a union of historical and streaming data, such that φn-εm ρ φn+εm, m << N compared to φn-εn ρ φn+εn by streaming algorithms The algorithm is memory efficient requiring O memory log(εn) ε m : size of data stream, rank of an element e in S : {x S x<e} ε : approximation parameter, 0<ε<1 T : no of time steps 15

16 Contribution Small processing time for updating historical data in the warehouse at every timestep, using amortized O((m / B)log(n / B)) disk accesses per timestep, Real time querying using amortized O (log 2 (εm / B)log(T )) disk accesses n : size of historical data B : block size of the data warehouse T : no of time steps ε : approximation parameter, 0<ε<1 16

17 3-way Tradeoff between Accuracy, Disk Accesses & Memory Accuracy Batch + External Memory Pure Streaming Batch + Streaming Disk Accesses Memory

18 Comparison of Our Algorithm with Pure Streaming Algorithms Relative Error GK (streaming) Qdigest (streaming) Our Algorithm ε Memory O! log(εn) $ O! log(u) $ O log(εn) # & # & " ε % " ε % ε ε εm N N: total size of historical and streaming data m: data stream size, m << N T: no of time steps U: universe size 18

19 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 19

20 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 20

21 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for each time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 21

Historical Data Update at Each Time Step: Approach 1 Merge the newly arrived dataset with older historical data at every time step Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D 1 D 2 D 3 D

22 Historical Data Update at Each Time Step: Approach 1 Merge the newly arrived dataset with older historical data at every time step Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D 1 D 2 D 3 D 4 D 1 sorted D 1 U D 2 sorted D sorted 2 sorted D 1 U D 2 U D 3 D sorted 3 Sort D 1 Merge D 1 and D 2 Sort in sorted D order 2 Merge D 3 with DSort 1 U D 23 in sorted order Disadvantages : Large disk access at each timestep i : Sort D i of size n i : O((n i /B)log(n i /B)) Merge with historical data of size n : O(n/B), B- block size Data Warehouse 22

23 Historical Data Update at Each Time Step: Approach 2 Add the new arrived dataset at each time step as a new partition in data warehouse Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D 1 D 2 D 3 D 4 D 5 D 1 sorted D 2 sorted D 3 sorted D sorted 4 Data Warehouse Sort D 1 Sort D 2 Sort D 3 Sort D 4 Advantage : Small update cost at each timestep i, Sort and store D i of size n i : O((n i /B)log(n i /B)) Disadvantage : Maintainence of several partitions Expensive to answer query - O(d*T) d: no of disk access / timestep to answer a query T: no of timesteps 23

Historical Data Update: Our Approach Dataset D i arrives at timestep i Data stored in sorted format in data partitions using levels Merge Threshold = Merge 2 Threshold = Merge 2 Threshold = 2 Level 0

24 Historical Data Update: Our Approach Dataset D i arrives at timestep i Data stored in sorted format in data partitions using levels Merge Threshold = Merge 2 Threshold = Merge 2 Threshold = 2 Level 0 D 2 D 1 D 3 D 4 D 5 D 6 D 7 Level 1 Level 2 D 1 U D 2 D 3 U D 4 D 5 U D 6 Violation of merge threshold! D 1 U D 2 U D 3 U D 4 Data Warehouse Partitions Each level has at most κ data partitions. In the figure, κ=2, κ merge threshold At most log κ T data partitions at the end of T time steps. 24

25 Historical Data Update at Each Time Step: Our Approach D 7 D 5 U D 6 Advantages: Relatively small update cost since the entire historical data is not accessed at each time step Small query time D 1 U D 2 U D 3 U D 4 25

26 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 26

Batch and Stream Summary Batch Summary S(H i ) ((1/ε 1 )+1) elements and their database index, for each

n-ε 1 n i n i -3 n i -2 n i -1 S(H i ) - sorted 0 ε 1 n i 2ε 1 n i 3ε 1 n i n-ε 1 n i n i -1 Stream

and so on, with a max error of ε 2 m, created when a query is made S(R) - sorted r 0 r 1 r 2 r m-2 r m-1

27 Batch and Stream Summary Batch Summary S(H i ) ((1/ε 1 )+1) elements and their database index, for each disk partition H i of size n i, updated at the end of each time step H i - sorted ε 1 n i 2ε 1 n i n-ε 1 n i n i -3 n i -2 n i -1 S(H i ) - sorted 0 ε 1 n i 2ε 1 n i 3ε 1 n i n-ε 1 n i n i -1 Stream Summary S(R) use stream state SS to get ((1/ε 2 )+1) elements of approximate ranks ε 2 m, 2ε 2 m, 3ε 2 m and so on, with a max error of ε 2 m, created when a query is made S(R) - sorted r 0 r 1 r 2 r m-2 r m-1 rank(r 0, R)=0 ε 2 m rank(r 2, R) 3ε 2 m rank(r m-1, R) = m-1 0<rank(r1, R) 2ε 2 m ε 2 (m-3) rank(rm-2, R) ε 2 (m-1) 27

High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time

28 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 28

Answer a Query : Find a Range within which Quantile is Guaranteed to lie H : historical data of size n; R : live data stream of size m; N=(n+m); Find elements u and v from S(H) U S(R), such that the

29 Answer a Query : Find a Range within which Quantile is Guaranteed to lie H : historical data of size n; R : live data stream of size m; N=(n+m); Find elements u and v from S(H) U S(R), such that the quantile is guaranteed to lie between [u, v] in the union of historical and streaming data Filters S(H) U S(R) in sorted order s 1 s 2 s 3 s 4 s 5 u v s 2(1/ε+1) H U R in sorted order e 1 e 2 e 3 e 4 u v e N-1 e N 2εN elements φ quantile guaranteed to lie here 29

30 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 30

31 Additional Disk Accesses to Improve Accuracy The filter range [u, v] is narrowed down recursively to zero down on error due to batch. At each recursion, Compute rank of z = (u+v)/2 in all data partitions and stream For each partition H i of size n i S(H i ) : (sorted) Step a. Find p and q, such that & q v e 1 e 2 p q e β 4εn i elements β=(1/ε 1 ) + 1 H i : (sorted) Step b: Access p and q in H i using index from S(H i ) e 1 e 2 p q e ni Step c: Find rank of z (using binary search), p z Rank of z = Σ(rank of z in each partition) + rank of z in data stream Choose [u, z] or [z, v] as the new filter for next recursion, depending on which contains the quantile 31

32 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Problem Statement and Motivation 4. Contribution 5. High Level Algorithm Description 6. Experimental Evaluation 32

33 Experimental Setup Datasets Wikipedia Network traces Uniform Random Normal Data Size (GB) Tuples / time step (million) Algorithms Greenwald-Khanna Qdigest Our Performance Metric Accuracy Processing time Querytime Description Measure relative error : φn - r / φn Runtime and no. of disk accesses Runtime and no. of disk accesses 33

34 Dependence On Memory Update time (seconds) Update time vs Memory Load Sort Relative Error Merge Summary Our Algorithm GK Memory (MB) e-05 1e-06 1e-07 1e-08 1e-09 Our Algorithm Greenwald-Khanna Q-Digest Quick Response 1e Memory (MB) (Merge Threshold apple = 10) QD Query RunTime (seconds) Accuracy vs Memory Query time vs Memory Our Algorithm - Runtime Greenwald-Khanna - Runtime Q-Digest - Runtime Our Algorithm - Disk accesses Query Disk Access Memory (MB) (Merge Threshold apple = 10) Normal Dataset (merge threshold: 10) 34

35 Update time vs Merge Threshold Dependence On Merge Threshold Relative Error e-05 1e-06 1e-07 1e-08 1e-09 1e-10 Relative Error in Practice Relative Error in Theory Merge Threshold apple (Memory fixed at 250MB) Accuracy vs Merge Threshold Query time vs Merge Threshold Update time (seconds) Avg. disk access (overall) Avg. disk access for Merge Sort Load Merge Summary Avg no. of disk accesses / time step Query RunTime (seconds) Our Algorithm - Runtime Our Algorithm - Disk accesses Memory (MB) (Merge Threshold apple = 10) Query Disk Access Normal Dataset (memory fixed: 250 MB) 35

36 Dependence On Historical Data Size Update time vs Historical Data Size Relative Error 1e-07 8e-08 6e-08 4e-08 2e Historical Data Size (GB) Accuracy vs Historical Data Size Query time vs Historical Data Size Average Update RunTime (sec) Runtime Avg Disk accesses (overall) Avg Disk accesses (merge) Normal Dataset (streamind data size: 1GB, memory: 250 MB, merge threshold: 10) Historical Data Size (GB) Average Update Disk Accesses Query RunTime (seconds) Runtime Disk accesses Query Disk Accesses Historical Data Size (GB) 36

37 Dependence On Streaming Data Size Update time vs Data Stream Size Relative Error 1e-08 8e-09 6e-09 4e-09 2e Stream Size (MB) Accuracy vs Data Stream Size Query time vs Data Stream Size Average Update RunTime (sec) Runtime Avg Disk accesses (overall) 5000 Avg Disk accesses (merge) Normal Data (historical data size: 100GB, memory: 250 MB, merge threshold: 10) Stream Size (MB) Average Update Disk Accesses Query RunTime (seconds) Runtime Disk accesses Stream Size (MB) 37 Query Disk Accesses

38 Conclusion We present an algorithm that identifies quantile from a union of historical and streaming data in real time with accuracy significantly better than pure streaming algorithms using memory similar to pure streaming algorithms using few additional disk accesses Future Work Analysis on the tradeoff between accuracy and disk accesses Improve the window adaptation of algorithm Parallel methods for processing historical data Other aggregates over a union of historical and streaming data 38

39 Thank You 39

Estimating quantiles from the union of historical and streaming data

Electrical and Computer Engineering Conference Papers, Posters and Presentations Electrical and Computer Engineering 11-216 Estimating quantiles from the union of historical and streaming data Sneha Aman