STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY

Size: px

Start display at page:

Download "STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY"

Theodora Baldwin
6 years ago
Views:

1 STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY

2 2

3 3

4 4 Apps with Flurry SDK

5 5

6 q( ) P i q(u i) Examples: Number of event of a certain type Number of unique user Number of unique users in a specific day Total time spent in certain geo Average $ spent by age 6

7 SAMPLING Challenges: P 1. The data is very large. Computing i q(u i) exactly is too costly. 2. The function q( ) is user specified and completely unconstrained. Good News: And approximate answer is acceptable (if the error is small) Solution: Estimate the answer on a random subset of the records 7

8 NOTATIONS q i := q(u i ) for brevity y := P i q i the exact answer for the query q! p i the probability of choosing record i S the set of sampled records, each chosen with probability p i ỹ = P i2s q i/p i the Horvitz-Thompson estimator for! y 8

9 PROPERTIES E[ỹ y] =0 Horvitz-Thompson estimator is unbiased [ỹ y] apple y p 1/( card(q)) its standard deviation isn t large! =min i p i card(q) := P q i / max q i Pr[ ỹ y "y] apple e O("2 card(q)) probability for large error is small card(q) (n)! S 1/" 2 (Olken, Rotem and Hellerstein 1986, and 1990) application to databases (Acharya, Gibbons, Poosala 2000) uniform sampling is best in the worst case 9

10 STRATIFIED SAMPLING Sample = 100,000 US individuals. Query = Republicans vs. Democrats in American Samoa? American Samoa is 0.02% of population Only ~20 from Samoa in the sample Survey error is very large! If card(q) is small S must be large Sample different strata (e.g. US territories) with different probabilities. (Neyman, Jerzy 1934) 10

11 DBLP EXAMPLE Choosing the right strata is hard! 2,101,151 papers 1000 most populous venues Query example title contains learning and # authors <= 3 title contains mechanism and year > 2004 What is the right stratification here? Stratifying by venue made things worse! Stratifying by year was better but still worse than uniform sampling. 11

12 SAMPLING, STRATIFICATION, AND DATABASES Design strata that minimize worst case variance on possible queries Linearly combine strata based on record features Combine stratifies and uniform sampling: Congressional Sampling o Acharya, Gibbons, Poosala 2000: Important idea: consider past queries to the database! Each stratum is a set of records that agree on all queries o Chaudhuri, Das and Narasayya 2007: optimize for the query log Split to two strata, per each query. Take linear combinations o Joshi, Jermaine, 2008: linear combinations of stratified probabilities 12

13 OUR APPROACH Assume queries are drawn from a distribution Q Use the query log Q as a training set (assumed w.r.t. Q ) Allow each record to be sampled with a different probability p i Minimize the Risk This translates to E[(ỹ y) 2 ] X E q Q qi 2 (1/p i 1) i unknown 13

14 OUR APPROACH ERM: Minimize X X qi 2 (1/p i 1) q2q i Query log X P Sampling budget p i c i apple B ( i c i B ) i Regularization 8 i p i 2 [, 1] ( apple B/ P i c i ) 14

15 OUR APPROACH Solve with Lagrange multipliers max [ 1,, Q X X X qi 2 (1/p i 1) i (p i ) q2q i X i i(1 p i ) (B X p i c i )] i i By KKT conditions q p or p i / 1 i = or p i =1 1 c i Q P q2q q2 i 15

16 OUR APPROACH Risk(p) apple Risk(p ) 1+O skew s log(n/ ) Q!! Alg Risk Best Risk Database badness Training Set size 16

17 17 RESULTS

18 RESULTS DBLP Dataset Uniform Sampling p = 1/ Training Queries Training Queries Training Queries Training Queries Expected Error [weaker...] Value of Regularization Parameter Eta [...stronger]

19 RESULTS Cube Dataset Uniform Sampling p = 1/10 50 Training Queries 100 Training Queries 200 Training Queries 800 Training Queries 6400 Training Queries Expected Error [weaker...] Value of Regularization Parameter Eta [...stronger]

20 RESULTS YAM+ Dataset Uniform p = 1/ Training Queries 100 Training Queries 200 Training Queries 400 Training Queries All Training Queries Expected Error [weaker...] Value of Regularization Parameter Eta [...stronger]

21 RESULTS 10 1 YAM+ Dataset Regularized ERM Uniform Sampling Expected Error e e+06 Numeric Cardinality of Test Query 21

22 RESULTS YAM+ Dataset Uniform Sampling Regularized ERM Probability (Rescaled) Average Error

23 RESULTS 10 1 YAM+ Dataset Uniform Sampling Regularized ERM Expected Error Sampling Rate = Budget / (Total Cost) 23

24 RESULTS Cube Dataset ERM Uniform Sampling Expected Error Numeric Cardinality of Test Query 24

Stratified Reservoir Sampling over Heterogeneous Data Streams

Stratified Reservoir Sampling over Heterogeneous Data Streams Mohammed Al-Kateb and Byung Suk Lee Department of Computer Science, The University of Vermont, Burlington VT, USA {malkateb,bslee}@cs.uvm.edu