Nonparametric Importance Sampling for Big Data

Size: px

Start display at page:

Download "Nonparametric Importance Sampling for Big Data"

Alice Bradford
5 years ago
Views:

1 Nonparametric Importance Sampling for Big Data Abigael C. Nachtsheim Research Training Group Spring 2018 Advisor: Dr. Stufken SCHOOL OF MATHEMATICAL AND STATISTICAL SCIENCES

2 Motivation Goal: build a model that predicts well over the predictor space Massive amounts of data increasingly available Big data presents computational challenges First step: some method of data reduction 2

3 Data Reduction Overview Our data set consists of n observations n is very large From the full data, select s observations s << n the s observations make up the subdata Carry out data analysis on subdata only 3

4 Data Reduction Overview: Example Full data: 1 response, 9 predictors, 10,000,000 observations n = 10,000,000 Choose s = 5,000 Subdata: 1 response, 9 predictors, 5,000 observations 4

5 Data Reduction Overview Obs Y X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X Obs Y X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X K 10M 5

6 Data Reduction Overview Obs Y X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X Obs Y X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X M 5K But how do we choose? 6

7 Selecting Subdata: Approach 1 Goal: Subdata that is similar to full data Just take a simple random sample - Fast - Easy But this may not be the best sample for prediction 7

8 Selecting Subdata: Approach 2 Goal: select an optimal subsample - Determinant of information matrix - Mean square error for prediction Select subdata carefully to optimize some criterion Improves properties of the estimator 8

9 Approach 2: Some Methods Leverage-based subsampling Shrinkage leveraging method Unweighted leveraging estimator Information-Based Optimal Subdata Selection (IBOSS) * * Wang, H., Yang, M., & Stufken, J. (2017). Information-Based Optimal Subdata Selection for Big Data Linear Regression. Journal of the American Statistical Association 9

10 Approach 2 Example: IBOSS Goal: maximize determinant of subdata information matrix Some nice properties - Unbiased estimators - Variance of estimators! 0 as n! - Computationally efficient 10

11 Approach 2 Example: IBOSS Drawback: assumes linear model With big data we may not be able to guess the underlying model 11

12 Another Possibility? Nonparametric approach - We don t know the underlying model Goal: spread the subdata out throughout full region 12

13 Today s Plan 1) Consider 2 new methods - Clustering - Space-filling design 2) Perform a simulation study to evaluate the methods 3) Conclusions 13

14 k-means Clustering Divide dataset into k initial clusters Assign each point to cluster with nearest mean Euclidean distance Update means Repeat Minimizes within cluster sum of squares 14

15 Potential Method 1: Clustering Cluster full data using k-means Choose subsample from clusters based on cluster characteristics We consider two clustering sampling strategies 15

16 Two Possible Strategies 1) Inversely proportional to density of cluster Sparse cluster " sample (proportionally) more points Dense cluster " sample (proportionally) fewer points 2) Equal subsample size from each cluster Take s/k points from each cluster Both are attempts at selecting subsample uniformly from the full sample 16

17 Space Filling Designs Spread design points through experimental region Used when form of underlying model is unknown 17

18 Some Examples Sphere Packing Design Uniform Design Fast Flexible Filling Design Latin Hypercube Design * * McKay, M., Beckman, R., & Conover, W. (1979). Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 21(2),

19 Potential Method 2: Design Construct Latin hypercube design with k points Cluster full data around these points Sample equally from each cluster 19

20 Simulation Study: Generate X One dimensional, mixture of Normals, n = 1000 Z 1 ~ N(-100, 10,000) Z 2 ~ N(300, 1) w i ~ Bernoulli(0.1) X i = w i *Z 1 + (1 w i )*Z 2 20

Simulation Study: Generate Y E(Y i X i ) = -0.

21 Simulation Study: Generate Y E(Y i X i ) = * X i 2 Y(X i ) = E(Y i X i ) + 30*ε i ε i = independent standard normal errors 21

22 Simulation Study Analysis For each of 1000 data sets with n = 1000: Select subdata, s = 50 using each method - Simple random sample - IBOSS - Cluster with inverse proportional size, k = 5 - Cluster with equal size, k = 5 - Space-filling design, k = 5 22

23 Simulation Study Analysis Using subdata only, estimate a model - Use OLS - Fit quadratic model Compute integrated predicted mean squared error 23

24 Simulation Results 10% of the data is here 24

25 Simulation Results 10% of the data is here 90% of the the data is here 25

26 Simulation Results 10% of the data is here 90% of the the data is here This is the true response: Y = *X 2 26

27 Simple Random Sample 27

28 IBOSS 28

29 Cluster: Equal Sizes 29

30 Cluster: Inverse Proportional Sizes 30

31 Space-filling Design 31

32 Full Data 32

33 Toy Example: Results Method Predicted RMSE Simple Random Sample 59,498 IBOSS Cluster: Inverse Prop Space-Filling Design 9.33 Cluster: Equal 9.31 Full Data

34 Toy Example: Results Method Predicted RMSE Simple Random Sample 59,498 IBOSS Cluster: Inverse Prop Space-Filling Design 9.33 Cluster: Equal 9.31 Full Data

35 Example with Real Data n = 4.2 million p = 15 1 continuous response Used in the IBOSS paper 35

36 Example with Real Data Construct subdata of size s = 2,000 Consider 4 methods: - Simple random sample - IBOSS - Space-filling design - Cluster: Equal 36

37 Example with Real Data Fit two models - First-order linear model (as in IBOSS paper) - Second-order linear model Compute holdout predicted mean squared error 37

38 Real Data Results: First-Order Model Method Predicted MSE IBOSS Simple random sample Cluster: Equal Space-filling design Using 2,000 observations Predicted MSE from the full data: Using 4.2 million observations 38

39 Real Data Results: Second-Order Model Method Predicted MSE IBOSS 90,545.1 Simple random sample Cluster: Equal Space-filling design Using 2,000 observations Predicted MSE from the full data: Using 4.2 million observations 39

40 Real Data Results: Second-Order Model Method Predicted MSE IBOSS 90,545.1 Simple random sample Cluster: Equal Space-filling design Using 2,000 observations Predicted MSE from the full data: Using 4.2 million observations 40

41 Preliminary Conclusions We can spread points uniformly using clustering and space-filling methods If goal is prediction: clustering and space-filling methods as good or better than simple random sample Space-filling design method performs best with quadratic model 41

42 Future work 1) More extensive simulation study involving Different sizes of k Different underlying models 2) Explore alternative methods to choose seed points Fast Flexible Filling Design Uniform random sample 3) Nearest neighbor to seed points rather than cluster 4) Consider large sample properties 42

Computer Experiments. Designs

Computer Experiments. Designs Computer Experiments Designs Differences between physical and computer Recall experiments 1. The code is deterministic. There is no random error (measurement error). As a result, no replication is needed.