COPULA MODELS FOR BIG DATA USING DATA SHUFFLING

Size: px

Start display at page:

Download "COPULA MODELS FOR BIG DATA USING DATA SHUFFLING"

Rosamond Martin
5 years ago
Views:

1 COPULA MODELS FOR BIG DATA USING DATA SHUFFLING Krish Muralidhar, Rathindra Sarathy Department of Marketing & Supply Chain Management, Price College of Business, University of Oklahoma, Norman OK Management Science & Information Systems, Spears College of Business, Oklahoma State University, Stillwater OK ABSTRACT Big data often involves complex relationships among variables. Copula models offer a simple, effective approach for modeling such complex relationships. Traditional implementation of copula models require the identification of the marginal distribution of the variables that make it difficult to automate the modeling process. In this study, we provide a non-parametric implementation using a new procedure called Data Shuffling that allows the entire modeling process to be easily automated. INTRODUCTION With the recent focus on the use of big data for marketing purposes, the development of models that can be used for prediction and inference using big data have gained importance. While traditional statistical models are useful in big data, there is a need to develop more sophisticated models. When the underlying distribution of the data is not normal and the relationship between the variables are non-linear, traditional statistical models are not appropriate. To overcome this problem, Danaher and Smith (2011) recently introduced the use of copula models for marketing applications. Using different examples from the marketing literature, they also demonstrated the versatility, flexibility, and accuracy of copulas for modeling marketing problems. The approach proposed by Danaher and Smith (2011) involves the following steps: (1) Estimation of the marginal distribution of the individual variables, (2) Estimation of the dependence parameters (the correlation matrix), (3) Generation of the multiple samples using Markov Chain Monte Carlo (MCMC) simulation, and (4) Estimating Bayesian point estimates of parameters and other metrics of interest using the MCMC generated samples. The procedure suggested by Danaher and Smith (2011) is an excellent one for the purposes of model construction, validation, and hypothesis testing in the context of marketing. However, in the context of big data, it would be difficult to use this procedure

2 Big data has three main characteristics: volume (quantity of data), variety (types of data), and velocity (speed of data collection). When implementing prediction models for big data, it is important that we consider the characteristics of the data. Since the data is high volume, it is necessary that the prediction models must be scalable to large scale data sets. Since the data has considerable variety, it is necessary that the prediction models be capable of incorporating numerical (both continuous and discrete) and categorical. Since the data has high velocity, the implementation should be automated and require no manual intervention. Unfortunately, the Danaher and Smith (2011) procedure does not adequately satisfy any of these requirements as we now discuss. Identification of the marginal distribution The identification of the marginal distribution of the individual variables is an important step in modeling using copulas. In many cases, there may be a clear theoretical basis for using a particular marginal distribution to model a particular variable. For example, the log-normal distribution is used as the appropriate marginal distribution for duration of visits based on the previous study by Danaher and Smith (2007); the beta binomial distribution is used as the appropriate marginal distribution for eggs and bacon data and the exposure to magazine advertisements based on a prior studies (Chandon 1986, Danaher and Hardie 2005, Rust 1986); and the negative binomial distribution for modeling the number of page views by web sites based on prior studies by Danahaer (2007) and Huang and Lin (2006). Thus, when there is prior information about the characteristics of the marginal distribution of the individual variables, then it would be appropriate to use this information. In situations where no theoretical basis exists for the use of a particular distribution for a particular variable, the decision becomes more difficult. The estimation of the marginal distribution of an individual variable has received considerable attention in the statistical literature. As observed by Danaher and Smith (2011), many approaches for estimating the marginal distribution of a variable exists including maximum likelihood, Bayesian, or method-of-moments. With each of these approaches, there are also multiple criteria for assessing the goodness of fit of the estimated marginal to the observed data. Unfortunately, the diversity of approaches also presents a problem it is difficult to identify one particular approach as being superior to all others. Even within a given approach, the diversity of criteria used to assess the goodness of fit creates doubt as to which distribution and estimated parameters provide the best fit for the observed data. Hence, for a given dataset, there may be reasonable disagreement as to which marginal distribution is best suited to model a particular variable. The problem is magnified when the problem under consideration has many marginal variables. In the Danaher and Smith (2011) study, the web site page views data consists of 45 variables. If previous studies had not established that the beta binomial distribution as the appropriate model for web site page views, then it would be necessary to identify the marginal distribution of each of these variables, which can be a difficult task. In addition, if the variables under consideration were not of similar measures (web site views), but represented completely different characteristics (like

3 for example age, income, etc.), the task of identifying the marginal distribution of each of these different characteristics can be an imposing task. Computational complexity The Danaher and Smith (2011) procedure is a considerable improvement over other procedures previously used for modeling complex relationships in marketing. However, their model requires considerable computational effort in order to compute estimates due to the use of the MCMC simulation. For instance, even for a relatively small problem with only observations, implementing their procedure requires approximately 55 minutes of computational time. This is a considerable improvement over other models which require over 6 hours of computational time. But in the context of big data, this computational effort would still be considered burdensome. Thus, we need to evaluate alternative approaches for implementing copula models for big data. In this study, we propose a modified version of the copula approach called data shuffling. However, it should be noted that the Danaher and Smith (2011) procedure should be the preferred approach for modeling in research scenarios. Prior to describing the data shuffling approach, we briefly describe the concept of copulas. COPULA MODELS Consider a set of random variables AA 1, AA 2,, AA mm with marginal cumulative distribution functions (CDF) uu ii = FF ii (aa ii ), ii = 1, 2,, mm and joint CDF = FF(aa 1, aa 2,, aa mm ). Sklar (1959) showed the joint CDF can be written as: FF(aa 1, aa 2,, aa mm ) = CC[uu 1, uu 2,, uu mm ], (1) where CC[uu 1, uu 2,, uu mm ] is a joint copula CDF with uniform marginal distributions. In addition, the joint probability density function (pdf) can be written in product form as: mm ff(aa 1, aa 2,, aa mm ) = ii=1 ff ii (aa ii )cc[uu 1, uu 2,, uu mm ] (2) where cc is called the copula density and ff ii (aa ii ) are the marginal densities of AA ii. The joint density shown in (2) also provides the ability to derive the conditional density of one or more of the random variables with respect to the other variables. The primary applications of copulas have been to combine specified (arbitrary) marginal distributions into joint distributions that exhibit certain specified dependence or joint behavior. Joe (1997), Nelsen (1995, 1999) and Schweizer (1991) serve as good introductions to the theory and application of copulas. A wide variety of copula functions have been investigated for combining non-normal distributions, both discrete and continuous. The selection of a copula function depends on the specific problem under consideration. The characteristics of the joint distribution also vary depending on the specific type of copula selected

4 In this study, we use the multivariate normal copula for illustration purposes. The normal copula parameterized with product moment correlation matrix ρρ can be written as: CC ρρ (uu) = φφ ρρ mm φφ 1 (uu 1 ), φφ 1 (uu 2 ),, φφ 1 (uu mm ), (3) where φφ ρρ mm represents the joint CDF of a mm-variate standard multivariate normal distribution with correlation matrix ρρ and φφ 1 represents the inverse of the CDF of the univariate standard normal function. In addition, for the multivariate normal distribution, the relationship between the rank order and product moment correlation can be expressed as follows: ρρ iiii = 2 SSSSSS ππrr iiii (4) 6 where ρρ iiii and rr iiii are the product moment and rank order correlation, respectively, between variables (ii, jj). Hence, the rank order correlation matrix RR of the original data can be used to compute the product moment correlation ρρ of the transformed normal variables. Thus, the multivariate normal copula model allows us to express the relationship between a set of variables with arbitrary marginal distributions using the relatively simple multivariate normal distribution. The application of copulas as described above requires the identification of the marginal distribution of each of the variables as described in Danaher and Smith (2011). As discussed earlier, the identification of the marginal distribution is often complicated and requires human involvement in the identification and selection of the best fit distribution. This may prevent the ability to automate the modeling process which is essential for big data applications. A nonparametric approach based only on the empirically observed data, called Data Shuffling, may provide a viable approach in these cases. DATA SHUFFLING Data shuffling was originally proposed by Muralidhar and Sarathy (2006) in the context of statistical disclosure limitation. The purpose of data shuffling was to generate a new data set that preserved the characteristics of the original data without disclosing information about individual records. Data shuffling is implemented as follows: (1) Identify the rank order correlation of the original data, (2) Construct the copula model using the rank order correlation, (3) Generate a new data set using the normalized copula model, and (4) Reverse map the original data onto to the generated normalized values. Note that the data shuffling process does not require the marginal distribution to be modeled and it does not require MCMC simulation, the two steps that make the implementation of the Danaher and Smith (2011) procedure problematic in the big data scenario. The key to data shuffling is the concept of reverse mapping. It is the ability to reverse map the normalized values back to the original data using ranks that allows for the implementation of the entire procedure without having

5 to identify the original marginal distribution. In traditional copula implementation, once the normalized values have been generated, it would be necessary to map these values back to the original distribution via the marginal distribution of the original data. In reverse mapping, we consider the normalized values that were generated as another random realization of the original data and the original data as the empirical marginal distribution and the random realization is then reverse mapped back to the original marginal distribution. Note that for large data sets, there exists a practically infinite number of potential combinations that result in the same rank order correlation matrix. The only exception to this rule would be the scenario where there is complete collinearity in the data (that is, the rank order correlation is either +1 or 1). For brevity, we do not go into the details of the theoretical derivations relating to data shuffling. We refer the interested reader to Muralidhar and Sarathy (2006). Consider the following simple illustration involving two variables and 20 observations with a rank order correlation of 0.60 (see Table 1). We use a multivariate copula to model this scenario (although data shuffling is not necessarily limited to this particular copula). The table shows the original data, the generated normalized copula values, and the reverse mapped values. It is important to note that when reverse mapping is performed, the rank order correlation of the normalized copula values and the reverse mapped data are exactly the same. This is extremely important since the key to the copula model is the preservation of the rank order correlation. To highlight the process of reverse mapping, consider the first normalized copula value (the first value in the YY 1 column) of In the traditional copula approach, we would first compute the normal distribution probability of this value. Then we would use this probability to compute the original value as the inverse of the cumulative probability of the marginal distribution of the variable. Obviously, this would require that we the marginal distribution of the variable XX 1. With reverse mapping, we avoid this process. We simply find the rank of the normalized copula value (which is 1). We replace this value with the value from XX 1 with a rank of 1 (0.2317). We repeat this for every record and every variable. As observed earlier, the rank order correlation of the normalized copula values and the reverse mapped values are identical, preserving the relationship between the variables. The illustration in Table 1 provides the basic conceptual idea behind data shuffling. In the following section, we provide the results of a comprehensive simulation experiment conducted to assess the effectiveness of data shuffling. EMPIRICAL EVALUATION OF DATA SHUFFLING In this section, we describe a simulation experiment conducted to evaluate the effectiveness of data shuffling. To be realistic, we intentionally chose the relationship between these variables to be complex. We consider a dataset with observations and 3 variables. Figures 1-3 show the relationship among these variables and their rank order correlation

6 Table 1. Small illustrative example of data shuffling Original data set Copula generated normalized data set Shuffled data set XX 1 XX 2 Rank Rank Rank Rank YY XX 1 XX 1 YY 2 2 (YY 1 & ZZ 1 ) (YY 2 & ZZ 2 ) ZZ 1 ZZ Rank order correlation Figure 1. Scatter plot of variables 1 and 2 for the original data (Rank order correlation = 0.70) Figure 2. Scatter plot of variables 1 and 3 for the original data (Rank order correlation = 0.60)

Figure 3. Scatter plot of variables 2 and 3 for the original data (Rank order correlation = 0.

Comparing Figures (1 and 4), (2 and 5), and (3 and 6), we observe that the data generated using the data shuffling procedure closely approximates the original data.

The purpose of the experiment was to assess the extent to which the values generated using the data shuffling approach was able to approximate the original relationship.

7 Figure 3. Scatter plot of variables 2 and 3 for the original data (Rank order correlation = 0.75) For illustrative purposes, Figures 4-6 provide the values generated using the data shuffling approach for a single set of simulated values. Comparing Figures (1 and 4), (2 and 5), and (3 and 6), we observe that the data generated using the data shuffling procedure closely approximates the original data. This result provides basic visual verification of the effectiveness of the data shuffling procedure. We conducted a simulation experiment to evaluate the performance of the data shuffling procedure. The purpose of the experiment was to assess the extent to which the values generated using the data shuffling approach was able to approximate the original relationship. The process of generating new samples was replicated 100 times. For each sample, we computed the rank order correlation between the variables. Using this rank order correlation, we computed the bias (the difference between the rank order correlation of the generated values and the original data). Figure 4. Scatter plot of variables 1 and 2 for the shuffled data Figure 5. Scatter plot of variables 1 and 3 for the shuffled data Figure 6. Scatter plot of variables 2 and 3 for the shuffled data

8 Table 2 provides the mean and standard deviation of this bias. The results indicate that the mean of this measure is very close to zero, indicating that data shuffling provides is unbiased in estimating the rank order correlation. The standard deviation of the bias is also very small, indicating that for this data set, data shuffling is very effective in providing a very close approximation of the true relationship between the variables. Figure 7 provides the frequency distribution of the bias in the rank order correlation between variables 1 and 2 across all 100 replications. The figure indicates that the difference between the simulated and original values are extremely small, and in most practical scenarios, would be considered negligible. In addition, the frequency distribution also provides the decision maker with the ability to perform simple hypothesis tests regarding the relationship between the variables. In summary, these results provide strong evidence of the effectiveness of data shuffling as a simple but effective procedure for using copula models in a big data environment. Table 2. Bias in correlation for shuffled data Measure Variables Variables Variables 1 and 2 1 and 3 2 and 3 Average Bias Standard deviation of Bias Figure 7. Frequency distribution of shuffled rank order correlation for variables 1 and 2 CONCLUSIONS The characteristics of big data (volume, variety, velocity) make it necessary that existing analytical approaches are modified to suit big data. Specifically, it is necessary that the modeling approaches are automated (requiring little or no human intervention) and scalable. When modeling complex relationships among variables, copulas offer a simple but effective approach. The traditional copula implementation require considerable effort in identifying the marginal distribution which makes them a difficult option for big data. In this study, we offer data shuffling as an alternative to the traditional copula models to overcome this problem. Our experimental results indicate that data shuffling is capable of effectively modeling complex relationships. By using reverse mapping, data shuffling eliminates the need for identifying the marginal distribution of the variables when implementing copula models. In addition, data shuffling is asymptotic, that is, as the size of the data set increases, the difference between using reverse mapping approach and the traditional marginal distribution approach becomes negligibly small. A more comprehensive investigation using more variables, different sample sizes, and different relationships is currently being conducted

9 REFERENCES Chandon, J.-L. J. (1986) A Comparative Study of Media Exposure Models. Garland, New York. Danaher, P. J. (2007) Modeling page views across multiple websites with an application to Internet reach and frequency prediction. Marketing Science, 26(3), Danaher, P. J., and Hardie, B.G.S. (2005) Bacon with your eggs? Applications of a new bivariate beta-binomial distribution. American Statistician, 59(4), Danaher, P.J. and Smith, M.S. (2011) Modeling Multivariate Distributions Using Copulas: Applications in Marketing. Marketing Science, 30(1), Huang, C.-Y. and Lin, C.-S. (2006) Modeling the audience s banner ad exposure for Internet advertising planning. Journal of Advertising, 35(2), Joe, J. (1997) Multivariate Models and Dependence Concepts. Chapman & Hall, London. Muralidhar, K. and Sarathy, R. (2006) Data Shuffling A New Masking Approach for Numerical Data. Management Science, 52(5), Nelsen, R.B. (1995) Copulas, Characterization, Correlation and Counterexamples. Math. Magazine, Rust, R. T. (1986) Advertising Media Models: A Practical Guide. Lexington Books, Lexington, MA. Schweizer, B. (1991) Thirty Years of Copulas. in G. Dall Aglio, S. Kotz, G. Salinetti, (eds.) Advances in Probability Distributions with Given Marginals, Kluwer, Dordrecht, Netherlands, Sklar, A. (1959) Fonctions de Répartition à n dimensions et Leurs Mages. Publications de l Institut Statisitque de l Universite de Paris,

Robust Linear Regression (Passing- Bablok Median-Slope)

Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their