Note: In the presentation I should have said "baby registry" instead of "bridal registry," see

Size: px

Start display at page:

Download "Note: In the presentation I should have said "baby registry" instead of "bridal registry," see"

Annabelle Fowler
5 years ago
Views:

1 Q-and-A from the Data-Mining Webinar Note: In the presentation I should have said "baby registry" instead of "bridal registry," see Q: You mentioned the 'Big Data' McKinsey Report. Is that the actual name of the report? Do you know where I could find that data point? A: The actual title is "Big Data: the next frontier for innovation, competition and productivity," and it is published by the McKinsey Global Institute. Q: How do you treat missing responses appearing in data sample? A: There is a whole domain of statistics called "missing data analysis" that covers ways to impute values where they are missing. If a small proportion of records (cases) have missing values for one or more variables, you can omit those records and proceed with the analysis. However, if a substantial proportion have missing values, you may need to impute missing values so as to be able to retain the rest of the information in those records. If there are a lot of variables, even a low incidence of missing values can knock out a lot of records. In a clinical trial, where a single subject's data might have cost tens of thousands of dollars to obtain, it pays to go to considerable lengths to retain as much information as possible. Predictive modeling, where data are typically plentiful, is not as sensitive to the problem, so fairly rudimentary remedies for missing data (e.g. "replace with mean for the variable") are often effective. Q: Can you resample to get the training, validation, and test sets? A: There is a variant of this called cross-validation in which the partitioning process is done multiple times, and the modeling process repeated. However, in each cross validation iteration, each partition is disjoint from the others. Classic resampling with replacement will produce duplication within the partitions, which reduces the utility of the validation partition as an independent check on the model's validity. Q: Are you aware of any research studies that cover data mining in a time-series context?

2 A: Yes, the second edition of our book includes several chapters on this, and the lead author has put similar material in her own book, "Practical Time Series Forecasting" (Galit Shmueli) - Q: How can we tackle the problem of having a large pool of features (independent variables) for a specific target? A: There is no one simple answer. Domain knowledge and common sense can help - if you have dozens or hundreds of predictors, only a minority of them are likely to contribute meaningfully to a model. Many can be eliminated by establishing either that they are highly correlated with one another, or completely uncorrelated with the target variable. Principal components can be used to reduce a profusion of variables to a limited number of weighted multivariate "components." Regression and logistic regression have methods for including and excluding variables, but the software implementation may not be up to the task of dealing with hundreds of variables. More sophisticated automated processes for reducing the number of predictor variables are available, but require software implementation. Q: What is Affinity/Recommend in slide 4, the predictive analytics circles? A: The dominant implementation of affinity analysis is Association Rules, or "what goes with what." In Association Rules, or "market basket analysis," each row indicates the items in a particular transaction. The output is in the form of rules like "if day lilies are purchased, tulips are also purchased." These rules, of course, come with quantitative measures, and are translated into recommendations like "since you purchased X, we think you would also like Y." Q: Which machine learning/statistical method is most widely used in analytics? A: A recent poll at KDNuggets identified regression (which I believe includes logistic regression) and clustering at the top of the list. Q: What happens if the model validated didn't do well with the test sub-sample? Do you go back to pick the second best model in the cross validation? A: Recall that the "test" sample is the second holdout sample (in addition to the training set). The first holdout sample, the validation sample, is used to assess and tune models. The "test" sample is used to provide an unbiased estimate of performance with new data, given that there is likely to be some

3 additional overfitting that occurs with a repeated validation process. It is normal for the selected model to underperform on the "test" sample; you should not then go back and select another model Q: Are there data mining programs that support / incorporate dictionaries? A: Data dictionaries are the information that explains the variables. They are an important part of the documentation of the data mining process. In some software, they are supported explicitly (SAS-EM); in others you can add this information as text information (e.g. in XLMiner, as an additional worksheet.) Q: How many records should be used to train a regression or classification model? A: There is no single answer to this question. It depends on how many variables there are, how structured they are, and how much noise there is in the data. There are ad-hoc statistical rules of thumb that relate number of records to number of variables, but data mining applications do not usually deal with such small data sets. You can experiment by bootstrapping, to see how variable the estimates are. Q: In the Target data, can a customer be in the data twice (or more times) - once at date 1 and once at date 2? A: I was presenting hypothetical data, in which each record (row) is a customer, so a customer would be in the data only once. If they purchase at date 1 and again at date 2, this shows up in the same row. For example, a customer might purchase cotton balls 15 days ago and also 90 days ago, thus records a "1" in the two variables cotton15 and cotton90. Q: For rare event targets, and the need to partition the data into the 3 training, validation and testing partitions, what is the minimum amount of target data that is required, and if you don't meet that, can you use some random sampling technique to progress the project. A: If by "random sampling" you mean "oversample the rare cases," the answer is yes. By oversampling we mean that each rare case has a higher probability of being selected than the not-rare cases. In veryrare cases we might take all the rare cases, split them up among the various partitions, and then take an equal number of not-rare cases. If by "random sampling" you mean bootstrapping or some other "with replacement" process to re-use the rare cases so they are selected more than once, the answer is probably no. Doing so would not add information.

4 Q: Is better to simulate data rather than using the test data? A: I can't see how it would be better. Best to use actual data. Q: You discussed "data analysis" part of data mining, how about "data collection" part? I know something about statistics analysis using some software, but know a little about writing a program to have computer "grab" data. Any advice? A: This is a broad area, because of the many varieties and flavors that data come in, the different programs that house it, and the degree of structure that it has. It's beyond the scope of this webinar, but the issue of taking massive amounts of unstructured data and turning it into analyzable data is actually what consumes most of the time in the data mining process. We do have a course in this at Statistics.com, called data cleaning and preparation, taught by Robert Nisbet, an experienced data mining consultant, and author of several books. That course will be taught Oct. 5. Q: Is data mining more for database analysts or IT rather than statisticians A: You can think of data mining as a stool with three legs: (1) IT/database, (2) computer science, and (3) statistics. All good data mining implementations need all 3 legs. Q: Is data mining different from data warehouse? A: Yes, the data warehouse is the structure that receives and integrates data from various functional areas in a business (sales, service, etc.). Data mining is one of several functions for which you might extract data from a data warehouse. Q: Do you think the new "MBA' program under "Analytics" degree is a good thing, in terms of producing better informed/trained marketing planners? A: I do - the new Masters in Analytics programs provide updated skills training that reflect the opportunities opened up by the deluge of data. Classic MBA curriculum did not cover these analytics.

5 Q: Which method is better - machine learning or statistical analysis? A: In a predictive modeling context, both are used, and one judges their performance by how well they predict on the validation data. If the goal is not simply to predict, but to understand something about the relationship between predictor variables and target variables, then statistical models that produce parameter estimates (like coefficients in a linear regression) are more interpretable than black box machine-learning models such as neural nets. CART (classification and regression trees), on the other hand, is a machine learning tool that does provide easily-interpretable rules that shed light on the role of different predictor variables. Q: Do you have suggestions for how to minimize/deal with selection bias into the training set? For example, some pregnant mothers may be more likely to sign up for the baby registry than other pregnant mothers. A: I can't think of any other way in which Target could conclusively determine whether a customer is pregnant. Keep in mind, though, that the purpose here is not to publish a journal article that is a definititive explanation of buying signals that explain pregnancy. Bias would be an issue in such a case, since it might call into question your explanation. In the Target example that was presented, the only thing of interest is whether the model is better than random chance in predicting pregnant or notpregnant -- or at least better enough to warrant the expense of the effort. Q: What are the considerations in setting the sizes of the training, validation and test partitions? A: The purposes of these partitions are, respectively, to train/fit the model, to tune, assess and select models, and, finally, to determine likely performance on actual data. Intuitively you can see that each of these tasks, in succession, requires less information to perform its job, though I am not aware of any optimal rules. In XLMiner the defaults used to be 50%, 30% and 20%. Q: Is it safe to say that data mining applications are primarily intended for situations where we have higher tolerance for misclassification or error in predicting a continuous variable than for situations that use traditional methods like designed experiments, simple regression, etc? A: Well, note that regression is one (of many) methods that are used in data mining. In data mining, you have large quantities of data that have been collected, usually, for some other purpose. Data mining is an attempt to make use of it, without having to go to great effort and expense to collect "statisticallyvalid" data. A designed experiment is used when you collect data for the purpose of applying treatments and answering a research question; the data are scarce and expensive to collect.

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce Overview Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Core Ideas in Data Mining Classification Prediction Association Rules Data Reduction Data Exploration