Note: In the presentation I should have said "baby registry" instead of "bridal registry," see

Similar documents
Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

1 Topic. Image classification using Knime.

Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions: A Case Study

From Building Better Models with JMP Pro. Full book available for purchase here.

Notes based on: Data Mining for Business Intelligence

GETTING STARTED WITH DATA MINING

Using Machine Learning to Optimize Storage Systems

Lecture 26: Missing data

Random Forest A. Fornaser

Big Data Analytics The Data Mining process. Roger Bohn March. 2017

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Predict Outcomes and Reveal Relationships in Categorical Data

Tree-based methods for classification and regression

Cluster Analysis Gets Complicated

Practical Guidance for Machine Learning Applications

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Slice Intelligence!

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

Chapter 1, Introduction

Modeling Methodology 2: Model Input Selection. by Will Dwinnell. Introduction

Data Preprocessing. Slides by: Shree Jaswal

The Definitive Guide to Preparing Your Data for Tableau

Certified Data Science with Python Professional VS-1442

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

Introduction to Data Mining and Data Analytics

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

SAS/SPECTRAVIEW Software and Data Mining: A Case Study

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Enterprise Miner Tutorial Notes 2 1

Knowledge Discovery and Data Mining

Neural Networks and Machine Learning Applied to Classification of Cancer. Sachin Govind, Advisor: Namrata Pandya, IMSA

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Model selection and validation 1: Cross-validation

Data Mining & Data Warehouse

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

> Data Mining Overview with Clementine

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

Machine Learning: An Applied Econometric Approach Online Appendix

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

SAS High-Performance Analytics Products

Taking Your Application Design to the Next Level with Data Mining

The Attraction of Complexity

Knowledge Discovery and Data Mining

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Introduction to Automated Text Analysis. bit.ly/poir599

ENTERPRISE MINER: 1 DATA EXPLORATION AND VISUALISATION

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

How to implement NIST Cybersecurity Framework using ISO WHITE PAPER. Copyright 2017 Advisera Expert Solutions Ltd. All rights reserved.

How to Conduct a Heuristic Evaluation

Week - 01 Lecture - 04 Downloading and installing Python

CS4491/CS 7265 BIG DATA ANALYTICS

Slides for Data Mining by I. H. Witten and E. Frank

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

CS 229 Midterm Review

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Oracle Big Data Connectors

Seven Interesting Data Warehouse Ideas

Chapter 3: Supervised Learning

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Knowledge Discovery and Data Mining

The Consequences of Poor Data Quality on Model Accuracy

1 Machine Learning System Design

Introduction to Mixed Models: Multivariate Regression

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

JMP Book Descriptions

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Clustering algorithms and autoencoders for anomaly detection

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can:

SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2)

Eureka Math. Algebra I, Module 5. Student File_B. Contains Exit Ticket, and Assessment Materials

Data Visualization Techniques

COMP 465 Special Topics: Data Mining

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

An introduction to plotting data

Software Engineering Prof.N.L.Sarda IIT Bombay. Lecture-11 Data Modelling- ER diagrams, Mapping to relational model (Part -II)

Data warehouses Decision support The multidimensional model OLAP queries

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

How do we obtain reliable estimates of performance measures?

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Clean & Speed Up Windows with AWO

Now, Data Mining Is Within Your Reach

CPSC 340: Machine Learning and Data Mining

Predicting Gene Function and Localization

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Nearest Neighbor Predictors

Transcription:

Q-and-A from the Data-Mining Webinar Note: In the presentation I should have said "baby registry" instead of "bridal registry," see http://www.target.com/babyregistryportalview Q: You mentioned the 'Big Data' McKinsey Report. Is that the actual name of the report? Do you know where I could find that data point? A: The actual title is "Big Data: the next frontier for innovation, competition and productivity," and it is published by the McKinsey Global Institute. Q: How do you treat missing responses appearing in data sample? A: There is a whole domain of statistics called "missing data analysis" that covers ways to impute values where they are missing. If a small proportion of records (cases) have missing values for one or more variables, you can omit those records and proceed with the analysis. However, if a substantial proportion have missing values, you may need to impute missing values so as to be able to retain the rest of the information in those records. If there are a lot of variables, even a low incidence of missing values can knock out a lot of records. In a clinical trial, where a single subject's data might have cost tens of thousands of dollars to obtain, it pays to go to considerable lengths to retain as much information as possible. Predictive modeling, where data are typically plentiful, is not as sensitive to the problem, so fairly rudimentary remedies for missing data (e.g. "replace with mean for the variable") are often effective. Q: Can you resample to get the training, validation, and test sets? A: There is a variant of this called cross-validation in which the partitioning process is done multiple times, and the modeling process repeated. However, in each cross validation iteration, each partition is disjoint from the others. Classic resampling with replacement will produce duplication within the partitions, which reduces the utility of the validation partition as an independent check on the model's validity. Q: Are you aware of any research studies that cover data mining in a time-series context?

A: Yes, the second edition of our book includes several chapters on this, and the lead author has put similar material in her own book, "Practical Time Series Forecasting" (Galit Shmueli) - http://galitshmueli.com/practical-time-series-forecasting Q: How can we tackle the problem of having a large pool of features (independent variables) for a specific target? A: There is no one simple answer. Domain knowledge and common sense can help - if you have dozens or hundreds of predictors, only a minority of them are likely to contribute meaningfully to a model. Many can be eliminated by establishing either that they are highly correlated with one another, or completely uncorrelated with the target variable. Principal components can be used to reduce a profusion of variables to a limited number of weighted multivariate "components." Regression and logistic regression have methods for including and excluding variables, but the software implementation may not be up to the task of dealing with hundreds of variables. More sophisticated automated processes for reducing the number of predictor variables are available, but require software implementation. Q: What is Affinity/Recommend in slide 4, the predictive analytics circles? A: The dominant implementation of affinity analysis is Association Rules, or "what goes with what." In Association Rules, or "market basket analysis," each row indicates the items in a particular transaction. The output is in the form of rules like "if day lilies are purchased, tulips are also purchased." These rules, of course, come with quantitative measures, and are translated into recommendations like "since you purchased X, we think you would also like Y." Q: Which machine learning/statistical method is most widely used in analytics? A: A recent poll at KDNuggets identified regression (which I believe includes logistic regression) and clustering at the top of the list. Q: What happens if the model validated didn't do well with the test sub-sample? Do you go back to pick the second best model in the cross validation? A: Recall that the "test" sample is the second holdout sample (in addition to the training set). The first holdout sample, the validation sample, is used to assess and tune models. The "test" sample is used to provide an unbiased estimate of performance with new data, given that there is likely to be some

additional overfitting that occurs with a repeated validation process. It is normal for the selected model to underperform on the "test" sample; you should not then go back and select another model Q: Are there data mining programs that support / incorporate dictionaries? A: Data dictionaries are the information that explains the variables. They are an important part of the documentation of the data mining process. In some software, they are supported explicitly (SAS-EM); in others you can add this information as text information (e.g. in XLMiner, as an additional worksheet.) Q: How many records should be used to train a regression or classification model? A: There is no single answer to this question. It depends on how many variables there are, how structured they are, and how much noise there is in the data. There are ad-hoc statistical rules of thumb that relate number of records to number of variables, but data mining applications do not usually deal with such small data sets. You can experiment by bootstrapping, to see how variable the estimates are. Q: In the Target data, can a customer be in the data twice (or more times) - once at date 1 and once at date 2? A: I was presenting hypothetical data, in which each record (row) is a customer, so a customer would be in the data only once. If they purchase at date 1 and again at date 2, this shows up in the same row. For example, a customer might purchase cotton balls 15 days ago and also 90 days ago, thus records a "1" in the two variables cotton15 and cotton90. Q: For rare event targets, and the need to partition the data into the 3 training, validation and testing partitions, what is the minimum amount of target data that is required, and if you don't meet that, can you use some random sampling technique to progress the project. A: If by "random sampling" you mean "oversample the rare cases," the answer is yes. By oversampling we mean that each rare case has a higher probability of being selected than the not-rare cases. In veryrare cases we might take all the rare cases, split them up among the various partitions, and then take an equal number of not-rare cases. If by "random sampling" you mean bootstrapping or some other "with replacement" process to re-use the rare cases so they are selected more than once, the answer is probably no. Doing so would not add information.

Q: Is better to simulate data rather than using the test data? A: I can't see how it would be better. Best to use actual data. Q: You discussed "data analysis" part of data mining, how about "data collection" part? I know something about statistics analysis using some software, but know a little about writing a program to have computer "grab" data. Any advice? A: This is a broad area, because of the many varieties and flavors that data come in, the different programs that house it, and the degree of structure that it has. It's beyond the scope of this webinar, but the issue of taking massive amounts of unstructured data and turning it into analyzable data is actually what consumes most of the time in the data mining process. We do have a course in this at Statistics.com, called data cleaning and preparation, taught by Robert Nisbet, an experienced data mining consultant, and author of several books. That course will be taught Oct. 5. Q: Is data mining more for database analysts or IT rather than statisticians A: You can think of data mining as a stool with three legs: (1) IT/database, (2) computer science, and (3) statistics. All good data mining implementations need all 3 legs. Q: Is data mining different from data warehouse? A: Yes, the data warehouse is the structure that receives and integrates data from various functional areas in a business (sales, service, etc.). Data mining is one of several functions for which you might extract data from a data warehouse. Q: Do you think the new "MBA' program under "Analytics" degree is a good thing, in terms of producing better informed/trained marketing planners? A: I do - the new Masters in Analytics programs provide updated skills training that reflect the opportunities opened up by the deluge of data. Classic MBA curriculum did not cover these analytics.

Q: Which method is better - machine learning or statistical analysis? A: In a predictive modeling context, both are used, and one judges their performance by how well they predict on the validation data. If the goal is not simply to predict, but to understand something about the relationship between predictor variables and target variables, then statistical models that produce parameter estimates (like coefficients in a linear regression) are more interpretable than black box machine-learning models such as neural nets. CART (classification and regression trees), on the other hand, is a machine learning tool that does provide easily-interpretable rules that shed light on the role of different predictor variables. Q: Do you have suggestions for how to minimize/deal with selection bias into the training set? For example, some pregnant mothers may be more likely to sign up for the baby registry than other pregnant mothers. A: I can't think of any other way in which Target could conclusively determine whether a customer is pregnant. Keep in mind, though, that the purpose here is not to publish a journal article that is a definititive explanation of buying signals that explain pregnancy. Bias would be an issue in such a case, since it might call into question your explanation. In the Target example that was presented, the only thing of interest is whether the model is better than random chance in predicting pregnant or notpregnant -- or at least better enough to warrant the expense of the effort. Q: What are the considerations in setting the sizes of the training, validation and test partitions? A: The purposes of these partitions are, respectively, to train/fit the model, to tune, assess and select models, and, finally, to determine likely performance on actual data. Intuitively you can see that each of these tasks, in succession, requires less information to perform its job, though I am not aware of any optimal rules. In XLMiner the defaults used to be 50%, 30% and 20%. Q: Is it safe to say that data mining applications are primarily intended for situations where we have higher tolerance for misclassification or error in predicting a continuous variable than for situations that use traditional methods like designed experiments, simple regression, etc? A: Well, note that regression is one (of many) methods that are used in data mining. In data mining, you have large quantities of data that have been collected, usually, for some other purpose. Data mining is an attempt to make use of it, without having to go to great effort and expense to collect "statisticallyvalid" data. A designed experiment is used when you collect data for the purpose of applying treatments and answering a research question; the data are scarce and expensive to collect.