Non-trivial extraction of implicit, previously unknown and potentially useful information from data

CS 795/895 Applied Visual Analytics Spring 2013 Data Mining Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs795-s13/ What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns 2 CS 795/895 - Spring 2013 - Weigle Slides Tan,Steinbach, Kumar Introduction to Data Mining

What is (not) Data Mining? What is not Data Mining? What is Data Mining? Look up phone number in phone directory Query a Web search engine for information about "Amazon" Certain names are more prevalent in certain US locations (O'Brien, O'Rurke, O'Reilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) 3 CS 795/895 - Spring 2013 - Weigle Slides Tan,Steinbach, Kumar Introduction to Data Mining Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Ex: classification, regression, deviation detection Description Methods Find human-interpretable patterns that describe the data. Ex: clustering, association rule discovery, sequential pattern discovery From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 4 CS 795/895 - Spring 2013 - Weigle Slides Tan,Steinbach, Kumar Introduction to Data Mining

Data Mining with WEKA Following slides are based on IBM developerworks articles by Michael Abernethy Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library Explains the basics and shows examples using WEKA should be sufficient for our purposes for more details, take a Data Mining course or see Introduction to Data Mining by Tan, Steinbach, and Kumar http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 5 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with What is Data Mining? Transformation of large amount of data into meaningful patterns and rules directed trying to predict a particular data point undirected trying to create groups of data, or find patterns in existing data Ultimate goal is to create a model major step is determining what technique to use 6 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Comparison of Techniques Data: BMW dealership information about each person who purchased a BMW, looked at a BMW, and browsed the BMW showroom Regression "How much should we charge for the new BMW M5?" Classification "How likely is person X to buy the newest BMW M5?" Clustering "What ages groups like the silver BMW M5?" Nearest neighbor "When people purchase the BMW M5, what other options do they tend to buy at the same time?" 7 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with What is WEKA? Waikato Environment for Knowledge Analysis First implemented in 1997 GPL (so it's free) Written in Java Very powerful data mining software http://www.cs.waikato.ac.nz/ml/weka/ 8 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

WEKA Examples Install and start WEKA article uses version 3.6.2 newest version is 3.6.9 All examples use the "Explorer" application Data files are available for download at the end of each IBM article 9 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Data Mining with WEKA Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 10 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Regression Easiest technique but also least powerful Takes a number of independent variables that produce a result - a dependent variable Regression model is used to predict the result of an unknown dependent variable, given the values of the independent variables 11 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Regression Example - Pricing a House Independent variables square footage, size of the lot, granite in the kitchen, bathrooms upgraded, etc. Dependent variable house price 12 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example Loading Data into WEKA WEKA's preferred format is Attribute-Relation File Format (ARFF) define each column and data type regression - limited to NUMERIC or DATE supply each row of data in comma-delimited form 13 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Regression Example - Pricing a House House size Upgraded Selling (square feet) Lot size Bedrooms Granite bathroom price 3529 9191 6 0 0 $205,000 3247 10061 5 1 1 $224,900 4032 10150 5 0 1 $197,900 2397 14156 4 1 0 $189,900 2200 9600 4 0 1 $195,000 3536 19994 6 1 1 $325,000 2983 9365 5 0 1 $230,000 3198 9669 5 1 1???? @RELATION house @ATTRIBUTE housesize NUMERIC @ATTRIBUTE lotsize NUMERIC @ATTRIBUTE bedrooms NUMERIC @ATTRIBUTE granite NUMERIC @ATTRIBUTE bathroom NUMERIC @ATTRIBUTE sellingprice NUMERIC @DATA 3529,9191,6,0,0,205000 3247,10061,5,1,1,224900 4032,10150,5,0,1,197900 2397,14156,4,1,0,189900 2200,9600,4,0,1,195000 3536,19994,6,1,1,325000 2983,9365,5,0,1,230000 14 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - House Loading Data into WEKA Preprocess tab Open file houses.arff Explore the data by choosing attributes and/or Visualize All 15 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Example - House Create the Model Classify tab Choose button Expand the functions branch Select LinearRegression note "SimpleLinearRegression" only looks at one variable Test options Use training set - use the data set we supplied Supplied test set - different set of data Cross-validation - use subsets of supplied data and average them out for final model Percentage split - use percentage of supplied data Choose (Num) sellingprice as dependent variable Start 16 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - House Interpreting the Model sellingprice = (-26.6882 * 3198) + (7.0551 * 9669) + (43166.0767 * 5) + (42292.0901 * 1) - 21661.1208 sellingprice = 219,328 17 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Example Visualize Tab 18 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - House Observations sellingprice = (-26.6882 * housesize) + (7.0551 * lotsize) + (43166.0767 * bedrooms) + (42292.0901 * bathroom) - 21661.1208 Granite doesn't matter isn't used in the model Bathrooms do matter Bigger houses reduce the value but, house size isn't an independent variable not a perfect model 19 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Regression Example - Cars Classic dataset of vehicles produced 1970-1982 often used for parallel coordinates examples 398 rows of data Independent variables cylinders, displacement, horsepower, weight, acceleration, model year, origin, car make Dependent variable miles per gallon (MPG) - aka class 20 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Regression More Information Keywords to search for: least squares homoscedasticity White tests Lilliefors tests R-squared p-values 21 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Data Mining with WEKA Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 22 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Classification Creates a step-by-step guide for how to determine the output of a new data instance aka classification trees or decision trees Creates a tree where each node represents a spot where a decision must be made based on the input want the tree to be as simple as possible with as few nodes and leaves as possible The model can be used for any unknown data instance 23 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Classification Training Set Data set with known output values used to build the model Take an entire training set and divide it into two parts: 60-80% - in training set, used to create model remaining - in test set, used to test the accuracy of the model overfitting - if you give too much data to the model, the model will be created perfectly, but just for that set of data 24 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Classification Confusion Matrix false positive - data instance where the model predicts it should be positive, but the actual value is negative false negative - data instance where the model predicts it should be negative, but the actual value is positive Impact of false positive and false negative are not always the same Ex: spam - A false positive (real email marked as spam) is more damaging than a false negative (spam marked as not spam) 25 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Classification Example - BMW Use data set from BMW dealership Goal: try to push a two-year extended warranty to its past customers Attributes: income bracket year/month first BMW bought year/month most recent BMW bought whether they responded to extended warranty in the past 26 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - BMW Accuracy Precision fraction of retrieved instances that are relevant Recall fraction of relevant instances that are retrieved F-Measure combines precision and recall harmonic mean of precision and recall 2 * (precision * recall) / (precision + recall) relevant red - errors not relevant 27 CS 795/895 - Spring 2013 - Weigle wikipedia - "Precision and recall" Example - BMW Validation Run the test set through the model bmw-test.arff Correctly Classified Instances training set - 59.1% test set - 55.7% Pretty close (though still not great) hmmm, maybe classification isn't the best method for this data 28 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Clustering Make groups of data to determine patterns from the data Advantages when the data set is defined and a general pattern needs to be determined Every attribute in the data set will be used to analyze the data Disadvantage - need to know in advance how many groups to create 29 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Clustering Basic Math Every attribute in data set is normalized Given the number of desired clusters, randomly select that number of samples from the data set to serve as initial test cluster centers Compute distance from each data sample to the cluster center Assign each data row into a cluster, based on min distance Compute the centroid, average of each column of data using only the members of each cluster Calculate the distance from each data sample to the centroids. If clusters and cluster members don't change, done! 30 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Clustering Example - BMW Use data set from BMW dealership Kept track of how people walk through the dealership and showroom, what cars they look at, how often they make purchases 100 rows of data Each column describes the steps that customers reached in their BMW experience 31 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Example - BMW Clusters Cluster 0 - "Dreamers" wander around dealership, don't purchase anything Cluster 1 - "M5 Lovers" walk straight to M5s, not a high purchase rate Cluster 2 - "Throw-Aways" small group, not statistically relevant Cluster 3 - "BMW Babies" always end up purchasing a car and always end up financing it walk around, then turn to computer search at the dealership, always buys M5 or Z4 Cluster 4 - "Starting Out With BMW" always look at 3-series, never more expensive M5 walk to showroom, not lot only 32% ultimately finish transaction 32 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Clustering More Information Keywords to search for: Lloyd's algorithm Manhattan Distance Chebyshev Distance sum of squared errors cluster centroids 33 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Data Mining with WEKA Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 34 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Nearest Neighbor aka collaborative filtering or instance-based learning Use past data instances, with known output values, to predict an unknown output value of a new data instance Different from regression as regression can only be used for numerical outputs 35 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Nearest Neighbor Basic Math Taking the unknown data point, the distance between it and every known data point is computed Algorithm can be expanded beyond the closest match to include any number of closest matches n-nearest neighbors Can also be used to predict a Yes/No output How many neighbors to use? need to experiment 36 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Nearest Neighbor Example - BMW Use data set from BMW dealership Goal: try to push a two-year extended warranty to its past customers 4,500 past sales of extended warranties Attributes: income bracket year/month first BMW bought year/month most recent BMW bought whether they responded to extended warranty in the past 37 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Nearest Neighbor More Information Keywords to search for: distance weighting Hamming distance Mahalanobis distance 38 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Remember Data mining models aren't always simple inputoutput mechanisms Data must be examined to determine the right model to choose Output must be analyzed and accurate before you're ready to move on Server-Side WEKA - We won't cover this, but article 3 introduces how to use the WEKA API for Java. 39 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with