Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Similar documents
INTRODUCTION TO DATA MINING

Statistical Learning and Data Mining CS 363D/ SSC 358

K- Nearest Neighbors(KNN) And Predictive Accuracy

CISC 4631 Data Mining

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

I211: Information infrastructure II

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Topic 1 Classification Alternatives

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Knowledge Discovery and Data Mining

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Artificial Intelligence. Programming Styles

Mining Web Data. Lijun Zhang

Supervised and Unsupervised Learning (II)

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data Mining. Lecture 03: Nearest Neighbor Learning

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining and Knowledge Discovery: Practice Notes

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

DUE By 11:59 PM on Thursday March 15 via make turnitin on acad. The standard 10% per day deduction for late assignments applies.

Recommender Systems 6CCS3WSN-7CCSMWAL

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

COMP s1 - Getting started with the Weka Machine Learning Toolkit

Artificial Neural Networks (Feedforward Nets)

Data Preprocessing. Supervised Learning

Jarek Szlichta

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

WEKA homepage.

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Performance Analysis of Data Mining Classification Techniques

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Data Mining and Data Analytics

Evaluating Classifiers

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Weka ( )

Notes based on: Data Mining for Business Intelligence

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Data Mining: STATISTICA

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Data warehouse and Data Mining

Mining Web Data. Lijun Zhang

CS570: Introduction to Data Mining

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

ECO375 Tutorial 1 Introduction to Stata

Short instructions on using Weka

Data Mining Concepts & Techniques

Data Mining and Knowledge Discovery: Practice Notes

Introduction to Data Science

Function Algorithms: Linear Regression, Logistic Regression

Business Analytics and Big Data: the process and the tools

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Jeff Howbert Introduction to Machine Learning Winter

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Classification. Instructor: Wei Ding

Seeing the Big Picture

CSE 446 Bias-Variance & Naïve Bayes

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Lecture 25: Review I

Data Mining and Machine Learning: Techniques and Algorithms

Formal Methods of Software Design, Eric Hehner, segment 1 page 1 out of 5

GENERAL MATH FOR PASSING

2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Course Overview

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Data Mining and Knowledge Discovery: Practice Notes

Practical Data Mining COMP-321B. Tutorial 5: Article Identification

Machine Learning: Algorithms and Applications Mockup Examination

1. Inroduction to Data Mininig

Introduction to Data Mining CS 584 Data Mining (Fall 2016)

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Redefining and Enhancing K-means Algorithm

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

WEKA Explorer User Guide for Version 3-4

Case Study: SAP BW Data Mining (Association Analysis)

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Science Essentials

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Transcription:

CS 795/895 Applied Visual Analytics Spring 2013 Data Mining Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs795-s13/ What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns 2 CS 795/895 - Spring 2013 - Weigle Slides Tan,Steinbach, Kumar Introduction to Data Mining

What is (not) Data Mining? What is not Data Mining? What is Data Mining? Look up phone number in phone directory Query a Web search engine for information about "Amazon" Certain names are more prevalent in certain US locations (O'Brien, O'Rurke, O'Reilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) 3 CS 795/895 - Spring 2013 - Weigle Slides Tan,Steinbach, Kumar Introduction to Data Mining Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Ex: classification, regression, deviation detection Description Methods Find human-interpretable patterns that describe the data. Ex: clustering, association rule discovery, sequential pattern discovery From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 4 CS 795/895 - Spring 2013 - Weigle Slides Tan,Steinbach, Kumar Introduction to Data Mining

Data Mining with WEKA Following slides are based on IBM developerworks articles by Michael Abernethy Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library Explains the basics and shows examples using WEKA should be sufficient for our purposes for more details, take a Data Mining course or see Introduction to Data Mining by Tan, Steinbach, and Kumar http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 5 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with What is Data Mining? Transformation of large amount of data into meaningful patterns and rules directed trying to predict a particular data point undirected trying to create groups of data, or find patterns in existing data Ultimate goal is to create a model major step is determining what technique to use 6 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Comparison of Techniques Data: BMW dealership information about each person who purchased a BMW, looked at a BMW, and browsed the BMW showroom Regression "How much should we charge for the new BMW M5?" Classification "How likely is person X to buy the newest BMW M5?" Clustering "What ages groups like the silver BMW M5?" Nearest neighbor "When people purchase the BMW M5, what other options do they tend to buy at the same time?" 7 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with What is WEKA? Waikato Environment for Knowledge Analysis First implemented in 1997 GPL (so it's free) Written in Java Very powerful data mining software http://www.cs.waikato.ac.nz/ml/weka/ 8 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

WEKA Examples Install and start WEKA article uses version 3.6.2 newest version is 3.6.9 All examples use the "Explorer" application Data files are available for download at the end of each IBM article 9 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Data Mining with WEKA Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 10 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Regression Easiest technique but also least powerful Takes a number of independent variables that produce a result - a dependent variable Regression model is used to predict the result of an unknown dependent variable, given the values of the independent variables 11 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Regression Example - Pricing a House Independent variables square footage, size of the lot, granite in the kitchen, bathrooms upgraded, etc. Dependent variable house price 12 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example Loading Data into WEKA WEKA's preferred format is Attribute-Relation File Format (ARFF) define each column and data type regression - limited to NUMERIC or DATE supply each row of data in comma-delimited form 13 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Regression Example - Pricing a House House size Upgraded Selling (square feet) Lot size Bedrooms Granite bathroom price 3529 9191 6 0 0 $205,000 3247 10061 5 1 1 $224,900 4032 10150 5 0 1 $197,900 2397 14156 4 1 0 $189,900 2200 9600 4 0 1 $195,000 3536 19994 6 1 1 $325,000 2983 9365 5 0 1 $230,000 3198 9669 5 1 1???? @RELATION house @ATTRIBUTE housesize NUMERIC @ATTRIBUTE lotsize NUMERIC @ATTRIBUTE bedrooms NUMERIC @ATTRIBUTE granite NUMERIC @ATTRIBUTE bathroom NUMERIC @ATTRIBUTE sellingprice NUMERIC @DATA 3529,9191,6,0,0,205000 3247,10061,5,1,1,224900 4032,10150,5,0,1,197900 2397,14156,4,1,0,189900 2200,9600,4,0,1,195000 3536,19994,6,1,1,325000 2983,9365,5,0,1,230000 14 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - House Loading Data into WEKA Preprocess tab Open file houses.arff Explore the data by choosing attributes and/or Visualize All 15 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Example - House Create the Model Classify tab Choose button Expand the functions branch Select LinearRegression note "SimpleLinearRegression" only looks at one variable Test options Use training set - use the data set we supplied Supplied test set - different set of data Cross-validation - use subsets of supplied data and average them out for final model Percentage split - use percentage of supplied data Choose (Num) sellingprice as dependent variable Start 16 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - House Interpreting the Model sellingprice = (-26.6882 * 3198) + (7.0551 * 9669) + (43166.0767 * 5) + (42292.0901 * 1) - 21661.1208 sellingprice = 219,328 17 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Example Visualize Tab 18 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - House Observations sellingprice = (-26.6882 * housesize) + (7.0551 * lotsize) + (43166.0767 * bedrooms) + (42292.0901 * bathroom) - 21661.1208 Granite doesn't matter isn't used in the model Bathrooms do matter Bigger houses reduce the value but, house size isn't an independent variable not a perfect model 19 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Regression Example - Cars Classic dataset of vehicles produced 1970-1982 often used for parallel coordinates examples 398 rows of data Independent variables cylinders, displacement, horsepower, weight, acceleration, model year, origin, car make Dependent variable miles per gallon (MPG) - aka class 20 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Regression More Information Keywords to search for: least squares homoscedasticity White tests Lilliefors tests R-squared p-values 21 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Data Mining with WEKA Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 22 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Classification Creates a step-by-step guide for how to determine the output of a new data instance aka classification trees or decision trees Creates a tree where each node represents a spot where a decision must be made based on the input want the tree to be as simple as possible with as few nodes and leaves as possible The model can be used for any unknown data instance 23 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Classification Training Set Data set with known output values used to build the model Take an entire training set and divide it into two parts: 60-80% - in training set, used to create model remaining - in test set, used to test the accuracy of the model overfitting - if you give too much data to the model, the model will be created perfectly, but just for that set of data 24 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Classification Confusion Matrix false positive - data instance where the model predicts it should be positive, but the actual value is negative false negative - data instance where the model predicts it should be negative, but the actual value is positive Impact of false positive and false negative are not always the same Ex: spam - A false positive (real email marked as spam) is more damaging than a false negative (spam marked as not spam) 25 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Classification Example - BMW Use data set from BMW dealership Goal: try to push a two-year extended warranty to its past customers Attributes: income bracket year/month first BMW bought year/month most recent BMW bought whether they responded to extended warranty in the past 26 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Example - BMW Accuracy Precision fraction of retrieved instances that are relevant Recall fraction of relevant instances that are retrieved F-Measure combines precision and recall harmonic mean of precision and recall 2 * (precision * recall) / (precision + recall) relevant red - errors not relevant 27 CS 795/895 - Spring 2013 - Weigle wikipedia - "Precision and recall" Example - BMW Validation Run the test set through the model bmw-test.arff Correctly Classified Instances training set - 59.1% test set - 55.7% Pretty close (though still not great) hmmm, maybe classification isn't the best method for this data 28 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Clustering Make groups of data to determine patterns from the data Advantages when the data set is defined and a general pattern needs to be determined Every attribute in the data set will be used to analyze the data Disadvantage - need to know in advance how many groups to create 29 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Clustering Basic Math Every attribute in data set is normalized Given the number of desired clusters, randomly select that number of samples from the data set to serve as initial test cluster centers Compute distance from each data sample to the cluster center Assign each data row into a cluster, based on min distance Compute the centroid, average of each column of data using only the members of each cluster Calculate the distance from each data sample to the centroids. If clusters and cluster members don't change, done! 30 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Clustering Example - BMW Use data set from BMW dealership Kept track of how people walk through the dealership and showroom, what cars they look at, how often they make purchases 100 rows of data Each column describes the steps that customers reached in their BMW experience 31 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Example - BMW Clusters Cluster 0 - "Dreamers" wander around dealership, don't purchase anything Cluster 1 - "M5 Lovers" walk straight to M5s, not a high purchase rate Cluster 2 - "Throw-Aways" small group, not statistically relevant Cluster 3 - "BMW Babies" always end up purchasing a car and always end up financing it walk around, then turn to computer search at the dealership, always buys M5 or Z4 Cluster 4 - "Starting Out With BMW" always look at 3-series, never more expensive M5 walk to showroom, not lot only 32% ultimately finish transaction 32 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Clustering More Information Keywords to search for: Lloyd's algorithm Manhattan Distance Chebyshev Distance sum of squared errors cluster centroids 33 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Data Mining with WEKA Part 1: Introduction and regression Part 2: Classification and clustering Part 3: Nearest neighbor and server-side library http://www.ibm.com/developerworks/views/opensource/libraryview.jsp?search_by=data+mining+weka, 34 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Nearest Neighbor aka collaborative filtering or instance-based learning Use past data instances, with known output values, to predict an unknown output value of a new data instance Different from regression as regression can only be used for numerical outputs 35 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Nearest Neighbor Basic Math Taking the unknown data point, the distance between it and every known data point is computed Algorithm can be expanded beyond the closest match to include any number of closest matches n-nearest neighbors Can also be used to predict a Yes/No output How many neighbors to use? need to experiment 36 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Nearest Neighbor Example - BMW Use data set from BMW dealership Goal: try to push a two-year extended warranty to its past customers 4,500 past sales of extended warranties Attributes: income bracket year/month first BMW bought year/month most recent BMW bought whether they responded to extended warranty in the past 37 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with Nearest Neighbor More Information Keywords to search for: distance weighting Hamming distance Mahalanobis distance 38 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with

Remember Data mining models aren't always simple inputoutput mechanisms Data must be examined to determine the right model to choose Output must be analyzed and accurate before you're ready to move on Server-Side WEKA - We won't cover this, but article 3 introduces how to use the WEKA API for Java. 39 CS 795/895 - Spring 2013 - Weigle Abernethy, "Data Mining with