CS570: Introduction to Data Mining

Similar documents
Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

9 Classification: KNN and SVM

Information Management course

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.7

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 9 Classification: Advanced Methods

Topic 1 Classification Alternatives

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

k-nearest Neighbor (knn) Sept Youn-Hee Han

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Naïve Bayes for text classification

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning - Lecture 15 Support Vector Machines

CS5670: Computer Vision

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Introduction to Support Vector Machines

Support Vector Machines

Lecture 9: Support Vector Machines

CS 229 Midterm Review

12 Classification using Support Vector Machines

Support Vector Machines

Support Vector Machines and their Applications

SUPPORT VECTOR MACHINES

Content-based image and video analysis. Machine learning

CISC 4631 Data Mining

Data Mining. Lecture 03: Nearest Neighbor Learning

Applying Supervised Learning

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Rule extraction from support vector machines

SVM Classification in Multiclass Letter Recognition System

CS 584 Data Mining. Classification 1

Support Vector Machines + Classification for IR

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

CSE 573: Artificial Intelligence Autumn 2010

Contents. Preface to the Second Edition

A Dendrogram. Bioinformatics (Lec 17)

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Data Preprocessing. Supervised Learning

Support Vector Machines for Face Recognition

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Topics in Machine Learning

Lecture Notes for Chapter 5

Network Traffic Measurements and Analysis

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

K- Nearest Neighbors(KNN) And Predictive Accuracy

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Kernel Methods & Support Vector Machines

Discriminative classifiers for image recognition

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Classification: Feature Vectors

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

SUPPORT VECTOR MACHINES

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Support vector machines

CS570: Introduction to Data Mining

Machine Learning Classifiers and Boosting

DM6 Support Vector Machines

Lecture 3. Oct

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Supervised Learning Classification Algorithms Comparison

Applied Statistics for Neuroscientists Part IIa: Machine Learning

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning: Think Big and Parallel

Instance-based Learning

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

CS570: Introduction to Data Mining

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS570: Introduction to Data Mining

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Nearest Neighbor Classification. Machine Learning Fall 2017

Announcements:$Rough$Plan$$

Feature scaling in support vector data description

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Supervised vs unsupervised clustering

6.034 Quiz 2, Spring 2005

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

ECG782: Multidimensional Digital Signal Processing

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Neural Networks and Deep Learning

SOCIAL MEDIA MINING. Data Mining Essentials

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Statistics 202: Statistical Aspects of Data Mining

Support Vector Machines

Classification by Support Vector Machines

Transcription:

CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann, and 2006 Tan, Steinbach & Kumar. Introd. Data Mining., Pearson. Addison Wesley. September 19, 2013 1

Classification and Prediction Last lecture Overview Decision tree induction Bayesian classification Today Training (learning) Bayesian network knn classification and collaborative filtering Support Vector Machines (SVM) Upcoming lectures Rule based methods Neural Networks Regression Ensemble methods Model evaluation 2

Training Bayesian Networks Several scenarios: Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct network topology Unknown structure, all hidden variables: No good algorithms known for this purpose Ref. D. Heckerman: Bayesian networks for data mining 3

Training Bayesian Networks Scenario: Given both the network structure and all variables observable: learn only the CPT (similar to naive Bayesien) Data Mining: Concepts and Techniques 4

Training Bayesian Networks Scenario: Network structure known, some variables hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function (similar to neural network training) Example optimization function: likelyhood of observing the data Weights are initialized to random probability values At each iteration, it moves towards what appears to be the best solution at the moment, w.o. backtracking Weights are updated at each iteration & converge to local optimum 5

Training Bayesian Networks Scenario: Network structure unknown, all variables observable: search through the model space to reconstruct network topology Define a total order of the variables Construct sequences and for each sequence remove the variables that do not affect the current variable Creating an arc using remaining dependencies 6

Classification and Prediction Last lecture Overview Decision tree induction Bayesian classification Today Training (learning) Bayesian network knn classification and collaborative filtering Support Vector Machines (SVM) Upcoming lectures Rule based methods Neural Networks Regression Ensemble methods Model evaluation 7

Lazy vs. Eager Learning Lazy vs. eager learning Lazy learning (e.g. instance-based learning): stores training data (or only minor processing) and waits till receiving test data Eager learning (e.g. decision tree, Bayesian): constructs a classification model before receiving test data Efficiency Lazy learning: less time in training but more in predicting Eager learning: more time in training but less in predicting Accuracy Lazy learning: effectively uses a richer hypothesis space by using many local linear functions to form its global approximation to the target function Eager learning: must commit to a single hypothesis that covers the entire instance space 8

Lazy Learner: Instance-Based Methods Typical approaches k-nearest neighbor approach (1950 s) Instances represented as points in a Euclidean space. Locally weighted regression Constructs local approximation 9

Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records 10

Nearest-Neighbor Classifiers Unknown record Algorithm Compute distance from (every) test record to (all) training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) 11

Nearest Neighbor Classification Compute distance between two points: Euclidean distance Problem with Euclidean measure: High dimensional data curse of dimensionality Can produce counter-intuitive results 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 vs d( p, q) 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 d = 1.4142 d = 1.4142 i ( p q i i ) 2 Solution: Normalize the vectors to unit length 12

k Nearest Neighbor Classification Determine the class from nearest neighbor list take the majority vote of class labels among the k-nearest neighbors Weigh the vote according to distance weight factor, w = 1/d 2 13

Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X 14

Nearest Neighbor Classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M Solution? Real-valued prediction for a given unknown tuple Returns the mean values of the k nearest neighbors 15

Collaborative Filtering: knn in action Customers who bought this book also bought: Data Preparation for Data Mining: by Dorian Pyle (Author) The Elements of Statistical Learning: by T. Hastie, et al Data Mining: Introductory and Advanced Topics: by Margaret H. Dunham Mining the Web: Analysis of Hypertext and Semi Structured Data 16

Collaborative Filtering: knn Classification/Prediction in Action User Perspective Lots of online products, books, movies, etc. Reduce my choices please Manager Perspective if I have 3 million customers on the web, I should have 3 million stores on the web. CEO of Amazon.com [SCH01] 17

Basic Approaches for Recommendation Collaborative Filtering (CF) Look at users collective behavior Look at the active user history Combine! Content-based Filtering Recommend items based on key-words More appropriate for information retrieval 18

How it Works? Each user has a profile Users rate items Explicitly: score from 1..5 Implicitly: web usage mining Time spent in viewing the item Navigation path Etc System does the rest, How? Collaborative filtering (based on knn!) 19

Collaborative Filtering: A Framework Items: I i 1 i 2 i j i n u 1 u 2 u i... u m Users: U 3 1.5. 2 2 1 3 r ij =? Unknown function f: U x I R The task: Q1: Find Unknown ratings? Q2: Which items should we recommend to this user?... 20

Collaborative Filtering User-User Methods Identify like-minded users Memory-based (heuristic): knn Model-based: Clustering, Bayesian networks Item-Item Method Identify buying patterns Correlation Analysis Linear Regression Belief Network Association Rule Mining 21

User-User Similarity: Intuition Target Customer Q1: How to measure similarity? Q3: How to combine? Q2: How to select neighbors? 22

How to Measure Similarity? Pearson correlation coefficient u i i 1 i n w p ( a, i) aj a j Commonly Rated Items ( r ( r r aj a jcommonly Rated Items ) r )( r ij r ) ( r r ) 2 ij i jcommonly Rated Items i 2 u a Cosine measure Users are vectors in product-dimension space w c ( a, i) r a r. r a i * 2 r i 2 23

Nearest Neighbor Approaches [SAR00a] Offline phase: Do nothing just store transactions Online phase: Identify highly similar users to the active one Best K ones All with a measure greater than a threshold Prediction User a s neutral r aj r a i w( a, i)( r i ij w( a, i) r ) i User i s deviation User a s estimated deviation 24

Challenges for Recommender Systems Collaborative systems Scalability Quality of recommendations Dealing with new users (no history available) Content-based systems False negatives False positives Limited by the features used to describe the items 25

Classification and Prediction Last lecture Overview Decision tree induction Bayesian classification Today Training (learning) Bayesian network knn classification and collaborative filtering Support Vector Machines (SVM) Upcoming lectures Rule based methods Neural Networks Regression Ensemble methods Model evaluation 26

Support Vector Machines: Overview A relatively new classification method for both separable and non-separable data Features Sound mathematical foundation Training time can be slow but efficient methods are being developed Robust and accurate, less prone to overfitting Applications: handwritten digit recognition, speaker identification, September 15, 2013 27

Support Vector Machines: History Vapnik and colleagues (1992) Groundwork from Vapnik-Chervonenkis theory (1960 1990) Problems driving the initial development of SVM Bias variance tradeoff, capacity control, overfitting Basic idea: accuracy on the training set vs. capacity A Tutorial on Support Vector Machines for Pattern Recognition, Burges, Data Mining and Knowledge Discovery,1998 28

Linear Support Vector Machines Problem: find a linear hyperplane (decision boundary) that best separate the data B 2 September 15, 2013 29

Linear Support Vector Machines Which line is better? B1 or B2? How do we define better? B 1 B 2 September 15, 2013 30

Support Vector Machines B 1 B 2 b 21 b 22 margin b 11 Find hyperplane maximizes the margin b 12 September 15, 2013 31

Support Vector Machines Illustration A separating hyperplane can be written as W X + b = 0 where W={w 1, w 2,, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as w 0 + w 1 x 1 + w 2 x 2 = 0 The hyperplane defining the sides of the margin: H 1 : w 0 + w 1 x 1 + w 2 x 2 = 1 H 2 : w 0 + w 1 x 1 + w 2 x 2 = 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors 32

Support Vector Machines B 1 w x b 0 w x b 1 w x b 1 w x w x i b 1for yi 1 For all training points: y (w i x i b) -1 0 i b 1for yi 1 September 15, 2013 33 b 12 b 11 Margin 2 w

Support Vector Machines We want to maximize: Margin 2 w Equivalent to minimizing: w 2 But subjected to the constraints: y i (w x i b) -1 0 Constrained optimization problem Lagrange reformulation September 15, 2013 34

Support Vector Machines What if the problem is not linearly separable? Introduce slack variables to the constraints: w w x x i i b b 1 for y 1 for y i i i i 1 1 Upper bound on the training errors: i i September 15, 2013 35

Nonlinear Support Vector Machines What if decision boundary is not linear? Transform the data into higher dimensional space and search for a hyperplane in the new space Convert the hyperplane back to the original space September 15, 2013 36

SVM Kernel functions Instead of computing the dot product on the transformed data tuples, it is mathematically equivalent to instead applying a kernel function K(X i, X j ) to the original data, i.e., K(X i, X j ) = Φ(X i ) Φ(X j ) Typical Kernel Functions SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional user parameters) 37

Support Vector Machines: Comments and Research Issues Robust and accurate with nice generalization properties Effective (insensitive) to high dimensions - Complexity characterized by # of support vectors rather than dimensionality Scalability in training - While the speed in test phase is largely solved, training for very large datasets is an unsolved problem. Extension to regression analysis Extension to multiclass SVM still in research Kernel selection still in researchh issue. 38

SVM Related Links SVM web sites www.kernel-machines.org www.kernel-methods.net www.support-vector.net www.support-vector-machines.org Representative implementations LIBSVM: an efficient implementation of SVM, multi-class classifications SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only C language SVM-torch: another recent implementation also written in C. 39

SVM Introduction Literature Statistical Learning Theory by Vapnik: extremely hard to understand, containing many errors too C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. Better than the Vapnik s book, but still written too hard for introduction, and the examples are not-intuitive The book An Introduction to Support Vector Machines by N. Cristianini and J. Shawe-Taylor Also written hard for introduction, but the explanation about the Mercer s theorem is better than above literatures The neural network book by Haykins Contains one nice chapter of SVM introduction 40

Classification and Prediction Last lecture Overview Decision tree induction Bayesian classification Today Training (learning) Bayesian network knn classification and collaborative filtering Support Vector Machines (SVM) Upcoming lectures Rule based methods Neural Networks Regression Ensemble methods Model evaluation 41