Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Similar documents
Data Mining Practical Machine Learning Tools and Techniques

Machine Learning in Real World: C4.5

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Data Mining Practical Machine Learning Tools and Techniques

All lecture slides will be available at CSC2515_Winter15.html

Slides for Data Mining by I. H. Witten and E. Frank

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

CS 229 Midterm Review

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Support vector machines

CS570: Introduction to Data Mining

Topics in Machine Learning

Data Mining and Analytics

Support Vector Machines

Applying Supervised Learning

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Introduction to Support Vector Machines

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

SUPPORT VECTOR MACHINES

Chap.12 Kernel methods [Book, Chap.7]

SUPPORT VECTOR MACHINES

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Generative and discriminative classification techniques

Kernel-based online machine learning and support vector reduction

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

CSE 573: Artificial Intelligence Autumn 2010

Naïve Bayes for text classification

Lecture 9: Support Vector Machines

Supervised vs unsupervised clustering

Classification: Feature Vectors

Support Vector Machines

Support Vector Machines

Support Vector Machines + Classification for IR

Contents. Preface to the Second Edition

Machine Learning Lecture 9

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Support Vector Machines.

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

SVM Classification in Multiclass Letter Recognition System

DM6 Support Vector Machines

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017

Instance-based Learning

Application of Support Vector Machine In Bioinformatics

Data Mining in Bioinformatics Day 1: Classification

Data mining with Support Vector Machine

Support Vector Machines

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Chakra Chennubhotla and David Koes

Linear methods for supervised learning

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Machine Learning Lecture 9

12 Classification using Support Vector Machines

Support Vector Machines.

ADVANCED CLASSIFICATION TECHNIQUES

Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Support Vector Machines

Support Vector Machines and their Applications

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Semi-supervised learning and active learning

Ensembles. An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Use of Multi-category Proximal SVM for Data Set Reduction

Advanced learning algorithms

The Curse of Dimensionality

Introduction to ANSYS DesignXplorer

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.7

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining Part 5. Prediction

Introduction to Automated Text Analysis. bit.ly/poir599

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Machine Learning. Decision Trees. Manfred Huber

Machine Learning Techniques for Data Mining

ECG782: Multidimensional Digital Signal Processing

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

CS 340 Lec. 4: K-Nearest Neighbors

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Rule extraction from support vector machines

Supervised Learning Classification Algorithms Comparison

Machine Learning in Biology

The Effects of Outliers on Support Vector Machines

Kernel Methods & Support Vector Machines

Eye Detection by Haar wavelets and cascaded Support Vector Machine

A Dendrogram. Bioinformatics (Lec 17)

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Kernels and Clustering

Information Management course

Transcription:

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Implementation: Real machine learning schemes Decision trees Classification rules Regression/model trees, locally weighted regression Clustering: hierarchical, incremental, probabilistic Pruning examples, generalized exemplars, distance functions Numeric prediction Support vector machines and neural networks Instance based learning From PRISM to RIPPER and PART (pruning, numeric data,...) Extending linear models From ID3 to C4.5 (pruning, numeric attributes,...) Hierarchical, incremental, probabilistic Bayesian networks Learning and prediction, fast data structures for learning 2

Industrial strength algorithms For an algorithm to be useful in a wide range of real world applications it must: Permit numeric attributes Allow missing values Be robust in the presence of noise Be able to approximate arbitrary concept descriptions (at least in principle) Basic schemes need to be extended to fulfill these requirements 3

Decision trees Extending ID3: to permit numeric attributes: straightforward to deal sensibly with missing values: trickier stability for noisy data: requires pruning mechanism End result: C4.5 (Quinlan) Best known and (probably) most widely used learning algorithm Commercial successor: C5.0 4

Numeric attributes Standard method: binary splits Unlike nominal attributes, every attribute has many possible split points Solution is straightforward extension: E.g. temp < 45 Evaluate info gain (or other measure) for every possible split point of attribute Choose best split point Info gain for best split point is info gain for attribute Computationally more demanding 5

Pruning Prevent overfitting to noise in the data Prune the decision tree Two strategies: Postpruning take a fully grown decision tree and discard unreliable parts Prepruning stop growing a branch when information becomes unreliable Postpruning preferred in practice prepruning can stop early 13

Subtree replacement Bottom up Consider replacing a tree only after considering all its subtrees 17

Subtree raising Delete node Redistribute instances Slower than subtree replacement (Worthwhile?) 18

Extending linear classification Linear classifiers can t model nonlinear class boundaries Simple trick: Map attributes into new space consisting of combinations of attribute values E.g.: all products of n factors that can be constructed from the attributes Example with two attributes and n = 3: x = w1 a31 w2 a21 a2 w3 a1 a22 w4 a32 41

Problems with this approach 1st problem: speed 10 attributes, and n = 5 >2000 coefficients Use linear regression with attribute selection Run time is cubic in number of attributes 2nd problem: overfitting Number of coefficients is large relative to the number of training instances Curse of dimensionality kicks in 42

Support vector machines Support vector machines are algorithms for learning linear classifiers Resilient to overfitting because they learn a particular linear decision boundary: The maximum margin hyperplane Fast in the nonlinear case Use a mathematical trick to avoid creating pseudo attributes The nonlinear space is created implicitly 43

The maximum margin hyperplane The instances closest to the maximum margin hyperplane are called support vectors 44

Support vectors The support vectors define the maximum margin hyperplane All other instances can be deleted without changing its position and orientation This means the hyperplane x = w 0 w1 a1 w 2 a2 can be written as i a x = b ya i is supp. vector i i 45

Finding support vectors i a x = b i is supp. vector i y i a Support vector: training instance for which αi > 0 Determine αi and b? A constrained quadratic optimization problem Off the shelf tools for solving these problems However, special purpose algorithms are faster Example: Platt s sequential minimal optimization algorithm (implemented in WEKA) Note: all this assumes separable data! 46

Nonlinear SVMs Pseudo attributes represent attribute combinations Overfitting not a problem because the maximum margin hyperplane is stable There are usually few support vectors relative to the size of the training set Computation time still an issue Each time the dot product is computed, all the pseudo attributes must be included 47

A mathematical trick Avoid computing the pseudo attributes Compute the dot product before doing the nonlinear mapping Example: n i a x = b i is supp. vector i y i a Corresponds to a map into the instance space spanned by all products of n attributes 48

Other kernel functions Mapping is called a kernel function Polynomial kernel i a n x = b i is supp. vector i y i a We can use others: Only requirement: Examples: i a x = b i is supp. vector i y i K a K x i, x j = x i x j K x i, x j = x i x j 1 d K x i, x j =exp x i x j 2 2 2 K x i, x j =tanh x i x j b * 49

Noise Have assumed that the data is separable (in original or transformed space) Can apply SVMs to noisy data by introducing a noise parameter C C bounds the influence of any one training instance on the decision boundary Corresponding constraint: 0 αi C Still a quadratic optimization problem Have to determine C by experimentation 50

Sparse data SVM algorithms speed up dramatically if the data is sparse (i.e. many values are 0) Why? Because they compute lots and lots of dot products Sparse data compute dot products very efficiently Iterate only over non zero values SVMs can process sparse datasets with 10,000s of attributes 51

Applications Machine vision: e.g face identification Outperforms alternative approaches (1.5% error) Handwritten digit recognition: USPS data Comparable to best alternative (0.8% error) Bioinformatics: e.g. prediction of protein secondary structure Text classifiation Can modify SVM technique for numeric prediction problems 52

Support vector regression Maximum margin hyperplane only applies to classification However, idea of support vectors and kernel functions can be used for regression Basic method same as in linear regression: want to minimize error Difference A: ignore errors smaller than ε and use absolute error instead of squared error Difference B: simultaneously aim to maximize flatness of function User specified parameter ε defines tube 53

More on SVM regression If there are tubes that enclose all the training points, the flattest of them is used Eg.: mean is used if 2ε > range of target values Model can be written as: i a x = b i is supp. vector i a Support vectors: points on or outside tube Dot product can be replaced by kernel function Note: coefficients αimay be negative No tube that encloses all training points? Requires trade off between error and flatness Controlled by upper limit C on absolute value of coefficients αi 54

Examples ε=2 ε=1 ε = 0.5 55