Machine Learning Techniques for Data Mining

Similar documents
Slides for Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques

Data Engineering. Data preprocessing and transformation

Topics In Feature Selection

Machine Learning Techniques for Data Mining

STA 4273H: Statistical Machine Learning

Classification: Feature Vectors

Contents. Preface to the Second Edition

CSE 573: Artificial Intelligence Autumn 2010

Nonparametric Methods Recap

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Artificial Intelligence. Programming Styles

Data Mining Practical Machine Learning Tools and Techniques

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

CS 229 Midterm Review

April 3, 2012 T.C. Havens

Supervised Learning: K-Nearest Neighbors and Decision Trees

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Naïve Bayes for text classification

CSE4334/5334 DATA MINING

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Random Forest A. Fornaser

Data Preparation. Nikola Milikić. Jelena Jovanović

7. Boosting and Bagging Bagging

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Applying Supervised Learning

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Data Mining. Lecture 03: Nearest Neighbor Learning

ECG782: Multidimensional Digital Signal Processing

Network Traffic Measurements and Analysis

Lecture 7: Decision Trees

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Features: representation, normalization, selection. Chapter e-9

Nearest neighbor classification DSE 220

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Data Preprocessing. Slides by: Shree Jaswal

Information theory methods for feature selection

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Using Machine Learning to Optimize Storage Systems

Boosting Simple Model Selection Cross Validation Regularization

10601 Machine Learning. Model and feature selection

Topics in Machine Learning

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Classification Algorithms in Data Mining

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

Supervised vs unsupervised clustering

Machine Learning. Supervised Learning. Manfred Huber

CISC 4631 Data Mining

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Data Mining and Knowledge Discovery: Practice Notes

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Data Mining Classification - Part 1 -

Gene Clustering & Classification

A study of classification algorithms using Rapidminer

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Kernels and Clustering

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Comparative analysis of classifier algorithm in data mining Aikjot Kaur Narula#, Dr.Raman Maini*

Semi-supervised learning and active learning

Building Classifiers using Bayesian Networks

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

What is machine learning?

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

FEATURE SELECTION BASED ON INFORMATION THEORY, CONSISTENCY AND SEPARABILITY INDICES.

Ensemble Methods, Decision Trees

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

3 Feature Selection & Feature Extraction

CSEP 573: Artificial Intelligence

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

CS Machine Learning

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

6. Dicretization methods 6.1 The purpose of discretization

SOCIAL MEDIA MINING. Data Mining Essentials

Unsupervised Discretization using Tree-based Density Estimation

Data Mining Lecture 8: Decision Trees

Semi-supervised Learning

Supervised Learning Classification Algorithms Comparison

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS 343: Artificial Intelligence

The Curse of Dimensionality

Chapter 2: Classification & Prediction

Transcription:

Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1

PART VII Moving on: Engineering the input and output 10/25/2000 2

Applying a learner is not all Already discussed: scheme/parameter selection Important: selection process should be treated as part of the learning process Modifying the input: attribute selection, discretization, data cleansing, transformations Modifying the output: combining classification models to improve performance Bagging, boosting, stacking, error-correcting output codes (and Bayesian model averaging) 10/25/2000 3

Attribute selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5 s performance Problem: attribute selection based on smaller and smaller amounts of data IBL is also very susceptible to irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Naïve Bayes doesn t have this problem Relevant attributes can also be harmful 10/25/2000 4

Scheme-independent selection Filter approach: assessment based on general characteristics of the data One method: find subset of attributes that is enough to separate all the instances Another method: use different learning scheme (e.g. C4.5, 1R) to select attributes IBL-based attribute weighting techniques can also be used (but can t find redundant attributes) CFS: uses correlation-based evaluation of subsets 10/25/2000 5

Attribute subsets for weather data 10/25/2000 6

Searching the attribute space Number of possible attribute subsets is exponential in the number of attributes Common greedy approaches: forward selection and backward elimination More sophisticated strategies: Bidirectional search Best-first search: can find the optimum solution Beam search: approximation to best-first search Genetic algorithms 10/25/2000 7

Scheme-specific selection Wrapper approach: attribute selection implemented as wrapper around learning scheme Evaluation criterion: cross-validation performance Time consuming: adds factor k 2 even for greedy approaches with k attributes Linearity in k requires prior ranking of attributes Scheme-specific attribute selection essential for learning decision tables Can be done efficiently for DTs and Naïve Bayes 10/25/2000 8

Discretizing numeric attributes Can be used to avoid making normality assumption in Naïve Bayes and Clustering Simple discretization scheme is used in 1R C4.5 performs local discretization Global discretization can be advantageous because it s based on more data Learner can be applied to discretized attribute or It can be applied to binary attributes coding the cut points in the discretized attribute 10/25/2000 9

Unsupervised discretization Unsupervised discretization generates intervals without looking at class labels Only possible way when clustering Two main strategies: Equal-interval binning Equal-frequency binning (also called histogram equalization) Inferior to supervised schemes in classification tasks 10/25/2000 10

Entropy-based discretization Supervised method that builds a decision tree with pre-pruning on the attribute being discretized Entropy used as splitting criterion MDLP used as stopping criterion State-of-the-art discretization method Application of MDLP: Theory is the splitting point (log 2 [N-1] bits) plus class distribution in each subset DL before/after adding splitting point is compared 10/25/2000 11

Example: temperature attribute 10/25/2000 12

Formula for MDLP N instances and k classes and entropy E in original set k 1 classes and entropy E 1 in first subset k 2 classes and entropy E 2 in first subset log gain > k 2 1) log2(3 2) ( N N + ke N + k 1 E 1 + k 2 E 2 Doesn t result in any discretization intervals for the temperature attribute 10/25/2000 13

Other discretization methods Top-down procedure can be replaced by bottomup method MDLP can be replaced by chi-squared test Dynamic programming can be used to find optimum k-way split for given additive criterion Requires time quadratic in number of instances if entropy is used as criterion Can be done in linear time if error rate is used as evaluation criterion 10/25/2000 14

Error-based vs. entropy-based 10/25/2000 15

The converse of discretization Scheme used by IB1: indicator attributes Doesn t make use of potential ordering information M5 generates ordering of nominal values and codes ordering using binary attributes This strategy can be used for any attribute for which values are ordered Avoids problem of using integer attribute to code ordering: would imply a metric In general: subsets of attributes coded as binary attributes 10/25/2000 16

Automatic data cleansing Improving decision trees: relearn tree with misclassified instances removed Better strategy (of course): let human expert check misclassified instances When systematic noise is present it s better not to modify the data Also: attribute noise should be left in training set (Unsystematic) class noise in training set should be eliminated if possible 10/25/2000 17

Robust regression Statistical methods that address problem of outliers are called robust Possible way of making regression more robust: Minimize absolute error instead of squared error Remove outliers (i.e. 10% of points farthest from the regression plane) Minimize median instead of mean of squares (copes with outliers in x and y direction) Finds narrowest strip covering half the observations 10/25/2000 18

Example: least median of squares 10/25/2000 19

Detecting anomalies Visualization best way of detecting anomalies (but often can t be done) Automatic approach: committee of different learning schemes E.g. decision tree, nearest-neighbor learner, and a linear discriminant function Conservative approach: only delete instances which are incorrectly classified by all of them Problem: might sacrifice instances of small classes 10/25/2000 20

Combining multiple models Basic idea of meta learning schemes: build different experts and let them vote Advantage: often improves predictive performance Disadvantage: produces output that is very hard to analyze Schemes we will discuss: bagging, boosting, stacking, and error-correcting output codes The first three can be applied to both classification and numeric prediction problems 10/25/2000 21

Bagging Employs simplest way of combining predictions: voting/averaging Each model receives equal weight Idealized version of bagging: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifier s predictions This improves performance in almost all cases if learning scheme is unstable (i.e. decision trees) 10/25/2000 22

Bias-variance decomposition Theoretical tool for analyzing how much specific training set affects performance of classifier Assume we have an infinite number of classifiers built from different training sets of size n The bias of a learning scheme is the expected error of the combined classifier on new data The variance of a learning scheme is the expected error due to the particular training set used Total expected error: bias + variance 10/25/2000 23

More on bagging Bagging reduces variance by voting/averaging, thus reducing the overall expected error In the case of classification there are pathological situations where the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new datasets of size n by sampling with replacement from original dataset Can help a lot if data is noisy 10/25/2000 24

Bagging classifiers model generation Let n be the number of instances in the training data. For each of t iterations: Sample n instances with replacement from training set. Apply the learning algorithm to the sample. Store the resulting model. classification For each of the t models: Predict class of instance using model. Return class that has been predicted most often. 10/25/2000 25

Boosting Also uses voting/averaging but models are weighted according to their performance Iterative procedure: new models are influenced by performance of previously built ones New model is encouraged to become expert for instances classified incorrectly by earlier models Intuitive justification: models should be experts that complement each other There are several variants of this algorithm 10/25/2000 26

AdaBoost.M1 model generation Assign equal weight to each training instance. For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model. Compute error e of model on weighted dataset and store error. If e equal to zero, or e greater or equal to 0.5: Terminate model generation. For each instance in dataset: If instance classified correctly by model: Multiply weight of instance by e / (1 - e). Normalize weight of all instances. classification Assign weight of zero to all classes. For each of the t (or less) models: Add -log(e / (1 - e)) to weight of class predicted by model. Return class with highest weight. 10/25/2000 27

More on boosting Can be applied without weights using resampling with probability determined by weights Disadvantage: not all instances are used Advantage: resampling can be repeated if error exceeds 0.5 Stems from computational learning theory Theoretical result: training error decreases exponentially Also: works if base classifiers not too complex and their error doesn t become too large too quickly 10/25/2000 28

A bit more on boosting Puzzling fact: generalization error can decrease long after training error has reached zero Seems to contradict Occam s Razor! However, problem disappears if margin (confidence) is considered instead of error Margin: difference between estimated probability for true class and most likely other class (between 1, 1) Boosting works with weak learners: only condition is that error doesn t exceed 0.5 LogitBoost: more sophisticated boosting scheme 10/25/2000 29

Stacking Hard to analyze theoretically: black magic Uses meta learner instead of voting to combine predictions of base learners Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model) Base learners usually different learning schemes Predictions on training data can t be used to generate data for level-1 model! Cross-validation-like scheme is employed 10/25/2000 30

More on stacking If base learners can output probabilities it s better to use those as input to meta learner Which algorithm to use to generate meta learner? In principle, any learning scheme can be applied David Wolpert: relatively global, smooth model Base learners do most of the work Reduces risk of overfitting Stacking can also be applied to numeric prediction (and density estimation) 10/25/2000 31

Error-correcting output codes Very elegant method of transforming multiclass problem into two-class problem Simple scheme: as many binary class attributes as original classes using one-per-class coding class a b c class vector 1000 0100 0010 d 0001 Idea: use error-correcting codes instead 10/25/2000 32

More on ECOCs Example: class a b c d class vector 1111111 0000111 0011001 0101010 What s the true class if base classifiers predict 1011111? We want code words for which minimum hamming distance between any pair of words d is large Up to (d-1)/2 single-bit errors can be corrected 10/25/2000 33

A bit more on ECOCs Two criteria for error-correcting output codes: Row-separation: minimum distance between rows Column-separation: minimum distance between columns (and columns complements) Why? Because if columns are identical, base classifiers will make the same errors Error-correction is weakened if errors are correlated Only works for problems with more than 3 classes: for 3 classes there are only 2 3 possible columns 10/25/2000 34

Exhaustive ECOCs With few classes exhaustive codes can be build (like the one on an earlier slide) Exhaustive code for k classes: The columns comprise every possible k-string Except for complements and all-zero/one strings Each code word contains 2 k-1-1 bits Code word for 1 st class: all ones 2 nd class: 2 k-2 zeroes followed by 2 k-2-1 ones i th class: alternating runs of 2 k-i zeroes and ones, the last run being one short 10/25/2000 35

One last slide on ECOCs With more classes, exhaustive codes are infeasible Number of columns increases exponentially Random code words have good error-correcting properties on average! More sophisticated methods exist for generating ECOCs using a small number of columns ECOCs don t work with NN classifier But: works if different attribute subsets are used to predict each output bit 10/25/2000 36