Data Warehousing and Machine Learning

Similar documents
Data Mining An Overview ITEV, F /18

Probabilistic Classifiers DWML, /27

9. Conclusions. 9.1 Definition KDD

Data Warehousing and Machine Learning

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Classification: Decision Trees

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Performance Analysis of Data Mining Classification Techniques

Preprocessing DWML, /33

Classification Algorithms in Data Mining

Introduction to Artificial Intelligence

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining and Knowledge Discovery: Practice Notes

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Nearest neighbor classification DSE 220

Data Mining: An experimental approach with WEKA on UCI Dataset

CSE4334/5334 DATA MINING

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining Course Overview

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Data Mining: Exploring Data

A Systematic Overview of Data Mining Algorithms

CS 584 Data Mining. Classification 1

Data Mining - Motivation

Data Mining and Knowledge Discovery: Practice Notes

Contents. Preface to the Second Edition

Machine Learning: Algorithms and Applications Mockup Examination

Data Mining and Knowledge Discovery: Practice Notes

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

Summary. Machine Learning: Introduction. Marcin Sydow

Data Mining Classification - Part 1 -

Experimental Design + k- Nearest Neighbors

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Classification with Decision Tree Induction

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Multi-label classification using rule-based classifier systems

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Introduction to Machine Learning

A Brief Introduction to Data Mining

Supervised and Unsupervised Learning (II)

Classification and Regression

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

INTRODUCTION TO DATA MINING

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

CSE 626: Data mining. Instructor: Sargur N. Srihari. Phone: , ext. 113

Intro to Artificial Intelligence

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset

Machine Learning Classifiers and Boosting

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

An Improved Apriori Algorithm for Association Rules

Chapter 1, Introduction

Python With Data Science

DATA MINING AND WAREHOUSING

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Knowledge Discovery and Data Mining

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining Practical Machine Learning Tools and Techniques

Input: Concepts, Instances, Attributes

Performance Evaluation of Various Classification Algorithms

COMP 465: Data Mining Classification Basics

Basic Concepts Weka Workbench and its terminology

Part I. Instructor: Wei Ding

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Random Forest A. Fornaser

Classification with Diffuse or Incomplete Information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Chapter 4 Data Mining A Short Introduction

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Basic Data Mining Technique

Jarek Szlichta

Decision tree learning

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

An Introduction to Data Mining in Institutional Research. Dr. Thulasi Kumar Director of Institutional Research University of Northern Iowa

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

A Classifier with the Function-based Decision Tree

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Lecture 7: Decision Trees

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

DATA WAREHOUING UNIT I

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

Applying Supervised Learning

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Introduction to Data Mining

Data Mining Concepts & Techniques

Transcription:

Data Warehousing and Machine Learning Introduction Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 47

What is Data Mining?? Introduction DWML Spring 2008 2 / 47

What is Data Mining?? Introduction DWML Spring 2008 2 / 47

What is Data Mining?! Introduction DWML Spring 2008 2 / 47

What is Data Mining? Definitions Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991]. Data Mining is a step in the KDD process consisting of applying computational techniques that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data [Fayyad, Piatetsky-Shapiro, Smyth 1996]. Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner [Hand, Mannila, Smyth 2001]. The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience [Mitchell, 1997] Introduction DWML Spring 2008 3 / 47

What is Data Mining? Definitions Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991]. Data Mining is a step in the KDD process consisting of applying computational techniques that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data [Fayyad, Piatetsky-Shapiro, Smyth 1996]. Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner [Hand, Mannila, Smyth 2001]. The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience [Mitchell, 1997] Data Mining vs. Machine Learning Different roots: information extraction vs. intelligent machines Today very large overlap of techniques and applications Some remaining differences: emphasis on large datasets (DM), theoretical analysis of learnability (ML),... For this course: Data Mining Machine Learning Introduction DWML Spring 2008 3 / 47

What is Data Mining? Data Mining in practice Introduction DWML Spring 2008 4 / 47

What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt Introduction DWML Spring 2008 4 / 47

What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt evaluate + iterate Introduction DWML Spring 2008 4 / 47

What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt evaluate + iterate general algorithmic methods data/domain specific operations Introduction DWML Spring 2008 4 / 47

What is Data Mining? Background Developed by a four member consortium in a EU project. Members of the consortium: Teradata (NCR) SPSS (statistical software) DaimlerChrysler OHRA (Insurance and Banking) Consortium supported by a special interest group composed of over 300 organizations involved in data mining projects. Aim From http://www.crisp-dm.org/: The CRISP-DM project has developed an industry- and tool-neutral Data Mining process model. [... ] this project defined and validated a data mining process that is applicable in diverse industry sectors. This will make large data mining projects faster, cheaper, more reliable and more manageable. Even small scale data mining investigations will benefit from using CRISP-DM. Introduction DWML Spring 2008 5 / 47

What is Data Mining? Phases of the CRISP DM Process Model (Illustration from www.crisp-dm.org) Introduction DWML Spring 2008 6 / 47

What is Data Mining? Business/Data understanding Vision: Data Mining extracts whatever interesting hidden information there is in the data Reality: Data Mining techniques solve several types of well-defined tasks Reality: The data used must support the task at hand Reality: The data miner must understand the background of the data, in order to select an appropriate data mining technique Introduction DWML Spring 2008 7 / 47

What is Data Mining? Our Focus Introduction DWML Spring 2008 8 / 47

What is Data Mining? Selecting the Modeling Technique Universe of Techniques (Defined by Tool) Techniques Appropriate for Problem Political Requirements (Management,Understandability) Constraints (Time, Data Characteristics, Staff Training/Knowledge) Tool(s) Selected Introduction DWML Spring 2008 9 / 47

Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring 2008 10 / 47

Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring 2008 10 / 47

Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring 2008 10 / 47

Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring 2008 10 / 47

Example: Regression Nutritional rating of cereals Data: nutritional information and ratings for 77 cereals. Task: find best linear approximation of the dependency of rating on sugars. Types of tasks and models DWML Spring 2008 11 / 47

Example: Classification Text Categorization The Association for Computing Machinery (ACM) maintains a subject classification scheme for computer science research papers. Part of the subject hierarchy (1998 version): I. Computing Methodologies I.2 Artificial Intelligence I.2.6 Learning - Analogies - Concept learning - Connectionism and neural nets - Induction - Knowledge acquisition - Language acquisition - Parameter learning Papers are manually classified by authors or editors. Data: collection of classified papers (full text or abstracts) Task: build a classifier that automatically assigns a subject index to new, unclassified papers. Types of tasks and models DWML Spring 2008 12 / 47

Example: Classification Spam Filtering Spam filtering in Mozilla: user trains the mail reader to recognize spam by manually labeling incoming mails as spam/no spam. Data: collection of user-classified emails (full text). Task: build a classifier that automatically categorizes an incoming email as spam/no spam Types of tasks and models DWML Spring 2008 13 / 47

Example: Classification Character Recognition Example for a Pattern Recognition problem (pattern recognition is an older discipline than data mining, but now can also be seen as a sub-area of data mining): Data: collection of handwritten characters, correctly labeled. Task: build a classifier that identifies new handwritten characters. Types of tasks and models DWML Spring 2008 14 / 47

Example: Classification Credit Rating From existing customer data predict whether a person applying for a new loan will repay or default on the loan. Data: existing customer records with attributes like age, employment type, income,... and information on payback history. Task: build a classifier that predicts whether a new customer will repay the loan. Types of tasks and models DWML Spring 2008 15 / 47

Examples: Clustering Text Categorization Web mining: automatically detect similarity between web pages (e.g. to support search engines or automatic construction of internet directories). Data: the WWW. Task: Construct a (similarity) model for pages on the WWW. Types of tasks and models DWML Spring 2008 16 / 47

Examples: Clustering Bioinformatics: Phylogenetic Trees From biological data construct a model of evolution. Lactococcus Lactis Caulobacter Crescentus Bacillus Halodurans Bacillus Subtilis Rattus Norvegicus Pan Troglodytes Homo Sapiens Data: e.g. genome sequences of different animal species. Task: construct a hierarchical model of similarity between the species. Types of tasks and models DWML Spring 2008 17 / 47

Examples: Association Analysis Association Rules Data: transaction data Task: infer association rules Transaction Items bought 1 Beer,Soap,Milk,Butter 2 Beer,Chips,Butter 3 Milk,Spaghetti,Butter,Tomatos...... {Beer} {Chips} {Spaghetti,Tomatos} {Wine}... Types of tasks and models DWML Spring 2008 18 / 47

Tools WEKA Free open source Java toolbox (www.cs.waikato.ac.nz/ml/weka/) Many methods, good interface Clementine Commercial system, Windows only Many methods, good interface, integrated use of MS SQL server For all toolboxes: easy use of methods can be dangerous correct interpretation of results requires understanding of methods. Documentation essential (and often a weak point...)! Types of tasks and models DWML Spring 2008 19 / 47

Data Warehousing and Machine Learning Decision trees Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Types of tasks and models DWML Spring 2008 20 / 47

Classification A high-level view Classifier Spam yes/no Classification DWML Spring 2008 21 / 47

Classification A high-level view SubAllCap yes/no TrustSend yes/no InvRet yes/no Body adult yes/no Classifier Spam yes/no Body zambia yes/no Classification DWML Spring 2008 21 / 47

Classification A high-level view Cell-1 1..64 Cell-2 1..64 Cell-3 1..64 Classifier Symbol A..Z,0..9 Cell-324 1..64 Classification DWML Spring 2008 21 / 47

Classification Labeled Data Instances (Cases, Examples) Attributes Class variable (Features, Predictor Variables) (Target variable) SubAllCap TrustSend InvRet... B zambia Spam y n n... n y n n n... n n n y n... n y n n n... n n.................. Instances Attributes Class variable Cell-1 Cell-2 Cell-3... Cell-324 Symbol 1 1 4... 12 B 1 1 1... 3 1 34 37 43... 22 Z 1 1 1... 7 0.................. (In principle, any attribute can become the designated class variable) Classification DWML Spring 2008 22 / 47

Classification Attribute Types Each attribute (including the class variable) has associated with it a set of possible values or states. E.g. States(A) = {yes, no} States(A) = {red, blue, green} States(A) = {010100, 020100,..., 311299} States(A) = R States(A) finite: States(A) = R: States(A) = N: A is called discrete A is called continuous or numeric A can be interpreted as continuous (N R), or made discrete by replacing N e.g. with {1, 2,..., 100, > 100} (few data mining methods are specifically adapted to integer valued attributes). Classification DWML Spring 2008 23 / 47

Classification Complete/Incomplete Data Name Gender DoB Income Customer since Last Purchase Thomas Jensen m 050367 190000 010397 250504 Jens Nielsen m 171072 250000 051103 040204 Lene Hansen f 021159 140000 300300 250105 Ulla Sørensen f 220879 210000 180998 031099.................. Name Gender DoB Income Customer since Last Purchase Thomas Jensen m 050367 190000 010397 250504 Jens Nielsen m?? 051103 040204 Lene Hansen f 021159? 300300 250105 Ulla Sørensen f?? 180998 031099.................. Classification DWML Spring 2008 24 / 47

Classification Classification Classification data in general: Attributes: Variables A 1, A 2,..., A n (discrete or continuous). Class variable: Variable C. Always discrete: States(C) = {c 1,..., c l } (set of class labels) A (complete data) Classifier is a mapping C : States(A 1,..., A n) States(C). A classifier able to handle incomplete data provides mappings for subsets {A i1,..., A ik } of {A 1,..., A n}. C : States(A i1,..., A ik ) States(C) A classifier partitions Attribute-value space (also: instance space) into subsets labelled with class labels. Classification DWML Spring 2008 25 / 47

Classification Iris dataset SL PL PW SW Measurement of petal width/length and sepal width/length for 150 flowers of 3 different species of Iris. first reported in: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7 (1936). Attributes Class variable SL SW PL PW Species 5.1 3.5 1.4 0.2 Setosa 4.9 3.0 1.4 0.2 Setosa 6.3 2.9 6.0 2.1 Virginica 6.3 2.5 4.9 1.5 Versicolor............... Classification DWML Spring 2008 26 / 47

Classification Labeled data in instance space: Classification DWML Spring 2008 27 / 47

Classification Labeled data in instance space: Virginica Versicolor Setosa Partition defined by classifier Classification DWML Spring 2008 27 / 47

Classification Decision Regions Deci- Axis-parallel linear: e.g. sion Trees Piecewise linear: e.g. Naive Bayes Nonlinear: e.g. Neural Network Classification DWML Spring 2008 28 / 47

Classification Classifiers differ in... Model space: types of partitions and their representation. how they compute the class label corresponding to a point in instance space (the actual classification task). how they are learned from data. Some important types of classifiers: Decision trees Naive Bayes classifier Other probabilistic classifiers (TAN,... ) Neural networks K-nearest neighbors Classification DWML Spring 2008 29 / 47

Decision Trees Example Attributes: height [0, 2.5], sex {m, f }. Class labels: {tall, short}. 2.5 2.0 tall tall m s f 1.0 short short < 1.8 h < 1.7 1.8 h 1.7 0 m f short tall short tall Partition of instance space Representation by decision tree Decision tree structure DWML Spring 2008 30 / 47

Decision Trees A decision tree is a tree - whose internal nodes are labeled with attributes - whose leaves are labeled with class labels - edges going out from node labeled with attribute A are labeled with subsets of States(A), such that all labels combined form a partition of States(A). Possible partitions e.g.: States(A) = R : [, 2.3[,[2.3, ] [, 1.9[,[1.9, 3.5[,[3.5, ] States(A) = {a, b, c} : {a}, {b}, {c} {a, b}, {c} Decision tree structure DWML Spring 2008 31 / 47

Decision Trees Decision tree classification Each point in the instance space is sorted into a leaf by the decision tree. It is classified according to the class label at that leaf. < 1.8 h m s < 1.7 1.8 f h 1.7 short tall short tall [m,1.85] C([m, 1.85]) = tall Decision tree classification DWML Spring 2008 32 / 47

Decision Trees How to learn a decision tree? Given a dataset: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium Medium 50 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Low 25 Bad 8 Medium Medium 75 Good We want to build a decision tree that is small has high classification accuracy Decision tree learning DWML Spring 2008 33 / 47

Decision Trees Some simple candidate trees: Savings Assets L M H L M H 2,5,7 1,4,8 3,6 G:1, B:2 G:3, B:0 G:1, B:1 Income 2,7 3,4,5,8 1,6 G:0, B:2 G:3, B:1 G:2, B:0 Income 50 > 50 25 > 25 2,3,4,6,7 G:2, B:3 1,5,8 G:3, B:0 3,6,7 G:1, B:2 1,2,4,5,8 G:4, B:1 Decision tree learning: selecting a root DWML Spring 2008 34 / 47

Decision Trees How accurate are these trees? Accurate trees: pure class label distributions at the leaves: (2,0) (0,2) (3,0) pure (1,2) (3,1) (2,3) (2,2) (1,1) impure Decision tree learning: selecting a root DWML Spring 2008 35 / 47

Decision Trees How accurate are these trees? Accurate trees: pure class label distributions at the leaves: (2,0) (0,2) (3,0) pure (1,2) (3,1) (2,3) (2,2) (1,1) impure Entropy A measure of impurity: for S = (x 1, x 2,..., x n) with x = P n i=1 x i : Entropy(S) = nx i=1 x i x log 2 ( x i x ) Entropy(2, 0) = Entropy(0, 2) = Entropy(3, 0) = (1 log 2 (1) + 0 log 2 (0)) = 0 + 0 = 0 Entropy(3, 1) = (0.75 log 2 (0.75) + 0.25 log 2 (0.25)) = 0.311 + 0.5 = 0.811 Entropy(2, 3) = (0.4 log 2 (0.4) + 0.6 log 2 (0.6)) = 0.528 + 0.442 = 0.97 Entropy(2, 2) = Entropy(1, 1) = (0.5 log 2 (0.5) + 0.5 log 2 (0.5)) = 0.5 + 0.5 = 1.0 Decision tree learning: selecting a root DWML Spring 2008 35 / 47

Decision Trees Information Gain A B true false L M H Entropy: 8,2 1,1 2,0 5,1 2,2 0.722 1.0 0.0 0.65 1.0 Expected Entropy: A : B : 10 12 0.722 + 2 1.0 = 0.768 12 2 12 0.0 + 6 12 0.65 + 4 1.0 = 0.658 12 Data Entropy: Entropy(9, 3) = 0.811 Information Gain: A : 0.811 0.768 = 0.043 B : 0.811 0.658 = 0.153 Decision tree learning: selecting a root DWML Spring 2008 36 / 47

Decision Trees Expected entropies: Savings Assets L M H L M H 1,2 3,0 1,1 0,2 3,1 2,0 3 8 0.918 + 3 8 0.0 + 2 8 1.0 = 0.594 2 8 0.0 + 4 8 0.811 + 2 0.0 = 0.405 8 Income Income 50 > 50 25 > 25 2,3 3,0 1,2 4,1 5 8 0.97 + 3 8 0.0 = 0.606 3 8 0.918 + 5 0.722 = 0.795 8 Information gains are Entropy(5, 3) = 0.954 minus expected entropies. Decision tree learning: selecting a root DWML Spring 2008 37 / 47

Decision Trees After the second (and final) ID3 iteration: replacements Assets L M H 2,7 Savings 1,6 G:0, B:2 G:2, B:0 L M H 5 4,8 3 G:0, B:1 G:2, B:0 G:0, B:1 Decision tree learning DWML Spring 2008 38 / 47

Decision Trees After the second (and final) ID3 iteration: replacements Assets L M H bad Savings good L M H bad good bad Decision tree learning DWML Spring 2008 38 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Gain(S, I = 12.5) = 0.9544 0 8 Entropy(S, I 12.5) + 8 Entropy(S, I > 12.5) = 0 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 37.5) = 0.9544 3 8 Entropy(S, I 37.5) + 5 Entropy(S, I > 37.5) = 0.1589 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 62.5) = 0.9544 5 8 Entropy(S, I 62.5) + 3 Entropy(S, I > 62.5) = 0.3476 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 0.0923 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 87.5) = 0.9544 7 8 Entropy(S, I 87.5) + 1 Entropy(S, I > 87.5) = 0.0923 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 0.0923 0 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 112.5) = 0.9544 8 8 Entropy(S, I 112.5) + 0 Entropy(S, I > 112.5) = 0 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 0.0923 0 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Thus, we get an attribute with states 62.5 and > 62.5. Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees ID3 algorithm for decision tree learning Determine attribute A with highest information gain (for continuous attributes: also determine split-value) Construct decision tree with root A, and one leaf for each value of A (two leaves if A is continuous) For a non-pure leaf L: determine attribute B with highest information gain for the data sorted into L. Replace L with a subtree consisting of root B and one leaf for each value of B (two leaves if B is continuous) Continue until all leaves are pure, or some other termination condition applies (e.g.: possible information gains below a given threshold) Label each leaf with the class label that is most frequent among the data sorted into the leaf Decision tree learning: continuous attributes DWML Spring 2008 40 / 47

Decision Trees Pros and Cons + Easy to interpret. + Efficient learning methods. - Difficulties with handling missing data. Decision tree learning: continuous attributes DWML Spring 2008 41 / 47

Overfitting The problem Assets bad L M Savings H good Predictions made by the learned model: Assets=M,Savings=M Risk=good Assets=M,Savings=H Risk=bad L M H bad good bad The training data contained a single case with Assets=M,Savings=H This case had the (uncharacteristic?) class label Risk=bad The model is overfitted to the training data With the prediction Assets=M,Savings=H Risk=good we will likely obtain a higher accuracy on future cases Overfitting DWML Spring 2008 42 / 47

Overfitting The general problem Complex models will represent properties of the training data very precisely The training data may contain some peculiar properties that are not representative for the domain The model will not perform optimally in classifying future instances Classification error Model size Future data Training data Overfitting DWML Spring 2008 43 / 47

Overfitting Decision Tree Pruning To prevent overfitting, extensions of ID3 (C4.5, C5.0) add a pruning step after the tree construction: Data is split into training set and test set Decision tree is learned using training data only Pruning: for internal node A: replace subtree rooted at A with leaf if this reduces the classification error on the test set. Overfitting DWML Spring 2008 44 / 47

Overfitting Example Full bad bad L L Assets M Savings M good H H good bad bad Pruned Assets L M H good good Test data (show only cases with Assets=M): Id. Savings Assets Income Risk 9 High Medium 50 Good 10 Low Medium 50 Bad 11 High Medium 75 Good 12 Medium Medium 50 Good Accuracy of full tree on test data: 50% Accuracy of pruned tree on test data: 75% prune the Savings node. Overfitting DWML Spring 2008 45 / 47

Overfitting Model Tuning with Test Set Test Train learn Model apply Test Data split Train tuning parameter setting tune Data learn final Model Models can be adjusted or tuned (e.g. pruning subtrees, setting model parameters) Tuning can be an iterative process that requires repeated evaluations on the test set A final model is learned using all the data Problem: part of data wasted as test set Overfitting DWML Spring 2008 46 / 47

Overfitting Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each setting of tuning parameter: for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Overfitting DWML Spring 2008 47 / 47

Overfitting Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each setting of tuning parameter: for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Cross Validation is also used for final evaluation of a learned model. Overfitting DWML Spring 2008 47 / 47