INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Similar documents
ECLT 5810 Evaluation of Classification Quality

CS145: INTRODUCTION TO DATA MINING

Machine Learning and Bioinformatics 機器學習與生物資訊學

9. Conclusions. 9.1 Definition KDD

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CS249: ADVANCED DATA MINING

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CS4491/CS 7265 BIG DATA ANALYTICS

K- Nearest Neighbors(KNN) And Predictive Accuracy

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Weka ( )

Evaluating Classifiers

Artificial Intelligence. Programming Styles

Model s Performance Measures

Chapter 3: Supervised Learning

Data Mining Concepts

Large Scale Data Analysis Using Deep Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Classification. Instructor: Wei Ding

Machine Learning Techniques for Data Mining

Data Mining Concepts & Techniques

Information Management course

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Evaluating Classifiers

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Network Traffic Measurements and Analysis

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Contents. Preface to the Second Edition

Lecture 25: Review I

Chapter 1, Introduction

How do we obtain reliable estimates of performance measures?

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Evaluating Classifiers

Data Mining and Knowledge Discovery Practice notes 2

Classification and Regression

Basic Data Mining Technique

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Data Mining and Knowledge Discovery: Practice Notes

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

An Improved Apriori Algorithm for Association Rules

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Question Bank. 4) It is the source of information later delivered to data marts.

I211: Information infrastructure II

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

SOCIAL MEDIA MINING. Data Mining Essentials

Evaluating Machine-Learning Methods. Goals for the lecture

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Data Warehousing and Machine Learning

Data Mining and Knowledge Discovery: Practice Notes

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Introduction to Artificial Intelligence

Evaluating Machine Learning Methods: Part 1

Table Of Contents: xix Foreword to Second Edition

Data Mining Course Overview

Association Rule Mining and Clustering

Knowledge Discovery in Data Bases

10 Classification: Evaluation

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

DATA MINING Introductory and Advanced Topics Part I

Data Mining An Overview ITEV, F /18

The k-means Algorithm and Genetic Algorithm

A Two Stage Zone Regression Method for Global Characterization of a Project Database

INF 4300 Classification III Anne Solberg The agenda today:

Introduction to Data Mining and Data Analytics

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Classification: Basic Concepts, Decision Trees, and Model Evaluation

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

PROBLEM 4

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Data Preprocessing. Slides by: Shree Jaswal

Data Mining and Machine Learning: Techniques and Algorithms

3 Virtual attribute subsetting

Probabilistic Classifiers DWML, /27

Performance Analysis of Data Mining Classification Techniques

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Classification Part 4

Stages of (Batch) Machine Learning

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Slides for Data Mining by I. H. Witten and E. Frank

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Transcription:

INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá

Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús Aguilar-Ruiz for some slides)

Knowledge Discovery in Datasets KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. Large amount of data Data mining algorithms? What knowledge? How to represent and use it?

Knowledge Discovery in Datasets The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Piatetsky-Shapiro, Smyth (1996) non-trivial process valid novel useful understandable multiple process Justified patterns/models Can be used Previously unknown by human and machine

Knowledge Discovery in Datasets Impractical manual data analysis INFORMATION Expert PROCESS KNOWLEDGE Knowledge Engineer Knowledge Base External resources Data Mining

KDD: Potential Applications Business information - Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc. Scientific information - Sky survey cataloging - Biosequence Databases - Geosciences: Quakefinder - etc. Manufacturing information - Controlling and scheduling - Network management - Experiment result analysis - etc. Personal information

KDD Process

KDD Process 1. Data integration and recopilation 2. Selection, cleaning and transformation Data 3. Machine Learning (data mining) Patterns (e.g., classifiers, rules, etc.) 4. Evaluation and interpretation Knowledge 5. Decision making

KDD Process: Integration & Recopilation Data warehousing, databases, data repositories from different and heterogeneous sources. We usually deal with a plain and flat matrix of data Data source in London INFORMATION Data source in Toronto Data source in Seville Data warehouse Query tool Flat File customer Cust-ID name address age income credit-info. C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 BC V5A 459, Canada

KDD Process: Selection, cleaning and transformation Removing Outliers Data Sampling (if there are too much data) Missing Values Removing redundant and irrelevant attributes (feature selection) Derive new attributes from existing ones E.g., Population density from population and area Discretisation, normalisation, etc.

KDD Model Classification DM algorithms are traditionally divided into Supervised learning which aims to discover knowledge for classification or prediction (predictive) Unsupervised learning which refers to the induction to extract interesting knowledge from data (descriptive) New approaches also consider semisupervised learning: goal is classification but the input contains both unlabeled and labeled data. Subgroup Discovery approaches generate descriptive rules are also half way between descriptive and predictive techniques.

Supervised learning - classifiers A classifier resembles a function in the sense that it attaches a value (or a range or a description) to a set of attribute values. Induce a classification model, given a database with m instances (samples) characterized by n predicted attributes, A 1,, A n, and the class variable, C (it is possible to have more than one class). A 1 A n C a 1,1 a 1,n c 1 a m,1 a m,1 c m

Supervised Techniques Decision trees are trees where each leaf indicates a class and internal nodes specifies some test to be carried out. There are many tree-building algorithms such as C4.5 (Quilan) or ID3. Rule induction If condition then class label If then else if (hierarchical) Lazy techniques store previous instances and search similar ones when performing classification with new instances k-nearest neighbour (k-nn) is a method for classifying objects based on closest training examples in the feature space. Numeric prediction Regression Techniques Neural Networks are composed by a set of nodes (units, neurons, processing elements) where each node has input and output and performs a simple computation by its node function. Statistical Techniques Bayesian networks classifiers assign a set of attributes A 1, A 2,..., A n to a class C j such that P(C j A 1, A 2,..., A n ) is maximal. Meta-techniques

Unsupervised Techniques Association Association Rules, APRIORI E.g., rules among supermarket items Clustering Tree clustering: join together objects (e.g., animals) into successively larger clusters, using some measure of similarity or distance. Algorithms: K-Means, EM (Expectation Maximization) There is no class attribute or class considered as another attribute A 1 A n a 1,1 a 1,n a m,1 a m,1

KDD Representation (Models) Dataset: Cancerous and Healthy Cells H1 H3 C1 C3 H2 H4 C2 C4 colour #nuclei #tails class H1 light 1 1 healthy H2 dark 1 1 healthy H3 light 1 2 healthy H4 light 2 1 healthy C1 dark 1 2 cancerous C2 dark 2 1 cancerous C3 light 2 2 cancerous C4 dark 2 2 cancerous

Decision Rules Mining with Decision Rules H1 H3 H2 H4 If colour = light and # nuclei = 1 Then cell = healthy If #nuclei = 2 and colour = dark Then cell = cancerours (and 4 rules more) C1 C2 C3 C4

Decision Trees Mining with Decision Trees H1 H2 #nuclei? 1 2 H3 H4 light colour? dark light colour? dark C1 C2 H #tails? 1 2 #tails? 1 2 C C3 C4 H C H C

Hierarchical Decision Rules Mining with Hierarchical Decision Rules H1 H3 C1 C3 H2 H4 C2 C4 If colour = light and # nuclei = 1 Then cell = healthy Else If #nuclei = 2 and colour = dark Then cell = cancerous Else If #tails = 1 Then cell = healthy Else cell = cancerous

Association Rules Mining with Association Rules H1 H3 H2 H4 If colour = light and # nuclei = 1 Then # tails = 1 (support = 25%; confidence = 50%) association among A and B means that the presence of A in a record implies the presence of B in the same record C1 C3 C2 C4 If # nuclei = 2 and cell = cancerous Then # tails = 2 (support = 3/8%; confidence = 2/3%) Apriori algorithm, Agrawal 1993 Support: the proportion of times that the rule applies. Confidence: the proportion of times that the rule is correct

Neural Networks Mining with Neural Networks H1 H2 color = dark Healthy H3 H4 # nuclei = 1 C1 C2 # tails = 2 Cancerous C3 C4

Evaluation Once we obtain the model with the training data, we need to evaluate it with some new data (testing data) We cannnot use the the same data for training and testing E.g., Evaluating a student with the exercises previouly solved Student s marks will be optimistic and we don t know about student capability to generalise the learned concepts. Holdout approach consist of dividing the dataset into training (approx. 2/3 of the data) and testing (approx 1/3 of the data). Problems: Data can be skewed, missing classes, etc. if randomly divided Stratification ensures that each class is represented with approximately equal proportions E.g., if data contains aprox 45% of positive cases, the training and testing datasets should mantain similar proportion of positive cases. Holdout estimate can be made more reliable by repeating the process with different subsamples (repeated holdout method) The error rates on the different iterations are averaged (overall error rate)

Cross-Validation Cross-validation (CV) avoids overlapping test sets First step: split dataset (D) into k subsets of equal size C 1,C k Second step: we construct a dataset D i = D-C i used for training and test the accuracy of the classifier f Di on C i subset for testing Having done this for all k we estimate the accuracy of the method by averaging the accuracy over the k cross-validation trials Called k-fold cross-validation Usually k=10 Subsets are generally stratified before the CV is performed The error estimates are averaged to yield an overall error estimate

Actual Evaluation Measures From the confusion matrix, we can obtain multiple measures about the goodness of a classifier. E.g. for a binary problem: Positive Predicted Negative Positive TP True Positive FN False Negative (Type II error) TPrate=TP/(TP+FN) (Sensitivity, Recall) Negative FP False Positive (Type I error) TN True Negative TNrate=TN/(FP+TN) (Specificity) PPV=TP/(TP+FP) Positive Predictive Value (Confidence, Precision) NPV=TN/(FN+TN) Negative Predicted Value Accuracy= TP+FP/(TP+TN+FP+FN)

Evaluation Measures Many times we need to combine the TP and FP to estimate the goodness of a classifier For example, with imbalance data, the accuracy of a classifer needs to improve the percentage of the mayority class. In a binay problem and 50/50 distribution, we need improve accuracy over 50%. However if the distribution is 90/10, accuracy needs to be over 90% f measure is an harmonic median of these proportions: 2 TP f measure = 2 TP + FP + FN AUC (Area under the ROC)

Evaluation: Underfitting vs. Overfitting Underfitting Too simple model Overfitting Too complex model

Overfitting Decision trees example Increasing the tree size, decreases the training and testing errors. However, at some point after (tree complexity), training error keeps decreasing but testing error increases Many algorithms have parameters to determine the model complexity (e.g., in decision trees is the prunning parameter)

Actual Actual Evaluation: Cost In many cases to fail in one way is not the same as failing in the other one (Type I error vs. Type II error). Cost can be considered when inducing the model from the data using the cost data. Cost sensitive classifiers Metacost (Dominos, 1999) Ej. Cost matrices by default left- and considering cost right-: Predicted Positive Negative Positive 0 1 Predicted Positive Negative Positive 0 15 Negative 1 0 Negative 1 0

Thanks! Questions?