Basic Data Mining Technique

Similar documents
The k-means Algorithm and Genetic Algorithm

COMP 465: Data Mining Classification Basics

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Extra readings beyond the lecture slides are important:

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Data warehouse and Data Mining

Classification with Decision Tree Induction

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Nesnelerin İnternetinde Veri Analizi

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Classification and Prediction

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Topic 1 Classification Alternatives

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Introduction to Data Mining. Yücel SAYGIN

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

A Program demonstrating Gini Index Classification

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Unsupervised Learning

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

(Classification and Prediction)

Jarek Szlichta

Data Mining. Lecture 03: Nearest Neighbor Learning

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Classification and Regression

Data Mining and Soft Computing

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

SOCIAL MEDIA MINING. Data Mining Essentials

Table Of Contents: xix Foreword to Second Edition

Contents. Foreword to Second Edition. Acknowledgments About the Authors

A Genetic Algorithm-Based Approach for Building Accurate Decision Trees

Decision Tree CE-717 : Machine Learning Sharif University of Technology

K- Nearest Neighbors(KNN) And Predictive Accuracy

Data Mining Concepts

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

International Journal of Software and Web Sciences (IJSWS)

Study on the Application Analysis and Future Development of Data Mining Technology

CSE4334/5334 DATA MINING

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Lecture outline. Decision-tree classification

Performance Analysis of Data Mining Classification Techniques

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Machine Learning: Algorithms and Applications Mockup Examination

Introduction to Data Mining and Data Analytics

Data Mining: An experimental approach with WEKA on UCI Dataset

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

Knowledge Discovery in Databases

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Fuzzy Partitioning with FID3.1

2. Data Preprocessing

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Artificial Intelligence. Programming Styles

Chapter 2: Classification & Prediction

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

CISC 4631 Data Mining

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Preprocessing. Slides by: Shree Jaswal

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Data Preprocessing. Supervised Learning

1) Give decision trees to represent the following Boolean functions:

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining Classification - Part 1 -

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Preprocessing DWML, /33

Performance Analysis of Classifying Unlabeled Data from Multiple Data Sources

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Machine Learning Chapter 2. Input

k-nearest Neighbor (knn) Sept Youn-Hee Han

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Basic Concepts Weka Workbench and its terminology

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Data Mining Practical Machine Learning Tools and Techniques

Nearest Neighbor Methods

Machine Learning - Clustering. CS102 Fall 2017

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

DATA MINING Introductory and Advanced Topics Part I

Data Mining and Machine Learning: Techniques and Algorithms

Mining di Dati Web. Lezione 3 - Clustering and Classification

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Chapter 3: Supervised Learning

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD

Iteration Reduction K Means Clustering Algorithm

Lecture 5: Decision Trees (Part II)

Question Bank. 4) It is the source of information later delivered to data marts.

Transcription:

Basic Data Mining Technique

What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches Chapter 4 2

Chapter 4 3

Chapter 4 4

Classification: predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and...uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Chapter 4 5

Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis Chapter 4 6

1. Model construction: 2. Model usage: Chapter 4 7

1. Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Chapter 4 8

Training Data Classification Algorithms N A M E RANK Y E A R S TENURED M ike A ssistant P rof 3 no M ary A ssistant P rof 7 yes B ill P rofessor 2 yes Jim A ssociate P rof 7 yes D ave A ssistant P rof 6 no A nne A ssociate P rof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes Chapter 4 9

Classification Process 2. Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set Chapter 4 10

Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAM E RANK YEARS TENURED Tom Assistant Prof 2 no M erlisa Associate Prof 7 no G eorge Professor 5 yes Joseph Assistant Prof 7 yes Tenured? Chapter 4 11

Prediction is similar to classification 1. Construct a model 2. Use model to predict unknown value Major method for prediction is regression Linear and multiple regression Non-linear regression Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions Chapter 4 12

1. Data Preparation 2. Evaluating Classification Methods Chapter 4 13

Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data Chapter 4 14

Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight proved by the model Goodness of rules decision tree size compactness of classification rules Chapter 4 15

Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Chapter 4 16

Chapter 4 17

Chapter 4 18

Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree Chapter 4 19

Decision tree generation consists of two phases 1. Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes 2. Tree pruning Identify and remove branches that reflect noise or outliers Chapter 4 20

This follows an example from Quinlan s ID3 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 30 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no Chapter 4 21

age? <=30 overcast 30..40 >40 student? yes credit rating? no yes fair excellent no yes no yes Chapter 4 22

Chapter 4 23

Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Chapter 4 24

Chapter 4 25

Instance-based learning: Store training examples and delay the processing ( lazy evaluation )...until a new instance must be classified Typical approaches k-nearest neighbor approach Instances represented as points in a Euclidean space. Case-based reasoning Uses symbolic representations and knowledge-based inference Chapter 4 26

All instances correspond to points in the n-d space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or realvalued. For discrete-valued, the k-nn returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by _ 1-NN for a typical set of training examples. + _. + + _ xq _ +..... Chapter 4 27

Also uses: lazy evaluation + analyze similar instances Difference: Instances... are not points in a Euclidean space Methodology Instances represented by rich symbolic descriptions (e.g., function graphs) Multiple retrieved cases may be combined Chapter 4 28

GA: based on an analogy to biological evolution Each rule is represented by a string of bits An initial population is created consisting of randomly generated rules e.g., IF A 1 and Not A 2 then C 2 can be encoded as 100 Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation Chapter 4 29

Chapter 4 30

Rough sets are used to approximately or roughly define equivalent classes Chapter 4 31

A rough set for a given class C is approximated by two sets: 1. a lower approximation (certain to be in C) and 2. an upper approximation (cannot be described as not belonging to C) Finding the minimal subsets of attributes (for feature reduction) is NPhard Chapter 4 32

Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph) Fuzzy membeship Low Medium High somewhat low baseline high Income Chapter 4 33

Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated For a given new sample, more than one fuzzy value may apply Each applicable rule contributes a vote for membership in the categories Typically, the truth values for each predicted category are summed Chapter 4 34

Data Mining: Concepts and Techniques (Chapter 7 for textbook), Jiawei Han and Micheline Kamber, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada Chapter 4 35