Nesnelerin İnternetinde Veri Analizi

Similar documents
Extra readings beyond the lecture slides are important:

COMP 465: Data Mining Classification Basics

CSE4334/5334 DATA MINING

Classification with Decision Tree Induction

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Part I. Instructor: Wei Ding

Basic Data Mining Technique

Data Mining Concepts & Techniques

Classification. Instructor: Wei Ding

CS Machine Learning

Classification and Regression

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Lecture 7: Decision Trees

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Lecture outline. Decision-tree classification

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Lecture 7. Data Stream Mining. Building decision trees

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CSE 5243 INTRO. TO DATA MINING

Classification: Decision Trees

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Data Warehousing & Data Mining

Data Mining Part 5. Prediction

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

A Program demonstrating Gini Index Classification

Incremental Learning Algorithm for Dynamic Data Streams

Chapter 3: Supervised Learning

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Data Mining Classification - Part 1 -

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Business Club. Decision Trees

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Algorithms: Decision Trees

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Mining Massive Data Streams

Supervised vs unsupervised clustering

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Incremental Classification of Nonstationary Data Streams

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

International Journal of Software and Web Sciences (IJSWS)

Machine Learning in Real World: C4.5

Topic 7 Machine learning

COMP90049 Knowledge Technologies

Classification Salvatore Orlando

ISSUES IN DECISION TREE LEARNING

PARALLEL CLASSIFICATION ALGORITHMS

A Brief Introduction to Data Mining

Data Mining Lecture 8: Decision Trees

8. Tree-based approaches

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

15-780: Graduate Artificial Intelligence. Decision trees

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

ARTIFICIAL INTELLIGENCE (CS 370D)

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Decision Tree Learning

Chapter 2: Classification & Prediction

Data Mining and Analytics

Nearest neighbor classification DSE 220

7. Decision or classification trees

Preprocessing DWML, /33

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Jarek Szlichta

Data Mining Concepts

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Unsupervised Learning

Streaming Random Forests

Machine Learning Techniques for Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

Logical Rhythm - Class 3. August 27, 2018

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Credit card Fraud Detection using Predictive Modeling: a Review

A Comparative Study of Selected Classification Algorithms of Data Mining

Image Segmentation. Shengnan Wang

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Data Warehousing and Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Naïve Bayes for text classification

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Given a collection of records (training set )

Introduction to Data Mining and Data Analytics

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

Network Traffic Measurements and Analysis

On Biased Reservoir Sampling in the Presence of Stream Evolution

Random Forest A. Fornaser

Machine Learning Chapter 2. Input

Applying Supervised Learning

Transcription:

Nesnelerin İnternetinde Veri Analizi Bölüm 3. Classification in Data Streams w3.gazi.edu.tr/~suatozdemir

Supervised vs. Unsupervised Learning (1) Supervised learning (classification) Supervision: The training data such as observations or measurements are accompanied by labels indicating the classes which they belong to New data is classified based on the models built from the training set Training Data with class label: age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no Training Instances Test Instances Model Learning Prediction Model Positive Negative

Supervised vs. Unsupervised Learning (2) Unsupervised learning (clustering) The class labels of training data are unknown Given a set of observations or measurements, establish the possible existence of classes or clusters in the data

Prediction Problems: Classification vs. Numeric Prediction Classification Predict categorical class labels (discrete or nominal) Construct a model based on the training set and the class labels (the values in a classifying attribute) and use it in classifying new data Numeric prediction Model continuous-valued functions (i.e., predict unknown or missing values) Typical applications of classification Credit/loan approval Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is

Classification Model Construction, Validation and Testing Model construction Each sample is assumed to belong to a predefined class (shown by the class label) The set of samples used for model construction is training set Model: Represented as decision trees, rules, mathematical formulas, or other forms Model Validation and Testing: Test: Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy: % of test set samples that are correctly classified by the model Test set is independent of training set Validation: If the test set is used to select or refine models, it is called validation (or development) (test) set Model Deployment: If the accuracy is acceptable, use the model to classify new data

Decision Tree Induction: An Example Decision tree construction: no A top-down, recursive, divide-andconquer process Resulting tree: age? <=30 overcast 31..40 >40 student? yes Buy excellent credit rating? fair t-buy Buy t-buy Buy Training data set: Who buys computer? age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no How to decide nodes?

From Entropy to Info Gain: A Brief Review of Entropy Entropy (Information Theory) A measure of uncertainty associated with a random number Calculation: For a discrete random variable Y taking m distinct values {y 1, y 2,, y m } Interpretation Higher entropy higher uncertainty Lower entropy lower uncertainty Conditional entropy m = 2

Classification Models Neural networks Statistical models linear/quadratic discriminants Decision trees Genetic models

Why Decision Tree Model? Relatively fast compared to other classification models Obtain similar and sometimes better accuracy compared to other models Simple and easy to understand Can be converted into simple and easy to understand classification rules

Data Streams Data arrive continuously (it s possible that they come in very fast) Data size is extremely large, potentially infinite Couldn t possibly store all the data

Issues Disk/Memory-resident algorithms require the data to be in the disk/memory They may need to scan the data multiple times Need algorithms that read data only once, and only require a small amount of time to process it Incremental learning method Goal Design decision tree learners that read each example at most once, and use a small constant time to process it.

Incremental learning methods Previous incremental learning methods Some are efficient, but do not produce accurate model Some produce accurate model, but very inefficient Algorithm that is efficient and produces accurate model VFDT - Hoeffding Tree Algorithm Given a stream of examples, use the first ones to choose the root attribute. Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.

VFDT - Hoeffding Tree Algorithm Calculate the information gain for the attributes and determine the best two attributes At each node, check for a condition If condition satisfied, create child nodes based on the test at the node If not, stream in more examples and perform calculations till condition satisfied

VFDT - Hoeffding Tree Algorithm The algorithm constructs the tree using the same procedure as ID3. It calculates the information gain for the attributes and determines the best two attributes. Sufficient to consider only a small subset of the training examples that pass through that node to find the best split For example, use the first few examples to choose the split at the root Problem: How many examples are necessary? Hoeffding Bound! Use Hoeffding bound to decide how many examples are enough at each node

Hoeffding Bound Independent of the probability distribution generating the observations A real-valued random variable r whose range is R n independent observations of r with mean r Hoeffding bound states that P( r r - ) = 1 -, where r is the true mean, is a small number, and R 2 ln(1/ ) 2n

Hoeffding Bound (cont.) Let G(X i ) be the heuristic measure used to choose the split, where X i is a discrete attribute Let X a, X b be the attribute with the highest and secondhighest observed G() after seeing n examples respectively Let G = G(X a ) G(X b ) 0

Hoeffding Bound (cont.) Given a desired, if G >, the Hoeffding bound states that P( G G - > 0) = 1 - G > 0 G(X a) - G(X b) > 0 G(X a) > G(X b) X a is the best attribute to split with probability 1-

VFDT Example Case Age<30? Data Stream _ G(Car Type) -G(Gender) _ Age<30? Data Stream Car Type= Sports Car?

VFDT - Issues VFDT, assume training data is a sample drawn from stationary distribution. Most large databases or data streams violate this assumption Concept Drift: data is generated by a time-changing concept function, e.g. Goal: Seasonal effects Economic cycles Mining continuously changing data streams Scale well

VFDT - Issues Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples Sensitive to window size If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept Too small w may lead to insufficient examples to learn the concept If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.

CVFDT CVFDT (Concept-adapting Very Fast Decision Tree learner) Extend VFDT Maintain VFDT s speed and accuracy Detect and respond to changes in the example-generating process

CVFDT Observations With a time-changing concept, the current splitting attribute of some nodes may not be the best any more. An outdated sub tree may still be better than the best single leaf, particularly if it is near the root. Grow an alternative sub tree with the new best attribute at its root, when the old attribute seems out-of-date. Periodically use a bunch of samples to evaluate qualities of trees. Replace the old sub tree when the alternate one becomes more accurate.

CVFDT Observations Alternate trees for each node in HT start as empty. Process examples from the stream indefinitely. For each example (x, y), Pass (x, y) down to a set of leaves using HT and all alternate trees of the nodes (x, y) passes through. Add (x, y) to the sliding window of examples. Remove and forget the effect of the oldest examples, if the sliding window overflows. CVFDTGrow CheckSplitValidity if f examples seen since last checking of alternate trees. Return HT.

CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example Read new example CVFDTGrow f examples since last checking? CheckSplitValidty

CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example Read new example CVFDTGrow f examples since last checking? CheckSplitValidty

CVFDTGrow For each node reached by the example in HT, Increment the corresponding statistics at the node. For each alternate tree T alt of the node, CVFDTGrow If enough examples seen at the leaf in HT which the example reaches, Choose the attribute that has the highest average value of the attribute evaluation measure (information gain or gini index). If the best attribute is not the null attribute, create a node for each possible value of this attribute

CVFDT Pass example down to leaves add example to sliding window Window overflow? Forget oldest example Read new example CVFDTGrow f examples since last checking? CheckSplitValidty

Forget old example Maintain the sufficient statistics at every node in HT to monitor the validity of its previous decisions. VFDT only maintain such statistics at leaves. HT might have grown or changed since the example was initially incorporated. Assigned each node a unique, monotonically increasing ID as they are created. forgetexample (HT, example, maxid) For each node reached by the old example with node ID no larger than the max leave ID the example reaches, Decrement the corresponding statistics at the node. For each alternate tree T alt of the node, forget(t alt, example, maxid).

CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example Read new example CVFDTGrow f examples since last checking? CheckSplitValidty

CheckSplitValidtiy Periodically scans the internal nodes of HT. Start a new alternate tree when a new winning attribute is found. Tighter criteria to avoid excessive alternate tree creation. Limit the total number of alternate trees.

Smoothly adjust to concept drift Alternate trees are grown the same way HT is. Periodically each node with non-empty alternate trees enter a testing mode. M training examples to compare accuracy. Prune alternate trees with non-increasing accuracy over time. Replace if an alternate tree is more accurate. Married? Car Type= Sports Car? Age<30? Experience <1 year?

Adjust to concept drift(2) Dynamically change the window size Shrink the window when many nodes gets questionable or data rate changes rapidly. Increase the window size when few nodes are questionable.

CVFDT (contd.) Experiment result

Possible Reading A similarity-based approach for data stream classification Adaptive random forests for evolving data stream classification Online data stream classification with incremental semisupervised learning Online neural network model for non-stationary and imbalanced data stream classification