Nesnelerin İnternetinde Veri Analizi

Nesnelerin İnternetinde Veri Analizi Bölüm 3. Classification in Data Streams w3.gazi.edu.tr/~suatozdemir

Supervised vs. Unsupervised Learning (1) Supervised learning (classification) Supervision: The training data such as observations or measurements are accompanied by labels indicating the classes which they belong to New data is classified based on the models built from the training set Training Data with class label: age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no Training Instances Test Instances Model Learning Prediction Model Positive Negative

Supervised vs. Unsupervised Learning (2) Unsupervised learning (clustering) The class labels of training data are unknown Given a set of observations or measurements, establish the possible existence of classes or clusters in the data

Prediction Problems: Classification vs. Numeric Prediction Classification Predict categorical class labels (discrete or nominal) Construct a model based on the training set and the class labels (the values in a classifying attribute) and use it in classifying new data Numeric prediction Model continuous-valued functions (i.e., predict unknown or missing values) Typical applications of classification Credit/loan approval Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is

Classification Model Construction, Validation and Testing Model construction Each sample is assumed to belong to a predefined class (shown by the class label) The set of samples used for model construction is training set Model: Represented as decision trees, rules, mathematical formulas, or other forms Model Validation and Testing: Test: Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy: % of test set samples that are correctly classified by the model Test set is independent of training set Validation: If the test set is used to select or refine models, it is called validation (or development) (test) set Model Deployment: If the accuracy is acceptable, use the model to classify new data

Decision Tree Induction: An Example Decision tree construction: no A top-down, recursive, divide-andconquer process Resulting tree: age? <=30 overcast 31..40 >40 student? yes Buy excellent credit rating? fair t-buy Buy t-buy Buy Training data set: Who buys computer? age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no How to decide nodes?

From Entropy to Info Gain: A Brief Review of Entropy Entropy (Information Theory) A measure of uncertainty associated with a random number Calculation: For a discrete random variable Y taking m distinct values {y 1, y 2,, y m } Interpretation Higher entropy higher uncertainty Lower entropy lower uncertainty Conditional entropy m = 2

Classification Models Neural networks Statistical models linear/quadratic discriminants Decision trees Genetic models

Why Decision Tree Model? Relatively fast compared to other classification models Obtain similar and sometimes better accuracy compared to other models Simple and easy to understand Can be converted into simple and easy to understand classification rules

Data Streams Data arrive continuously (it s possible that they come in very fast) Data size is extremely large, potentially infinite Couldn t possibly store all the data

Issues Disk/Memory-resident algorithms require the data to be in the disk/memory They may need to scan the data multiple times Need algorithms that read data only once, and only require a small amount of time to process it Incremental learning method Goal Design decision tree learners that read each example at most once, and use a small constant time to process it.

Incremental learning methods Previous incremental learning methods Some are efficient, but do not produce accurate model Some produce accurate model, but very inefficient Algorithm that is efficient and produces accurate model VFDT - Hoeffding Tree Algorithm Given a stream of examples, use the first ones to choose the root attribute. Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.

VFDT - Hoeffding Tree Algorithm Calculate the information gain for the attributes and determine the best two attributes At each node, check for a condition If condition satisfied, create child nodes based on the test at the node If not, stream in more examples and perform calculations till condition satisfied

VFDT - Hoeffding Tree Algorithm The algorithm constructs the tree using the same procedure as ID3. It calculates the information gain for the attributes and determines the best two attributes. Sufficient to consider only a small subset of the training examples that pass through that node to find the best split For example, use the first few examples to choose the split at the root Problem: How many examples are necessary? Hoeffding Bound! Use Hoeffding bound to decide how many examples are enough at each node

Hoeffding Bound Independent of the probability distribution generating the observations A real-valued random variable r whose range is R n independent observations of r with mean r Hoeffding bound states that P( r r - ) = 1 -, where r is the true mean, is a small number, and R 2 ln(1/ ) 2n

Hoeffding Bound (cont.) Let G(X i ) be the heuristic measure used to choose the split, where X i is a discrete attribute Let X a, X b be the attribute with the highest and secondhighest observed G() after seeing n examples respectively Let G = G(X a ) G(X b ) 0

Hoeffding Bound (cont.) Given a desired, if G >, the Hoeffding bound states that P( G G - > 0) = 1 - G > 0 G(X a) - G(X b) > 0 G(X a) > G(X b) X a is the best attribute to split with probability 1-

VFDT Example Case Age<30? Data Stream _ G(Car Type) -G(Gender) _ Age<30? Data Stream Car Type= Sports Car?

VFDT - Issues VFDT, assume training data is a sample drawn from stationary distribution. Most large databases or data streams violate this assumption Concept Drift: data is generated by a time-changing concept function, e.g. Goal: Seasonal effects Economic cycles Mining continuously changing data streams Scale well

VFDT - Issues Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples Sensitive to window size If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept Too small w may lead to insufficient examples to learn the concept If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.

CVFDT CVFDT (Concept-adapting Very Fast Decision Tree learner) Extend VFDT Maintain VFDT s speed and accuracy Detect and respond to changes in the example-generating process

CVFDT Observations With a time-changing concept, the current splitting attribute of some nodes may not be the best any more. An outdated sub tree may still be better than the best single leaf, particularly if it is near the root. Grow an alternative sub tree with the new best attribute at its root, when the old attribute seems out-of-date. Periodically use a bunch of samples to evaluate qualities of trees. Replace the old sub tree when the alternate one becomes more accurate.

CVFDT Observations Alternate trees for each node in HT start as empty. Process examples from the stream indefinitely. For each example (x, y), Pass (x, y) down to a set of leaves using HT and all alternate trees of the nodes (x, y) passes through. Add (x, y) to the sliding window of examples. Remove and forget the effect of the oldest examples, if the sliding window overflows. CVFDTGrow CheckSplitValidity if f examples seen since last checking of alternate trees. Return HT.

CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example Read new example CVFDTGrow f examples since last checking? CheckSplitValidty

CVFDTGrow For each node reached by the example in HT, Increment the corresponding statistics at the node. For each alternate tree T alt of the node, CVFDTGrow If enough examples seen at the leaf in HT which the example reaches, Choose the attribute that has the highest average value of the attribute evaluation measure (information gain or gini index). If the best attribute is not the null attribute, create a node for each possible value of this attribute

CVFDT Pass example down to leaves add example to sliding window Window overflow? Forget oldest example Read new example CVFDTGrow f examples since last checking? CheckSplitValidty

Forget old example Maintain the sufficient statistics at every node in HT to monitor the validity of its previous decisions. VFDT only maintain such statistics at leaves. HT might have grown or changed since the example was initially incorporated. Assigned each node a unique, monotonically increasing ID as they are created. forgetexample (HT, example, maxid) For each node reached by the old example with node ID no larger than the max leave ID the example reaches, Decrement the corresponding statistics at the node. For each alternate tree T alt of the node, forget(t alt, example, maxid).

CheckSplitValidtiy Periodically scans the internal nodes of HT. Start a new alternate tree when a new winning attribute is found. Tighter criteria to avoid excessive alternate tree creation. Limit the total number of alternate trees.

Smoothly adjust to concept drift Alternate trees are grown the same way HT is. Periodically each node with non-empty alternate trees enter a testing mode. M training examples to compare accuracy. Prune alternate trees with non-increasing accuracy over time. Replace if an alternate tree is more accurate. Married? Car Type= Sports Car? Age<30? Experience <1 year?

Adjust to concept drift(2) Dynamically change the window size Shrink the window when many nodes gets questionable or data rate changes rapidly. Increase the window size when few nodes are questionable.

CVFDT (contd.) Experiment result

Possible Reading A similarity-based approach for data stream classification Adaptive random forests for evolving data stream classification Online data stream classification with incremental semisupervised learning Online neural network model for non-stationary and imbalanced data stream classification