New ensemble methods for evolving data streams

Similar documents
MOA: {M}assive {O}nline {A}nalysis.

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Accurate Ensembles for Data Streams: Combining Restricted Hoeffding Trees using Stacking

Tutorial 1. Introduction to MOA

Efficient Data Stream Classification via Probabilistic Adaptive Windows

Massive data mining using Bayesian approach

CD-MOA: Change Detection Framework for Massive Online Analysis

Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams

Cache Hierarchy Inspired Compression: a Novel Architecture for Data Streams

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Adaptive Parameter-free Learning from Evolving Data Streams

A Practical Approach

An Adaptive Framework for Multistream Classification

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³

Lecture 7. Data Stream Mining. Building decision trees

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA A Machine Learning Workbench for Data Mining

On evaluating stream learning algorithms

Use of Ensembles of Fourier Spectra in Capturing Recurrent Concepts in Data Streams

Self-Adaptive Ensemble Classifier for Handling Complex Concept Drift

Towards Real-Time Feature Tracking Technique using Adaptive Micro-Clusters

Mining Contextual Preferences in Data Streams

Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams

CHAPTER 6 EXPERIMENTS

Mining Recurrent Concepts in Data Streams using the Discrete Fourier Transform

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

arxiv: v1 [cs.lg] 8 Sep 2017

IMPACT OF ONLINE STREAM CLUSTERING IN BANDWIDTH- CONSTRAINED MOBILE VIDEO ENVIRONMENT

Naïve Bayes for text classification

TUBE: Command Line Program Calls

User Guide Written By Yasser EL-Manzalawy

A Novel Ensemble Classification for Data Streams with Class Imbalance and Concept Drift

Change Detection in Categorical Evolving Data Streams

Efficient Multi-label Classification

PAPER An Efficient Concept Drift Detection Method for Streaming Data under Limited Labeling

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

Basic Concepts Weka Workbench and its terminology

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Optimizing the Computation of the Fourier Spectrum from Decision Trees

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

BoostVHT: Boosting Distributed Streaming Decision Trees

Weka ( )

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Learning from Time-Changing Data with Adaptive Windowing

A Comparative Study of Selected Classification Algorithms of Data Mining

ONLINE ALGORITHMS FOR HANDLING DATA STREAMS

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Hyperspectral Image Change Detection Using Hopfield Neural Network

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

The Impacts of Data Stream Mining on Real-Time Business Intelligence

Prototype-based Learning on Concept-drifting Data Streams

Data Mining With Weka A Short Tutorial

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

Data Stream-based Intrusion Detection System for Advanced Metering Infrastructure in Smart Grid: A Feasibility Study

A Two-level Learning Method for Generalized Multi-instance Problems

ECLT 5810 Evaluation of Classification Quality

WEKA homepage.

Comparative Study on Classification Meta Algorithms

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift

Détection de concept drift

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Recovery Analysis for Adaptive Learning from Non-Stationary Data Streams: Experimental Design and Case Study

Memory Models for Incremental Learning Architectures. Viktor Losing, Heiko Wersing and Barbara Hammer

International Journal of Advanced Research in Computer Science and Software Engineering

Chapter 8 The C 4.5*stat algorithm

Prototyping DM Techniques with WEKA and YALE Open-Source Software

Individualized Error Estimation for Classification and Regression Models

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

Effective Classifiers for Detecting Objects

Fine-Grained, Unsupervised, Context-based

Machine Learning Techniques for Data Mining

Analyzing HTTP requests for web intrusion detection

Data Stream-based Intrusion Detection System for Advanced Metering Infrastructure in Smart Grid: A Feasibility Study

Review on Gaussian Estimation Based Decision Trees for Data Streams Mining Miss. Poonam M Jagdale 1, Asst. Prof. Devendra P Gadekar 2

Incremental Learning Algorithm for Dynamic Data Streams

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

Lab Exercise Three Classification with WEKA Explorer

A Data Mining Approach for Intrusion Detection System Using Boosted Decision Tree Approach

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

A Study of Random Forest Algorithm with implemetation using Weka

Ensembles of Nested Dichotomies with Multiple Subset Evaluation

Optimized Very Fast Decision Tree with Balanced Classification Accuracy and Compact Tree Size

StreamDM: Advanced Data Mining in Spark Streaming

Solving Problems of Imperfect Data Streams by Incremetnal Decision Trees

Performance Analysis of Data Mining Classification Techniques

CHAPTER 4 METHODOLOGY AND TOOLS

Community edition(open-source) Enterprise edition

Data Mining Based Online Intrusion Detection

Heuristic Updatable Weighted Random Subspaces for Non-stationary Environments

FPSMining: A Fast Algorithm for Mining User Preferences in Data Streams

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

WEKA Explorer User Guide for Version 3-4

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Weka: Practical machine learning tools and techniques with Java implementations

Transcription:

New ensemble methods for evolving data streams A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia University of Waikato Hamilton, New Zealand Paris, 29 June 2009 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009

New Ensemble Methods For Evolving Data Streams Outline a new experimental data stream framework for studying concept drift two new variants of Bagging: ADWIN Bagging Adaptive-Size Hoeffding Tree (ASHT) Bagging. an evaluation study on synthetic and real-world datasets 2 / 25

Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 3 / 25

What is MOA? {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. It is closely related to WEKA It includes a collection of offline and online as well as tools for evaluation: boosting and bagging Hoeffding Trees with and without Naïve Bayes classifiers at the leaves. 4 / 25

WEKA Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms and data processing tools implemented in Java Released under the GPL Support for the whole process of experimental data mining Preparation of input data Statistical evaluation of learning schemes Visualization of input data and the result of learning Used for education, research and applications Complements Data Mining by Witten & Frank 5 / 25

WEKA: the bird 6 / 25

MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 7 / 25

MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 7 / 25

MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 7 / 25

Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 8 / 25

Experimental setting Evaluation procedures for Data Streams Holdout Interleaved Test-Then-Train Environments Sensor Network: 100Kb Handheld Computer: 32 Mb Server: 400 Mb 9 / 25

Experimental setting Data Sources Random Tree Generator Random RBF Generator LED Generator Waveform Generator Function Generator 9 / 25

Experimental setting Classifiers Naive Bayes Decision stumps Hoeffding Tree Hoeffding Option Tree Bagging and Boosting Prediction strategies Majority class Naive Bayes Leaves Adaptive Hybrid 9 / 25

Hoeffding Option Tree Hoeffding Option Trees Regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. 10 / 25

Design of a MOA classifier void resetlearningimpl () void trainoninstanceimpl (Instance inst) double[] getvotesforinstance (Instance i) void getmodeldescription (StringBuilder out, int indent) 11 / 25

Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 12 / 25

Extension to Evolving Data Streams New Evolving Data Stream Extensions New Stream Generators New UNION of Streams New Classifiers 13 / 25

Extension to Evolving Data Streams New Evolving Data Stream Generators Random RBF with Drift LED with Drift Waveform with Drift Hyperplane SEA Generator STAGGER Generator 13 / 25

Concept Drift Framework f (t) 1 f (t) 0.5 α α t t 0 W Definition Given two data streams a, b, we define c = a W t 0 b as the data stream built joining the two data streams a and b Pr[c(t) = b(t)] = 1/(1 + e 4(t t0)/w ). Pr[c(t) = a(t)] = 1 Pr[c(t) = b(t)] 14 / 25

Concept Drift Framework f (t) 1 f (t) 0.5 α α t t 0 W Example (((a W 0 t 0 b) W 1 t 1 c) W 2 t 2 d)... (((SEA 9 W t 0 SEA 8 ) W 2t 0 SEA 7 ) W 3t 0 SEA 9.5 ) CovPokElec = (CoverType 5,000 581,012 Poker) 5,000 1,000,000 ELEC2 14 / 25

Extension to Evolving Data Streams New Evolving Data Stream Classifiers Adaptive Hoeffding Option Tree DDM Hoeffding Tree EDDM Hoeffding Tree OCBoost FLBoost 15 / 25

Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 16 / 25

Ensemble Methods New ensemble methods: Adaptive-Size Hoeffding Tree bagging: each tree has a maximum size after one node splits, it deletes some nodes to reduce its size if the size of the tree is higher than the maximum value ADWIN bagging: When a change is detected, the worst classifier is removed and a new classifier is added. 17 / 25

Adaptive-Size Hoeffding Tree T 1 T 2 T 3 T 4 Ensemble of trees of different size smaller trees adapt more quickly to changes, larger trees do better during periods with little change diversity 18 / 25

Adaptive-Size Hoeffding Tree 0,3 0,28 0,29 0,28 0,275 0,27 0,27 0,26 Error 0,25 Error 0,265 0,24 0,23 0,26 0,22 0,255 0,21 0,2 0 0,1 0,2 0,3 0,4 0,5 0,6 0,25 0,1 0,12 0,14 0,16 0,18 0,2 0,22 0,24 0,26 0,28 0,3 Kappa Kappa Figure: Kappa-Error diagrams for ASHT bagging (left) and bagging (right) on dataset RandomRBF with drift, plotting 90 pairs of classifiers. 19 / 25

ADWIN Bagging ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates ADWIN Bagging When a change is detected, the worst classifier is removed and a new classifier is added. 20 / 25

Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 21 / 25

Empirical evaluation Figure: Accuracy and size on dataset LED with three concept drifts. 22 / 25

Empirical evaluation SEA W = 50 Time Acc. Mem. NaiveBayes 5.32 83.87 0.00 HT 6.96 84.89 0.34 HT DDM 8.30 88.27 0.17 HT EDDM 8.56 87.97 0.18 HOT5 11.46 84.92 0.38 HOT50 22.54 85.20 0.84 AdaHOT5 11.46 84.94 0.38 AdaHOT50 22.70 85.35 0.86 Bag10 HT 31.06 85.45 3.38 BagADWIN 10 HT 54.51 88.58 1.90 Bag10 ASHT W+R 33.20 88.89 0.84 Bag5 ASHT W+R 19.78 88.55 0.01 OzaBoost 39.40 86.28 4.03 OCBoost 59.12 87.21 2.41 23 / 25

Empirical evaluation SEA W = 50000 Time Acc. Mem. NaiveBayes 5.52 83.87 0.00 HT 7.20 84.87 0.33 HT DDM 7.88 88.07 0.16 HT EDDM 8.52 87.64 0.06 HOT5 12.46 84.91 0.37 HOT50 22.78 85.18 0.83 AdaHOT5 12.48 84.94 0.38 AdaHOT50 22.80 85.30 0.84 Bag10 HT 30.88 85.34 3.36 BagADWIN 10 HT 53.15 88.53 0.88 Bag10 ASHT W+R 33.56 88.30 0.84 Bag5 ASHT W+R 20.00 87.99 0.05 OzaBoost 39.97 86.17 4.00 OCBoost 60.33 86.97 2.44 24 / 25

Summary http://www.cs.waikato.ac.nz/ abifet/moa/ Conclusions Extension of MOA to evolving data streams MOA is easy to use and extend New ensemble bagging methods: Adaptive-Size Hoeffding Tree bagging ADWIN bagging Future Work Extend MOA to more data mining and learning methods. 25 / 25