From Ensemble Methods to Comprehensible Models

Similar documents
Similarity-Binning Averaging: A Generalisation of Binning Calibration

April 3, 2012 T.C. Havens

Network Traffic Measurements and Analysis

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Semi-supervised learning and active learning

Business Club. Decision Trees

Boosting Algorithms for Parallel and Distributed Learning

Random Forest A. Fornaser

Logical Rhythm - Class 3. August 27, 2018

CS 229 Midterm Review

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Using Decision Boundary to Analyze Classifiers

Data Mining Lecture 8: Decision Trees

Machine Learning Techniques for Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications

7. Boosting and Bagging Bagging

Fraud Detection Using Random Forest Algorithm

Data Mining and Data Warehousing Classification-Lazy Learners

Random Forest in Genomic Selection

Bagging for One-Class Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Trade-offs in Explanatory

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Fast Branch & Bound Algorithm in Feature Selection

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Perceptron-Based Oblique Tree (P-BOT)

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Neural Networks and Deep Learning

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Calibrating Random Forests

Fast Edge Detection Using Structured Forests

Machine Learning Practical NITP Summer Course Pamela K. Douglas UCLA Semel Institute

All lecture slides will be available at CSC2515_Winter15.html

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

Introduction to Automated Text Analysis. bit.ly/poir599

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

6.034 Quiz 2, Spring 2005

Object Classification Problem

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Contextual Co-occurrence Information for Object Representation and Categorization

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Mondrian Forests: Efficient Online Random Forests

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Context Change and Versatile Models in Machine Learning

Ensemble Methods: Bagging

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Random Search Report An objective look at random search performance for 4 problem sets

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Classification using Weka (Brain, Computation, and Neural Learning)

HIERARCHICAL MULTILABEL CLASSIFICATION

Weighted decisions in a Fuzzy Random Forest

GP Ensemble for Distributed Intrusion Detection Systems

Data Mining and Knowledge Discovery Practice notes 2

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

A Bayesian Approach to Hybrid Image Retrieval

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Introduction to Machine Learning

Comparing Implementations of Optimal Binary Search Trees

Active Sampling for Constrained Clustering

Data Mining and Knowledge Discovery: Practice Notes

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Ulas Bagci

For Monday. Read chapter 18, sections Homework:

Two-step Modified SOM for Parallel Calculation

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

Performance Analysis of Data Mining Classification Techniques

Ensemble Methods, Decision Trees

Rank Measures for Ordering

Multi-label classification using rule-based classifier systems

Model Selection and Assessment

User Documentation Decision Tree Classification with Bagging

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Wild Mushrooms Classification Edible or Poisonous

BITS F464: MACHINE LEARNING

Using Machine Learning to Optimize Storage Systems

Support Vector Machine with Restarting Genetic Algorithm for Classifying Imbalanced Data

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Sentiment analysis under temporal shift

Face detection and recognition. Detection Recognition Sally

Information Management course

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Soft computing algorithms

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Semen fertility prediction based on lifestyle factors

Lesson 3. Prof. Enza Messina

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Information Fusion Dr. B. K. Panigrahi

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning for NLP

Gesture Recognition: Hand Pose Estimation. Adrian Spurr Ubiquitous Computing Seminar FS

8. Tree-based approaches

Algorithms: Decision Trees

Vulnerability of machine learning models to adversarial examples

Classifying Building Energy Consumption Behavior Using an Ensemble of Machine Learning Methods

Transcription:

From Ensemble Methods to Comprehensible Models Cèsar Ferri, José Hernández-Orallo, M.José Ramírez-Quintana {cferri, jorallo, mramirez}@dsic.upv.es Dep. de Sistemes Informàtics i Computació, Universitat Politècnica de València, Valencia, Spain The 5th International Conference on Discovery Science Lübeck, 24-26 November 2002

Introduction Machine Learning techniques that construct a model/hypothesis (e.g. ANN, DT, SVM, ): usually devoted to obtain one single model: As accurate as possible (close to the target model). Other (presumably less accurate) models are discarded. An old alternative has recently been popularised: Every consistent hypothesis should be taken into account But How? DISCOVERY SCIENCE 2002 2

Ensemble Methods (1/3) Ensemble Methods (Multi-classifiers): Generate multiple (and possibly) heterogeneous models and then combine them through votingor other fusion methods. a a b c a c c c c b Combined Prediction c Much better results (in terms of accuracy) than single models when the number and variety of classifiers is high. DISCOVERY SCIENCE 2002 3

Ensemble Methods (2/3) Ensemble Methods (Multi-classifiers): Different topologies: simple, stacking, cascading, a 1 a 2 a m Decision Tree C 1 Data a 1 a Neural C 2 2 a Net m Fusion Combined Prediction Data a 1 C 1 a 2 Decision a Tree m a 1 a 2 a m Neural Net C 2 Decision Tree Combined Prediction a 1 a 2 a m SVM C n Simple Combination Different generation policies: boosting, bagging, randomisation, Different fusion methods: majority voting, average, maximum, a 1 a 2 a m DISCOVERY SCIENCE 2002 4 SVM C n Stacking

Ensemble Methods (3/3) Main drawbacks: Computational costs: huge amounts of memory and time are required to obtain and store the set of hypotheses (ensemble). Throughput: the application of the combined model is slow. Comprehensibility: the combined model behaves like a black box. The solution of these drawbacks would boost the applicability of ensemble methods in machine learning applications. DISCOVERY SCIENCE 2002 5

Archetype (1/2) The question is to reduce to one hypothesis from the combination of mhypotheses without losing too much accuracy. One possibility is to select one hypothesis according to the semantic similarity to the combined hypothesis DISCOVERY SCIENCE 2002 6

Archetype (2/2) The intuitive idea is to select the component of the ensemble closest to the the combined hypothesis DISCOVERY SCIENCE 2002 7

Hypotheses similarity Measures of similarity of hypothesis must be considered: Given two classifiers, an unlabelled dataset of n examples, with Cclasses, we can construct a C x C confusion or contingency matrix M i,j θ = C i= 1 Mi n, i θ θ2 κ = θ 1 Q C i= 1 M i, i i= 1 i= 1, j= 1, i f = C C M i, i + C M M i= 1, j= 1, i f i, j i, j DISCOVERY SCIENCE 2002 8

Random Invented Dataset We require a dataset to establish the similarity between the hypotheses We could employ a subset of the training dataset as validation dataset A better possibility is the generation randomly of an unlabelled invented dataset DISCOVERY SCIENCE 2002 9

Ensembles of Decision Trees Decision Tree: Each internal node represents a condition. Each leaf assigns a class to the examples that fall under that leaf. Forest: several decision trees can be constructed. Many trees have common parts. Traditional ensemble methods repeat those parts: memory and time DISCOVERY SCIENCE 2002 10

Decision Tree Shared Ensembles Shared ensemble: Common parts are shared in an AND/OR tree structure. Construction space and time resources are highly reduced. Throughput is also improved by this technique. DISCOVERY SCIENCE 2002 11

Decision Tree Shared Ensembles Previous work: Multiple Decision Trees (Kwok & Carter 1990) Option Decision Trees (Buntine 1992) The AND/OR tree structure is populated (partially) breadth-first. Combination has been performed: Using weighted combination (Buntine 1992). Using majority voting combination (Kohavi & Kunz 1997). Different conclusions on where alternatives are especially beneficial: At the bottom of the tree (Buntine). Trees are quite similar Accuracy improvement is low. At the top of the tree (Kohavi & Kunz). Trees share few parts Space resources are exhausted as in other nonshared ensembles (boosting, bagging,...). DISCOVERY SCIENCE 2002 12

Multi-tree tree Construction New Way of Populating the AND/OR Tree: The first tree is constructed in the classical eager way. Discarded alternative splits are stored in a list. Repeat ntimes: Once a tree is finished, the best alternative split (according to a wakening criterion) is chosen. The branch is finished using the classical eager way. This amounts to a beam search Anytime algorithm. Extensions and alternatives can happen at any part of the tree (top, bottom). The populating strategy can be easily changed. The fusion strategy can also be flexibly modified. The desired size of the AND/OR tree can be specified quite precisely. DISCOVERY SCIENCE 2002 13

Fusion Methods Combination on the Multi-tree: The number of trees grows exponentially w.r.t. the number of alternative OR-nodes explored: Advantages: ensembles are now much bigger with a constant increase of resources. Presumably, the combination will be more accurate. Disadvantages: the combination at the top is unfeasible. Global fusion techniques would be prohibitive. DISCOVERY SCIENCE 2002 14

Local Fusion First Stage. Classical top-down: Each example to be predicted is distributed top-down into many alternative leaves. The example is labelled in each leaf (class vector). Second Stage. The fusion goes bottom-up: Whenever an OR-node is found. The (possibly) inconsistent predictions are combined through a local fusion method: Fusion of millions or billions of trees can be performed efficiently. DISCOVERY SCIENCE 2002 15

Selection of an Archetype (1/3) Due to huge amount of hypotheses is could be not feasible to compute the similarity of each hypothesis with respect to the combined hypothesis We could make compute the similarity for each internal node, and then extract the most similar solution DISCOVERY SCIENCE 2002 16

Selection of an Archetype (2/3) 1. We label the invented dataset w.r.t. the combined hypothesis 2. We fill a contingency matrix Min each leaf of the multitree according with the labeled invented dataset 3. We propagate upwards the contingency matrix: For the AND-nodes we accumulate the contingency matrix of their children nodes:m= M 1 + M 2 +..+ M i For the OR-nodes, we compute a similarity measure of their children, and the M of the node with highest similarity is copied in the AND node. The selected node is marked DISCOVERY SCIENCE 2002 17

Selection of an Archetype (3/3) DISCOVERY SCIENCE 2002 18

Archetype Technique 1. Multi-tree generation: The first step consists in the generation of a multi-tree from a training dataset. 2. Invented dataset: In this phase, an unlabelled invented dataset is created, by a random dataset 3. Multi-tree combination: The invented dataset is labelled by the combination of the shared ensemble 4. Calculation and propagation of contingency matrices: A contingency matrix is assigned to each node of the multi-tree, using the labelled invented dataset and a similarity metric. 5. Selection of a solution: An archetype hypothesis is extracted from the multi-tree by descending the multi-tree through the marked nodes. DISCOVERY SCIENCE 2002 19

Experiments (1/4) Experimental setting: 13 datasets from the UCI repository. Multi-tree implemented in the SMILES system. Splitting criterion: GainRatio (C4.5). Second node selection criterion (wakening criterion): random. DISCOVERY SCIENCE 2002 20

Experiments (2/4) Evaluating Similarity Metrics DISCOVERY SCIENCE 2002 21

Experiments (3/4) Influence of the Size of the Invented Dataset: DISCOVERY SCIENCE 2002 22

Experiments (4/4) Combination Resources compared to other Ensemble Methods: DISCOVERY SCIENCE 2002 23

Archetype as a Hybrid Method DISCOVERY SCIENCE 2002 24

Conclusions Archetyping as an method to obtain comprehensible solutions from an ensemble method: The use of multi-trees permits the extraction of a hypothesis from an exponential number of hypotheses An invented dataset avoids the loss of part of the training evidence as validation datasets The Archetype solution can also be considered as an explanation of the combined hypothesis DISCOVERY SCIENCE 2002 25

Conclusions Some further improvements: The experimental study of archetype as an hybrid method. The study of methods that could select analytically the archetype solution, without the necessity of employing an invented dataset SMILES is freely available at: http://www.dsic.upv.es/~flip/smiles/ DISCOVERY SCIENCE 2002 26