DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Similar documents
Perceptron: This is convolution!

Deep Learning and Its Applications

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Deep Learning for Computer Vision II

Network Traffic Measurements and Analysis

The exam is closed book, closed notes except your one-page cheat sheet.

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep Learning with Tensorflow AlexNet

Spatial Localization and Detection. Lecture 8-1

Random Forest A. Fornaser

Deep Learning for Computer Vision

Face Recognition A Deep Learning Approach

Como funciona o Deep Learning

Inception Network Overview. David White CS793

Machine Learning. MGS Lecture 3: Deep Learning

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Machine Learning Classifiers and Boosting

Random Forests and Boosting

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

8. Tree-based approaches

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Supervised vs unsupervised clustering

Deep Learning Cook Book

Data Mining Lecture 8: Decision Trees

The Basics of Decision Trees

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Bayesian model ensembling using meta-trained recurrent neural networks

Tree-based methods for classification and regression

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

Fast Edge Detection Using Structured Forests

Advanced Video Analysis & Imaging

Logical Rhythm - Class 3. August 27, 2018

Machine Learning. Chao Lan

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Neural Network Neurons

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Dynamic Routing Between Capsules

Business Club. Decision Trees

INTRODUCTION TO DEEP LEARNING

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Machine Learning 13. week

CS489/698: Intro to ML

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

7. Boosting and Bagging Bagging

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Supervised Learning for Image Segmentation

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Supervised Learning Classification Algorithms Comparison

CNN Basics. Chongruo Wu

Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

Fuzzy Set Theory in Computer Vision: Example 3

Classification with Decision Tree Induction

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Deep Neural Networks Optimization

Deep Neural Decision Forests

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Tutorial on Machine Learning Tools

Deep Learning Applications

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

How Learning Differs from Optimization. Sargur N. Srihari

5 Learning hypothesis classes (16 points)

Ensemble Methods, Decision Trees

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

ImageNet Classification with Deep Convolutional Neural Networks

CS229 Final Project: Predicting Expected Response Times

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

An Exploration of Computer Vision Techniques for Bird Species Classification

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

A Systematic Overview of Data Mining Algorithms

Deep Learning for Embedded Security Evaluation

Lecture 6-Decision Tree & MDL

Computer Vision Lecture 16

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Study of Residual Networks for Image Recognition

An introduction to random forests

Fast or furious? - User analysis of SF Express Inc

Classifying Depositional Environments in Satellite Images

Univariate and Multivariate Decision Trees

Random Forest Classification and Attribute Selection Program rfc3d

Mondrian Forests: Efficient Online Random Forests

Deep neural networks II

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

11. Neural Network Regularization

Neural Networks (Overview) Prof. Richard Zanibbi

Machine Learning Methods for Ship Detection in Satellite Images

Transcription:

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS Deep Neural Decision Forests Microsoft Research Cambridge UK, ICCV 2015 Decision Forests, Convolutional Networks and the Models in-between Microsoft Research Technical Report arxiv 3 Mar. 2016 Meir Dalal Or Gorodissky 1

OVERVIEW OF THE PRESENTATION MOTIVATION DECISION TREES RANDOM FORESTS DECISION TREES VS CNN COMBINING DECISION TREE & CNN 2

MOTIVATION Combining CNN s feature learning with Random Forest s classification capacities 3

DECISION TREE - WHAT IS IT Supervised learning algorithm used for classification An inductive learning task - use particular facts to make more generalized conclusions A predictive model based on a branching series of tests These smaller tests are less complex than a one-stage classifier (Divide & Conquer) Different way to look at : each node either predicates the answer or passes the problem to a different node Example 4

DECISION TREES - TYPIC AL (NAIVE) PROBLEM Training examples Example Attributes Target 5

DECISION TREES - TYPICAL (NAIVE) PROBLEM CONT. 6

DECISION TREES - TYPICAL (NAIVE) PROBLEM CONT. 7

DECISION TREES - HOW TO CONSTRUCT When to stop All the instances have the same target class There are no more instances There are no more attributes Reach to pre-defined max depth How to split? constructing a decision trees usually work top-down Gini impurity Information gain 8

DECISION TREES - TERMINOLOGY Root Node Decision Node Splitting Prediction Node 9

DECISION TREES - STOCHASTIC ROUTING Input space χ, output space Y Decision nodes : n Ν d n ( ; Θ) Prediction nodes : l L: π l over Y Θ - Decision node parameterization Routing function till now d n is binary and the routing is deterministic Leaf prediction mark as π l π: Stochastic routing function d n ( ; Θ) : χ 0,1 Routing decision is an output of a Bernoulli random variable with mean d n ( ; Θ) Leaf node contain a probability for each class 10

DECISION TREE - ENSEMBLE METHODS If a decision tree is fully grown, it may lose some generalization capability Overfitting How to solve it? Ensemble methods Involve group of predictive models to achieve a better accuracy and model stability 11

RANDOM FOREST When you can t think of any algorithm, use random forest! Algorithm (Bootstrap Aggregation) 1. Grow K different decision trees 1. Pick a random subset of the training examples (with return) 2. Pick d << D random attributes to split the data 3. Each tree is grown to the largest extent possible and there is no pruning 2. Given a new data point χ 1. Classify χ using each of the trees T 1 T K 2. Predict new data by aggregating the predictions of the tree trees (i.e., majority votes for classification, average for regression). F O R E S T D E C I S I O N A v e r a g i n g a l l t h e t r e e s p r e d i c t i o n s 12

DECISION TREES X CONV NEURAL NETS DT Levels Divide & Conquer Only log 2 N parameters used in test time No feature learned (at most) Training is done layer wise High efficiency CNN Layers High dimensionality Use all the parameters in test time! Feature learning integrated classification Training E2E with S/GD State of the art accuracy How to efficiently combine DT/RF with CNN? 13

DECISION TREE BY CNN FEATURES ARCHITECTURE CNN RF Softmax 14

DECISION TREE BY CNN FEATURES ARCHITECTURE CNN RF 15

DECISION TREE BY CNN FEATURES ARCHITECTURE CNN RF F O R E ST D E C I S I O N Ave r a g i n g a l l t h e t re e s p re d i c t i o n s 16

DECISION TREE BY CNN FEATURES ARCHITECTURE d n ; Θ = σ f n x ; Θ σ x = 1 + e x 1 (sigmoid function) f n ( ; Θ) : χ R Decision Nodes Prediction Probability Prediction for sample x p T y x, Θ, π = l L π ly μ l (x Θ) where π ly - probability of a sample reaching a leaf l to take class y μ l (x Θ) - probability that sample x will reach leaf l l L μ l (x Θ) = 1 Forest Of Decision Trees Deliver a prediction for a x sample by averaging the output of each tree: P F y x = 1 K h=1 K P Th y x K - number of decision trees in the forest 17

TWO-STEP OPTIMIZATION STRATEGY Objective Function: (1) Learning decision nodes min Θ Our goal: R( Θ, π; T) (2) Learning predictions nodes min π Our goal: R( Θ, π; T) η > 0 learning rate B T - random subset Z l t normalization fcator π ly0 arbitrary > 0 18

LEARNING TREE BY BACK PROPAGATION (2) (1) π Update the predication nodes in each tree independently since each tree has its own set of leaf predictions Randomly select a tree in the forest for each mini-batch Θ 19

Histogram Counts LEARNING AND ENTROPY How can we quantify that the network s learned process? Measure the decision uncertainty for a given sample x Decisions Nodes As the certainty of routing a sample increase, the sample will only be routed to a small subset of available decisions nodes with reasonably high probability d n response on validation set 100 epochs 500 epochs 1K epochs d n output values 20

Average leaf entropy [bits] LEARNING AND ENTROPY How can we quantify that the network s learned process? Leaf Entropy Measure the leaf posterior distribution Highly peaked distributions for the leaf predictors, leads to low entropy Average leaf entropy during training H > H #Training epochs 21

RESULTS 1 Algorithms ADF - state-of-the-art stand-alone, off-the-shelf forest ensemble sndf -1 fully connected layer, no hidden layers 22

RESULTS 2 Architecture GoogLeNet* - GoogLeNet implementation Distributed (Deep) Machine Learning Common (DMLC) library dndf.net - Replacing each softmax layer in GoogLeNet* (1) with Random Forest consisting of 10 trees 23

CONCLUSIONS Novel algorithm for learning Random Forest - sndf (shallow neural decision forest) Model unified representation learning and classifier using random forest - dndf.net (deep neural decision forest) Train dnfts - 2 step stochastic gradient descent Prediction function Routing function No dramatic improvement in accuracy comparing to regular GoogLeNet 24

RECAP Before: Decision trees and random forests are efficient classifiers CNNs are state of the art at feature extractions an classifiers In Deep Neural Decision Forests ICCV 2015: All softmax layers are used to deduce a random forest GoogLeNet variation Two steps SGD defined for finding both the decision and prediction functions Trained E2E achieved (slightly) better results Peter Kontschieder Now: In Decision Forests, Convolutional Networks and the Models in-between Microsoft Research Technical Report arxiv 3 Mar. 2016 Generalize DT and CNN as Conditional Networks using routers Improve state of the art architectures compute cost while maintaining accuracy Yani Ioannou 25

SAVE THE PLANET / YOUR PHONE (MOTIVATION) VGG16 single forward pass uses ~ 30G FLOPS Top ranking efficient super computer (HPC) ~ 10G FLOPS / Watt https://www.top500.org/green500/ 100,000,000 US search for an image on their cloud ~ 300MWatt After one hour: Energy equivalent to a ~ 45 ton of coal https://www.euronuclear.org/info/encyclopedia/coalequivalent.htm Nomophobia From Wikipedia, the free encyclopedia is a proposed name for the phobia of being out of mobile phone contact. [1][2] It is, however, arguable that the word "phobia" is misused and that in the majority of cases it is another form of anxiety disorder. [3][not in citation given] Although nomophobia does not appear in the current Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), it has been proposed as a "specific phobia", based on definitions given in the DSM-IV. [4][dubious discuss] 26

MOTIVATION Neural networks are becoming deeper and more complex carrying a quickly growing computational cost We would like to make more efficient neural networks by introducing ideas from decision trees Decide on the fly how accurate efficient you want your prediction to be (trade off) Top 1 accuracy on imagenet Vs. number of operations (GFLOPS) size is the number of parameters https://arxiv.org/abs /1605.07678 27

DECISION TREES X DEEP NEURAL NETS TA K I N G A C L OSER LOOK DT Decision nodes Random forest Prediction nodes Deactivating branches More Efficient CNN Relu Ensembles Softmax Dropout More Accurate Actually they are similar But how do we combine them? - Generalize both as Conditional Networks 28

POC - FROM NET TO TREE Take 2 consecutive layers from trained CNN (VGG) Calculate the 2 layers crosscorrelation matrix of a fully connected neural network Rearrange as a block matrix (higher cross-correlation values) Decorrelate by zeroing block off-diagonal elements Replot the net with the branched structure 29

FAST NOTATION 30

INTRODUCING THE ROUTER NODE split node P l R ʃ r(1) data router r(2) Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: 31

INTRODUCING THE ROUTER NODE split node data router Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes 32

INTRODUCING THE ROUTER NODE split node data router Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes 33

INTRODUCING THE ROUTER NODE Partial derivative: split node data router Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes Hard Routing binary weights on branches (on/off) Soft Routing real weights on branches 34

INTRODUCING THE ROUTER NODE Quizwhere are DTs? Hard Explicit Implicit Soft Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes Hard Routing binary weights on branches (on/off) Soft Routing real weights on branches 35

INTRODUCING THE ROUTER NODE Explicit Implicit Hard DT Soft Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes Hard Routing binary weights on branches (on/off) Soft Routing real weights on branches Generalization is called Conditional Network 36

EXPERIMENT CONDITIONAL GOOGLE-NET Ensemble/Random forest architecture Based on two GoogLeNets: regular and one with 10x oversampling. This time we learn an explicit router based simple CNN1 Router is trained together to predict the accuracy of each route for each image. 37

EXPERIMENT CONDITIONAL GOOGLE-NET Purple Dots: original networks accuracies. Dashed Line: accuracy when choosing each network at random Green Line: amortized cost to accuracy curve on the validation set Green Point: operation point where we achieve almost the 10x oversampled CNN accuracy with less than half the computational cost. We could decide during test time what accuracy we require. 38

EFFICIENCY BENEFITS OF IMPLICIT ROUTING Top: A standard CNN (one route). Bottom: A two-routed implicit arch. The larger boxes denote feature maps, the smaller ones the filters Due to branching, the depth of the second set of kernels (in yellow) changes between the two architectures yielding lower computational cost. 39

EXPERIMENT CONDITIONAL VGG11 Split features into 2 Based on VGG11 with additional global max polling layer after last convolutional layer. Implemented as DAG 40

EXPERIMENT CONDITIONAL VGG11 Matching the original VGG11 top5 error with less than half the compute (45%), and almost one-fifth (21%) of the parameters. Training from scratch took twice the epochs but the overall time remained the same due to the decrease in computations. 41

TL;DR Decision Trees are efficient and CNN are Accurate Conditional NN are the generalization of both Trade off - we try to find the sweet spot combining the two By using Implicit Routing: we could achieve 50% reduction of computational and memory cost. By using Explicit Routing: we could achieve 50% reduction of computational cost same accuracy Decide on the fly how accurate-costly we want to be *If you aren t more accurate maybe you re more efficient 42