DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS Deep Neural Decision Forests Microsoft Research Cambridge UK, ICCV 2015 Decision Forests, Convolutional Networks and the Models in-between Microsoft Research Technical Report arxiv 3 Mar. 2016 Meir Dalal Or Gorodissky 1
OVERVIEW OF THE PRESENTATION MOTIVATION DECISION TREES RANDOM FORESTS DECISION TREES VS CNN COMBINING DECISION TREE & CNN 2
MOTIVATION Combining CNN s feature learning with Random Forest s classification capacities 3
DECISION TREE - WHAT IS IT Supervised learning algorithm used for classification An inductive learning task - use particular facts to make more generalized conclusions A predictive model based on a branching series of tests These smaller tests are less complex than a one-stage classifier (Divide & Conquer) Different way to look at : each node either predicates the answer or passes the problem to a different node Example 4
DECISION TREES - TYPIC AL (NAIVE) PROBLEM Training examples Example Attributes Target 5
DECISION TREES - TYPICAL (NAIVE) PROBLEM CONT. 6
DECISION TREES - TYPICAL (NAIVE) PROBLEM CONT. 7
DECISION TREES - HOW TO CONSTRUCT When to stop All the instances have the same target class There are no more instances There are no more attributes Reach to pre-defined max depth How to split? constructing a decision trees usually work top-down Gini impurity Information gain 8
DECISION TREES - TERMINOLOGY Root Node Decision Node Splitting Prediction Node 9
DECISION TREES - STOCHASTIC ROUTING Input space χ, output space Y Decision nodes : n Ν d n ( ; Θ) Prediction nodes : l L: π l over Y Θ - Decision node parameterization Routing function till now d n is binary and the routing is deterministic Leaf prediction mark as π l π: Stochastic routing function d n ( ; Θ) : χ 0,1 Routing decision is an output of a Bernoulli random variable with mean d n ( ; Θ) Leaf node contain a probability for each class 10
DECISION TREE - ENSEMBLE METHODS If a decision tree is fully grown, it may lose some generalization capability Overfitting How to solve it? Ensemble methods Involve group of predictive models to achieve a better accuracy and model stability 11
RANDOM FOREST When you can t think of any algorithm, use random forest! Algorithm (Bootstrap Aggregation) 1. Grow K different decision trees 1. Pick a random subset of the training examples (with return) 2. Pick d << D random attributes to split the data 3. Each tree is grown to the largest extent possible and there is no pruning 2. Given a new data point χ 1. Classify χ using each of the trees T 1 T K 2. Predict new data by aggregating the predictions of the tree trees (i.e., majority votes for classification, average for regression). F O R E S T D E C I S I O N A v e r a g i n g a l l t h e t r e e s p r e d i c t i o n s 12
DECISION TREES X CONV NEURAL NETS DT Levels Divide & Conquer Only log 2 N parameters used in test time No feature learned (at most) Training is done layer wise High efficiency CNN Layers High dimensionality Use all the parameters in test time! Feature learning integrated classification Training E2E with S/GD State of the art accuracy How to efficiently combine DT/RF with CNN? 13
DECISION TREE BY CNN FEATURES ARCHITECTURE CNN RF Softmax 14
DECISION TREE BY CNN FEATURES ARCHITECTURE CNN RF 15
DECISION TREE BY CNN FEATURES ARCHITECTURE CNN RF F O R E ST D E C I S I O N Ave r a g i n g a l l t h e t re e s p re d i c t i o n s 16
DECISION TREE BY CNN FEATURES ARCHITECTURE d n ; Θ = σ f n x ; Θ σ x = 1 + e x 1 (sigmoid function) f n ( ; Θ) : χ R Decision Nodes Prediction Probability Prediction for sample x p T y x, Θ, π = l L π ly μ l (x Θ) where π ly - probability of a sample reaching a leaf l to take class y μ l (x Θ) - probability that sample x will reach leaf l l L μ l (x Θ) = 1 Forest Of Decision Trees Deliver a prediction for a x sample by averaging the output of each tree: P F y x = 1 K h=1 K P Th y x K - number of decision trees in the forest 17
TWO-STEP OPTIMIZATION STRATEGY Objective Function: (1) Learning decision nodes min Θ Our goal: R( Θ, π; T) (2) Learning predictions nodes min π Our goal: R( Θ, π; T) η > 0 learning rate B T - random subset Z l t normalization fcator π ly0 arbitrary > 0 18
LEARNING TREE BY BACK PROPAGATION (2) (1) π Update the predication nodes in each tree independently since each tree has its own set of leaf predictions Randomly select a tree in the forest for each mini-batch Θ 19
Histogram Counts LEARNING AND ENTROPY How can we quantify that the network s learned process? Measure the decision uncertainty for a given sample x Decisions Nodes As the certainty of routing a sample increase, the sample will only be routed to a small subset of available decisions nodes with reasonably high probability d n response on validation set 100 epochs 500 epochs 1K epochs d n output values 20
Average leaf entropy [bits] LEARNING AND ENTROPY How can we quantify that the network s learned process? Leaf Entropy Measure the leaf posterior distribution Highly peaked distributions for the leaf predictors, leads to low entropy Average leaf entropy during training H > H #Training epochs 21
RESULTS 1 Algorithms ADF - state-of-the-art stand-alone, off-the-shelf forest ensemble sndf -1 fully connected layer, no hidden layers 22
RESULTS 2 Architecture GoogLeNet* - GoogLeNet implementation Distributed (Deep) Machine Learning Common (DMLC) library dndf.net - Replacing each softmax layer in GoogLeNet* (1) with Random Forest consisting of 10 trees 23
CONCLUSIONS Novel algorithm for learning Random Forest - sndf (shallow neural decision forest) Model unified representation learning and classifier using random forest - dndf.net (deep neural decision forest) Train dnfts - 2 step stochastic gradient descent Prediction function Routing function No dramatic improvement in accuracy comparing to regular GoogLeNet 24
RECAP Before: Decision trees and random forests are efficient classifiers CNNs are state of the art at feature extractions an classifiers In Deep Neural Decision Forests ICCV 2015: All softmax layers are used to deduce a random forest GoogLeNet variation Two steps SGD defined for finding both the decision and prediction functions Trained E2E achieved (slightly) better results Peter Kontschieder Now: In Decision Forests, Convolutional Networks and the Models in-between Microsoft Research Technical Report arxiv 3 Mar. 2016 Generalize DT and CNN as Conditional Networks using routers Improve state of the art architectures compute cost while maintaining accuracy Yani Ioannou 25
SAVE THE PLANET / YOUR PHONE (MOTIVATION) VGG16 single forward pass uses ~ 30G FLOPS Top ranking efficient super computer (HPC) ~ 10G FLOPS / Watt https://www.top500.org/green500/ 100,000,000 US search for an image on their cloud ~ 300MWatt After one hour: Energy equivalent to a ~ 45 ton of coal https://www.euronuclear.org/info/encyclopedia/coalequivalent.htm Nomophobia From Wikipedia, the free encyclopedia is a proposed name for the phobia of being out of mobile phone contact. [1][2] It is, however, arguable that the word "phobia" is misused and that in the majority of cases it is another form of anxiety disorder. [3][not in citation given] Although nomophobia does not appear in the current Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), it has been proposed as a "specific phobia", based on definitions given in the DSM-IV. [4][dubious discuss] 26
MOTIVATION Neural networks are becoming deeper and more complex carrying a quickly growing computational cost We would like to make more efficient neural networks by introducing ideas from decision trees Decide on the fly how accurate efficient you want your prediction to be (trade off) Top 1 accuracy on imagenet Vs. number of operations (GFLOPS) size is the number of parameters https://arxiv.org/abs /1605.07678 27
DECISION TREES X DEEP NEURAL NETS TA K I N G A C L OSER LOOK DT Decision nodes Random forest Prediction nodes Deactivating branches More Efficient CNN Relu Ensembles Softmax Dropout More Accurate Actually they are similar But how do we combine them? - Generalize both as Conditional Networks 28
POC - FROM NET TO TREE Take 2 consecutive layers from trained CNN (VGG) Calculate the 2 layers crosscorrelation matrix of a fully connected neural network Rearrange as a block matrix (higher cross-correlation values) Decorrelate by zeroing block off-diagonal elements Replot the net with the branched structure 29
FAST NOTATION 30
INTRODUCING THE ROUTER NODE split node P l R ʃ r(1) data router r(2) Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: 31
INTRODUCING THE ROUTER NODE split node data router Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes 32
INTRODUCING THE ROUTER NODE split node data router Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes 33
INTRODUCING THE ROUTER NODE Partial derivative: split node data router Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes Hard Routing binary weights on branches (on/off) Soft Routing real weights on branches 34
INTRODUCING THE ROUTER NODE Quizwhere are DTs? Hard Explicit Implicit Soft Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes Hard Routing binary weights on branches (on/off) Soft Routing real weights on branches 35
INTRODUCING THE ROUTER NODE Explicit Implicit Hard DT Soft Implemented here as perceptron though other choices are possible Outputs real value weights that affect data routing: Explicit Routing data is sent conditionally to a single / multiple routes Implicit Routing data is sent unconditionally but selectively to all son nodes Hard Routing binary weights on branches (on/off) Soft Routing real weights on branches Generalization is called Conditional Network 36
EXPERIMENT CONDITIONAL GOOGLE-NET Ensemble/Random forest architecture Based on two GoogLeNets: regular and one with 10x oversampling. This time we learn an explicit router based simple CNN1 Router is trained together to predict the accuracy of each route for each image. 37
EXPERIMENT CONDITIONAL GOOGLE-NET Purple Dots: original networks accuracies. Dashed Line: accuracy when choosing each network at random Green Line: amortized cost to accuracy curve on the validation set Green Point: operation point where we achieve almost the 10x oversampled CNN accuracy with less than half the computational cost. We could decide during test time what accuracy we require. 38
EFFICIENCY BENEFITS OF IMPLICIT ROUTING Top: A standard CNN (one route). Bottom: A two-routed implicit arch. The larger boxes denote feature maps, the smaller ones the filters Due to branching, the depth of the second set of kernels (in yellow) changes between the two architectures yielding lower computational cost. 39
EXPERIMENT CONDITIONAL VGG11 Split features into 2 Based on VGG11 with additional global max polling layer after last convolutional layer. Implemented as DAG 40
EXPERIMENT CONDITIONAL VGG11 Matching the original VGG11 top5 error with less than half the compute (45%), and almost one-fifth (21%) of the parameters. Training from scratch took twice the epochs but the overall time remained the same due to the decrease in computations. 41
TL;DR Decision Trees are efficient and CNN are Accurate Conditional NN are the generalization of both Trade off - we try to find the sweet spot combining the two By using Implicit Routing: we could achieve 50% reduction of computational and memory cost. By using Explicit Routing: we could achieve 50% reduction of computational cost same accuracy Decide on the fly how accurate-costly we want to be *If you aren t more accurate maybe you re more efficient 42