Ensemble methods in machine learning. Example. Neural networks. Neural networks

Similar documents
Learning from Data: Adaptive Basis Functions

For Monday. Read chapter 18, sections Homework:

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Network Neurons

CS6220: DATA MINING TECHNIQUES

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

Logical Rhythm - Class 3. August 27, 2018

Network Traffic Measurements and Analysis

Random Forest A. Fornaser

Perceptron: This is convolution!

Machine Learning Classifiers and Boosting

Neural Networks (pp )

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Machine Learning 13. week

Data Mining. Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Multilayer Feed-forward networks

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Naïve Bayes for text classification

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Supervised Learning in Neural Networks (Part 2)

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Deep neural networks II

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Using Machine Learning to Optimize Storage Systems

Optimization. Industrial AI Lab.

Adaptive Regularization. in Neural Network Filters

Week 3: Perceptron and Multi-layer Perceptron

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

CSC 578 Neural Networks and Deep Learning

Technical University of Munich. Exercise 7: Neural Network Basics

CS 229 Midterm Review

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

DEEP LEARNING IN PYTHON. The need for optimization

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

EE 511 Neural Networks

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

The Mathematics Behind Neural Networks

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Introduction to Machine Learning CMU-10701

CS570: Introduction to Data Mining

Deep Learning for Computer Vision

CS489/698: Intro to ML

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

Motivation. Problem: With our linear methods, we can train the weights but not the basis functions: Activator Trainable weight. Fixed basis function

CMPT 882 Week 3 Summary

Machine Learning. Chao Lan

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Learning via Optimization

Tree-based methods for classification and regression

Simple Model Selection Cross Validation Regularization Neural Networks

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Lecture on Modeling Tools for Clustering & Regression

Neural Networks and Deep Learning

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Lecture #11: The Perceptron

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

The exam is closed book, closed notes except your one-page cheat sheet.

Applying Supervised Learning

Notes on Multilayer, Feedforward Neural Networks

Report: Privacy-Preserving Classification on Deep Neural Network

Website: HOPEFIELD NETWORK. Inderjeet Singh Behl, Ankush Saini, Jaideep Verma. ID-

STA 4273H: Statistical Machine Learning

COMPUTATIONAL INTELLIGENCE

Neural Nets. General Model Building

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Business Club. Decision Trees

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Multi-Class Logistic Regression and Perceptron

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

A Systematic Overview of Data Mining Algorithms

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Hidden Units. Sargur N. Srihari

Instructor: Jessica Wu Harvey Mudd College

April 3, 2012 T.C. Havens

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Bayesian Methods in Vision: MAP Estimation, MRFs, Optimization

Deep Learning. Architecture Design for. Sargur N. Srihari

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Transcription:

Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you used the output of two linear classifiers as the input to another linear classifier? Boosting iteratively build an ensemble by training each new model to emphasize the parts of the training set that the previous model struggled with Stacking train a learning algorithm to combine the predictions of other learning algorithms Neural networks This is a particular example of a neural network Neural networks Formally, a neural network is expressed mathematically via where are fixed activation functions and are parameters to be learned from the data Example The previous example fits this model with and input layer hidden layer output layer

Typical neural networks In general, learning the parameters for a neural network can be quite difficult To make life easier, it is nice to choose differentiable to be Typical neural networks The choice of depends somewhat on the application Regression ( ) A historically common choice for function is the logistic/sigmoid Binary classification ( ) Multiclass classification ( ) The choice of depends somewhat on the application Remarks Training neural networks Like SVMs, neural networks fit a linear model in a nonlinear feature space Unlike SVMs, those nonlinear features are learned Unfortunately, training involves nonconvex optimization Originally conceived as models for the brain nodes are neurons edges are synapses if is a step function, this represents a neuron firing when the total incoming signal exceeds a certain threshold note that our choices for are not exactly step functions, but smooth approximations this also means our output is continuous (must be quantized in the case of classification) Given training data and parameters, where, we would like to estimate the

Training neural networks Simplifying the notation Note that we can always augment a 1 to our input data (increasing the dimension to ) and constrain the to also ensure that If we do this, we may omit and to arrive at the simpler formulation Letting denote the matrix having rows and the matrix having rows, we can also write this as Training neural networks Loss functions Given training data and parameters For simplicity, let, where, we would like to estimate the Regression or binary classification For To emphasize the dependence of our network on the parameters, we will write the output of the network when given input as or for vector-valued regression problems We would like to choose to ensure that We can quantify this by choosing a loss function which we will seek to minimize by picking appropriately

Loss functions General classification In the case of classification where, we can define the as indicator vectors in (e.g., ) Nonconvex optimization Because of the complex interactions between the parameters in, these objective functions do not lead to convex optimization problems In this case a natural loss function is local minimum This is called the cross entropy If we interpret the outputs of our neural network as class conditional probabilities, this is computing the negative log-likelihood of given the training data, and is hence a natural quantity to minimize global minimum A local minimum is not necessarily a global minimum The best we can hope for is to try to find a local minimum, and hope that this gives us good performance in practice Aside This is currently a very active area of research, but there is some preliminary evidence that the picture I just showed you is not really an accurate depiction of the nonconvexity that typically arises in practice It may be the case that the picture really looks more like Gradient descent A simple way to try to find the minimum of our objective function is to iteratively roll downhill From, step in the direction of the negative gradient : step size Perhaps most/all local minima are equally good! (highly speculative )

Squared error loss Back to the picture Today, we will focus on how to train to minimize the squared error loss Our task is to compute for some given Deriving the gradient: Second layer Deriving the gradient: Second layer Set We can more compactly write where prediction error weighted prediction error Hadamard product (element-wise multiplication)

Deriving the gradient: First layer Deriving the gradient: First layer We can express this in matrix form as Gradient descent update Computing the gradient The update at iteration is Notice that given, we can compute Given, we can then compute The weights in our formula for the partial derivatives depend on the previous parameter estimate We can calculate this gradient very efficiently because of the special structure imposed by the network This suggests an efficient two-pass algorithm for computing the gradient at each iteration

Backpropagation Forward pass Using current weights, feed each into the network and compute Backward pass Now compute Backpropagation Next, back-propagate these values to the previous layer Forward pass Backward pass

Convergence speed In backpropagation, each hidden unit passes and receives information to and from only those units to which it is connected Much faster way to compute the gradient than a naïve approach Lends itself naturally to parallel implementations Gradient descent can still be pretty slow to converge Initialization Initial conditions make a big difference To avoid saturation, it is often a good idea to pre-process the data to have zero mean and unit variance or to be scaled to lie in Since our objective function will have many local minima, it is generally a good idea to try several random starting values Common heuristics for initializing the weights in a fully connected layer mapping inputs to outputs is to sample each weight from Multi-layer neural networks This framework, along with the backpropagation algorithm, can easily be extended to networks with multiple hidden layers Has had a resurgence in recent years under the name deep learning