SGD: Stochastic Gradient Descent

Improving SGD Hantao Zhang Deep Learning with Python Reading: http://neuralnetworksanddeeplearning.com/index.html Chapter 2 SGD: Stochastic Gradient Descent Main Idea: Given a set of input/output examples D = { (x, y) }. Define the network as a function f(w, x) on weights w and x. Define the cost, say C = ½( (x,y) D a(x) y 2 )/ D and try to minimalize. For each epoch, repeat the following: 1. compute a(x) = f(w, x) and C = ½( (x,y) D a(x) y 2 )/ D. 2. compute C/ w 3. update w by w = w - ( C/ w) to decrease C. For large datasets this is expensive: we don t want to load all the data D into memory, and the gradient depends on all the data. An alternative: pick a small subset of examples, called mini batch, B<<D approximate the gradient using C = ½( (x,y) B a(x) y 2 )/ B. on average C/ w is the right direction. take a step in that direction repeat. B = one example is a very popular choice, called online update 1

Batch Update With On-line (stochastic) update we update weights after every pattern. With Batch update we accumulate the changes for each weight in a batch, and update weights at the end of each batch. Batch update often gives a correct direction of the gradient for the entire data set, while on-line could do some weight updates in directions quite different from the average gradient of the entire data set Based on noisy instances and also just that specific instances will not represent the average gradient Size of mini batch? Another super-parameter to choose through experiments and experiences. 3 Stochastic Gradient Descent Since True gradient is approximated only, loss will not always decrease (locally) as training data point is random. Still converges over time. 2

Computation in General NN There are L layers, l {1, 2,, L} plus input layer (l = 0); each layer is fully connected to the next. An example of 4-layer NN: C/W i [k, j] = y i [k] i [j] C/W i = (y i 1) (1 i ) For mini-batch B, define i = B ( C/W i ) Update W i by W i = W i i or 4 weights: W 0 W 1 W 2 W 3 W i = W i i / B W i [j,k] = link weight from j th neuron in layer i to k th neuron in layer (i+1) 4 outputs: y 0 y 1 y 2 y 3 y 4 (exclude y 0 = x) 4 sums: y 0 z 0 y 1 z 1 y 2 z 2 y 3 z 3 y 4 z i = y i W i + b i, y i+1 = a(z i ) 4 i & i : 0 1 1 2 2 3 3 4 Let cost C ½ (y L y) 2, i C/y i, i C/z i, then L = (y L y), i = i+1 a (z i ), and i = W i i Improve readability def backprop(self, x, y): # feedforward computation y_act = x y_acts = [x] # list to store all the activations z_sums = [] # list to store all the z vectors for b, w in zip(self.biases, self.weights): z = np.dot(y_act, w)+ b z_sums.append(z) y_act = self.activation(z) y_acts.append(y_act) # backward propagation theta = self.cost_derivative(y_act, y) for i in range(self.num_layers-1, -1, -1): ad = self.activation_derivative(z_sums[i], y_acts[i+1]) delta = theta * ad self.delta_b[i] = delta y_hat = y_acts[i][:, np.newaxis] delta_hat = delta[np.newaxis, :] self.delta_w[i] = np.dot(y_hat, delta_hat) if (i > 0): theta = np.dot(self.weights[i], delta) return (self.delta_b, self.delta_w) Change negative index to positive z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) 3

Improve performance Before, one example at a time for backprop. def update_batch(self, batch_x, batch_y, eta): # batch: mini batch of examples # eta: learning rate nabla_b = np.array([np.zeros(b.shape) for b in self.biases]) nabla_w = np.array([np.zeros(w.shape) for w in self.weights]) for x, y in zip(batch_x, batch_y): delta_b, delta_w = self.backprop(x, y) nabla_w = nabla_w + delta_w nabla_b = nabla_b + delta_b self.weights -= eta*nabla_w self.biases -= eta*nabla_b For mini-batch B, define i = B ( C/W i ) Update W i by W i = W i i or W i = W i i / B z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) Improve performance Now, a batch of examples are feed to backprop. def update_batch(self, batch_x, batch_y, eta): # batch: mini batch of examples # eta: learning rate nabla_w, nabla_b = self.backprop(batch_x, batch_y) for i in range(self.num_layers): self.weights[i] -= eta*np.sum(nabla_w[i], axis=0) self.biases[i] -= eta*np.sum(nabla_b[i], axis=0) For mini-batch B, define i = B ( C/W i ) Update W i by W i = W i i or W i = W i i / B z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) 4

Improve performance def backprop(self, x, y): # feedforward, x is a batch of examples y_act = x y_acts = [x] z_wsums = [] for b, w in zip(self.biases, self.weights): z = np.dot(y_act, w) + b z_wsums.append(z) y_act = self.activation(z) y_acts.append(y_act) # backward propagation theta = self.cost_derivative(y_act, y) for i in range(self.num_layers-1, -1, -1): ad = self.activation_derivative(z_wsums[i], y_acts[i+1]) delta = np.multiply(theta, ad) y_hat = y_acts[i][:, :, np.newaxis] delta_hat = delta[:, np.newaxis, :] self.nabla_w[i] = np.multiply(y_hat, delta_hat) self.nabla_b[i] = delta if (i>0): theta = np.dot(delta, np.transpose(self.weights[i])) return (self.nabla_w, self.nabla_b) Now, a batch of examples are feed to backprop. z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) MNIST: Database of handwritten digits yann.lecun.com/exdb/mnist/ by Yann LeCun s team at NYU Has a training set of 60 K examples (6K examples for each digit), and a test set of 10K examples. Each digit is a 28 x 28 pixel grey level image. The digit itself occupies the central 20 x 20 pixels, and the center of mass lies at the center of the box. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. 5

MNIST: Database of handwritten digits MNIST also keeps a performance record of all image recognition programs. LeCun s Convolutional Neural Networks variations (0.8%, 0.6% and 0.4% on MNIST) Tangent Distance (Simard, LeCun & Denker: 2.5%) Randomized Decision Trees (Amit, Geman & Wilder, 0.8%) K-NN based Shape context/tps matching (Belongie, Malik & Puzicha: 0.6%) SVM on orientation histograms (Maji & Malik, 0.8%) Network.py s performance: architecture = [784, 30, 10], epoch=30: 4.6%, 70 seconds architecture = [784, 60, 30, 10],, epoch=30: 4.2%, 100 seconds mnist.py def vectorized_digit(j): "" Return a 10-dimensional unit vector with a 1.0 in the jth position and zeroes elsewhere. This is used to convert a digit in (0...9) into a corresponding desired output from the neural network.""" e = np.zeros((10)) e[j] = 1.0 return e f = gzip.open('mnist.pkl.gz', 'rb') train_data, valid_data, test_data = pickle.load(f, encoding="latin1") f.close() # train_x.shape = (50000, 784) 784 = 28x28, an image 28x28. # train_data[1].shape = (50000,) # train_y.shape = (50000, 10) train_x = train_data[0] train_y = [vectorized_digit(y) for y in train_data[1]] 6

mnist.py import time start_time = time.time() import mlnnsgd as mlnn net = mlnn.network([784, 60, 10]) print('creating Network =', net.sizes) print('weight shapes:', [w.shape for w in net.weights]) net.sgd(train_x, train_y, epochs=10, batch_size=100, eta=0.1, test_data=test_data) print("run time: %s seconds" % (time.time() - start_time)) # in mlnn.py: def evaluate(self, test_data): "" Return the number of test inputs for which the neural network outputs the incorrect result. """ digits = np.argmax(self.feedforward(test_data[0]), axis=1) return np.count_nonzero(digits - test_data[1]) s = 0 for x, y in zip(test_data[0], test_data[1]): if (np.argmax(self.feedforward(x))!= y): s = s+1 return s Backpropagation Observations Procedure is (relatively) efficient All computations are local Use inputs and outputs of current node What is good enough? Rarely reach target (0 or 1) outputs Typically, train until within 0.1 of target How to improve further the performance? 7

Hyperparameter Selection Pick a small Learning Rate, e.g. 0.1, as a starting point. Connectivity: typically fully connected between layers Number of hidden nodes: Too many nodes make learning slower, could overfit Too many hidden nodes is usually OK if using a reasonable stopping criteria Too few will underfit Number of layers: 1 (common) or 2 hidden layers which are usually sufficient for good results, attenuation makes learning very slow modern deep learning approaches show significant improvement using many layers Manually set hyperparameters: trial and error runs Often sequential, or binary search: find one hyperparameter value with others held constant, freeze it, find next hyperparameter, etc. Random is empirically most consistently effective typically each hyperparameter is chosen with a uniform distribution from a log scale for each trial Hyperparameters could be learned by the learning algorithm in which case you must take care to not overfit the training data Performance of NN Training Convergence of Backpropagation Let Gradient descent find a local minimum quickly What affect the convergence? NN size and training set size Learning rate Initial weight values Derivative values Avoiding Overfitting Generalize well and work better for general cases What affect overfitting? NN Architecture Weight values using weight-decay through regulation Stop earlier 8

Hidden Nodes Typically one fully connected hidden layer. Common initial number is 2n or 2logn hidden nodes where n is the number of inputs In practice train with a small number of hidden nodes, then keep doubling, etc. until no more significant improvement on test sets All output and hidden nodes should have bias weights Hidden nodes discover new higher order features which are fed into the output layer. i k i j k i k i Local Minima SGD in general have more difficulties with simple tasks than with more complex tasks Good news with MLPs Many dimensions make for many descent options Local minima more common with very simple/toy problems, very rare with larger problems and larger nets Even if there are occasional local minima problems, could simply train multiple nets and pick the best Some algorithms add noise to the updates to escape minima 9

Local Minima and Neural Networks Neural Network can get stuck in local minima for small networks, but for most large networks (many weights), local minima rarely occur in practice This is because with so many dimensions of weights it is unlikely that we are in a minima in every dimension simultaneously almost always a way down Backpropagation Summary Excellent Empirical results Scaling The pleasant surprise Local minima very rare as problem and network complexity increase Most common neural network approach Many other different styles of neural networks User defined parameters usually handled by multiple experiments Many variants Regression Typically Linear output nodes, normal hidden nodes Adaptive Parameters Many different learning algorithm approaches Higher order gradient descent (Newton, Conjugate Gradient, etc.) Recurrent networks Deep networks! Still an active research area 10