Global Optimality in Neural Network Training

Size: px

Start display at page:

Download "Global Optimality in Neural Network Training"

Easter Horn
6 years ago
Views:

1 Global Optimality in Neural Network Training Benjamin D. Haeffele and René Vidal Johns Hopkins University, Center for Imaging Science. Baltimore, USA

2 Questions in Deep Learning Architecture Design Optimization Generalization

3 Questions in Deep Learning Are there principled ways to design networks? How many layers? Size of layers? Choice of layer types? How does architecture impact expressiveness? [1] [1] Cohen, et al., On the expressive power of deep learning: A tensor analysis. COLT. (2016)

4 Questions in Deep Learning How to train neural networks?

5 Questions in Deep Learning How to train neural networks? Problem is non-convex.

6 Questions in Deep Learning How to train neural networks? X Problem is non-convex.

7 Questions in Deep Learning How to train neural networks? X Problem is non-convex. What does the loss surface look like? [1] Any guarantees for network training? [2] How to guarantee optimality? When will local descent succeed? [1] Choromanska, et al., "The loss surfaces of multilayer networks." Artificial Intelligence and Statistics. (2015) [2] Janzamin, et al., "Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods." arxiv (2015).

8 Questions in Deep Learning Performance Guarantees? Simple X Complex How do networks generalize? How should networks be regularized? How to prevent overfitting?

9 Interrelated Problems Architecture Optimization can impact generalization. [1] Generalization/ Regularization Optimization Architecture has a strong effect on the generalization of networks. [2] Some architectures could be easier to optimize than others. [1] Neyshabur, et al., In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning. ICLR workshop. (2015). [2] Zhang, et al., Understanding deep learning requires rethinking generalization. ICLR. (2017).

10 Today s Talk: The Questions Architecture Are there properties of the network architecture that allow efficient optimization? Generalization/ Regularization Optimization

11 Today s Talk: The Questions Generalization/ Regularization Architecture Optimization Are there properties of the network architecture that allow efficient optimization? Positive Homogeneity Parallel Subnetwork Structure

12 Today s Talk: The Questions Generalization/ Regularization Architecture Optimization Are there properties of the network architecture that allow efficient optimization? Positive Homogeneity Parallel Subnetwork Structure Are there properties of the regularization that allow efficient optimization?

13 Today s Talk: The Questions Generalization/ Regularization Architecture Optimization Are there properties of the network architecture that allow efficient optimization? Positive Homogeneity Parallel Subnetwork Structure Are there properties of the regularization that allow efficient optimization? Positive Homogeneity Adapt network architecture to data [1] [1] Bengio, et al., Convex neural networks. NIPS. (2005)

14 Today s Talk: The Results Optimization

15 Today s Talk: The Results Optimization A local minimum such that one subnetwork is all zero is a global minimum.

16 Today s Talk: The Results Optimization Once the size of the network becomes large enough... Local descent can reach a global minimum from any initialization. Non-Convex Function Today s Framework

17 Outline Architecture 1. Network properties that allow efficient optimization Positive Homogeneity Parallel Subnetwork Structure Generalization/ Regularization Optimization 2. Network size from regularization 3. Theoretical guarantees Sufficient conditions for global optimality Local descent can reach global minimizers

18 Key Property 1: Positive Homogeneity Start with a network. Network Outputs Network Weights

19 Key Property 1: Positive Homogeneity Scale the weights by a non-negative constant.

20 Key Property 1: Positive Homogeneity Scale the weights by a non-negative constant.

21 Key Property 1: Positive Homogeneity The network output scales by the constant to some power.

22 Key Property 1: Positive Homogeneity The network output scales by the constant to some power. Network Mapping

23 Key Property 1: Positive Homogeneity The network output scales by the constant to some power. Network Mapping - Degree of positive homogeneity

24 Most Modern Networks Are Positively Homogeneous Example: Rectified Linear Units (ReLUs)

25 Most Modern Networks Are Positively Homogeneous Example: Rectified Linear Units (ReLUs)

26 Most Modern Networks Are Positively Homogeneous Example: Rectified Linear Units (ReLUs)

27 Most Modern Networks Are Positively Homogeneous Example: Rectified Linear Units (ReLUs)

28 Most Modern Networks Are Positively Homogeneous Example: Rectified Linear Units (ReLUs) Doesn t change rectification

29 Most Modern Networks Are Positively Homogeneous Example: Rectified Linear Units (ReLUs) Doesn t change rectification

30 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

31 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

32 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

33 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

34 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

35 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

36 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

37 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

38 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

39 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

40 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out

41 Most Modern Networks Are Positively Homogeneous Simple Network Input Conv + ReLU Conv + ReLU Max Pool Linear Out Typically each weight layer increases degree of homogeneity by 1.

42 Most Modern Networks Are Positively Homogeneous Some Common Positively Homogeneous Layers Fully Connected + ReLU Convolution + ReLU Max Pooling Linear Layers Mean Pooling Max Out Many possibilities

43 Most Modern Networks Are Positively Homogeneous Some Common Positively Homogeneous Layers Fully Connected + ReLU Convolution + ReLU Max Pooling Linear Layers Mean Pooling Max Out Many possibilities X Not Sigmoids

44 Outline Architecture 1. Network properties that allow efficient optimization Positive Homogeneity Parallel Subnetwork Structure Generalization/ Regularization Optimization 2. Network regularization 3. Theoretical guarantees Sufficient conditions for global optimality Local descent can reach global minimizers

45 Key Property 2: Parallel Subnetworks Subnetworks with identical architecture connected in parallel.

46 Key Property 2: Parallel Subnetworks Subnetworks with identical architecture connected in parallel. Simple Example: Single hidden layer network

47 Key Property 2: Parallel Subnetworks Subnetworks with identical architecture connected in parallel. Simple Example: Single hidden layer network

48 Key Property 2: Parallel Subnetworks Subnetworks with identical architecture connected in parallel. Simple Example: Single hidden layer network Subnetwork: One ReLU hidden unit

49 Key Property 2: Parallel Subnetworks Any positively homogeneous subnetwork can be used Subnetwork: Multiple ReLU layers

50 Key Property 2: Parallel Subnetworks Example: Parallel AlexNets[1] AlexNet Subnetwork: AlexNet Input AlexNet AlexNet AlexNet AlexNet Output [1] Krizhevsky, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS, 2012.

51 Outline Architecture 1. Network properties that allow efficient optimization Positive Homogeneity Parallel Subnetwork Structure Generalization/ Regularization Optimization 2. Network regularization 3. Theoretical guarantees Sufficient conditions for global optimality Local descent can reach global minimizers

52 Basic Regularization: Weight Decay Network Weights

53 Basic Regularization: Weight Decay Network Weights

54 Basic Regularization: Weight Decay Network Weights

55 Basic Regularization: Weight Decay Network Weights

56 Basic Regularization: Weight Decay Network Weights

57 Basic Regularization: Weight Decay Network Weights Degrees of positive homogeneity don t match = Bad things happen.

58 Basic Regularization: Weight Decay Degrees of positive homogeneity don t match = Bad things happen. Network Weights Proposition: There will always exist non-optimal local minima.

59 Adapting the size of the network via regularization Start with a positively homogeneous network with parallel structure

60 Adapting the size of the network via regularization Take the weights of one subnetwork.

61 Adapting the size of the network via regularization Define a regularization function on the weights.

62 Adapting the size of the network via regularization Define a regularization function on the weights. Non-negative. Positively homogeneous with same degree as network mapping.

63 Adapting the size of the network via regularization Define a regularization function on the weights. Non-negative. Positively homogeneous with same degree as network mapping.

64 Adapting the size of the network via regularization Define a regularization function on the weights. Non-negative. Positively homogeneous with same degree as network mapping.

65 Adapting the size of the network via regularization Define a regularization function on the weights. Non-negative. Positively homogeneous with same degree as network mapping. Example: Product of norms

66 Adapting the size of the network via regularization Sum over all the subnetworks.

67 Adapting the size of the network via regularization Sum over all the subnetworks.

68 Adapting the size of the network via regularization Sum over all the subnetworks.

69 Adapting the size of the network via regularization Sum over all the subnetworks. # of Subnetworks

70 Adapting the size of the network via regularization Allow the number of subnetworks to vary. # of Subnetworks Adding a subnetwork is penalized by an additional term in the sum. Acts to constrain the number of subnetworks.

71 Outline Architecture 1. Network properties that allow efficient optimization Positive Homogeneity Parallel Subnetwork Structure Generalization/ Regularization Optimization 2. Network regularization 3. Theoretical guarantees Sufficient conditions for global optimality Local descent can reach global minimizers

72 Our problem

73 Our problem The non-convex problem we re interested in

74 Our problem The non-convex problem we re interested in

75 Our problem The non-convex problem we re interested in Labels Loss Function: Assume convex and once differentiable in Examples: Cross-entropy Least-squares

76 Why do all this?

77 Why do all this? Induces a convex function on the network outputs.

78 Why do all this? Induces a convex function on the network outputs. Induced Function: Comes from the regularization

79 Why do all this? Induces a convex function on the network outputs. Induced Function: Comes from the regularization

80 Why do all this? Induces a convex function on the network outputs. Induced Function: Comes from the regularization

81 Why do all this? Induces a convex function on the network outputs. Induced Function: Comes from the regularization

82 Why do all this? Induces a convex function on the network outputs. The convex problem provides an achievable lower bound for the non-convex network training problem.

83 Why do all this? Induces a convex function on the network outputs. The convex problem provides an achievable lower bound for the non-convex network training problem. Use the convex function as an analysis tool to study the non-convex network training problem.

84 Sufficient Conditions for Global Optimality Theorem: A local minimum such that one subnetwork is all zero is a global minimum.

85 Sufficient Conditions for Global Optimality Theorem: A local minimum such that one subnetwork is all zero is a global minimum.

86 Sufficient Conditions for Global Optimality Theorem: A local minimum such that one subnetwork is all zero is a global minimum. Intuition: The local minimum satisfies the optimality conditions for the convex problem.

87 Global Minima from Local Descent Theorem: If the size of the network is large enough (has enough subnetworks), then a global minimum can always be reached by local descent from any initialization.

88 Global Minima from Local Descent Theorem: If the size of the network is large enough (has enough subnetworks), then a global minimum can always be reached by local descent from any initialization. Non-Convex Function Today s Framework

89 Global Minima from Local Descent Theorem: If the size of the network is large enough (has enough subnetworks), then a global minimum can always be reached by local descent from any initialization. Non-Convex Function Today s Framework Meta-Algorithm: If not at a local minima, perform local descent At local minima, test if first Theorem is satisfied If not, add a subnetwork in parallel and continue Maximum number of subnetworks guaranteed to be bounded by the dimensions of the network output

90 Conclusions Network size matters Optimize network weights AND network size Current: Size = Number of parallel subnetworks Future: Size = Number of layers, neurons per layer, etc Regularization design matters Match the degrees of positive homogeneity between network and regularization Regularization can control the size of the network Not done yet Several practical and theoretical limitations

91 Thank You Vision Johns Hopkins University Center for Imaging Johns Hopkins University Work supported by NSF grants , and

Deep Learning with Tensorflow AlexNet

Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification