MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms

MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms 1 Introduction In supervised Machine Learning (ML) we have a set of data points or observations for which we know the desired output, expressed in terms of categorical classes, numerical or logical variables or as generic observed description of any real problem. The desired output is in fact providing some level of supervision in that it is used by the learning model to adjust parameters or make decisions allowing it to predict correct output for new data. Finally, when the algorithm is able to correctly predict observations we define it a classifier. Some classifiers are also capable of providing results in a more probabilistic sense, i.e. a probability of a data point belonging to class. We usually refer to such model behavior as regression. A typical workflow for supervised learning is shown in the diagram of Figure 1. Figure 1 The typical workflow based on supervised machine learning. The process is based on the following main steps: Pre-processing of data. First we need to build input patterns that are appropriate for feeding into our supervised learning algorithm. This includes scaling and preparation of data; Create data sets for training and evaluation. This is done by randomly splitting the universe of data patterns. The training set is made of the data used by the classifier to learn their internal feature correlations, whereas the evaluation set is used to validate the already trained model in order to get an error rate (or other validation measures) that can help to identify the performance and accuracy of the classifier. Typically, you will use more training data than validation data; Training of the model. We execute the model on the training data set. The output result consists of a model that (in the successful case) has learned how to predict the outcome when new unknown data are submitted; Validation. After we have created the model, it is of course required a test of its performance accuracy, completeness and contamination (or its dual, the purity). It is particularly crucial to do this on data that the model has not seen yet. This is main reason why on previous steps we separated the data set into training patterns and a subset of the data not used for training. We intend to verify and 1

measure the generalization capabilities of the model. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validation set is higher than the training error, then we have to go back and adjust model parameters. The reason could be that the model has essentially memorized the answers seen in the training data, failing its generalization capabilities. This is a typical behavior in case of overfitting, and there are various techniques for overcoming it; Use (Run). If validation was successful, the model has correctly learned the underlying real problem. So far we can proceed to use the model to classify/predict new data. As suggested by its name, the tool MLPQNA-LEMON is referred to two kinds of supervised ML models. Both are based on the same topological architecture (for instance the Multi Layer Perceptron neural network), forking at level of the backward learning phase, where it is possible to choose between the QNA or LEMON rule. The QNA (Quasi Newton Algorithm) learning rule belongs to the Newton s methods aimed at finding the stationary point of a function through a statistical approximation of the Hessian of the training error obtained by a cyclic gradient calculation. MLPQNA makes use of the known L-BFGS algorithm (Limited memory Broyden Fletcher Goldfarb Shanno), originally designed for problems with a wide parameter space. LEMON (LEvenberg-Marquardt Optimization Network) is based on the modified Levenberg-Marquardt method which makes use of the exact Hessian of the error function (and not of its linearized approximation). For networks with up to several hundreds of weights this algorithm is comparable with the QNA (often faster). But its main advantage is that it does not require stopping criteria. This method almost always converges exactly to one of the minima of a function. 2 Implementation information The implementation is in C++. It has been tested and validated on 64-bit machines running MS Windows 7/8/10 or Linux Ubuntu/SL. It does not require external packages (it embeds some third-party packages as local libraries) except for some DLLs (Dynamic Link Libraries) in the case of Windows OS. Its execution foresees a set of command-line options. These are organized into options, use cases and functional use cases. MLPQNA rule is based on the L-BFGS algorithm (limited memory BFGS), a quasi-newton method with fixed iteration cost O(Npatterns Weights) and moderate memory requirements O(Weights). This algorithm is ideally suitable for solving large-scale problems, and is quite good at dealing with problems of average and small dimensions. LEMON is based on the modified Levenberg-Marquardt method, using the exact Hessian of the error function (NOT linearized approximation). For a networks with up to several hundreds of weights this algorithm is comparable with L-BFGS (often it is faster than the L-BFGS). But its main advantage is that it does not require at all that stopping criteria be specified (absence of two QNA parameters wstep and iterations). This method will almost always converge exactly to the one of the minima of a function. Nevertheless, there are also things putting it at a disadvantage when solving large scale problems: high iteration cost, equal to O(NPatterns Weights 2 ) and high memory requirements, equal to O(Weights 2 ). 3 Learning rules The learning algorithm for a MLP must update the network weights in order to minimize the error function by following any of the mentioned rules (QNA or LEMON). 2

Among various learning parameters, the most important is concerning the concept of regularization. The implemented MLPQNA and LEMON models use Tikhonov regularization (AKA weight decay). When the regularization factor is accurately chosen, then generalization error of the trained neural network can be improved, and training can be accelerated. However, the a-priori choice of the best decay parameter is impossible and strongly depending on the specific problem. Its selection must be done through a heuristic trial-and-error process. Therefore, if it is unknown what decay regularization value to choose (as usual), it could be experimented the values within the range of 0.001 (weak regularization) up to 100 (very strong regularization). It should be searched through the values, starting with the minimum and making the Decay value 3 to 10 times as much at each step, while checking, by cross-validation the network's generalization error. It should be noted that if the Decay specified value is too small (less than 0.001), it will be automatically increased up to the permissible minimum: the MLPQNA+LEMON package always implements at least minimum regularization of a task. In order to achieve the weight decay rule, we minimize more complex merit function: f = E + λs 2. Here E is a training set error, S is a sum of squares of network weights, and decay coefficient λ controls amount of smoothing applied to the network. Optimization is performed from the initial point and until the successful stopping of the optimizer. The figure below shows us a spectrum of neural networks trained with different values of λ from zero value (no regularization) to infinitely large λ. It can be seen that we control tendency to overfit by continuously changing λ. Zero λ corresponds to overfitted network. Infinitely large λ gives us underfitted network, equal to some constant. Between these extreme values there is a range of networks which reproduce dataset with different degrees of precision and smoothness. Again, as shown, the perfect network is outside of this range. We can choose good neural network by tuning λ weight decay coefficient. Optimal lambda can be selected by using test set or cross-validation (in this case all dataset can be used for training). Figure 2 learning trend by optimization as conditioned by the decay parameter. For classification problems we use a MLP network with a linear output layer and SOFTMAX-normalization as output evaluation. The network output is considered nonnegative, and the sum of all output (provided by all neurons of the output layer) should be strictly equal to one, permitting using them as the probability that the input vector will be referred to one of the classes (in the extreme case, outputs of the trained network are converging to these probabilities). The number of outputs in such a network is always no less than two (which is a restriction imposed by the elementary logic). 3

There are two basic views commonly held in statistics on how a classification problem solution should look like. The first viewpoint is that any object shall refer to one and only one of the classes. For example, if email classification is in question, then "spam" and "non-spam" classes can be distinguished. There can be some uncertainty in the classification (an email can be somewhat similar to spam), but only the terminal decision, whether it is spam or non-spam, will be returned. The second approach, the one used by the MLPQNA+LEMON package, consists in obtaining a vector of posterior probabilities, that is, a vector having component parts equal to probabilities that the object belongs to each class. The algorithm does not take any decision on the classification of an email. It just notifies how much probability there is that a particular email is spam, and how much probability there is that it is not. And the decision making based on this information is transferred to the user. The second approach is more flexible than the first one, and it is more reasonable. How does the classification algorithm happen to know about the order of priority the user is sticking to? In some cases, it is necessary to minimize the error made in one of the classes, e.g., the misclassification of an email as spam. Then the email will be classified as spam only in that case if there is very little probability (e.g., less than 0.05%) that it is NON spam. In other cases, all classes are equal to each other, and a class with a maximum conditional probability can just be chosen. Therefore, the outcome of any classification algorithm of our package is a posterior probability vector, instead of the class which an object can be put into. After the model is built, the error on a test (or training) set needs to be estimated. To estimate regression results, three measures of error can be used, that is, a root-mean-square error, an average error and an average relative error (the latter being calculated as per the records with a nonzero value of the dependent variable). These three measures of error are commonly known, and need not to be discussed. If a classification problem is at issue, then five measures of error can be used. The first and best-known is the classification error (the number or percent of the incorrectly classified cases). The second equally known measure is cross-entropy. The MLPQNA+LEMON package uses average cross-entropy per record estimated in bits (base 2 logarithm). The use of average cross-entropy (instead of total cross-entropy) permits comparable estimates for different test sets to be obtained. The remaining three error measures are the root-mean-square error, average error and average relative error again. However, as opposed to the regression task, they are used here to characterize the posterior probability vector miscalculation. The error implies how much the probability vector calculated by means of a classification algorithm differs from the vector obtained on the basis of a test set (this vector's component parts are equal to 0 or 1, subject to the class which the object belongs to). The meaning of the root-mean-square error and average error is comprehensible: it is an error in conditional probability approximation that is averaged as per all probabilities. The average relative error is an average error in approximating the probability that an object is correctly classified (same as average error for binary tasks). Taking into account these considerations, the implemented MLP error functions in the MLPQNA-LEMON package are (calculated at the end of each batch Hessian cycle): REGRESSION ERROR Least Square error + Tikhonov regularization E = Npatterns (y i t i ) 2 2 + W 2 λ 2 where, y and t are respectively, output and target for each pattern, while W is the weight matrix of MLP. 4

CLASSIFICATION ERROR 1. Cross entropy enabled cross-entropy per record estimated in bits (logarithm); Npatterns E = ln ( 1 ) y i 2. Cross entropy disabled percentage of misclassified patterns at each cycle; From a theoretical point of view, there is the possibility to furnish the extremely complex expression of regression function calculating the output (example the zphot value) through MLP network (trained by MLPQNA+LEMON) with two hidden layers. Given: i = 1 N index of N input neurons (features for example related to magnitudes/colors) for a single input pattern x = {x i } N ; h 1 = 1 H 1 index of H 1 neurons of the first hidden layer; h 2 = 1 H 2 index of H 2 neurons of the second hidden layer; y index of the single output neuron; w h1 i, w h2 h 1, w yh2 weights among network layers, respectively, hidden1-input, hidden2-hidden1 and output-hidden2); θ h1, θ h2, θ y the bias related to different neuron layers, respectively, hidden 1, hidden 2 and output; where each neuron of all layers has its activation function f(x) = tanh (x); We obtain: zphot(x) = e e 2 H2 e h2=1 w yh 2 H2 h2=1 2 w yh 2 [ e e [ e N w 2 w h 2h1 (e2 h1i x i H1 e 2θ h1 e 2 N w h1i x i+e 2θ ) h1=1 h1 e 2θ h2 N w 2 w h 2h1 (e2 h1i x i H1 e 2θ h1 e 2 N w h1i x i+e 2θ ) h1=1 h1 +e 2θ h2 ] N w 2 w h 2h1 (e2 h1i x i H1 e 2θ h1 e 2 N w h1i x i+e 2θ ) h1=1 h1 e 2θ h2 N w 2 w h 2h1 (e2 h1i x i H1 e 2θ h1 e 2 N w h1i x i+e 2θ ) h1=1 h1 +e 2θ h2] e 2θ y + e 2θ y 5

4 Main commands In terms of command-line interface, we distinguish among three kinds of interfaces: Command-line options (e.g. prompt>> mlpqna-lemon [option]): o help: list of command lines valid for all functional use cases; o version: code version information; o author: author information; Command-line use cases (e.g. prompt>> mlpqna-lemon [use_case]): o Classification: multivariate classification; o Regression: non-linear regression; Command-line functional use cases (e.g. prompt>> mlpqna-lemon [use_case] [function]: o QNA TRAIN: training using QNA learning rule; o LEMON TRAIN: training using LEMON learning rule; o TEST: testing; o RUN: execution of trained/tested model on arbitrary data. In order to select the specific use case at command-line level, the user must provide a sequence of parameters, presenting some differences accordingly to the specific use case. The list of these command-line parameters is provided in the following sections. 4.1 TRAINING command line This is the exact composition of the command-line parameter list in the case of training: >> mlpqna_lemon use_case function decay restarts wstep iterations ninp nout nhidlay nhid1 nhid2 CE input_path CV k 1 W_init W_name exp_dir Here, the description of each parameter: 1. mlpqna_lemon: name of the program; 2. use_case: [integer] code for the use case (10 classification, 20 regression); 3. function: [integer] functional use case (3 QNA train, 7 LEMON train); 4. decay: [float] decay parameter of QNA/LEMON weight updating law. Weight decay constant, (>=0.001). Decay term 'Decay* Weights ^2' is added to error function. Default value = 0.001; 5. restarts: [integer] max number of random gradient calculations approximating Hessian; 6. wstep: [integer] max error on Hessian approximation (fixed to 0 for LEMON); 7. iterations: [integer] max number of iterations for each restart (fixed to 0 for LEMON); 8. ninp: [integer] number of input nodes (input data features); 9. nout: [integer] number of output nodes (1 for regression, arbitrary for classification); 10. nhidlay: [integer] number of hidden layers of the network (1 or 2); 11. nhid1: [integer] number of neurons of first hidden layer; 12. nhid2: [integer] number of neurons of second hidden layer; 13. 1: fixed (internal reasons); 14. input_path: [string] relative/absolute pathname of the input data file; 15. CV: [1/0] flag to enable/disable k-fold cross validation; 16. k: [integer] number of cross validation folds (indifferent if CV disabled); 17. CE: [1/0] flag to enable/disable Cross Entropy (used for classification only); 6

18. W_init: Weight initialization choice [integer]. It issues how to initialize network weights. It is possible to resume a previous training phase: 702 RANDOM initialization between [-1, +1] or 704 FROM_FILE. To be used in case of past training resume; 19. W_name: Name of the weight file (with full relative path if loaded from different directory) to be loaded to initialize network weights [character string]. To be used in case of parameter 18 set to FROM_FILE value. If parameter 18 is RANDOM, this is not considered; 20. exp_dir: [string] relative pathname for the output (DIR must exist and ex/re or ex/cl must be created within it). 4.2 TEST/RUN command line This is the exact composition of the command-line parameter list in the case of test/run: >> mlpqna_lemon use_case function ninp nout nhidlay nhid1 nhid2 input_path 1 trainedweightsfile trainedparamsfile exp_dir Here, the description of each parameter: 1. mlpqna_lemon: name of the program; 2. use_case: [integer] code for the use case (10 classification, 20 regression); 3. function: [integer] functional use case (4 test, 5 run); 4. ninp: [integer] number of input nodes (input data features); 5. nout: [integer] number of output nodes (1 for regression, arbitrary for classification); 6. nhidlay: [integer] number of hidden layers of the network (1 or 2); 7. nhid1: [integer] number of neurons of first hidden layer; 8. nhid2: [integer] number of neurons of second hidden layer; 9. input_path: [string] relative/absolute pathname of the input data file; 10. 1: fixed (internal reasons); 11. trainedweightsfile: [string] relative pathname of the model trained weights file; 12. trainedparamsfile: [string] relative pathname of the model trained parameters file; 13. exp_dir: [string] relative pathname for the output. 5 Input/Output The following sections describe the type and contents of the input/output interface, depending on the functional use case. 5.1 Input Input data must be a file in CSV format, without metadata headers. For classification the class target columns associated to each input pattern must be represented in binary codification (es. 100, 010, 001 for 3-class labels). 5.2 Training Output When executed under training use case, the output is composed by following files, stored into a predefined directory sub-tree. This sub-tree starts from the execution directory, and it branches into two different sub-trees, depending on the functionality domain of the current execution: 7

-./ex/cl for the classification case -./ex/re for the regression case In one of such directories the following output files are automatically generated at the end of execution: errorlog.txt: error report file, containing details about any incorrect condition or exception that caused the abnormal exit from the execution. This file is not created if the program ends normally; trainlog.txt: log file with detailed information about experiment configuration, main results and parameter setup; trainpartialerror.txt: ascii (space separated) file with partial values at each training iteration of the QNA algorithm. Useful to obtain a graphical view of the learning process. Each row is composed by three columns: o training step; o number of iterations of current step (number of Hessian approximations <= MaxIts); o current step batch error (MSE or Cross Entropy value if selected in classification mode); trainedweights.txt: final network weights frozen at the end of batch training. It can be used in a new training experiment to restore old one; frozen_train_net.txt: internal network node values as frozen at the end of training, to be given as network input file in test/run cases; traintestoutlog.txt: output values as calculated after training, with respective target values. It can be used to evaluate the network output for each input pattern. It corresponds to an embedded test session done by submitting the training dataset as test dataset; train_output.txt: ascii file with network outputs and related targets for all input patterns (simplified, not verbose, version of traintestoutlog.txt, for internal use only); traintestconfmatrix.txt: confusion matrix calculated at the end of training. It results from the values stored into the traintestoutlog.txt file. Useful to obtain a simple statistical evaluation of the whole training results. In the case of regression it is an adapted version; 5.3 Test Output When executed under training use case, the output is composed by following files, stored into a predefined directory sub-tree. This sub-tree starts from the execution directory, and it branches into two different sub-trees, depending on the functionality domain of the current execution: -./ex/cl for the classification case -./ex/re for the regression case In one of such directories the following output files are automatically generated at the end of execution: errorlog.txt: error report file, containing details about any incorrect condition or exception that caused the abnormal exit from the execution. This file is not created if the program ends normally; testoutlog.txt: output values as calculated after test, with respective target values. It can be used to evaluate the network output for each input pattern; test_output.txt: ascii file with network outputs and related targets for all input patterns (simplified, not verbose, version of testoutlog.txt, for internal use only); 8

testconfmatrix.txt: confusion matrix calculated at the end of test. It results from the values stored into the testoutlog.txt file. Useful to obtain a simple statistical evaluation of the whole test results. In the case of regression it is an adapted version; 5.4 Run Output When executed under training use case, the output is composed by following files, stored into a predefined directory sub-tree. This sub-tree starts from the execution directory, and it branches into two different sub-trees, depending on the functionality domain of the current execution: -./ex/cl for the classification case -./ex/re for the regression case In one of such directories the following output files are automatically generated at the end of execution: errorlog.txt: error report file, containing details about any incorrect condition or exception that caused the abnormal exit from the execution. This file is not created if the program ends normally; run_output.txt: output values as calculated after training, with respective target values. It can be used to evaluate the network output for each input pattern. It corresponds to an embedded test session done by submitting the training dataset as test dataset; 9