CSC 2515 Introduction to Machine Learning Assignment 2

CSC 2515 Introduction to Machine Learning Assignment 2 Zhongtian Qiu(1002274530)

Problem 1 See attached scan files for question 1.

2. Neural Network 2.1 Examine the statistics and plots of training error and validation error (generalization). Since this problem needs to run 5 times and each time run 100 epochs for Matlab. We need to run more than 500 times straightly for python or save models and I prefer the first solution. I set eps=0.1, momentum remains 0 and number of epochs equals 5000. The cross entropy and classification rate are as below: Step 4999 Train CE 0.00541 Validation CE 0.05453 mean_classification_error 0.00000 Error: Train 0.00541 Validation 0.05453 Test 0.07188 fr_rate: Train 0.00000 Validation 0.03000 Test 0.03000 Step 4999 Train CE 0.00541 Validation CE 0.05453 mean_classification_error 0.00000 Error: Train 0.00541 Validation 0.05453 Test 0.07188 fr_rate: Train 0.00000 Validation 0.03000 Test 0.03000 Above is the final results after 5000 eopchs. From the detail printing out, I figured out that: 1. Within 5000 iterations, the cross entropy of training set keeps going down. On the other hand, the cross entropy of validation set reaches the lowest point (0.0520) at about 2600 epochs, after which the curves start to go up due to overfitting again. 2. The training set becomes 100% fit the model within 5000 iterations, approximately starting from 2400 epochs. However, the lowest point (0.015) of the validation set reaches at around 800 epochs, after which the curves start to go up due to overfitting again. 3. Since the cross entropy doesn t change too much, we are supposed to more focus on the accuracy of validation set, which means 800 epochs might be a nice choice for observation of the local optima for the curve. Also, due to the cross entropy and accuracy are around 2600 and 800 epochs reaching optima, we should observe the curve in the longer range, which probably

should extends to 5000 epochs to see how the whole curve goes.

2.2 Classification Error Let s set eps = 0.1, mom = 0.0 and number of epochs stays in 3000, because such number of epochs are good enough to see convergence. The results are as below: Step 0 Train CE 0.69316 Validation CE 0.69315 mean_classification_error 0.50000 Step 100 Train CE 0.68543 Validation CE 0.68560 mean_classification_error 0.14167 Step 200 Train CE 0.48862 Validation CE 0.49274 mean_classification_error 0.05167 Step 2999 Train CE 0.01102 Validation CE 0.05210 mean_classification_error 0.00000 Error: Train 0.01102 Validation 0.05210 Test 0.06818 fr_rate: Train 0.00000 Validation 0.02000 Test 0.03000

2.3 Learning rate Since last experiment we figure out that some models require over 2000 iterations, I therefore want to set 5000 epochs to see further changes of curves. I will record the optimal accuracy and classification error rate for the lowest point. Firstly, I want to do control variable method to see what is going on when we keep eps or momentum unchanged to see the other variable effect the curves. Then I want to try different combination of all eps and momentum to look for more patterns and the best parameters. 2.3.1 Control variable method( for eps and mom respectively) 2.3.1.1 eps I set momentum remains 0.0, hidden layer remains 10 and let s try 5000 epochs to see the whole picture of the curves.

Above is when eps = 0.01.

Above is when eps = 0.2.

Above is when eps = 0.5. Conclusion: 1) As the learning rate(eps) goes bigger, the less number of epochs it

took to get converged, for both correction rate and cross entropy. 2)Expect when eps = 0.01, the curves have not converged yet. The lowest points before overfitting for other parameters are pretty close, no big difference. 2.3.1.2 momentum I will set eps = 0.5, which converged fastest in the last case, as well as remains other parameters unchanged to see more about momentum as below( since I ve shown eps = 0.5 and momentum =0 in the last case, I will skip this one):

Above is when momentum = 0.5.

Above is when momentum = 0.9. Conclusion: 1) Similar to eps, as momentum goes bigger, the quicker curves get converged for both cross entropy and correction rate. 2) When curves reach the best point of the curves, there are no big difference for best cross entropy or correction rate. 2.3.2 Try all combination I run all the 9 combinations, e.g eps={0.01, 0.2, 0.5} & momentum={0, 0.5, 0.9} and the overall results as below: eps mom Cross error optimal iter optimal iter for ment um Entropy rate for CE error rate 0.01 0 0.10198 0.02 5000 5000 everything still goes down 0.01 0.5 0.06139 0.015 5000 4300 0.01 0.9 0.05211 0.015 800 2700

0.2 0 0.05243 0.015 1300 400 0.2 0.5 0.05298 0.015 700 200 0.2 0.9 0.05292 0.02 200 100 0.5 0 0.0524 0.02 500 100 0.5 0.5 0.05266 0.02 300 100 0.5 0.9 0.05894 0.015 100 400 it goes up and down many times in the begininig From this chart, it s very obvious to see that as the eps or momentum goes bigger, the quicker curves would get convergence for both correction rate and cross entropy. In addition, the best points for different combination are of no big difference. However, there is a very interesting combination when eps=0.01 and momentum = 0.9. The chart of their correction rate is as below: You can see there are some fluctuations in the beginning of the curve, which means the algorithm bounce around in the beginning. In this case, if we don t set

our number of epochs well, there might be reaching fake optima before really converging. In order to avoid this case, I would suggest let s set our parameters as close to each other as possible. Conclusion: 1) Bigger eps and momentum helps curves converge more quickly. 2) Don t set eps and momentum far away from each other, which perhaps intrigue fake optima. 3) If we only focus on the speed of convergence, I would suggest the parameters to be set as bigger as possible because there are no big difference for the performance of correction rate and cross entropy. 2.4 Number of hidden units Since we have seen this influence of different parameters under 5000 epochs, I will try 1000 epochs to see convergence, which is good enough to see. Above is when number of hidden units = 2.

Above is when number of hidden units = 5. Above is when number of hidden units = 10.

Above is when number of hidden units = 30.

Above is when number of hidden units = 100. The effect of this modification on the convergence properties:! The larger numbers of hidden units, the quicker curves converge.! The larger numbers of hidden units, the lower cross entropy we get after training with same number of epochs.! The fraction rate doesn t differ much by using different number of hidden units by the end of training. 2.5 Compare k- NN and Neural Networks when k = 1, classification rate of valid is 0.985, of test is 0.985 when k = 3, classification rate of valid is 0.990, of test is 0.988 when k = 5, classification rate of valid is 0.980, of test is 0.988 when k = 7, classification rate of valid is 0.985, of test is 0.990 when k = 9, classification rate of valid is 0.985, of test is 0.988 The results of knn is generally like above. The error rate of knn is slightly better than neural network does, which is 0.985 versus 0.99 for validation set and 0.97

versus 0.99 for test set. Interestingly, the classification rate of knn are very close to each other (round to 0.01) by using different k values, which means the classification of sample data is insensitive to the size of k to some extent.! Efficiency: In this case, it seems knn executes a little bit more quickly than neural network does (hidden units = 10). However, if the size of sample goes larger, it will be definitely clear that neural network plays better performance on the speed of classification.! Non- parametric vs parametric: It s quite clear that knn is of no parameters while neural network requires a lot of parameters to implement the algorithm.! Model sophistication: Since neural network involves input layer, output layer and perhaps many hidden layers and each iteration requires back- propagation, it s inevitable that neural network is a very complicated model. On the other hand, knn is very easy to implement and only take the distance into account.! Accuracy: In this case, knn is slightly better than neural network on test set and even reaches exactly same accuracy for validation set. However, after I read some papers, they say neural network plays better performance in general.

3 Mixture of Gaussians 3.2 Trainings 3.2.1RandConst & iter I tried randconst = {0.01,0.1,1,2,5,10} and find out that there is no big difference choosing ranconst. I wrote a new function called q2_1 in which I tried each randconst for more than 100 times and return me the biggest log- likelihood it gets. It turns out that the best randconst different for every time. Therefore, I run 100 times and set a mark to record the highest log- likelihood corresponding to which randconst within {0.01,0.1,1,2,5,10}. The results are as follow: 1: 22 2: 32 5: 24 10: 18 0.01: 0 0.1: 4 From the result that I run for 100 times, we can see there are no big difference for ranconst bigger than 1. Since randconst = 2 does the best performance among these parameters, I will generally choose 2 as my best choice for randconst. After I run several times, I found iter = 10 cannot show the whole picture of curve getting convergence. Usually, for model 2 s, curves converge among 6-15 iterations while 3 s converges among 15-20 iterations. Therefore, I will set iter=20 in this question. 3.2.2 Mean vectors and variance vectors I extracted sample data from 2 and 3 respectively and show the mean and variance of both of classes after training by using mixture of Gaussian as below:

Above is mean and variance for 2 s.

Above is mean and variance of 3 s. 3.2.4Mixing proportions

For 2 s, the mixing proportion is 0.51664302 and 0.48335698. For 3 s, the mixing proportion is 0.55233519 and 0.44766481. 3.2.5 LogP(Training Data) I run my code for 2 s and 3 s respectively and they all get converged within 15 iterations. For model 2 s, the final log probability is - 3868.40634. On the other hand, for model 3 s, the final log probability is 2355.06382. The plots are as below: Above is the plot of 2 s.

Above is the plot of 3 s. 3.3 Initializing a mixture of Gaussian with k- means 3.3.1 Speed of convergence I tried usual initialization and k- means initialization respectively and print the logp for them. The plot graphs are as below:

Above is the one with using k- means. Above is the one without using k- means.

You can see the one without k- means to initialize parameters converges around more than 15 iterations of EM but the one with k- means, on the other hand, quickly converges at around 6 to 10 iteration of EM. It indeed demonstrates the mixture of Gaussian could be accelerated to a faster convergence by using k- means to initialize mu. This makes sense intuitively since running k- means as a form of preprocessing will start the Gaussians off in a more reasonable and sensible place than just initializing them randomly. 3.3.2 Final log- prob using traditional initialization Iter 0 logprob -51780.78153 Iter 1 logprob -7881.57381 Iter 2 logprob 16113.54780 Iter 3 logprob 21412.38686 Iter 4 logprob 22911.00821 Iter 5 logprob 23656.62129 Iter 6 logprob 24106.71679 Iter 7 logprob 24655.18689 Iter 8 logprob 24817.27402 Iter 9 logprob 24964.65708 Iter 10 logprob 25058.68011 Iter 11 logprob 25066.77599 Iter 12 logprob 25077.79039 Iter 13 logprob 25088.22694 Iter 14 logprob 25088.23409 Iter 15 logprob 25088.23409 Iter 16 logprob 25088.23410 Iter 17 logprob 25088.23410 Iter 18 logprob 25088.23410 Iter 19 logprob 25088.23410 Logprob : Train 41.813723 Valid 20.572443 Test 19.843835

initialize with kmeans and iter=20 Iter 0 logprob -35209.90540 Iter 1 logprob 26052.44447 Iter 2 logprob 27984.23275 Iter 3 logprob 28400.52485 Iter 4 logprob 28582.26903 Iter 5 logprob 28616.27712 Iter 6 logprob 28728.35964 Iter 7 logprob 28754.39175 Iter 8 logprob 28792.53961 Iter 9 logprob 28823.78598 Iter 10 logprob 28839.07842 Iter 11 logprob 28839.08860 Iter 12 logprob 28839.08860 Iter 13 logprob 28839.08860 Iter 14 logprob 28839.08860 Iter 15 logprob 28839.08860 Iter 16 logprob 28839.08860 Iter 17 logprob 28839.08860 Iter 18 logprob 28839.08860 Iter 19 logprob 28839.08860 Logprob : Train 48.065148 Valid 20.868113 Test 19.633125 From the print- out result we can see, the one starting with kmeans to initialize their parameters ends up with higher log probability. This means we can have better maximization with our model by using kmeans to initialize parameters. Not only quicker to converge, but also have better performance on max the log probability. 3.4 Classification using MoGs I'm assuming that the priors of each model are 0.5. Notice that since both P(x) and the priors are constant across both classes, we can just use P(x d) to classify,

since P(d x) ~ P(x d). I tried the number of mixing component in {2,3,5,10, 15,20,25, 30}. 3.4.1 You should find that the error rates on the training sets generally decrease as the number of clusters increases. Explain why. This is because the more numbers of clusters you try to classify, the easier to classify the new input data into the right cluster after training. This is like you set more variables in logistic regression to describe the model, in which more variables is more likely to fit the model within certain number of iterations. 3.4.2 Examine the error rate curve for the test set and discuss its properties. Explain the trends that you observe. I run the script for several times and the graphs of result are not exactly the same each time. Sometimes the result is as above, that k=25 leads to the lowest average error rate and sometime is as below that k=20 is the optimized point.

I think the reason of being that is the model could be fit better as we increase the number of mixture components at first, which brings more accurate clustering for data. However, beyond certain number of mixture components, let s say 20 or 25 in this problem, the model becomes relatively overfitting so that as training data keeps going down, test data starts to go up for the classification error rate. 3.4.3 If you wanted to choose a particular model from your experiments as the best, how would you choose it? If your aim is to achieve the lowest error rate possible on the new images your system will receive, which model (number of clusters) would you select? Why? I d like to choose it by observing and comparing the error rate. If I could the number of mixture component from the parameters I selected, I probably choose 20 as my best choice, since I run my code for over 10 times. However, if I could only choose from the parameters given from 3.4, I might choose 15 as my selection. It s because when number of mixture component is too small, the training set still get space(by making more clusters) to promote accuracy but when number of mixture component is too big, the model will become overfit and have negative influence on the validation and test set. K=15 is just in

between that perfectly train the training data and doesn t overfit the model. There is a graph I plotted explaining a lot:

3.5 Bonus Question: Mixture of Gaussians vs Neural Network 3.5.1 Visualize and compare Above is the last 3 out of 30 subplots for neural network s input layer weigths. Since we have 15 as our best choice for number of components for 2 s and 3 s respectively, I suppose we should use 30 clusters i.e. K=30 for neural network because it directly use the whole 600 dataset. Due to 30 clusters, the output is quite small( so I only plot last three to see more clearly) but enough to see the pixel inside each subplot is quite blur and rough in comparison of MoG ones. I suppose there are several reasons towards this phenomenon:! In MoG, we trained 2 s and 3 s respectively, which must be more accurate than neural network to train them together.! In MoG, we plotted mean and variance, in which tell the properties of after- trained model. However, in neural network, the input to hidden layers is only a part of the model properties and the rest of properties are within hidden layers. In this way, the plot of W1 only tells a little part of story.

As for classification rate, the result is as below. The classification error rates for validation and test set are 0.025 and 0.0325 respectively, which is slightly poorer than MoG but still good enough. CE: Train 0.00735 Validation 0.05332 Test 0.06959 fr_rate: Train 0.00000 Validation 0.02500 Test 0.03250 Neural Networks (30 hidden units) MoG (K=15) Training Set 1 [0/600] 1 [0/600] Validation Set 0.975 [5/200] 0.995 [1/200] Test Set 0.9675 [13/400] 0.985 [6/400] 3.5.2 Visualize the input to hidden weights as images to see what your network has learned I only picked up the last three subplots so we can see better.

This is the randomly initial weights of input layer. This is after 30 iterations. This is the end of 60 iterations.

After 90 iterations, the curves have converged and similarly the pixel plots don t change much since then. As is shown above, the plots get clearer as the training more iterations. In the beginning, pixels are dispersed randomly while, in the end, starts to have some obviously bright places and dark places in every plot. 3.5.3 Compare hidden unit weights versus mixture component Intuitively, the essence the probability in mixture of Gaussian or the hidden unit weights in neural network is supposed to represent some features. In MoG, the probability is by no means to be negative and therefore the weights being close to zero could push the results close to zero, which makes it close to 2 and on the other around to 3. In neural network, the weights in hidden layers could be negative, which inclined to make the final result to zero, or could be positive, which inclined to make the final result close to 1, being 3. I printed the last three w2 and plot the last three subplots accordingly, as below:

What I got from w2 accordingly is as below: [-0.18654953] [ 2.0379637 ] [ 1.59050723]! You can see clearly from the pictures. If we deem that the bright part of the picture represent the feature of either 2 or 3, first picture is mostly resemble to 2 and the rest two are resemble to 3 apparently. I also examined all other subplots and found that the negative number stands for 2 s features while positive number stands for 3 s features.! It makes sense. If the weights are negative, the final results are more likely to push to 0, which makes the subplot resemble 2, and vice versa.! It s in some way similar to the mixing proportion, i.e. probability in the code. When it s close to 0 then it push the final result close to 0, which makes it resemble to 2 and vice versa as well.! I would say hidden units are more like features instead of clusters. In MoG, K(mixture component) means the number of clusters and p(mixing proportion) represents the probability to belong to which cluster. On the

contrary, in neural network, the weights is like trying to stand for a part of the 2 s or 3 s model as a feature or pattern to recognize a particular part of the number.! The way of how neural network works is to let nn learn those features part by part.