CSC 2515 Introduction to Machine Learning Assignment 2

Size: px

Start display at page:

Download "CSC 2515 Introduction to Machine Learning Assignment 2"

Wilfrid Palmer
5 years ago
Views:

1 CSC 2515 Introduction to Machine Learning Assignment 2 Zhongtian Qiu( )

2 Problem 1 See attached scan files for question 1.

3 2. Neural Network 2.1 Examine the statistics and plots of training error and validation error (generalization). Since this problem needs to run 5 times and each time run 100 epochs for Matlab. We need to run more than 500 times straightly for python or save models and I prefer the first solution. I set eps=0.1, momentum remains 0 and number of epochs equals The cross entropy and classification rate are as below: Step 4999 Train CE Validation CE mean_classification_error Error: Train Validation Test fr_rate: Train Validation Test Step 4999 Train CE Validation CE mean_classification_error Error: Train Validation Test fr_rate: Train Validation Test Above is the final results after 5000 eopchs. From the detail printing out, I figured out that: 1. Within 5000 iterations, the cross entropy of training set keeps going down. On the other hand, the cross entropy of validation set reaches the lowest point (0.0520) at about 2600 epochs, after which the curves start to go up due to overfitting again. 2. The training set becomes 100% fit the model within 5000 iterations, approximately starting from 2400 epochs. However, the lowest point (0.015) of the validation set reaches at around 800 epochs, after which the curves start to go up due to overfitting again. 3. Since the cross entropy doesn t change too much, we are supposed to more focus on the accuracy of validation set, which means 800 epochs might be a nice choice for observation of the local optima for the curve. Also, due to the cross entropy and accuracy are around 2600 and 800 epochs reaching optima, we should observe the curve in the longer range, which probably

4 should extends to 5000 epochs to see how the whole curve goes.

5 2.2 Classification Error Let s set eps = 0.1, mom = 0.0 and number of epochs stays in 3000, because such number of epochs are good enough to see convergence. The results are as below: Step 0 Train CE Validation CE mean_classification_error Step 100 Train CE Validation CE mean_classification_error Step 200 Train CE Validation CE mean_classification_error Step 2999 Train CE Validation CE mean_classification_error Error: Train Validation Test fr_rate: Train Validation Test

6 2.3 Learning rate Since last experiment we figure out that some models require over 2000 iterations, I therefore want to set 5000 epochs to see further changes of curves. I will record the optimal accuracy and classification error rate for the lowest point. Firstly, I want to do control variable method to see what is going on when we keep eps or momentum unchanged to see the other variable effect the curves. Then I want to try different combination of all eps and momentum to look for more patterns and the best parameters Control variable method( for eps and mom respectively) eps I set momentum remains 0.0, hidden layer remains 10 and let s try 5000 epochs to see the whole picture of the curves.

7 Above is when eps = 0.01.

8 Above is when eps = 0.2.

9 Above is when eps = 0.5. Conclusion: 1) As the learning rate(eps) goes bigger, the less number of epochs it

10 took to get converged, for both correction rate and cross entropy. 2)Expect when eps = 0.01, the curves have not converged yet. The lowest points before overfitting for other parameters are pretty close, no big difference momentum I will set eps = 0.5, which converged fastest in the last case, as well as remains other parameters unchanged to see more about momentum as below( since I ve shown eps = 0.5 and momentum =0 in the last case, I will skip this one):

11 Above is when momentum = 0.5.

12 Above is when momentum = 0.9. Conclusion: 1) Similar to eps, as momentum goes bigger, the quicker curves get converged for both cross entropy and correction rate. 2) When curves reach the best point of the curves, there are no big difference for best cross entropy or correction rate Try all combination I run all the 9 combinations, e.g eps={0.01, 0.2, 0.5} & momentum={0, 0.5, 0.9} and the overall results as below: eps mom Cross error optimal iter optimal iter for ment um Entropy rate for CE error rate everything still goes down

13 it goes up and down many times in the begininig From this chart, it s very obvious to see that as the eps or momentum goes bigger, the quicker curves would get convergence for both correction rate and cross entropy. In addition, the best points for different combination are of no big difference. However, there is a very interesting combination when eps=0.01 and momentum = 0.9. The chart of their correction rate is as below: You can see there are some fluctuations in the beginning of the curve, which means the algorithm bounce around in the beginning. In this case, if we don t set

14 our number of epochs well, there might be reaching fake optima before really converging. In order to avoid this case, I would suggest let s set our parameters as close to each other as possible. Conclusion: 1) Bigger eps and momentum helps curves converge more quickly. 2) Don t set eps and momentum far away from each other, which perhaps intrigue fake optima. 3) If we only focus on the speed of convergence, I would suggest the parameters to be set as bigger as possible because there are no big difference for the performance of correction rate and cross entropy. 2.4 Number of hidden units Since we have seen this influence of different parameters under 5000 epochs, I will try 1000 epochs to see convergence, which is good enough to see. Above is when number of hidden units = 2.

15 Above is when number of hidden units = 5. Above is when number of hidden units = 10.

16 Above is when number of hidden units = 30.

17 Above is when number of hidden units = 100. The effect of this modification on the convergence properties:! The larger numbers of hidden units, the quicker curves converge.! The larger numbers of hidden units, the lower cross entropy we get after training with same number of epochs.! The fraction rate doesn t differ much by using different number of hidden units by the end of training. 2.5 Compare k- NN and Neural Networks when k = 1, classification rate of valid is 0.985, of test is when k = 3, classification rate of valid is 0.990, of test is when k = 5, classification rate of valid is 0.980, of test is when k = 7, classification rate of valid is 0.985, of test is when k = 9, classification rate of valid is 0.985, of test is The results of knn is generally like above. The error rate of knn is slightly better than neural network does, which is versus 0.99 for validation set and 0.97

18 versus 0.99 for test set. Interestingly, the classification rate of knn are very close to each other (round to 0.01) by using different k values, which means the classification of sample data is insensitive to the size of k to some extent.! Efficiency: In this case, it seems knn executes a little bit more quickly than neural network does (hidden units = 10). However, if the size of sample goes larger, it will be definitely clear that neural network plays better performance on the speed of classification.! Non- parametric vs parametric: It s quite clear that knn is of no parameters while neural network requires a lot of parameters to implement the algorithm.! Model sophistication: Since neural network involves input layer, output layer and perhaps many hidden layers and each iteration requires back- propagation, it s inevitable that neural network is a very complicated model. On the other hand, knn is very easy to implement and only take the distance into account.! Accuracy: In this case, knn is slightly better than neural network on test set and even reaches exactly same accuracy for validation set. However, after I read some papers, they say neural network plays better performance in general.

19 3 Mixture of Gaussians 3.2 Trainings 3.2.1RandConst & iter I tried randconst = {0.01,0.1,1,2,5,10} and find out that there is no big difference choosing ranconst. I wrote a new function called q2_1 in which I tried each randconst for more than 100 times and return me the biggest log- likelihood it gets. It turns out that the best randconst different for every time. Therefore, I run 100 times and set a mark to record the highest log- likelihood corresponding to which randconst within {0.01,0.1,1,2,5,10}. The results are as follow: 1: 22 2: 32 5: 24 10: : 0 0.1: 4 From the result that I run for 100 times, we can see there are no big difference for ranconst bigger than 1. Since randconst = 2 does the best performance among these parameters, I will generally choose 2 as my best choice for randconst. After I run several times, I found iter = 10 cannot show the whole picture of curve getting convergence. Usually, for model 2 s, curves converge among 6-15 iterations while 3 s converges among iterations. Therefore, I will set iter=20 in this question Mean vectors and variance vectors I extracted sample data from 2 and 3 respectively and show the mean and variance of both of classes after training by using mixture of Gaussian as below:

20 Above is mean and variance for 2 s.

21 Above is mean and variance of 3 s Mixing proportions

22 For 2 s, the mixing proportion is and For 3 s, the mixing proportion is and LogP(Training Data) I run my code for 2 s and 3 s respectively and they all get converged within 15 iterations. For model 2 s, the final log probability is On the other hand, for model 3 s, the final log probability is The plots are as below: Above is the plot of 2 s.

23 Above is the plot of 3 s. 3.3 Initializing a mixture of Gaussian with k- means Speed of convergence I tried usual initialization and k- means initialization respectively and print the logp for them. The plot graphs are as below:

24 Above is the one with using k- means. Above is the one without using k- means.

25 You can see the one without k- means to initialize parameters converges around more than 15 iterations of EM but the one with k- means, on the other hand, quickly converges at around 6 to 10 iteration of EM. It indeed demonstrates the mixture of Gaussian could be accelerated to a faster convergence by using k- means to initialize mu. This makes sense intuitively since running k- means as a form of preprocessing will start the Gaussians off in a more reasonable and sensible place than just initializing them randomly Final log- prob using traditional initialization Iter 0 logprob Iter 1 logprob Iter 2 logprob Iter 3 logprob Iter 4 logprob Iter 5 logprob Iter 6 logprob Iter 7 logprob Iter 8 logprob Iter 9 logprob Iter 10 logprob Iter 11 logprob Iter 12 logprob Iter 13 logprob Iter 14 logprob Iter 15 logprob Iter 16 logprob Iter 17 logprob Iter 18 logprob Iter 19 logprob Logprob : Train Valid Test

26 initialize with kmeans and iter=20 Iter 0 logprob Iter 1 logprob Iter 2 logprob Iter 3 logprob Iter 4 logprob Iter 5 logprob Iter 6 logprob Iter 7 logprob Iter 8 logprob Iter 9 logprob Iter 10 logprob Iter 11 logprob Iter 12 logprob Iter 13 logprob Iter 14 logprob Iter 15 logprob Iter 16 logprob Iter 17 logprob Iter 18 logprob Iter 19 logprob Logprob : Train Valid Test From the print- out result we can see, the one starting with kmeans to initialize their parameters ends up with higher log probability. This means we can have better maximization with our model by using kmeans to initialize parameters. Not only quicker to converge, but also have better performance on max the log probability. 3.4 Classification using MoGs I'm assuming that the priors of each model are 0.5. Notice that since both P(x) and the priors are constant across both classes, we can just use P(x d) to classify,

27 since P(d x) ~ P(x d). I tried the number of mixing component in {2,3,5,10, 15,20,25, 30} You should find that the error rates on the training sets generally decrease as the number of clusters increases. Explain why. This is because the more numbers of clusters you try to classify, the easier to classify the new input data into the right cluster after training. This is like you set more variables in logistic regression to describe the model, in which more variables is more likely to fit the model within certain number of iterations Examine the error rate curve for the test set and discuss its properties. Explain the trends that you observe. I run the script for several times and the graphs of result are not exactly the same each time. Sometimes the result is as above, that k=25 leads to the lowest average error rate and sometime is as below that k=20 is the optimized point.

28 I think the reason of being that is the model could be fit better as we increase the number of mixture components at first, which brings more accurate clustering for data. However, beyond certain number of mixture components, let s say 20 or 25 in this problem, the model becomes relatively overfitting so that as training data keeps going down, test data starts to go up for the classification error rate If you wanted to choose a particular model from your experiments as the best, how would you choose it? If your aim is to achieve the lowest error rate possible on the new images your system will receive, which model (number of clusters) would you select? Why? I d like to choose it by observing and comparing the error rate. If I could the number of mixture component from the parameters I selected, I probably choose 20 as my best choice, since I run my code for over 10 times. However, if I could only choose from the parameters given from 3.4, I might choose 15 as my selection. It s because when number of mixture component is too small, the training set still get space(by making more clusters) to promote accuracy but when number of mixture component is too big, the model will become overfit and have negative influence on the validation and test set. K=15 is just in

29 between that perfectly train the training data and doesn t overfit the model. There is a graph I plotted explaining a lot:

30 3.5 Bonus Question: Mixture of Gaussians vs Neural Network Visualize and compare Above is the last 3 out of 30 subplots for neural network s input layer weigths. Since we have 15 as our best choice for number of components for 2 s and 3 s respectively, I suppose we should use 30 clusters i.e. K=30 for neural network because it directly use the whole 600 dataset. Due to 30 clusters, the output is quite small( so I only plot last three to see more clearly) but enough to see the pixel inside each subplot is quite blur and rough in comparison of MoG ones. I suppose there are several reasons towards this phenomenon:! In MoG, we trained 2 s and 3 s respectively, which must be more accurate than neural network to train them together.! In MoG, we plotted mean and variance, in which tell the properties of after- trained model. However, in neural network, the input to hidden layers is only a part of the model properties and the rest of properties are within hidden layers. In this way, the plot of W1 only tells a little part of story.

31 As for classification rate, the result is as below. The classification error rates for validation and test set are and respectively, which is slightly poorer than MoG but still good enough. CE: Train Validation Test fr_rate: Train Validation Test Neural Networks (30 hidden units) MoG (K=15) Training Set 1 [0/600] 1 [0/600] Validation Set [5/200] [1/200] Test Set [13/400] [6/400] Visualize the input to hidden weights as images to see what your network has learned I only picked up the last three subplots so we can see better.

32 This is the randomly initial weights of input layer. This is after 30 iterations. This is the end of 60 iterations.

33 After 90 iterations, the curves have converged and similarly the pixel plots don t change much since then. As is shown above, the plots get clearer as the training more iterations. In the beginning, pixels are dispersed randomly while, in the end, starts to have some obviously bright places and dark places in every plot Compare hidden unit weights versus mixture component Intuitively, the essence the probability in mixture of Gaussian or the hidden unit weights in neural network is supposed to represent some features. In MoG, the probability is by no means to be negative and therefore the weights being close to zero could push the results close to zero, which makes it close to 2 and on the other around to 3. In neural network, the weights in hidden layers could be negative, which inclined to make the final result to zero, or could be positive, which inclined to make the final result close to 1, being 3. I printed the last three w2 and plot the last three subplots accordingly, as below:

What I got from w2 accordingly is as below: [-0.18654953] [ 2.0379637 ] [ 1.59050723]! You can see clearly from the pictures.

34 What I got from w2 accordingly is as below: [ ] [ ] [ ]! You can see clearly from the pictures. If we deem that the bright part of the picture represent the feature of either 2 or 3, first picture is mostly resemble to 2 and the rest two are resemble to 3 apparently. I also examined all other subplots and found that the negative number stands for 2 s features while positive number stands for 3 s features.! It makes sense. If the weights are negative, the final results are more likely to push to 0, which makes the subplot resemble 2, and vice versa.! It s in some way similar to the mixing proportion, i.e. probability in the code. When it s close to 0 then it push the final result close to 0, which makes it resemble to 2 and vice versa as well.! I would say hidden units are more like features instead of clusters. In MoG, K(mixture component) means the number of clusters and p(mixing proportion) represents the probability to belong to which cluster. On the

35 contrary, in neural network, the weights is like trying to stand for a part of the 2 s or 3 s model as a feature or pattern to recognize a particular part of the number.! The way of how neural network works is to let nn learn those features part by part.

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5. More on Neural Networks Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.6 Recall the MLP Training Example From Last Lecture log likelihood