Three supervised learning methods on pen digits character recognition dataset

Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru Fukushma Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 sfukush@cs.ucsd.edu 1 Introducton Supervsed learnng s a broad feld that encompasses a number of methods, whch can generally be classfced nto two categores: parametrc and nonparametrc. In the parametrc methods, t s assumed that the forms of the underlyng densty functons are known. The problem of estmatng unknown functons can be reduced to estmatng some values of parameters. In contrast, n the nonparametrc method, there s no assumpton on the form of the underlyng denstes. The parametrc category s dvded further nto two subcategores: generatve and dscrmnatve. In the generatve method, we estmate P(X Y), whch descrbes how to generate X gven Y, whle n the dscrmnatve method, we drectly estmate P(Y X). It s our goal to compare classfcaton results and characterstcs of learnng methods n the dfferent categores. Bayesan classfcaton wth a mxture of Gaussans, logstc regresson, and k nearest-neghbor classfcaton were mplemented, and ther results usng a pen dgts character recognton dataset are analyzed. 2 Bayesan classfcaton wth a mxture of Gaussans Usng mxtures of probablty densty functons for estmatng lkelhoods s a practcal technque when a class s not easly descrbed by one probablty densty functon. Such a case may arse when a class contans two centers of concentrated actvty, such as a bmodal dstrbuton. Combnng dstrbutons can offer a better approxmaton of a true modelng functon. In the project, we used a mxture of Gaussans to model each class. To generate the Gaussans parameters, the expectaton maxmzaton process was employed, and further tempered by determnstc annealng. The frst step n generatng the probablty dstrbuton was to create a covarance matrx and a mean for each class of data n the tranng set. We assumed that the devaton would not change, and were able to leave the covarance matrx stable once created. The mean

though was modfed to better ft the data through the teratve process of EM. Intally, a dfferent set of mean values were created for each Gaussan component by random perterbaton around the mean per class. The expectaton process then took each tranng example from that class and calculated the probablty usng all the Gaussans n the mxture model. Each probablty was also annealed by takng t s root to the tenth power, whch amed to lengthen the teratve process to acheve a truer representaton of the actual model. Each Gaussan had a weght assocated wth t and that weght was used to determne the percentage of partcpaton the Gaussan would have n determnng the probablty for the example. Then the maxmzaton step calculated new weghts and new means for each Gaussan by determnng how much partcpaton each had n the fnal answer. The process was repeated teratvely untl the values of the weghts and the means converged wth.01 of the prevous teraton. Ths number was chosen snce all values of data were between 0 and 100, and the sad convergence lmt would amount to approxmately.01% dfference n many cases. Wth the converged values of the weghts and means for each Gaussans, we then appled those to the test data to obtan the partcpaton probablty of a testng example n a mxture of Gaussans. That was not the fnal class predcton though, as Bayes theorem was used to fnd the probablty. The value from the Gaussan was used as the P(X Y) component n a Bayes equaton, where X was the testng example and Y was the class. The other numerator, P(Y) was calculated by countng classes. The denomnator, P(X), was the same for all Gaussans, so we merely needed to multply P(Y)*P(X Y) and compare t wth all classes. The maxmum of whch was used to predct whch class the data lay wthn. The most mportant queston we faced was how many component Gaussans should be used to buld the densty functon. To determne ths value, we used ten fold cross valdaton on our tranng set and averaged the accuracy results as the number of Gaussans ranged from one to ten. The results are plotted n Fgure 1. Interestngly, there s very lttle dfference n the accuracy when more than two Gaussans were used. Ths may be explaned by the fact that two Gaussans cover almost all the data n the class and addtonal Gaussans do not fnd other centers of data to ft, effectvely contrbutng zero to the total probablty for almost all data. Even though wth a larger number of Gaussans there seems to be a slghtly small uptck, we chose to use two Gaussans snce t was possble that more than two Gaussans would overft the data. Fgure 1: Accuracy vs. number of Gaussans for mxture modelng Usng two Gaussans, the accuracy for the test set was 95.88%.

The tme complexty of usng a mxture of models was qute small, snce the data had all been generated and thus needed to only be appled. In the case of two Gaussans, ths meant the Gaussan was appled twce for each class, for each example of data. The testng tme could be consdered to be O(c*p), where c was the number of classes and p was the number of features. To actually generate the model though, takes longer, snce t depended on a convergence process to generate the values needed for the Gaussan. Once agan, the Gaussan must be appled twce to each tranng example, for each class. But, mportantly, the values for each set of Gaussans must be teratvely recalculated untl t convergences. In practce, the convergence process took on average ten tmes, but t was a number that clearly depended on the data nvolved and choces n the startng means and weghts. 3 Logstc regresson Logstc regresson s a parametrc, dscrmnatve classfcaton algorthm, and drectly estmate P(Y X). For ths project, we used a bnary classfer for each dgt. In the tranng phase, the classfer was gven a label 1 for a sample of the dgt and 0 for other dgts. In the testng phase, each classfer output the probablty that the sample represented the dgt and was multpled by a pror of each dgt, and then the dgt whose classfer s resulted value was hghest was chosen as the fnal predcted dgt. We used the functon, fmnsearch n the optmzaton toolbox n Matlab, for maxmzng the log of condtonal lkelhood over weght coeffcents W, l(w ) = n Y n (w 0 + w X n ln(1 + exp(w 0 + w X n )) where p s the number of features and X n represents the value of the vector X for the n-th tranng sample. The classfcaton accuracy when the classfer was traned wth a whole tranng dataset and tested on the testng set was 81.85%. To mprove ts accuracy, we exploted regularzaton, whch reduces overfttng by peneralzng large values of W. The revsed log lkelhood we used was, l(w ) = n Y n (w 0 + w X n ln(1 + exp(w 0 + w X n )) λ W 2 2 where p s the number of features and X n s the value of the vector X for the n-th tranng sample. For fgurng out whch value of λ works well, we conduct 2 fold cross valdatons wth 8 dfferent λ values from 2 9, 2 8,..., and 2 3. The reason we chose 2 fold, not 10 fold and we just examned these 8 values was the tme constrant. 10 fold cross valdaton mght have produced more accurate estmaton. As Fgure 2 shows, when λ was 2 4, ts resulted accuracy was maxmzed. Hence, we used 2 4 as the value of λ. However, the accuracy on a whole testng set was deterorated to 79.56% from 81.85% wthout any regularzaton. To fgure out the best value on the testng set, we further used the same 8 λ values on the testng set, and Fgure 3 shows, the best accuracy, 82.42%, was produced when λ was 2 8. There are possble reasons for the relatvely poor accuracy. The frst one s that we used 2 fold cross valdatons, not 10 fold, whch mght make ts estmaton deterorated. The second s that the range of used λ values was lmted. It mght be possble that other λ values would produce more accurate predctons. Both of these two reasons were caused manly by the neffcency of fmnsearch algorthm, that s, t took too much tme to converge. In addton, the fact we needed to termnate before the convergence of fmnsearch algorthm s another possble reason for ths result.

Fgure 2: Accuracy vs. dfferent values of λ on cross valdatons 4 K nearest-neghbor classfcaton K-nearest neghbor classfcaton algorthms explctly gnore parametrc modelng n decdng whch class a data pont les wthn. Ths has the effect of performng a hard classfcaton on each data pont, wthout the ablty to nuance and massage parameters to tune to specfc problems. The basc dea n K nearest neghbors s that for an nstance x of the testng set, the dstance between all tranng ponts s calculated. The dstance functon s defned as the Eucldean dstance, whch has the convenent property of workng for any dmensonal set of data. In our case, the data had 16 features, so the dfference between each feature was taken and squared, summed wth all other features and the square root was taken. Wth the dstance calculated between each pont, the class pluralty s taken of closest k neghbors. Although the complexty of the algorthmc s qute lmted, t s remarkably accurate for certan sets of data dependng on the degree of k whch s chosen. The tme complexty of the algorthm s an unfortunate drawback of usng k-nearest neghbors. Although there s no tranng phase per se, each data pont from the test set must be compared separately wth every value n the tranng set. That number mght be reduced through samplng, f the tranng set s too large, but that may not be desrable n many stuatons. If we say n s the tranng sze, p s the number of features and k s the number of neghbors, then the runnng tme s O(n,p,k) = ( n*p *k + k ). The frst component of the sum, n*p*k, s the tme requred to calculate the dstance between every tranng example, whch then has to be compared aganst the top k neghbors to determne f t s a closer neghbor then exstng neghbors. The last term of the sum s the tme requred to count the class that had a pluralty amongst all neghbors. As k s usually qute small, the formula mght better be wrtten as O(n,p) = n * p. To effectvely choose whch k should be employed, 10 fold cross valdaton was used on the tranng set. Thus, each tenth of the tranng set was used as a test set, whle the remanng porton was used to determne class membershp. The accuracy was averaged over each cross valdaton run. Ths was done for all k from 1 to 20. The best results were obtaned when k=1. There was a notceable declne n accuracy as k ncreased, ndcatng

Fgure 3: Accuracy vs. dfferent values of λ on the testng set that more neghbors were not better, most lkely because the classes were relatvely close to each other n the Eucldean sense. When more neghbors were used, more classes were brought nto the equaton whch affected the overall predcton. Fgure 4 demonstrates the deteroratng qualty as k ncreased. Fgure 4: Accuracy vs. number of neghbors for K-nearest neghbors Usng k=1 on the entre tranng set and the testng set resulted n an accuracy of 97.86%. 5 Dscusson 5.1 Accuracy Table 1 shows the comparsons between the three classfcaton algorthms when they were traned wth a whole tranng dataset and tested on the entre testng set. As for accuracy, the K nearest neghbor algorthm produced the best result among three. Bayesan classf-

caton wth a mxture Gaussans was close to t, and logstc regresson was the worst. The possble reason for the poor performance of logstc regresson was due to the numercal optmzaton method used, fmnsearch functon. It tred to fnd the coeffcent values whch mnmzed a functon value, but t ddn t converge wthn a small number of teratons. So, wth the relatvely lmted tme we had, we needed to termnate t the maxmum number of functon evaluatons, 3400, whch was the default value of MaxFunEvals n optons for fmnsearch n Matlab. If t had run longer, ts accuracy could have been mproved. Bayesan w/mx Gauss Logstc regresson K nearest neghbor Accuracy 95.88% 82.42% 97.86% Tme for tranng N.A. N.A. 0 Tme for testng O(c*p) O(c*p) O(p*n) Space O(c*p 2 ) O(c*p) O(p*n) Table 1: Comparsons between the three classfcaton methods (c s the number of class, p s the number of features, and n s the number of tranng data.) 5.2 Tme complexty The k nearest neghbor algorthm doesn t requre a tranng phase, but takes a long tme n ts testng phase snce t needs to examne all data ponts. On the other hand, Bayesan classfcaton wth a mxture of Gaussans and logstc regresson both must be traned, but then they can conduct testng much faster than k nearest neghbor. Whle n the tranng of the Bayesan classfer, the estmated parameter values converged relatvely quckly, the tranng for the logstc regresson took a much longer tme snce fmnsearch, the Matlab functon used for numercal optmzaton, was not effcent. A more effcent method such as teratve reweghted least squares would reduce ts tme complexty. Tme complextes are shown n Table 1, where c s the number of classes, p s the number of features, and n s the number of tranng examples. For logstc regresson and Bayesan classfcaton wth a mxture of Gaussans, we could not provde a closed form bg-o notaton due to the convergence propertes of both algorthms. 5.3 Space complexty Snce the k nearest neghbor algorthm needs to examne all data ponts when a new example s classfed, all data needs to be stored. Its space complexty s represented as O(p*n), where p s the number of features and n s the number of tranng examples. On the other hand, the Bayesan method and logstc regresson only need to store several values of parameters. In the Bayesan method, the space complexty was O(c*(j*p + p 2 )), where j was the number of Gaussans and p was the number of features. In ths nstance, p represents the array of mean values for each Gaussan and p 2 represents the covarance matrx for each class. In logstc regresson, the space complexty was O(c*p), where c s the number of classes and p s the number of features. The latter two algorthms space complexty should always be much smaller than O(p*n), the space complexty for k nearest neghbors, snce the sze of the tranng dataset should always be much larger than the other parameters. 5.4 Characterstcs of the classfers As mentoned n the subsecton about accuracy above, the nferor performance of logstc regresson mght be caused by early termnaton of the numercal optmzaton functon wthout t havng truly converged. Wth ths n mnd, we dscuss several characterstcs of the classfcaton algorthms.

The accuracy of k nearest neghbor was the best among three. The observaton can best be explaned by the flexblty the algorthm has n examnng other neghbors. Whle choosng whch value of k worked best, we notced ts accuracy deterorated as k ncreased, as seen n 4. Ths ndcates that for the dataset, the best predctor of an example s class was fndng the class of another example whch had nearly the same values for each feature. In contrast, the other two classfcaton algorthms do not have ths flexblty and are forced nto usng parameters amed at coverng the entre range of examples. Even n the case of the mxture of components, selectng the example that s closest may do better for a varety of reasons. For example, the nfluence of nearby Gaussans from other classes may overrde a Gaussan component whch has lttle weght wthn t s own class. Although the k nearest neghbor algorthm performed the best, t s man drawback s that t takes much longer to classfy a test example compared to the other two methods. Ths characterstc prohbts the k nearest neghbor algorthm from beng used n certan knds of applcatons whch requre classfcaton n realtme. Between the two parametrc methods, the number of parameters are dfferent. For the Bayesan, there are O(c*(j*p + p 2 )), whle logstc regresson only had O(c*p). Hence, when the number of features, p, s large, logstc regresson may be preferred. 6 Concluson Expermentaton wth the three methods of classfcaton revealed a number of nsghts. As the number of features ncreasng, the classfcaton problem becomes ntractable n many respects. The prevous dataset had nearly 800 features and could not be used n many formulas wthout overflow or underflow and, specfcally, n logstc regresson, convergence tme took too long to be useful. The power of the conceptually smplstc k nearest neghbor model was a surprse and demonstrated that for many datasets, a smpler approach may be just as vald as a parametrc approach. Even more nterestng was that as more neghbors were used, the accuracy actually decreased. One mght be led to beleve that as more data was examned, the accuracy would rse correspondngly, snce a better nformed judgement could be made.