Three supervised learning methods on pen digits character recognition dataset

Similar documents
Support Vector Machines

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Unsupervised Learning

Feature Reduction and Selection

CS 534: Computer Vision Model Fitting

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Classifier Selection Based on Data Complexity Measures *

Machine Learning. Topic 6: Clustering

Unsupervised Learning and Clustering

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Mathematics 256 a course in differential equations for engineering students

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Biostatistics 615/815

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Announcements. Supervised Learning

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

A Statistical Model Selection Strategy Applied to Neural Networks

Support Vector Machines

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

S1 Note. Basis functions.

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Data Mining: Model Evaluation

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Lecture 5: Multilayer Perceptrons

Machine Learning 9. week

A Robust Method for Estimating the Fundamental Matrix

Cluster Analysis of Electrical Behavior

Lecture 4: Principal components

Smoothing Spline ANOVA for variable screening

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

An Optimal Algorithm for Prufer Codes *

Classifying Acoustic Transient Signals Using Artificial Intelligence

Hermite Splines in Lie Groups as Products of Geodesics

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

X- Chart Using ANOM Approach

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

CSCI 5417 Information Retrieval Systems Jim Martin!

Wishing you all a Total Quality New Year!

Understanding K-Means Non-hierarchical Clustering

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky

y and the total sum of

The Codesign Challenge

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Collaboratively Regularized Nearest Points for Set Based Recognition

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Machine Learning. K-means Algorithm

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

Adaptive Transfer Learning

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

5 The Primal-Dual Method

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Edge Detection in Noisy Images Using the Support Vector Machines

Fuzzy Logic Based RS Image Classification Using Maximum Likelihood and Mahalanobis Distance Classifiers

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Fast Feature Value Searching for Face Detection

The Research of Support Vector Machine in Agricultural Data Classification

A Background Subtraction for a Vision-based User Interface *

Incremental MQDF Learning for Writer Adaptive Handwriting Recognition 1

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Analysis of Continuous Beams in General

Related-Mode Attacks on CTR Encryption Mode

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Artificial Intelligence (AI) methods are concerned with. Artificial Intelligence Techniques for Steam Generator Modelling

SVM-based Learning for Multiple Model Estimation

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Reducing Frame Rate for Object Tracking

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Deep Classification in Large-scale Text Hierarchies

Classification Based Mode Decisions for Video over Networks

Fast Sparse Gaussian Processes Learning for Man-Made Structure Classification

Random Kernel Perceptron on ATTiny2313 Microcontroller

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Transcription:

Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru Fukushma Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 sfukush@cs.ucsd.edu 1 Introducton Supervsed learnng s a broad feld that encompasses a number of methods, whch can generally be classfced nto two categores: parametrc and nonparametrc. In the parametrc methods, t s assumed that the forms of the underlyng densty functons are known. The problem of estmatng unknown functons can be reduced to estmatng some values of parameters. In contrast, n the nonparametrc method, there s no assumpton on the form of the underlyng denstes. The parametrc category s dvded further nto two subcategores: generatve and dscrmnatve. In the generatve method, we estmate P(X Y), whch descrbes how to generate X gven Y, whle n the dscrmnatve method, we drectly estmate P(Y X). It s our goal to compare classfcaton results and characterstcs of learnng methods n the dfferent categores. Bayesan classfcaton wth a mxture of Gaussans, logstc regresson, and k nearest-neghbor classfcaton were mplemented, and ther results usng a pen dgts character recognton dataset are analyzed. 2 Bayesan classfcaton wth a mxture of Gaussans Usng mxtures of probablty densty functons for estmatng lkelhoods s a practcal technque when a class s not easly descrbed by one probablty densty functon. Such a case may arse when a class contans two centers of concentrated actvty, such as a bmodal dstrbuton. Combnng dstrbutons can offer a better approxmaton of a true modelng functon. In the project, we used a mxture of Gaussans to model each class. To generate the Gaussans parameters, the expectaton maxmzaton process was employed, and further tempered by determnstc annealng. The frst step n generatng the probablty dstrbuton was to create a covarance matrx and a mean for each class of data n the tranng set. We assumed that the devaton would not change, and were able to leave the covarance matrx stable once created. The mean

though was modfed to better ft the data through the teratve process of EM. Intally, a dfferent set of mean values were created for each Gaussan component by random perterbaton around the mean per class. The expectaton process then took each tranng example from that class and calculated the probablty usng all the Gaussans n the mxture model. Each probablty was also annealed by takng t s root to the tenth power, whch amed to lengthen the teratve process to acheve a truer representaton of the actual model. Each Gaussan had a weght assocated wth t and that weght was used to determne the percentage of partcpaton the Gaussan would have n determnng the probablty for the example. Then the maxmzaton step calculated new weghts and new means for each Gaussan by determnng how much partcpaton each had n the fnal answer. The process was repeated teratvely untl the values of the weghts and the means converged wth.01 of the prevous teraton. Ths number was chosen snce all values of data were between 0 and 100, and the sad convergence lmt would amount to approxmately.01% dfference n many cases. Wth the converged values of the weghts and means for each Gaussans, we then appled those to the test data to obtan the partcpaton probablty of a testng example n a mxture of Gaussans. That was not the fnal class predcton though, as Bayes theorem was used to fnd the probablty. The value from the Gaussan was used as the P(X Y) component n a Bayes equaton, where X was the testng example and Y was the class. The other numerator, P(Y) was calculated by countng classes. The denomnator, P(X), was the same for all Gaussans, so we merely needed to multply P(Y)*P(X Y) and compare t wth all classes. The maxmum of whch was used to predct whch class the data lay wthn. The most mportant queston we faced was how many component Gaussans should be used to buld the densty functon. To determne ths value, we used ten fold cross valdaton on our tranng set and averaged the accuracy results as the number of Gaussans ranged from one to ten. The results are plotted n Fgure 1. Interestngly, there s very lttle dfference n the accuracy when more than two Gaussans were used. Ths may be explaned by the fact that two Gaussans cover almost all the data n the class and addtonal Gaussans do not fnd other centers of data to ft, effectvely contrbutng zero to the total probablty for almost all data. Even though wth a larger number of Gaussans there seems to be a slghtly small uptck, we chose to use two Gaussans snce t was possble that more than two Gaussans would overft the data. Fgure 1: Accuracy vs. number of Gaussans for mxture modelng Usng two Gaussans, the accuracy for the test set was 95.88%.

The tme complexty of usng a mxture of models was qute small, snce the data had all been generated and thus needed to only be appled. In the case of two Gaussans, ths meant the Gaussan was appled twce for each class, for each example of data. The testng tme could be consdered to be O(c*p), where c was the number of classes and p was the number of features. To actually generate the model though, takes longer, snce t depended on a convergence process to generate the values needed for the Gaussan. Once agan, the Gaussan must be appled twce to each tranng example, for each class. But, mportantly, the values for each set of Gaussans must be teratvely recalculated untl t convergences. In practce, the convergence process took on average ten tmes, but t was a number that clearly depended on the data nvolved and choces n the startng means and weghts. 3 Logstc regresson Logstc regresson s a parametrc, dscrmnatve classfcaton algorthm, and drectly estmate P(Y X). For ths project, we used a bnary classfer for each dgt. In the tranng phase, the classfer was gven a label 1 for a sample of the dgt and 0 for other dgts. In the testng phase, each classfer output the probablty that the sample represented the dgt and was multpled by a pror of each dgt, and then the dgt whose classfer s resulted value was hghest was chosen as the fnal predcted dgt. We used the functon, fmnsearch n the optmzaton toolbox n Matlab, for maxmzng the log of condtonal lkelhood over weght coeffcents W, l(w ) = n Y n (w 0 + w X n ln(1 + exp(w 0 + w X n )) where p s the number of features and X n represents the value of the vector X for the n-th tranng sample. The classfcaton accuracy when the classfer was traned wth a whole tranng dataset and tested on the testng set was 81.85%. To mprove ts accuracy, we exploted regularzaton, whch reduces overfttng by peneralzng large values of W. The revsed log lkelhood we used was, l(w ) = n Y n (w 0 + w X n ln(1 + exp(w 0 + w X n )) λ W 2 2 where p s the number of features and X n s the value of the vector X for the n-th tranng sample. For fgurng out whch value of λ works well, we conduct 2 fold cross valdatons wth 8 dfferent λ values from 2 9, 2 8,..., and 2 3. The reason we chose 2 fold, not 10 fold and we just examned these 8 values was the tme constrant. 10 fold cross valdaton mght have produced more accurate estmaton. As Fgure 2 shows, when λ was 2 4, ts resulted accuracy was maxmzed. Hence, we used 2 4 as the value of λ. However, the accuracy on a whole testng set was deterorated to 79.56% from 81.85% wthout any regularzaton. To fgure out the best value on the testng set, we further used the same 8 λ values on the testng set, and Fgure 3 shows, the best accuracy, 82.42%, was produced when λ was 2 8. There are possble reasons for the relatvely poor accuracy. The frst one s that we used 2 fold cross valdatons, not 10 fold, whch mght make ts estmaton deterorated. The second s that the range of used λ values was lmted. It mght be possble that other λ values would produce more accurate predctons. Both of these two reasons were caused manly by the neffcency of fmnsearch algorthm, that s, t took too much tme to converge. In addton, the fact we needed to termnate before the convergence of fmnsearch algorthm s another possble reason for ths result.

Fgure 2: Accuracy vs. dfferent values of λ on cross valdatons 4 K nearest-neghbor classfcaton K-nearest neghbor classfcaton algorthms explctly gnore parametrc modelng n decdng whch class a data pont les wthn. Ths has the effect of performng a hard classfcaton on each data pont, wthout the ablty to nuance and massage parameters to tune to specfc problems. The basc dea n K nearest neghbors s that for an nstance x of the testng set, the dstance between all tranng ponts s calculated. The dstance functon s defned as the Eucldean dstance, whch has the convenent property of workng for any dmensonal set of data. In our case, the data had 16 features, so the dfference between each feature was taken and squared, summed wth all other features and the square root was taken. Wth the dstance calculated between each pont, the class pluralty s taken of closest k neghbors. Although the complexty of the algorthmc s qute lmted, t s remarkably accurate for certan sets of data dependng on the degree of k whch s chosen. The tme complexty of the algorthm s an unfortunate drawback of usng k-nearest neghbors. Although there s no tranng phase per se, each data pont from the test set must be compared separately wth every value n the tranng set. That number mght be reduced through samplng, f the tranng set s too large, but that may not be desrable n many stuatons. If we say n s the tranng sze, p s the number of features and k s the number of neghbors, then the runnng tme s O(n,p,k) = ( n*p *k + k ). The frst component of the sum, n*p*k, s the tme requred to calculate the dstance between every tranng example, whch then has to be compared aganst the top k neghbors to determne f t s a closer neghbor then exstng neghbors. The last term of the sum s the tme requred to count the class that had a pluralty amongst all neghbors. As k s usually qute small, the formula mght better be wrtten as O(n,p) = n * p. To effectvely choose whch k should be employed, 10 fold cross valdaton was used on the tranng set. Thus, each tenth of the tranng set was used as a test set, whle the remanng porton was used to determne class membershp. The accuracy was averaged over each cross valdaton run. Ths was done for all k from 1 to 20. The best results were obtaned when k=1. There was a notceable declne n accuracy as k ncreased, ndcatng

Fgure 3: Accuracy vs. dfferent values of λ on the testng set that more neghbors were not better, most lkely because the classes were relatvely close to each other n the Eucldean sense. When more neghbors were used, more classes were brought nto the equaton whch affected the overall predcton. Fgure 4 demonstrates the deteroratng qualty as k ncreased. Fgure 4: Accuracy vs. number of neghbors for K-nearest neghbors Usng k=1 on the entre tranng set and the testng set resulted n an accuracy of 97.86%. 5 Dscusson 5.1 Accuracy Table 1 shows the comparsons between the three classfcaton algorthms when they were traned wth a whole tranng dataset and tested on the entre testng set. As for accuracy, the K nearest neghbor algorthm produced the best result among three. Bayesan classf-

caton wth a mxture Gaussans was close to t, and logstc regresson was the worst. The possble reason for the poor performance of logstc regresson was due to the numercal optmzaton method used, fmnsearch functon. It tred to fnd the coeffcent values whch mnmzed a functon value, but t ddn t converge wthn a small number of teratons. So, wth the relatvely lmted tme we had, we needed to termnate t the maxmum number of functon evaluatons, 3400, whch was the default value of MaxFunEvals n optons for fmnsearch n Matlab. If t had run longer, ts accuracy could have been mproved. Bayesan w/mx Gauss Logstc regresson K nearest neghbor Accuracy 95.88% 82.42% 97.86% Tme for tranng N.A. N.A. 0 Tme for testng O(c*p) O(c*p) O(p*n) Space O(c*p 2 ) O(c*p) O(p*n) Table 1: Comparsons between the three classfcaton methods (c s the number of class, p s the number of features, and n s the number of tranng data.) 5.2 Tme complexty The k nearest neghbor algorthm doesn t requre a tranng phase, but takes a long tme n ts testng phase snce t needs to examne all data ponts. On the other hand, Bayesan classfcaton wth a mxture of Gaussans and logstc regresson both must be traned, but then they can conduct testng much faster than k nearest neghbor. Whle n the tranng of the Bayesan classfer, the estmated parameter values converged relatvely quckly, the tranng for the logstc regresson took a much longer tme snce fmnsearch, the Matlab functon used for numercal optmzaton, was not effcent. A more effcent method such as teratve reweghted least squares would reduce ts tme complexty. Tme complextes are shown n Table 1, where c s the number of classes, p s the number of features, and n s the number of tranng examples. For logstc regresson and Bayesan classfcaton wth a mxture of Gaussans, we could not provde a closed form bg-o notaton due to the convergence propertes of both algorthms. 5.3 Space complexty Snce the k nearest neghbor algorthm needs to examne all data ponts when a new example s classfed, all data needs to be stored. Its space complexty s represented as O(p*n), where p s the number of features and n s the number of tranng examples. On the other hand, the Bayesan method and logstc regresson only need to store several values of parameters. In the Bayesan method, the space complexty was O(c*(j*p + p 2 )), where j was the number of Gaussans and p was the number of features. In ths nstance, p represents the array of mean values for each Gaussan and p 2 represents the covarance matrx for each class. In logstc regresson, the space complexty was O(c*p), where c s the number of classes and p s the number of features. The latter two algorthms space complexty should always be much smaller than O(p*n), the space complexty for k nearest neghbors, snce the sze of the tranng dataset should always be much larger than the other parameters. 5.4 Characterstcs of the classfers As mentoned n the subsecton about accuracy above, the nferor performance of logstc regresson mght be caused by early termnaton of the numercal optmzaton functon wthout t havng truly converged. Wth ths n mnd, we dscuss several characterstcs of the classfcaton algorthms.

The accuracy of k nearest neghbor was the best among three. The observaton can best be explaned by the flexblty the algorthm has n examnng other neghbors. Whle choosng whch value of k worked best, we notced ts accuracy deterorated as k ncreased, as seen n 4. Ths ndcates that for the dataset, the best predctor of an example s class was fndng the class of another example whch had nearly the same values for each feature. In contrast, the other two classfcaton algorthms do not have ths flexblty and are forced nto usng parameters amed at coverng the entre range of examples. Even n the case of the mxture of components, selectng the example that s closest may do better for a varety of reasons. For example, the nfluence of nearby Gaussans from other classes may overrde a Gaussan component whch has lttle weght wthn t s own class. Although the k nearest neghbor algorthm performed the best, t s man drawback s that t takes much longer to classfy a test example compared to the other two methods. Ths characterstc prohbts the k nearest neghbor algorthm from beng used n certan knds of applcatons whch requre classfcaton n realtme. Between the two parametrc methods, the number of parameters are dfferent. For the Bayesan, there are O(c*(j*p + p 2 )), whle logstc regresson only had O(c*p). Hence, when the number of features, p, s large, logstc regresson may be preferred. 6 Concluson Expermentaton wth the three methods of classfcaton revealed a number of nsghts. As the number of features ncreasng, the classfcaton problem becomes ntractable n many respects. The prevous dataset had nearly 800 features and could not be used n many formulas wthout overflow or underflow and, specfcally, n logstc regresson, convergence tme took too long to be useful. The power of the conceptually smplstc k nearest neghbor model was a surprse and demonstrated that for many datasets, a smpler approach may be just as vald as a parametrc approach. Even more nterestng was that as more neghbors were used, the accuracy actually decreased. One mght be led to beleve that as more data was examned, the accuracy would rse correspondngly, snce a better nformed judgement could be made.