Classification and clustering using SVM

Size: px

Start display at page:

Download "Classification and clustering using SVM"

Rhoda Conley
5 years ago
Views:

PhD Report Thess Ttle: Data Mnng for Unstructured Data Author:

1 Lucan Blaga Unversty of Sbu Hermann Oberth Engneerng Faculty Computer Scence Department Classfcaton and clusterng usng SVM nd PhD Report Thess Ttle: Data Mnng for Unstructured Data Author: Danel MORARIU, MSc PhD Supervsor: Professor Lucan N. VINTAN SIBIU, 005

2 Classfcaton and clusterng usng SVM Contents 1 Introducton...3 Feature Etracton The Dataset...6. Data Reducton Entropy and Informaton Gan Mutual Informaton Tranng/Testng Fles Structure Support Vector Machne n Classfcaton/Clusterng Problems SVM Technque for Bnary Classfcaton Multclass Classfcaton Clusterng usng Support Vector Machne SMO - Sequental Mnmal Optmzaton Probablstc Outputs for SVM Epermental Research Background Work Epermental Data Sets and Feature Selecton Applcaton s Parameters Types of Kernels Used LbSvm Graphcal Interpretaton SVM Classfcaton One - Class SVM Feature Subset Selecton Usng SVM Classfyng usng Support Vector Machne. Implementaton Aspects and Results Bnary Classfcaton Polynomal Kernel Gaussan Kernel (Radal Bases Functon RBF) Feature Subset Selecton. A Comparatve Approach LbSvm versus UseSvm Mult-class Classfcaton. Quanttatve Aspects Clusterng usng SVM. Quanttatve Aspects Conclusons and Further Work References...61 Page of 63

3 Classfcaton and clusterng usng SVM 1 Introducton Whle more and more tetual nformaton s avalable onlne, effectve retreval s dffcult wthout good ndeng and summarzaton of documents content. Document categorzaton s one soluton to ths problem and s the task of classfyng natural language documents nto a set of predefned categores. A growng number of classfcaton methods and machne learnng technques have been appled n recent years. Documents are typcally represented as sparse vectors of the features space. Each word n the vocabulary represents a dmenson of the feature space. The number of occurrences of a word n a document represents the value of the correspondng component n the document characterstc vector. Ths hgher dmensonalty of the feature space s a maor problem of tet categorzaton. The natve feature space conssts of the unque terms that occur n the documents, whch can be tens or hundreds of thousands of terms for even a moderate-szed tet collecton. Much tme and memory s needed for tranng a classfer on a large collecton of documents. Ths s why we tray usng varous methods for reducng the feature space and the response tme. As we wll see the results are better when we work wth a lower dmenson of the space. As the space grows the accuracy of the classfer doesn t grow sgnfcantly, actually t decreases when we work wth a hgher dmensonal feature space. Ths report s a comparatve study of feature selecton methods (Informaton Gan, Mutual Informaton and Support Vector Machne) and types of nput data representaton n statstcal learnng of tet categorzaton. Also I wll present a technque used wth great success n the last years n classfcaton problems for nonlnear separable nput data. I wll present the applcaton for processng documents and creatng the vector of features. I wll contnue by presentng the applcaton mplemented for the classfcaton and clusterng parts usng technques based on support vectors and kernels. I have used Tet Mnng lke an applcaton of data mnng technques to etract the sgnature of each document (the features vector). Startng wth a set of d documents and t terms (words belongng to documents), we can model each document as a vector v n the t dmensonal t spacer. In the classfcaton phase, I have used Support Vector Machne that s a powerful technque for nonlnear separable nput sets. A great advantage of ths technque s that t can use large nput sets. Thus we can easly test the number of features nfluence on the classfcaton and clusterng accuracy. I mplemented ths classfcaton for two types of kernels: polynomal kernel and Gaussan kernel (Radal Bass Functon - RBF). I wll present results for two class classfcaton and for mult-class classfcaton usng SVM technque. For two class classfcaton I took nto account only documents n one class versus the rest of the documents from the set. For mult-class classfcaton I repeated two class classfcaton for each topc (the category where the document s classfed) obtanng more decson functons. I have also modfed ths technque so that t can be used as a method of features selecton n the tet mnng step. I wll present a graphcal vsualzaton of classfcaton and clusterng results usng ths method. I wll use dfferent types of kernels and dfferent types of nput data representaton tryng to fnd the best parameters for better classfcaton accuracy. I tred to fnd a smplfed form of the kernels, wthout reducng the performance, actually ncreasng t, usng more ntutve parameters. Input data are represented n dfferent formats, and I analyzed the nfluence of those representatons on kernel type. I have used three types of representaton. Bnary format where the attrbutes are represented usng values 0 or 1 ( 0 f the word doesn t occur n the document Page 3 of 63

4 Classfcaton and clusterng usng SVM and 1 f t occurs, wthout beng nterested n the number of occurrences). The second format s Nomnal format where the attrbutes store the number of occurrences of the word n the frequency vector, normalzed wth normal norm. The last format used s Connell SMART system where the attrbute store the number of occurrences of the word n the frequency vector, normalzed by another formula. Secton descrbes the term selecton method and the detals of constructng the tranng and testng datasets that are used n the report. I wll gve a bref overvew of feature selecton methods n the contet of proposed tranng strategy. In Secton 3 I wll descrbe the classfer and clusterng algorthm based on Support Vector Machne technque. Also I wll present detals of mplementaton for those algorthms. In Secton 4 we llustrate how to use the applcaton to mprove accuracy of classfers and parameters of the applcaton. I wll gve a descrpton of the eperments n whch we compare the effectveness of the presented methods and I wll present the results of the eperment. In the fnal secton, I wll present concludng remarks and future work. Acknowledgments Besdes my parents there are a lot of people that deserve my grattude. I can not menton them all here but I want to thank all of them. I would lke to thank a few people, some of them were my teachers, that had guded me n ths proect. Frst of all I would lke to epress my sncere grattude to my PhD supervsor Professor Lucan VINŢAN for hs responsble scentfc coordnaton, for provdng stmulatng dscussons focused on my PhD work and for all hs valuable support. I would also lke to thank the ones that guded me from the begnnng of my PhD studes: Prof. Ioana MOISIL, Prof. Boldur BĂRBAT, prof. Danel VOLOVICI, Dr. Dorn SIMA and Dr. Macare BREAZU for ther valuable generous professonal support. I would also lke to thank SIEMENS AG, CT IC MUNCHEN, Germany, especally Vce-Presdent Dr. h. c. mat. Hartmut RAFFLER, for hs very useful professonal suggestons and for the fnancal support that he have provded. I want to thank to my tutor from SIEMENS, Dr. Volker TRESP, Senor Prncpal Research Scentst n Neural Computaton, for the scentfc support provded and for hs valuable gudance n ths wde nterestng doman of research. I also want to thank Dr. Ka Yu for hs useful nformaton n the development of my deas. Last but not least, I want to thank all those who supported me n the preparaton of ths techncal report. Page 4 of 63

5 Classfcaton and clusterng usng SVM Feature Etracton In fact a substantal fracton of the avalable nformaton s stored n tet or document database whch consst of a large collecton of documents from varous sources such as news artcles, research papers, books, web pages, etc. Data stored n tet format s consdered sem-structured data n that they are nether completely unstructured nor completely structured, because a document may however contan a few structured felds such as ttle, authors, publcaton, data, etc. Unfortunately those felds usually are not flled n, the maorty of people don t loose tme to complete these felds. Some researchers suggest that the nformaton needs to be organzed durng the creaton tme, usng some plannng rules. Ths s pontless because most of people wll not respect them. Ths s consdered a characterstc of unordered egaltaransm of Internet [Der00]. Any attempt to apply the same organzng rules wll determne the users to leave. The result for the tme beng s that most nformaton needs to be organzed after t was generated and searchng and organzng tools need to work together. Typcally, only a small fracton of many avalable documents wll be relevant to a gven ndvdual user. Wthout knowng what could be n the document, t s dffcult to formulate effectve queres for analyzng and etractng useful nformaton from the data. Users need tools to compare dfferent documents, rank the mportance and relevance of the documents. Thus tet data mnng has become ncreasngly mportant. Tet mnng goes one step beyond keyword-based and smlarty-based nformaton retreval and dscovers knowledge from sem-structured tet data usng methods such as keyword-based assocaton and document classfcaton. In ths contet tradtonal nformaton retreval technques become nadequate for the ncreasngly vast amounts of tet data. Informaton retreval s concerned wth the organzng and retreval of nformaton from a large number of tet-based documents. Most nformaton retreval systems accept keywordbased and/or smlarty-based retreval [Cro99]. Those deas were presented n detaled n my frst PhD techncal report n secton. [Mor05]. I only want to gve a bref theoretcal background as a support for the tet mnng step presentaton. The tet mnng step s the frst step of my applcaton. In keyword-based nformaton retreval system, a document s represented by a strng, whch can be dentfed by a set of keywords. Smlarty-based retreval system fnds smlar documents based on a set of common keywords. The output of such retreval should be based on the degree of relevance, where relevance s measured based on the closeness of the keywords n the document, actually the relatve frequency of the keyword. In modern nformaton retreval systems, keywords for document representaton are automatcally etracted from the document. Normally, ths mples removng hgh frequency words (stopwords), strppng the suffes, and detectng equvalent stems. After determnng the document representaton n ths manner, each term s assgned a weght. In the lterature ths method of document representaton s called bag-of-words [Cha00]. Let s consder a set of d documents and a set of t terms for modelng nformaton retreval. We can model each of the documents as a vector v n the t dmensonal space. The -th coordnate of v (v ) s a number that measures the assocaton of the -th term wth respect to the gven document: t s generally defned as 0 f the document does not contan the term, and nonzero otherwse. There are many ways to defne the term-weghng for the nonzero entres n such a vector [Mt97]. For eample, t can be smply defned v = 1 as long as the -th term occurs n the document, or let v be the term frequency, or normalzed term frequency. Term frequency s the number of occurrences of the -th term n the document. Normalzed term frequency s term frequency dvded by the total Page 5 of 63

6 Classfcaton and clusterng usng SVM number of occurrences of all terms n the document. I have tested varous term-weghts n my applcaton. Once the terms representng the documents and ther weghts are determned, we can form a document-term matr for the entre set of documents. Ths matr can be used to compute the par-wse dependences of terms. A reasonable measure to compute those dependences s odds raton [Mla99], nformaton gan, mutual nformaton [Mla99], or support vector machne[mla0]. There are some problems n determnng term dependences based on document representaton [Bha00]: Too many terms n descrpton. The total number of dstnct terms s qute large even for a small collecton. A large rato of these terms, when ntellectually evaluated, seems rrelevant for the descrpton of the document. Inablty to epand query wth good qualty terms. It s well known that good qualty terms are those that are more prevalent n relevant documents. Dependence base on all documents may not help n addng good qualty terms to the query. Msmatch between query and document terms. Usually users vocabulary dffers from that of authors or ndees. Ths leads n some query terms not beng assgned to any of the documents. Such terms wll not form a node n the dependence tree construct by the earler studes..1 The Dataset For eperments I used the Reuters-000 collecton [Reu000], whch ncludes a total of documents, wth all news stores publshed by Reuters Press coverng the perod from 0 th August 1996 through 19 th August Each news s stored as an XML fle (ml verson="1.0" encodng="utf-8"). Fles are grouped by date n 365 zp fles. All that fles are grouped accordng to three crtera. Accordng to the ndustry crteron the artcles are grouped n 870 categores. Accordng to the regon the artcle referrers to there are 366 categores and accordng to topcs there are 160 dstnct topcs. The defnton of the metadata element of Reuters Epermental NewML allows the attachment of systematcally organzed nformaton about the news summary. The metadata contans nformaton about date when the artcle was publshed, language of the artcle, ttle, place, etc. The ttle s marked usng ttle markups lke <ttle> </ttle>. Then the tet of the artcle follows marked by <tet> and </tet>. After the content of the artcle s a structure that s based on a scheme for "Reuters Web" that are n the followng format: <metadata> <codes class="bp:countres:1.0"> </code> <codes class="bp:topcs:1.0"> </code> <codes class="bp:ndustry:1.0"> </code> <!ENTITY % dc.elements "(dc.ttle dc.creator.name dc.creator.ttle dc.creator.locaton dc.creator.locaton.cty dc.creator.locaton.sublocaton dc.creator.locaton.stateorprovnce dc.creator.locaton.country.code dc.creator.locaton.country.name dc.creator.phone dc.creator.emal dc.creator.program dc.date.created dc.date.lastmodfed Page 6 of 63

7 Classfcaton and clusterng usng SVM dc.date.converted dc.publsher dc.publsher.provder dc.publsher.contact.name dc.publsher.contact.emal dc.publsher.contact.phone dc.publsher.contact.ttle dc.publsher.locaton dc.publsher.graphc dc.coverage.start dc.coverage.end dc.coverage.perod dc.relaton.obsoletes dc.relaton.ncludes dc.relaton.references dc.date.publshed dc.date.lve dc.date.statuschanges dc.date.epres dc.source dc.source.contact dc.source.provder dc.source.locaton dc.source.graphc dc.source.date.publshed dc.source.dentfer dc.contrbutor.edtor.name dc.contrbutor.captonwrter.name )"> </metadata> </newstem> In the entres above there s nformaton ncluded on regon, topc and ndustry. Ths nformaton s ncluded usng codes. There are separate fles where t can be found the correspondence between the codes and the complete name. In my applcaton I used only the name of the news stores, the content of the news, topc proposed by Reuters for classfyng and ndustry. Thus from each fle I etract the words from the ttle and the content of the news and I create a vector that characterzes that document. A tet retreval system often assocates a stop lst wth a set of documents. A stop lst s a set of words that are deemed rrelevant for a set of documents [Ja01] [Ian00]. In my applcaton I have used a general stopwords lst from the package r.ar [IR] from Teas Unversty. I wanted to use a general lst n order to elmnate the non-relevant words. In the stopword lst there are 509 dfferent words ncluded whch are consdered to be rrelevant for the content of the tet. For each word that remaned after the elmnaton of stopwords I have etracted the root of the word (stemmng). If after stemmng I obtaned the root wth dmenson smaller or equal to (the word obtaned has only two or one characters) and the orgnal dmenson of the word was or 1 I elmnate those words. If the orgnal dmenson was greater than three I wll kept the word n the orgnal format wthout stemmng. Up to ths moment I ddn t deal wth the case n whch two dfferent words have the same root after stemmng. Afterwards I counted the occurrences of every remanng root. Thus I have created an array for each document from Reuters wth the remanng roots (called later tokens). Ths array s the ndvdual characterzaton of each document. Page 7 of 63

8 Classfcaton and clusterng usng SVM For eample a vector that characterzes small news s represented n the followng format: 4 : sngapor:3 pore:1 mddl:4 dstl:4 stock:4 hghest: apr:1 weekl:1 aprl: trade:1 develop:1 board:1 statst:1 show:1 thursda:1 week: end: august:1 regst:1 on: barrel: newsroom:1 - c1:1 ccat:1 m14:1 m143:1 mcat:1 Number 4 represent the number of the ndeed document. A root and the number of ts occurrences are separated by a : ( n document can occur words n dfferent contet but have same steam). The - symbol at the end of the lne ntroduces Reuters classfcaton topcs. Ths representaton of documents s called n the lterature bag-of-words approach. For tranng and testng data n two classes I buld a subset of data here referred to as Subset-c15. From all documents I select those documents that are grouped by Reuters n System Software (I3300) as ndustry code. After ths selecton I obtaned only 7083 documents. In the resultng set there are 63 dfferent topcs for classfyng accordng to Reuters. For multclass classfcaton I used all 63 topcs. For bnary classfcaton I chose topc c15 that means Comment /Forecasts accordng to Reuters codes. I grouped those 7083 artcles n tranng set and testng set randomly assurng that the tranng set s smaller than the testng set. The algorthm that I wll present s based on the fact that data s grouped only n two classes. Ths s n fact a bnary classfcaton algorthm where the labels can take only two values. Therefore n ths phase I took n consderaton only one class of documents and documents that are not belongng to that class are consdered to be n another class. In the multclass classfcaton I take n consderaton all topcs and I learn separately for each topc usng SVM one versus the rest technque. I chose samples that have a specfc topc versus the rest of samples for each topc. Afterwards I have created a large frequency vector wth all unque tokens and the number of occurrences of each token n all documents (n order to further apply the SVM method). I use a vector wth all tokens to memorze each document n the set. If a token appears n the document then I wll store the number of occurrences n the new vector otherwse I wll store 0 for that token (sparse vector [sparse]). By dong so all vectors have the same sze and the tokens are arranged n same order so that all data avalable s organzed n the same way. For the Subset-c15 I have obtaned dfferent tokens.. Data Reducton In the learnng phase data needs to be stored n the memory n order to compute on t and so the learnng tme s consderably bgger. Due to the sze of the token vector the accuracy of learnng (as we wll further see n ths report) s dmnshed. I have appled some technques to reduce the dmenson of ths large frequency vector. For dong so I have used the Informaton Gan [Yan97], Mutual Informaton [Cov91] and Support Vector Machne [NEL00][Dou04]. There s another type of methods used for features nducton that automatcally create a nonlnear combnaton of estng features and addtonal nput features to mprove classfcaton accuracy lke the method proposed by [Jm04]. All methods use the elmnaton of tokens whch occurs less than the preordan threshold. Page 8 of 63

9 Classfcaton and clusterng usng SVM..1 Entropy and Informaton Gan Entropy and nformaton gan are functons of the probablty dstrbuton that underle the process of communcatons. The entropy s a measure of uncertanty of a random varable. Let X be a dscrete random varable wth alphabet S and probablty mass functon p( ) = Pr{ X = }, S. We denote the probablty mass functon by p() rather than p X () for convenence. More nformaton about Entropy and Informaton Gan and a complete eample I have presented n my frst PhD report n secton [Mor05]. There I have used only two values for the attrbute (attrbutes could take only values 0 and 1). Here I want to present another aspect n whch the attrbutes can take more than ust two values (the attrbutes can take any of the values n ther doman). Note that the entropy s a functon of the dstrbuton of X. It does not depend on the actual values taken by random varable X, but only on the probabltes. Thus f X ~ p( ), then the epected value of the random varable g(x) s wrtten E g( X ) = g( ) p( ) (.1) p S or more smply as Eg() when the probablty mass functon s understood from the contet. 1 The entropy of X can also be nterpreted as the epected value of log, where X s drawn p( X ) accordng to probablty mass functon p(). Thus 1 E( X ) = E p log (.) p( X ) As seen ths defnton of entropy s related to defnton of entropy n thermodynamcs. It s possble to derve the defnton of entropy aomatcally by defnng certan propertes that the entropy of a random varable must satsfy. The concept of entropy n nformaton theory s closely connected wth the concept of entropy n statstcal mechancs. If we draw a sequence of n ndependent and dentcally dstrbuton random varables, t can be shown that the probablty of a ne( X ) ne( X ) typcal sequence s about and that there are about such typcal sequences. Ths property (known as the asymptotc equaton property) s the bass of many of the proofs n nformaton theory. In [For04] the author ustfed that Informaton Gan faled to produce good results on an ndustral tet classfcaton problem. The author says that for a large class of features scorng methods suffers a ptfall: they can be blnded by a surplus of strongly predctve features for some classes, whle largely gnorng features needed to dscrmnate dffcult classes... Mutual Informaton Entropy s uncertanty of a sngle random varable. We can defne condtonal entropy, whch s the entropy of a random varable, gven another random varable. The reducton n uncertanty due to another random varable s called the mutual nformaton. For two random varables X and Y ths reducton s: p(, y) I( X ; Y ) = Entropy( X ) Entropy( X Y ) = p(, y) *log (.3) p( ) * p( ), y y Page 9 of 63

10 Classfcaton and clusterng usng SVM The mutual nformaton I(X,Y) s a measure of the dependence between the two random varables. It s symmetrc n X and Y and always non-negatve. Thus the mutual nformaton I(X;Y) s the reducton n the uncertanty of X due to the knowledge of Y. Accordng to the formula we can observe that the mutual nformaton for a random varable wth tself s the entropy of the random varable. Ths s the reason that the entropy s sometmes referred to as self-nformaton. Between entropy and mutual nformaton there s the followng relatonshp presented n a Venn dagram (Fgure.1). Notce that the mutual nformaton I(X;Y) corresponds to the ntersecton of the nformaton n X wth the nformaton n Y. E(X,Y) E(X Y) I(X;Y) E(Y X) E(X) E(Y) Fgure.1. Relatonshp between entropy and mutual nformaton In my applcaton I use another method for feature selecton usng the concept of Support Vector Machne that s presented n detal n the net chapter. For ths I have used the lnear kernel and I have taken n consderaton only those attrbutes that have a weght greater than a certan threshold..3 Tranng/Testng Fles Structure It s useful to elmnate the words that occur n all or many documents from the set because these words can t characterze the documents, they are lke a stopword for ths sets. I have consdered that ths new array characterzes all documents. After that I have modfed the arrays of each document n order to have the same dmenson as the larger frequency vector. In the new array there are entres equal to zero f the word doesn t occur n the document. We consder ths to be the sgnature of each document n the documents set. In the frst phase the developed applcaton produces the tet mnng on Reuter s fles chosen as above. The large frequency vectors are stored n four dfferent fles after creaton. Two of these fles contan the date necessary for bnary tranng and testng classfcaton. The other two fles contan data necessary for multclass tranng and testng classfcaton. The fles structure format s presented below. Ths format s almost dentcal wth the format presented by Ian n [Ian00] and used by Weaka applcaton that can be found at [Weka]. The fles have a part contanng attrbutes, a part contanng topcs and a part contanng data. In the attrbutes part there are specfed all attrbutes (tokens) that characterze the frequency vectors. The attrbutes are specfed usng the followed by the name of the attrbute. In the topc part there are specfed all the topcs for ths set accordng to Reuter s classfcaton. The topcs are specfed usng the followed by the name of the topc and the number of samples that contan that topc. The data secton s marked and contans all large frequency vectors followed by : and all topcs for that entry, accordng to Reuter s classfcaton. For the output Page 10 of 63

11 Classfcaton and clusterng usng SVM fles where we have all the topcs (for mult-class classfcaton) the large frequency vectors are ordered by ther occurrence n Reuter s fles. For the output fles where we have only one topc (for classfcaton one versus the rest) the large frequency vectors are ordered by topc. Intally we put the large frequency vectors that belong to the class and after that we put the large frequency vectors that don t belong to the class. The structure presented below s obtaned as a result of tet mnng on Reuter s database. In ths step I have used some classes taken from the package r.ar [IR] from Teas Unversty. In order to test the nfluence of feature selecton on classfcaton accuracy I used several levels of threshold for Informaton Gan, Mutual Informaton and SVM feature selecton. I wll present later the eact value of threshold and number of resultng attrbutes. The tranng and testng fle need to have the same structure. Ths means that we need to have the same number of attrbutes and topcs. Both attrbutes and topcs have to be n the same order both n the tranng and testng fle. As follows I wll present a sample c151 3,,3,,0,0,0,3,1,0,0,8,0,1:1 1,5,1,5,0,1,0,9,0,0,0,,0,3:1 4,1,1,3,0,4,0,0,1,0,0,0,0,1:1,9,7,,0,,0,6,0,0,0,11,0,3:1 3,,17,3,0,14,0,0,13,0,0,0,0,1:1 0,0,9,0,5,,5,0,0,1,1,6,8,0:-1 3,0,,0,1,,0,,0,5,1,0,,0:-1 1,0,,5,3,0,3,9,7,,0,0,1,0:-1,0,0,4,1,0,6,0,0,,0,0,1,0:-1 4,0,,1,1,0,6,,1,1,0,0,4,0:-1 3,0,1,3,,0,,0,0,4,0,0,1,0:-1 1,0,0,,,0,1,7,1,,0,0,4,0:-1 Page 11 of 63

12 Classfcaton and clusterng usng SVM 3 Support Vector Machne n Classfcaton/Clusterng Problems In ths chapter I wll present some theoretcal aspects referrng to Support Vector Machne technque and some aspects referrng to the mplementaton of ths technque for classfyng and clusterng documents. Classfyng usng SVM s a supervsed learnng technque that uses a labeled dataset for tranng and tres to fnd a decson functon that classfes best the tranng data. Ths technque s based on classfyng only n two classes. There are some methods used for classfcaton n more that two classes. At the end of ths chapter I wll present some changes that need to be made n the classfcaton algorthm, so that t can work wth unlabeled documents (clusterng). 3.1 SVM Technque for Bnary Classfcaton Support Vector Machne (SVM) s a classfcaton technque based on the statstcal learnng theory[sck0], [NEL00]. It was successfully used a few centures old mathematcal dscoveres to solve optmzaton problems on large sets as far as the numbers of artcles and ther characterstcs (features) are concerned. The purpose of the algorthm s to fnd a hyperplane (n a n-dmensonal space the hyperplane s a space wth the dmenson n-1) that splts optmally the tranng set (a practcal dea can be found n [Ch03]). Actually the algorthm conssts n determnng the parameters that determne the general equaton of the plane. Lookng at the two dmensonal problem we actually want to fnd a lne that best separates ponts n the postve class (ponts that are n the class) from ponts n the negatve class (the remanng ponts). The hyperplane s characterzed by a decson functon lke f() =sgn( <w, ()> + b), where w s the weght vector orthogonal to the hyperplane, b s a scalar that represents the margn of the hyperplane, s the current sample tested, () s a functon that transforms the nput data nto a hgher dmensonal feature space and, representng the dot product. Sgn s the sgnum functon that returns 1 f the value s greater than to 0 and -1 otherwse. If w has unt length, then <w, ()> s the length of () along the drecton of w. Generally w wll be scaled by w. In the tranng part the algorthm need to fnd the normal vector w that leads to the largest b of the hyperplane. For better understandng let s consder that we have data separated nto two classes (crcles and squares) as n Fgure 3.1. The problem that we want to solve conssts n fndng the optmal lne that separates those two classes. The problem seems very easy to solve but we have to keep n mnd that the optmal classfcaton lne should classfy correctly all the elements generated by the same gven dstrbuton. There are a lot of hyperplanes that meet the classfcaton requrements but the algorthm tres to determne the optmum. Page 1 of 63

13 Classfcaton and clusterng usng SVM { w, +b=-1} y =-1 X w b { w, +b=+1} X y =+1 Fgure The optmal hyperplane wth normal vector w and offset b Let,y R n where =( 1,,..., n ) and y=(y 1, y,..., y n ). We can defne the dot product of two vectors as the real value whch measures the area determned by the two vectors. We wll denote ths further on as <,y> and we wll compute t: <, y > = 1 *y 1 + *y n *y n ( 3.1) We say that s orthogonal on y f ther dot product equals to 0, that s f: <, y> = 0 ( 3.) f w has the norm equal to 1, then <w, ()> s equal to the length of () along the drecton of w. Generally w wll be scaled by w n order to obtan a unt vector. By w we mean the norm of w, defned as w = <w, w> and also called Eucld s norm. All through our presentaton we wll use normalzed vectors. Ths learnng algorthm can be performed n a dot product space and for data whch s separable by hyperplane, by constructng f from emprcal data. It s based on two facts. Frst, among all the hyperplanes separatng the data, there s a unque optmal hyperplane, dstngushed by the mamum margn of separaton between any tranng pont and the hyperplane. Second, the capacty of the hyperplane to separate the classes decreases wth the ncreasng of the margn. These are two antagonstc features whch transform the soluton of ths problem nto an eercse of compromses. mamze mn w H, b R { w, +b=0} {, w, + b = 0, = 1,..., m} H (3.3) For tranng data whch s not separable by a hyperplane n the nput space the dea of SVM s to map the tranng data nto a hgher-dmensonal feature space va, and construct a separatng hyperplane wth the mamum margn there. Ths yelds a non-lnear decson boundary n the nput space. By the use of a kernel functon w, φ( ) t s possble to compute the separatng hyperplane wthout eplctly carryng out the map nto the feature space [SCK0]. In order to fnd the optmal hyperplane, dstngushed by the mamum margn, we need to solve the followng obectve functon: Page 13 of 63

14 Classfcaton and clusterng usng SVM 1 mnmzeτ ( w) = w H, b R w, ( 3.4) subect to y ( w, + b) 1 for all =1,...,m The constrants ensure that f( ) wll be +1 for y =+1 and -1 for y =-1. Ths problem s computatonally attractve because t can be constructed by solvng a quadratc programmng problem for whch there are effcent algorthms. Functon τ s called obectve functon (the frst feature) wth nequalty constrans. Together, they form a so-called prmal optmzaton problem. Lets see why we need to mnmze the length of w. If w would be 1, then the left sde of the restrcton would equal the dstance between and the hyperplane. Generally we need to dvde y ( w, + b) by w n order to change ths nto a dstance. From now on f we can meet the restrcton for all =1,..,m wth a mnmum length w then the total margn wll be mamal. Problems lke ths one are the subect of optmzaton theory and are n general dffcult to solve because they mply great computatonal costs. To solve ths type of problems t s more convenent to deal wth the dual problem, whch accordng to mathematcal demonstratons leads to the same results. In order to ntroduce the dual problem we have to ntroduce the Lagrange multplers α 0 and the Lagrangan [SCK0] whch lead to the so-called dual optmzaton problem: L 1 α, w + b) 1) ( 3.5) m ( w, b, ) = w α ( y ( = 1 wth Lagrange multplers α 0. The Lagrangan L must be mamzed wth respect to the dual varables α, and mnmzed wth respect to the prmal varables w and b (n other words, a saddle pont has to be found). Note that the restrctons are embedded n the second part of Lagrangan and need not be appled separately. We wll try to gve an ntutve soluton to the restrcted optmzaton problem. If the restrcton s volated then y ( w, + b) -1 < 0, n whch case L can be ncreased by ncreasng the correspondng α. At the same tme w and b need to be modfed f L dmnshes. In order that α (y ( w, + b) -1) doesn t become an arbtrary large negatve number the changes n w and b wll ensure that for a separable problem the condtons wll fnally be met. Smlarly, for all the restrctons that don t meet the equalty (that s for whch y ( w, + b) -1 > 0) the correspondng α needs to be 0. For these α the value of L s mamzed. The second complementary restrcton n the optmzaton theory s gven by Karush-Kuhn-Tucker (also known as the KKT restrctons). At the saddle pont the partal dervates of L wth respect to the prmal varables need to be 0: m w = α y, and α = 0 ( 3.6) = 0 m = 0 y The soluton vector thus has an epanson n terms of a subset of tranng samples, namely those samples wth non zero α, called Support Vectors. Note that although the soluton w s unque (due to the strct convety of prmal optmzaton problem), the coeffcents α, need not be. Accordng to the Karush-Kuhn-Tucker (KKT) theorem, only the Lagrange multplers α that are non-zero at the saddle pont, correspond to constrants whch are precsely met. Formally, for all =1,,m, we have α [y (,w + b ) 1 ] = 0 for all =1,...,m ( 3.7) The patterns for whch α > 0 are called Support Vectors. Ths termnology s related to the correspondng terms n the theory of conve sets, related to conve optmzaton. Accordng to the Page 14 of 63

15 Classfcaton and clusterng usng SVM KKT condton they le eactly on the margn. All remanng tranng samples are rrelevant. By elmnatng the prmal varables w and b n the Lagrangan we arrve to the so-called dual optmzaton problem, wtch s the problem that one usually solves n practce. mamzew ( α) = m α R m α ths s called the target functon 1 m = 1, = 1 α α y y m, ( 3.8) subect to α 0 for all =1,..., m and α y = 0 ( 3.9) = 1 Thus the hyperplane can be wrtten n the dual optmzaton problem as: m f ( ) = sgn yα, + b ( 3.10) = 1 where b s computed usng KKT condtons. The optmzaton problem structure s very smlar to the one that occurs n the mechancal Lagrange formulaton. In solvng the dual problem t s frequently when only a subset of restrcton becomes actve. For eample, f we want to keep a ball n a bo then t wll usually roll n a corner. The restrctons correspondng to the walls that are not touched by the ball are rrelevant and can be elmnated. In practce, the separatng hyperplane may not est, for nstance f the classes are overlapped. To take n consderaton the samples that can possbly volate the restrctons we can ntroduce the slack varablesξ 0 for all = 1,..., m. Usng slack varables the restrcton become y ( w, + b ) 1 ζ for all =1,...,m. The optmum hyperplane can now be found both by controllng the classfcaton capacty ( w ) and by controllng the sum of slack varables Σ ζ. We notce that the second part provdes a superor boundary to the number of tranng errors. Ths s called soft margn classfcaton. The obectve functon becomes then: m 1 τ ( w, ξ ) = w + C ξ ( 3.11) = 1 wth the same restrctons, where the constant C > 0 determnes the echange between mamzng the margn and mnmzng the number of tranng errors. If we rewrte ths n terms of Lagrange multplers, we get the same mamzaton problem wth an etra restrcton: m 0 α C for all =1,...,m and α = 0 ( 3.1) = 1 y Everythng was formulated n a dot product space. We thnk of ths space as the feature space. On the practcal level, changes have to be made to perform the algorthm n a hgher-dmensonal feature space. The patterns thus need not concde wth the nput patterns. They can equally well be the result of mappng the orgnal nput patterns nto a hgher dmensonal feature space usng functon. Mamzng the target functon and evaluatng the decson functon, than requests the computaton of dot products φ ( ), φ( ) n a hgh dmensonal space. These epensve calculatons are reduced sgnfcantly by usng a postve defnte kernel k, such Page 15 of 63

16 Classfcaton and clusterng usng SVM that k (, ') =, '. Ths substtuton, whch s referred sometmes as the kernel trck s used to etend the hyperplane classfcaton to nonlnear Support Vector Machnes. The kernel trck can be appled snce all feature vectors only occur n dot products. The weght vectors than becomes an epresson n the feature space, and therefore wll be the functon through whch we represent the nput vector n the new space. Thus we obtan the decson functon: m f ( ) = sgn yα k(, ) + b ( 3.13) = 1 The man advantage of ths algorthm s that t doesn t requre transposng all data nto a hgher dmensonal space. So there s no epensve calculus as n neural networks. We also get a smaller dmensonal set of testng data as n the tranng phase we wll only consder the support vectors whch are usually few. Another advantage of ths algorthm s that t allows the usage of tranng data wth as many features as needed wthout ncreasng the processng tme eponentally. Ths s not true for neural networks. For nstance the back propagaton algorthm has troubles dealng wth a lot of features. The only problem of the support vector algorthm s the resultng number of support vectors. As the number ncreases the response tme ncreases lnearly too. 3. Multclass Classfcaton Most real lfe problems requre classfcaton to more than two classes. There are several methods for dealng wth multple classes that use bnary classfcaton. One of these methods conssts n classfyng one versus the rest n whch elements that belong to a class are dfferentated from the others. In ths case we calculate an optmal hyperplane that separates each class from the rest of the elements. As the output of the algorthm we wll choose the mamum value obtaned from all decson functons. To get a classfcaton n M classes, we construct a set of bnary classfers where each tran separates one class from the rest. After that we combne them by dong the mult-class classfcaton accordng to the mamal output before applyng the sgn functon; that s, by takng = 1,.., M m argma g ( ), where g ( ) = y α k(, ) + b ( 3.14) f and = 1 ( ) = sgn( g ( )) ( 3.15) Ths problem has a lnear complety as for M classes we compute M hyperplanes. Another method for multclass classfcaton conssts n classfyng the pars. In ths method we choose two classes and we compute a hyperplane for them. We do ths for each par of classes n the tranng set. For M classes we compute (M-1)*M/ hyperplanes. Ths method has a polynomal complety. The advantage of ths method s that usually the resultng hyperplane has a smaller dmenson (less support vectors). Page 16 of 63

17 3.3 Clusterng usng Support Vector Machne Page 17 of 63 Classfcaton and clusterng usng SVM Further on we wll desgnate by classfcaton the process of supervsed learnng on labeled data and we wll desgnate by clusterng the process of unsupervsed learnng on unlabeled data. The algorthm above only uses labeled tranng data. Vapnk presents n [VAP01] an alteraton of the classcal algorthm n whch are used unlabeled tranng data. Here fndng the hyperplane becomes fndng a mamum dmensonal sphere of mnmum cost that groups the most resemblng data (presented also n [Ja00]. Ths approach wll be presented as follows. In [SCK0] we can fnd a dfferent clusterng algorthm based mostly on probabltes. For document clusterng we wll use the terms defned above and we wll menton the necessary changes for runnng the algorthm on unlabeled data. There are more types of usually used kernels that can be used n the decson functon. The most frequently used are the lnear kernel, the polynomal kernel, the Gaussan kernel and the sgmod kernel. We can choose the kernel accordng to the type of data that we are usng. The lnear and the polynomal kernel run best when the data s well separated. The Gaussan and the sgmod kernel work best when data s overlapped but the number of support vectors also ncreases. For clusterng the tranng data wll be mapped n a hgher dmensonal feature space usng the Gaussan kernel. In ths space we wll try to fnd the smallest sphere that ncludes the mage of the mapped data. Ths s possble as data s generated by a gven dstrbuton and when they are mapped n a hgher dmensonal feature space they wll group n a cluster. After computng the dmensons of the sphere ths wll be remapped n the orgnal space. The boundary of the sphere wll be transformed n one or more boundares that wll contan the classfed data. The resultng boundares can be consdered as margns of the clusters n the nput space. Ponts belongng to the same cluster wll have the same boundary. As the wdth parameter of the Gaussan kernel decreases the number of unconnected boundares ncreases. When the wdth parameters ncreases there wll be overlappng clusters. We wll use the followng form for the Gaussan kernel: σ (, ) = e ( 3.16) Note ths that f σ decreases the eponent ncreases n absolute value and so the value of the kernel wll tend to 0. Ths wll map the data nto a smaller dmensonal space nstead of a hgher dmensonal space. Mathematcally speakng the algorthm try to fnd the valleys of the generatng dstrbuton of the tranng data. If there are outlers the algorthm above wll determne a sphere of very hgh costs. To avod ecessve costs and msclassfcaton there s a soft margn verson of the algorthm smlar to the one n secton 3.1. The soft margn algorthm uses a constant so that dstant ponts (hgh cost ponts) won t be ncluded n the sphere. In other words we choose a lmt of the cost to ntroduce an element n the sphere. We wll present n detal the clusterng algorthm. We wll also use some mathematcal aspects to n ustfy some of the assertons. Let X R be the nput space, { } X the set of N samples to be classfed, R the radus of the searched sphere and the Gaussan kernel presented n (3.16). We wll consder a as beng the center of the sphere. Gven ths the problem becomes: ( ) a R, = 1... N ( 3.17) and by replaced the norm wth the correspondng dot product we get:

18 Classfcaton and clusterng usng SVM < ( ) a, ( ) a > R, = 1... N ( 3.18) Equaton (3.17) represents the prmal optmzaton problem for clusterng. For the soft margn case, we wll use the slack varables ξ and formulas (3.17) and (3.18) become: ( ) a R + ξ, = 1... N ( 3.19) and respectvely: < ( ) a, ( ) a > R + ξ, = 1... N ( 3.0) The equatons above are to equvalent forms of the same prmal optmzaton problem usng soft margn. The prmal optmzaton problem s very dffcult f not mpossble to solve n practce. Ths s why we wll ntroduce the Lagrangan and the dual optmzaton problem. As above the soluton of the dual problem s the same to the soluton of the prmal problem. L = R ( R + ξ a) ) β ξ µ + C ( ξ ( 3.1) where β 0, µ 0 are the Lagrange multplers, C s a constant and C ξ s the penalty term. In ths case we have two Lagrange multplers as there are more restrctons to be met. Consderng the partal dervates of L wth respect to R, a and ξ (prmal varables) equal to 0 we get: L = R Rβ 1 = R β = 0 β = 1 ( 3.) R L = a β ( ) + a = 0 a = β ( ) ( 3.3) L = β µ + C = 0 β = C µ ( 3.4) ξ Then the KKT restrctons are: ξ = 0 ( 3.5) µ ( R + ( ) a ) β = 0 ξ ( 3.6) Analyzng the problem and the restrcton above we notce the followng cases: ξ > 0 and β > 0 ( ) a > R, whch means that ( ) = 0 β = C ( ) a > R ξ = 0 ( ) a R Page 18 of 63 les outsde the sphere µ, whch means that s a bounded support vector - BSV, whch means that these elements belong to the nteror of the sphere and they wll be classfed when we remap them nto the orgnal space. ξ = 0 and 0 < β < C ( ) a = R, whch means that the ponts le on the sphere. These ponts are actually the support vectors that wll determne the clusters and that wll then be used to classfy new elements.

19 Classfcaton and clusterng usng SVM Thus the support vectors (SV) wll le eactly on the sphere and the bounded support vectors (BSV) wll le outsde the sphere. We can easly see that when C 1 there are no BSV, so all the ponts n the nput space wll be classfed (the hard margn case). () Fgure 3. Mappng the unlabeled data from the nput space nto a hgher dmensonal feature space and determnng the sphere that ncludes all ponts (hard margn) In Fgure 3. we presented the geometrcal nterpretaton of the frst part of the optmzaton problem: mappng the unlabeled data from the nput space nto a hgher dmensonal feature space va the Gaussan kernel functon. Thus we can determne the sphere that contans all the ponts to be classfed. We consdered the hard margn case. We wll now present the mappng of the sphere n the nput space to see the way we create the boundares. Fgure 3.3 Groupng data nto classes In Fgure 3.3 we present the way data s grouped after mappng the boundary of the sphere n Fgure 3. back n the nput space. Ths s very dffcult to mplement as an algorthm. For an easer mplementaton we wll elmnate R, a and β so that the Lagrangan becomes a Wolfe n the dual form. The Wolfe obtaned so s an optmzaton problem n β. We wll now replace n (3.1) formulas (3.3) and (3.4) and we wll have: W = R R + ξ ( ) ( ) β β ξ ( C β ) + C Page 19 of 63 ξ

20 Classfcaton and clusterng usng SVM Page 0 of 63 ( ) ( ) = C C R R W ξ β ξ ξ β β β ξ β ( ) ( ) + = R W β β β 1 ( ) ( ) = W β β ( ) ( ) ( ) ( ) > < = W β β β, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) > >+< > < > < < = W β β β β,,,, β ( ) ( ) ( ) ( ) > > + < < = W β β β,, ( ) ( ) ( ) ( ) > < = W,, β β β ( 3.7) The restrctons for µ wll be replaced by: N C... 1, 0 = β ( 3.8) Usng the support vector technque for representng the dot product ( ) ( ), by a Gaussan kernel: ( ), q e K = ( 3.9) wth the wdth parameter σ 1 = q. In these condtons the Lagrangan becomes: ( ) ( ) = K K W,,, β β β ( 3.30) We wll denote the dstance for each pont the dstance from each mage n the feature space to the center of the sphere as: ( ) a R = ( 3.31) and by usng (3.3) ths becomes: ( ) ( ) = R β ( ) ( ) ( ) ( )> =< R β β, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )> > + < > < > < =< R β β β β,,,,

21 R R Classfcaton and clusterng usng SVM ( ), ( ) > β < ( ), ( ) > β < ( ), ( ) > + β β < ( ), ( ) > =< ( ) = K(, ) K(, ) + ββ K(, ) The radus wll now be: whch s: { R( ) SV}, β ( 3.3) R = ( 3.33) { R( ) cuξ = 0 and 0 < C} R = β < The boundares that nclude ponts belongng to the same cluster are defned by the followng set: { R( ) R} = ( 3.34), Another ssue to be dscussed s assgnng the cluster, whch s how do we decded what ponts are n what class. Geometrcally speakng ths means determnng the enterng of the boundares. When two ponts are n dfferent clusters the lne that untes them s transpose outsde of the sphere n the hgher space. Ths knd of lne wll contan a pont y for whch R ( y) > R. In order to determne the estence of ponts y t s necessary to consder a number of test ponts to check f t goes outsde the sphere or not. Let M be the number of test ponts and so the number of ponts beng consdered for testng f the lne belongs or not to the sphere. The n-th test pont wll be: ( + ) n yn =, n = 1, M 1 ( 3.35) M The results of these tests wll be stored n an adacency matr A for each par of ponts. The form of adacency matr wll be: 1, y, wth R( y) R A = ( 3.36) 0, otherwse The cluster wll be defned by the connected components of graphs nduced by matr A. When we use the soft margn and thus have BSV, these wll not be classfed n any of the classes by ths procedure. They can be assgned to the closest class or they can reman unclassfed. 3.4 SMO - Sequental Mnmal Optmzaton Ths represents one of methods that mplement a problem of quadratc programmng ntroduced by Platt [Pla99]. The strategy of SMO s to break up the constrants nto the smallest optmzaton groups possble. Note that t s not possble to modfy varables α ndvdually wthout modfyng the sum of constrants (the KKT condtons). We therefore generate a generc conve constraned optmzaton problem for pars of varables. Thus, we consder the optmzaton over two varables α and α wth all other varables consdered fed, optmzng the target functon wth respect to them. Here and are sample and sample from the tranng set. The eposton proceeds as follows: frst we solve the generc optmzaton problem n two varables and subsequently we determne the value of the placeholders of the generc problem, we adust b properly and determne how pattern can be selected to ensure speedy convergence. Page 1 of 63

22 Classfcaton and clusterng usng SVM Thus the problem presented above mples solvng a lnearly analytcal optmzaton problem n two varables. The only problem now s the order that the tranng samples are chosen n for speedy convergence. By usng two elements we have the advantage of dealng wth the lnear decson functon whch s easer to solve than a quadratc programmng optmzaton problem. Another advantage of SMO algorthm s that not all tranng data must be smultaneously n the memory, but only the samples for whch we optmze the decson functon. Ths fact allows the algorthm to work wth very hgh dmenson samples. We begn wth a generc conve constraned optmzaton problem n two varables. Usng the shorthand K := k(, ) the quadratc problem becomes: [ α K + α K + α α K ] 1 mnmze α, α subect to sα + α = γ 0 α C and + c α + c α 0 α C ( 3.37) where α and α are the Lagrange multplers for those two samples, K s the kernel and s s +1 or -1. Here, c, c, γ R, and K R are chosen sutably to take the effect of the m- varables that are kept fed. The constants C represent pattern dependent regularzaton parameters. To obtan the mnmum we use the restrctonα α = γ. Ths formula allows us to reduce actual + optmzaton problem to a optmzaton problem n followng notaton χ : = K + K K α by elmnatng α. For shorthand we use the In the mplementaton of SMO I take the value for the constants C equal to 1 because I have chosen hard margn n prmal form of SVM. Constants c, c, and γ are defned thus: m l, l l m l, l l m c : = α K, c : = α K and γ = 1 α ( 3.38) l, l As the ntal condton I have consdered all weght vectors equal to zero and the Lagrange multplers α for all samples equal to zero. The dmenson of the weght vector s equal to the number of attrbutes and the dmenson of α vector s equal to the number of samples. To solve the mnmzaton problem n the obectve functon I have created fve sets that splt the nput data, accordng to KKT condton, as follows: I 0 = { α ( 0, C )} I = { α = 0, y = + 1} I +,0,0 = { α = 0, y = 1} I I +, C, C = = { α = C, y = + 1} { α = C, y = 1} ( 3.39) From now on we wll use the followng notatons: m_i0 - all the tranng samples for whch the Lagrange multplers α are between 0 and C m_i0_p - all the samples for whch α = 0 and postve topc m_i0_n - all the samples for whch α = 0 and negatve topc m_ic_p - all the samples for whch α = C and postve topc m_ic_n - all the samples for whch α = C and negatve topc I have also defned: m _ bup : = mn m _ I 0 m _ I 0 _ p m _ IC _ n f ( ) y ( 3.40) Page of 63

23 Classfcaton and clusterng usng SVM m _ blow : = mn m _ I 0 m _ I 0 _ n m _ IC _ p f ( ) y where m_bup and m_blow are the superor and the nferor lmts for the margn of the ( m _ bup + m _ blow) hyperplane. As an mproved estmaton of b I have used the formula m _ b =. The stoppng crtera s no longer that the KKT condtons are smaller than some tolerance Tol, but that m_blow m_bup, holds wth some tolerance m_blow m_bup Tol. In the tranng part, the samples wth α 0 are consdered support vectors and are relevant n the obectve functon. They are the only ones that need to be kept from the tranng set. For ths, I have created a set called m_supportvectors that stores all the samples that have α 0. In the applcaton I have used a member m_data that keeps all the tranng samples from the fle. Each sample has a specfed poston n m_data. Ths poston s the same n all the sets used to make the search easer. If the sets don t contan the sample on the specfed poston, the entry for that poston contans a null value. Consderng the above I have put all the samples n the two sets, m_i0_p and m_i0_n, because for the begnnng I have consdered that α and the weght vector equal to zero. Then I eamne all the samples, renewng all the sets. If durng ths process there s no change we can consder that the tranng s over, otherwse I repeat the process agan as presented below. In mnmzaton problem, n order to fnd the mnmum, we use α +α =γ. Modfyng α depends on the dfference between the values f( ) and f( ). There are some algorthms that can be used for choosng the nde and, thus the larger the dfference among them, the bgger s the dstance to the hyperplane. I have chosen the two loops approach to mamze the obectve functon. The outer loop terates over all patterns volatng the KKT condtons, or possbly over those where the threshold condton of the m_b s volated. Usually we frst loop over those wth Lagrange multplers nether on the upper nor the lower boundary. Once all of these are satsfed we loop over all patterns volatng the KKT condtons, to ensure self consstency on the entre dataset. Ths solves the problem of choosng the nde. It s sometmes useful, especally when dealng wth nose data, to terate over the complete KKT volatng dataset before complete self consstence on the subset has been acheved. Otherwse consderable computatonal resources are spent makng subsets. The trck s to perform a sweep through the data once only less than, say 10% of the non bound varables change. Now to select : to make a large step towards the mnmum, one looks for large steps n α. Snce t s computatonally epensve to compute K + K sk for all possble pars (,) one chooses an heurstc to mamze the change n the Lagrange multplers α and thus mamze the absolute value of numerator n epressons for ~ α (new alpha). Ths means that we are lookng for patterns wth large dfference n ther relatve errors f( ) - y and f( ) y. The nde correspondng to the mamum absolute value s chosen for ths purpose. If ths heurstc happens to fal (f lttle progress s made by ths choce) all other ndces are looked at n the followng way (a second choce): All ndces correspondng to non-bound eamples are looked at, searchng for an eample to make progress on. In the case that the frst heurstc was unsuccessful, all other sample are analyzed untl an eample s found where progress can be made. If both prevous steps fal, SMO precedes to the net nde. Page 3 of 63

24 Classfcaton and clusterng usng SVM In the problem of computng the offset b from the decson functon another stoppng crteron occurs. Ths new crteron occurs because, unfortunately, durng the tranng, not all Lagrange multplers wll be optmal, snce, f they were, we would already have obtaned the soluton. Hence, obtanng b by eplotng the KKT condtons s not accurate (because the margn must be eactly 1 for Lagrange multplers for whch the bo constrants are nactve). Usng the lmt of the margn presented n (3.40) and snce the KKT condton has to hold for a soluton we can check that ths corresponds to m _ bup 0 m _ blow. For m_i0 I have already used ths fact n the decson functon. After that usng KKT condtons I have renewed the thresholds m_bup and m_blow (superor and nferor lmt for margn of the hyperplane). I have also chosen an nterval from whch to choose the second sample whch I renew by renewng m_up and m_low. Accordng to the class of the frst sample I choose the second tranng sample as follows. If the frst sample s n the set wth postve topc I choose the second sample from the set wth negatve topc. After choosng the samples for tranng I check f there s a real progress n the algorthm. The real beneft comes from the fact that we may use m_low and m_up to choose patterns to focus on. The large contrbuton to the dscrepancy between m_up and m_low stems from those pars of patterns (, ) for whch the dscrepancy: I0 U I,0 U I +, C dscrepancy (,):=(f( )-y ) (f( ) y ) where ( 3.41) I 0 U I +,0 U I, C s the largest. In order to have a real beneft the dscrepancy needs to have a hgh value. For the frst sample I have computed the superor and nferor lmt of α usng: L ζ > 0 α =, ( 3.4) H otherwse where L 1 ma(0, s ( γ C )) f s > 0 1 ma(0, s γ ) otherwse = and H 1 mn( C, s γ ) 1 ma( C, s ( γ C )) = f s > 0 otherwse and ζ : = sc χ : = K c + γsk + K sk γk, ( 3.43) After that I can compute the new α usng the formula: α α = old 1 + χ δ f χ > 0 f χ = 0 and δ > 0, ( 3.44) f χ = 0 and δ < 0 where δ : = y (( f ( ) y ) ( f ( ) y )), and α s the value of the new alpha (the soluton wthout restrcton). After that I have computed the new α for frst sample and I checked and forced t to hold the KKT condtons (to be between 0 and C). After that I have computed the new α for the second sample old old usng the formulaα = s( α α ) α. I have modfed all the sets usng the new α. Thus for the frst sample f the new α s greater than zero t s ncluded n the m_supportvector set, otherwse t s elmnated f t s n set. If α s between 0 and C and the topc s postve ths sample s ncluded n m_i0_p otherwse t s elmnated from the set f t s, etc. We repeat that wth the Page 4 of 63

25 Classfcaton and clusterng usng SVM second sample. After that I have updated the weght vector for all the attrbutes usng the formula: w = m = 1 α y Usng the new obectve functon we can compute the classfer error for all samples, the nferor and the superor lmt of the margn of the hyperplane b, and the nferor and superor margn of the doman for choosng sample m_up and m_low. To speed up the computng process, I have used n my program a large matr called m_store n whch I have stored the outputs for the kernels computed between two samples. The net tme when I need the value, I wll take t from the matr. I have put the value n ths matr after I have calculated the value of the kernel. I use the trck that the value of kernel for two varables s constant n report wth evoluton of varables from algorthm. There s a dfferent approach of the algorthm for lnear kernel and others kernels. In the case of the lnear kernel the obectve of the algorthm s to compute the weght from the ntal decson functon takng nto consderaton the prmal optmzaton problem f() = sgn( <w, ()> + b). The lnear kernel s a partcular case of the polynomal kernel, when the degree s 1. We take nto consderaton only the weght because wth ths type of kernel, the data s kept n the orgnal space. In almost all cases ths type of kernel, presented by some researchers separately, obtans poor results n the classfcaton. Also t s used especally when the algorthm based on support vectors s used for feature selecton because t produces a separate weght for each attrbute. Other type of kernels used, lke the polynomal kernel wth the degree greater then 1 and the Gaussan m kernel, use decson functon f ( ) = sgn yα k(, ) + b that comes form dual optmzaton = 1 problem (where I compute only the Lagrange multplers α. 3.5 Probablstc Outputs for SVM The standard SVM doesn t have an output to measure the confdence of the classfcaton. Platt n [Pla99_] presents a method for transformng the standard SVM output nto a posteror probablty output. Posteror probabltes are also requred when a classfer s makng a small part of an overall decson, and the classfcaton outputs must be combned for the overall decson. Thus the author presents a method for etractng probabltes P(class / nput) from the output of SVM. These can be used for post processng classfcaton. Ths method leaves the SVM error functon unchanged. For classfcaton n multple classes, the class s chosen based on mamal posteror probablty over all classes. The author follows the dea of [Wah99] for producng probablstc outputs from a kernel machne. Support Vector Machne produce an uncalbrated value that s not a probablty, and f sgnum functon s appled t obtan only two values (-1 and 1). m f ( ) = sgn yα k(, ) + b ( 3.45) = 1 that les n a Reproducng Kernel Hlbert Space. Tranng a SVM mnmzes an error functon that penalzes an appromaton of the tranng msclassfcaton rate plus a penalty term. Mnmzng the error functon s also mnmzng a bound on the test msclassfcaton rate, whch s also a desrable goal. An addtonal advantage of ths error functon s that mnmzng wll produce a sparse machne where only a subset of possble kernels s used n the fnal machne. 1 P( class nput) = P( y = 1) = p( ) = ( 3.46) 1+ ep( f ( )) Page 5 of 63

26 Classfcaton and clusterng usng SVM where f() s the output functon for SVM. Instead of estmatng the class-condtonal denstes p(f y), the authors suggest usng a parametrc model to ft the posteror probablty P(y=1 f) drectly. The parameters of the model are adapted to gve the best probablty outputs. The authors analyse the emprcal data and see that the densty are very far away from Gaussan. There are dscontnutes n the dervates of both denstes and postve margn f=1 and the negatve margn f=-1. These dscontnutes occur because the cost functon also has dscontnutes at the margns. The class-condtonal denstes between the margns are apparently eponental and the authors suggest the usage of a parametrc form of the sgmod: P 1 = = ( 3.47) 1+ ep( Af ( ) + B) ( y 1 f ) Ths sgmod model s equvalent to assumng that the output of the SVM s proportonal to the logarthmcal probabltes of postve samples. Ths sgmod model has two parameters whch are traned ndependently n an etra post processng step. They are traned usng the regularzed bnomal lkelhood. As long as A < 0, the monotony of (3.45) s assured. The parameters A and B from (3.47) are found usng mamum lkelhood estmaton from the tranng set (f, y ). The complete algorthm for computng the parameters A and B s presented n [Pla99_]. Also ths method was used by Ka n [Ka03] usng only a sngle parameter. Page 6 of 63

27 Classfcaton and clusterng usng SVM 4 Epermental Research 4.1 Background Work Epermental Data Sets and Feature Selecton In ths secton, I wll descrbe the process of etractng useful data that I wll use and the developed applcatons. I have used for eperments the Reuters-000 collecton [Reu000], whch ncludes a total of documents, wth news stores. The Reuters collecton s commonly used n tet categorzaton research. Due to the huge dmenson of the database I have chosen only a subset of data that I wll work upon referred to as Subset-c15. Thus form all documents I have selected those documents that are grouped by Reuters n System Software (I3300) by ndustry. After ths classfcaton there are 7083 documents left to work on. In the resultng documents there are 63 dfferent topcs for classfyng, accordng to Reuters. For bnary classfcaton I chose the topc c15 that means Comment /Forecasts, accordng to Reuters codes. I have chosen topc c15 as only about 30% of the data belongs to that class so the algorthm can actually learn. Those 7083 artcles are dvded randomly nto a tranng set of 47 documents and a testng set of 361 documents (samples). Most researchers take n consderaton only the ttle and the abstract of the artcle or only the frst 00 words from each pece of news. In my evaluaton I take nto consderaton the entre news and the ttle of the news n order to create the characterstc vector. I have used the bag-of-words approach for representng the documents as presented n secton. Tet categorzaton s the problem of automatcally assgnng predefned categores to free tet documents. The maor problem s the hgh dmensonalty of the feature space. Automatc feature selecton methods nclude the removal of non-nformatve terms accordng to corps statstcs and the constructon of new features whch combne lower level feature nto hgher level orthogonal dmensons. In feature selecton the focus conssts n an aggressve dmensonalty reducton. In Subset-c15 we have dstnctve attrbutes that represent n fact the root of the words form the documents. A tranng algorthm wth a vector of ths dmenson s tme consumng and n almost all cases the accuracy of the learnng s low due to the nose n data. For feature selecton I have evaluated three methods ncludng term selecton based on document frequency. The frst method that I used s Informaton Gan wth dfferent thresholds. By varyng the threshold I have obtaned a number of features varyng from 41 for the threshold equal to 0.1 to 7999 for the threshold equal to The method was presented n secton..1. Another method used s Support Vector Machne algorthm wth lnear kernel lke n [Mla0]. For ths I use the threshold between 0.1 and obtanng a number of features between 47 and As we wll see a relatvely small number of features make the classfer work better than wth many features. The same conclusons were drawn n [Gab04]. For the tet mnng step I have created a ava package ReadReuters that uses few classes from the package r.ar [IR] n order to etract the data (the root of the words and the number of appearances) from the documents. Ths package generates the four fles presented n secton.1. For the learnng and testng step I have mplemented a Java applcaton usng the algorthm presented n [SCK0] based on a support vector technque. For ths algorthm I have added the dea proposed by Platt n [Pla99_] for the probablstc output of SVM presented n secton 3.5 of ths report. I have tested ths algorthm usng the Subset-c15 for classfyng n two classes and I have tested dfferent kernels and degrees of kernels for the SVM algorthm. Also I have analyzed Page 7 of 63

28 Classfcaton and clusterng usng SVM dfferent methods of feature selecton lke Informaton Gan and feature selecton usng Support Vector Machne. When usng feature selecton wth dfferent thresholds there can be cases when all the attrbutes (features) nto a vector are zero (the document doesn t contan the selected attrbutes) and n ths case we elmnate that vector. Ths knd of documents won t be represented by any vector so ther value wll be lost. Thus we can have a smaller number of samples n the tranng or the testng set. The number of documents presented above s obtaned for the mnmum threshold Applcaton s Parameters All applcatons run and accept parameters from command lne. We have done ths as we are more nterested n the results of the learnng process nstead n a graphcal nterface. All learnng results wll be stored n a fle. Even though we have created a smple graphcal nterface to see the way the classes look lke for the two dmensonal model. The applcaton for the data mnng step has the followng command lne: ava supportvector.readreuters -t <folder name> [-T <threshold value>] [-f <feature selecton method>] Where the optons are: - -t <folder name> the name of the folder where the Reuters fles are located (or other fles n Reuters format). Ths opton s compulsory. - -T <threshold value> - value of the threshold, the number of occurrences of the word n all documents. If ths opton s omtted default value s 0. - [-f <feature selecton method>] the method used for feature selecton. You can choose from Informaton Gan, Average entropy, Mutual nformaton or no method of feature selecton. If the compulsory optons s omtted or t s called usng opton -? the applcaton returns a message wth all these optons. The learnng applcaton has the general command lne: SupportVector.UseSVM -t <tranng fle> -v <testng fle> [-k <kernel types>] [-c <degree or constant >] [-d <type of attrbutes>] [-f <SVM feature selecton>] Where the optons are: - -t <tranng fle> - where tranng fle s the name of the fle contanng the tranng set - ths opton s compulsory - -v <testng fle> - where testng fle s the name of the fle contanng the testng set - ths opton s compulsory - -k <kernel s types> - type of kernel. If we want to test usng polynomal kernel we need to wrte -k POL, for radal bass functon kernel we need to wrte -k RBF. If ths opton s omtted the default value s polynomal kernel. - -c <degree or constant> For polynomal kernel t represents the degree of the polynomal and for the RBF kernel t represents the constant of the denomnator. Any double value s admtted. If ths opton s omtted the default value s 1. - "-d <type of attrbutes>" type of attrbute representaton n the tranng and testng sets. Possble values are BIN for bnary representaton, NOM for nomnal representaton, SMART for Connell Smart representaton. The default value s BIN. Page 8 of 63

29 Classfcaton and clusterng usng SVM - -f <SVM feature selecton> t s used when we want to use the algorthm as a feature selecton algorthm. The double value represents the value of the threshold. If the compulsory opton s omtted or f the applcaton s called usng opton -? the applcaton returns a message wth all the optons above. The ava applet has all the parameters of the learnng applcaton besdes the tranng and testng fles as the data s taken from the nterface. The tranng data s consdered to be the ponts gven by the user and testng data are all the ponts of the applet Types of Kernels Used The purpose of the kernel s to transpose the tranng data from the nput space n a hgher dmensonal feature space and to separate the data n ths new space. The dea of the kernel s to compute the norm of the dfference between two vectors (the cosne of the angle between the two vectors) n a hgher dmensonal space wthout representng those vectors n the new space (kernel trck). In order to use the kernel trck we need to epress the kernel n terms of dot products. In practce we can see that by addng a scalar constant to the kernel we can get better classfyng and clusterng results. In ths report I tested a new dea to correlate ths new scalar wth the dmenson of the space where the data wll be represented because I consder that those two parameters (the degree and the scalar) need to be correlated. I wll present the results for dfferent kernels and for dfferent parameters of each kernel. For the polynomal kernel I change the degree of the polynomal and for the Gaussan kernel I change the constant C accordng to the followng formulas: polynomal k (, ' ) = ( d + ' ) d - d beng the parameter to be modfed Gaussan (radal bass functon RBF) ' k(, ' ) = ep - C beng the parameter to n C be modfed and n beng the number of elements greater that 0 from the nput vectors. and beng the nput vectors. For the lnear kernel I used the polynomal kernel wth degree 1. Lnear kernel was also used for feature selecton LbSvm Almost all researchers present ther results usng an mplementaton of support vector machne technque called LbSVM avalable at [LbSvm]. LbSvm s a smple, easy-to-use, and effcent software for SVM classfcaton and regresson. In almost all cases I try to present results of classfcaton of my applcaton versus the results obtaned for the same set usng LbSvm. LbSvm s a command lne applcaton and allows the user to select dfferent algorthms for SVM, dfferent types of kernels and dfferent parameters for each kernel. LbSvm mplemented fve type of SVM (C-SVM, nu-svm, one-class SVM, epslon-svr and nu-svr) and has 4 types of kernels (lnear, polynomal, Gaussan and sgmod). The forms of the kernels that we used are: polynomal k(, ' ) = ( gamma ' + coef 0) d, where gamma, d and coef0 are varables. radal bass functon (RBF) k(, ' ) ep( gamma * ' ) =, where gamma s varable. Page 9 of 63

Classfcaton and clusterng usng SVM The default value of gamma s 1/k, k s the number of attrbutes, the default value of d s 3, and the default value of coef0 s 0.

Graphcal Interpretaton The developed ava applet offers the possblty to see a graphcal vsualzaton of the results of the algorthm nto a bdmensonal space.

30 Classfcaton and clusterng usng SVM The default value of gamma s 1/k, k s the number of attrbutes, the default value of d s 3, and the default value of coef0 s 0. I wll use for comparsons only C-SVM and one-class SVM wth dfferent parameters. I wll compare wth C-SVM for supervsed classfcaton and wth one-class SVM for unsupervsed case (clusterng). 4. Graphcal Interpretaton The developed ava applet offers the possblty to see a graphcal vsualzaton of the results of the algorthm nto a bdmensonal space. The applet offers functons lke: savng and loadng data tranng fles contanng ponts belongng to one or more classes. My applcaton can work up to 7 classes but I wll present results only for three classes as the LbSvm applcaton can only work up to 3 classes. I wll present results for dfferent type of kernels and dfferent degrees for my applcaton and n comparson wth LbSvm usng the same characterstcs. I have called my applcaton UseSvm_toy SVM Classfcaton I have tested the graphcal nterface for two types of kernels: polynomal and Gaussan. I wll present now the results for the polynomal kernel. In order to have an equvalent comparson I have chosen gamma=1 and coef0 = *d n LbSvm. Fgure 4.1- Polynomal kernel wth degree. Left - LbSvm; Rght - UseSVM_toy For the degree LbSvm has 3 errors n the testng part and UseSVM_toy has 4 errors. The number of support vectors (sv) for LbSVM (7) s smaller the number of sv for UseSVM_toy (40). As follows I wll present the results for the degree equal to 4. For ths type of kernel LbSVM and UseSVM_toy work better. They also work better for degrees greater then 4. Page 30 of 63

Classfcaton and clusterng usng SVM Fgure 4. - Polynomal kernel wth degree 4.

small degrees of the kernel. We present these cases n Fgure 4.3. We need to use a degree greater then fve to obtan good results wth the LbSvm.

If we use the degree t can stll fnd 3 classes but for hgher degrees t groups all the data nto a sngle class.

31 Classfcaton and clusterng usng SVM Fgure 4. - Polynomal kernel wth degree 4. In the case of polynomal kernel f we have med data (the classes are very close to each other) or we have a clump of data bounded by other data, the results are very poor for small degrees of the kernel. We present these cases n Fgure 4.3. We need to use a degree greater then fve to obtan good results wth the LbSvm. For UseSVM_toy the result are good for degree greater or equal wth 4. Usng the default parameters of LbSvm leads to poor results. If we use the degree t can stll fnd 3 classes but for hgher degrees t groups all the data nto a sngle class. I have notced that even though many researchers use no bas, the learnng process s mproved f we use a value of the bas equal to twce the value of the degree. It s logcal that the bas and the kernel degree are nterconnected as changng one leads to a shft n the ponts. Those parameters I thnk need to be connected because when changng the representng space the bas needs to be modfed n the new space. Page 31 of 63

Classfcaton and clusterng usng SVM Fgure 4.3 - Polynomal kernel used for med data The man advantage of the polynomal kernel s that t has a smaller number of support vectors.

In almost all cases we do not know how does the data look lke, thus we need to use a greater value of the degree to obtan better results or we need to try to fnd the optmal degree.

32 Classfcaton and clusterng usng SVM Fgure Polynomal kernel used for med data The man advantage of the polynomal kernel s that t has a smaller number of support vectors. As a consequence we have a reduced tme for testng. In order that med data s classfed correctly we need a greater kernel degree whch mples a longer tme for tranng. In almost all cases we do not know how does the data look lke, thus we need to use a greater value of the degree to obtan better results or we need to try to fnd the optmal degree. I wll present now the vsual results for the Gaussan kernel varyng of the varable C. The Gaussan kernel s known to work better than the polynomal kernel on both separated data and overlapped data. Ths s why I wll present only the results for overlapped data. One of the dsadvantages of the Gaussan kernel s that the number of support vectors s greater than n the case of the polynomal kernel. In order to establsh an equvalence between the two applcaton we 1 need to use gamma =, where n needs to be. nc Page 3 of 63

33 Classfcaton and clusterng usng SVM Fgure RBF kernel wth degree 0.05 and One - Class SVM In ths secton I wll present some vsual results obtaned usng clusterng based on Support Vector Machne methods. The mathematcal part of ths method was presented n secton 3.3. For clusterng I wll present results only for the Gaussan kernel because t obtans the best results n comparson wth other kernels and s constantly used n the lterature for clusterng. The parameters that need to be modfed for ths applcaton are C the constant for Gaussan kernel and υ the percentage of data that are ntally chosen. In the begnnng I wll present results usng a ava applet that offers the possblty of a graphcal vsualzaton of the results. The results obtaned usng testng fles wll be presented n secton I wll present now the nfluence of constant C on the number of classes. Page 33 of 63

34 Classfcaton and clusterng usng SVM Fgure 4.5 Influence of constant C on the number of clusters By modfyng the constant C we actually modfy the accuracy, a small C leads to smaller clusters and more dsconnected clusters. As C decreases, the dmenson of the features space where the data s mapped ncreases. As C decreases the border of the clusters s more accurate (t fts the data better). As follows I wll present the nfluence of the ntally chosen data (I wll consder percentages of data) on the results obtaned n clusterng. Page 34 of 63

35 Classfcaton and clusterng usng SVM Fgure 4.6 Influence of ntal percentage of data υ on the accuracy of learnng We can see that the percentage of ntally chosen data has a substantal nfluence on the evoluton of the algorthm for same value of constant C. When we have a large number of ntal data (great value of υ) we have one great class or more and smaller classes. When we have a small value of υ we have a smaller dmensonal class and a small number of classfed data. I have also notced that for value of υ greater than 0.5 the algorthm classfed all data n the same cluster. 4.3 Feature Subset Selecton Usng SVM In [Mla0], Mladenc et. all present a method for selectng features based on lnear support vector machne. The authors compare more tradtonal feature selecton methods, such as odds raton and Informaton Gan, n achevng the desred tradeoff between the vector sparseness and the classfcaton performance. The result ndcates that at the same level of sparseness, feature selecton based on normal SVM yelds better classfcaton performances. Frst the authors tran the lnear SVM on a subset of tranng data and retan only those features that correspond to hghly weghted components (n the absolute value sense) of the resultng hyperplane that separates postve and negatve samples. The reduced feature space s then used to tran a classfer over a large tranng set because more documents now ft nto the same amount of memory. Ths dea was also presented n [Dou04]. In [Jeb00][Jeb04] are eplaned the advantages of usng the same methods n the features selecton step and n the learnng step. Followng ths dea I have used my algorthm of SVM wth lnear kernel n the step of feature selecton. Thus the tet mnng step becomes a learnng step that trans over all the features (attrbutes, n my case 19038) and t tres to fnd the hyperplane that splts the postve and negatve samples. The resultng weght vector n the decson functon has the same dmenson as the feature space. Usng ths weght vector we select only the features wth an absolute value of the weght that eceeds a specfed threshold. The resultng set has a smaller dmenson and t s then splat randomly n the tranng and testng set. The resultng sets are then used n the learnng step of the algorthm. For the threshold I use dfferent values, rangng from 0.1 to Thus for a threshold equal to I have obtaned 1936 attrbutes. Ths number of attrbutes s reduced to 47 for a threshold Page 35 of 63

Classification / Regression Support Vector Machines

Classification / Regression Support Vector Machines Classfcaton / Regresson Support Vector Machnes Jeff Howbert Introducton to Machne Learnng Wnter 04 Topcs SVM classfers for lnearly separable classes SVM classfers for non-lnearly separable classes SVM