CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

Size: px

Start display at page:

Download "CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University"

Wilfrid Booker
5 years ago
Views:

1 CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made to learn. In the feld known as data mnng, text categorzaton has been extensvely studed by the machne learnng communty, as t s a classc example of supervsed learnng by machnes. In supervsed learnng, computers learn from a categorzaton functon whch s calculated usng the nformaton from the labeled tranng data provded by the supervsor. Then, t can classfy any unlabeled document nto predefned categores. It s known that the learner s approxmaton of the functon wll mprove wth the amount of tranng data suppled to t. However supplyng tranng data to the learnng machne s expensve n terms of tme and money snce t s generally done by humans. In ths paper, our motvaton s to decrease the number of tranng data needed by the computers n order to decrease the human nterventon as much as possble. Ths paper talks about a very effcent method to select the tranng data from the unlabeled data pool to tran the classfer machnes. We present expermental results showng that wth carefully selected tranng data, the need for labeled data can be sgnfcantly reduced, and the learner s classfcaton performance s preserved, even ncreased n some cases. 1. INTRODUCTION Machne learnng s the study of computer algorthms that mprove machnes ablty to act lke human bengs automatcally through experence. Applcatons of machne learnng range from desgn prncples of new generaton of robots to ntellgent search engnes whch have become ndspensable tools n almost everybody s daly and professonal lves. Engneers who work on desgnng ntellgent machnes have turned to machne learnng methods because t s more effectve and more practcal than havng to use computer programmng to code every scenaro a machne mght encounter. In recent years, there has been an exploson n computaton and nformaton technologes. Moreover, the amount of textual nformaton avalable n electronc format has ncreased sgnfcantly due to the wde use of computers and mproved storage facltes. For example, e- busnesses are challenged wth understandng and fndng useful patterns from the terabytes of customer nformaton that they collect. In addton to the stored data, the World Wde Web has become an ndspensable source of nformaton n people s lves. The vast amount of accumulatng news data from the daly onlne news portals have already attracted the attenton of data mners. The need to have access to the rght and relevant nformaton n the shortest tme and the desre to reach to organzed and categorzed text has ncreased tremendously. As a consequence, there s an ncreasng nterest n machne learnng methods to perform varous natural language processng tasks n order to make effcent use of these large amounts of nformaton. One way for computers to learn to categorze documents automatcally s a supervsed learnng method. In supervsed learnng, t s generally accepted that the almost certan way of mprovng categorzaton performance s to ncrease the sze of the tranng dataset. The reason for ths, n text doman, s that the sparsty and the varety of the language makes t mpossble to construct a tranng dataset that covers all possble cases. The prmary motvaton for actve learnng comes from the dffculty and cost of obtanng labeled tranng examples snce those samples have to be labeled by human experts. In some domans, such as ndustral process modelng, a sngle tranng example may requre several days and cost thousands of dollars to generate. In other domans lke text categorzaton or emal flterng, obtanng examples s not expensve, but may requre the user to spend hours on the tedous work of labelng them. In order to avod the excessve tme and money wasted on creatng labeled tranng data, actve learnng s used to fnd the most nformatve samples among the avalable unlabeled data. In actve learnng, nstead of ly pckng documents and manually labelng for our tranng set, we have the opton of more carefully choosng (or queryng) documents from the unlabeled pool of documents. We can choose our next sample from the unlabeled data pool,

2 Fgure 1. Passve Learner based upon the answers to our prevous queres [9], [7]. Expermental results show that, by carefully selectng the tranng data, the computers wll be n need of less tranng data to learn how to classfy documents to predefned categores. Less tranng data means less tme s spent by human supervsors for data labelng and learnng functon s calculated n a faster perod of tme snce tranng tme s drectly proportonal to the number of tranng data. In the lterature, there has been a great deal of research n actve learnng. For example, Cohn et al. [2] mnmze the varance component of the estmated generalzaton error. Freund et al. [4] employ a commttee of classfers, and query a pont whenever the commttee members dsagree. We use a well known machne learnng algorthm, called support vector machnes (SVMs) to select the most nformatve documents for actve learnng wthout knowng ther labels from the unlabeled pool of documents. A smlar work by Tong et al. also uses SVMs to select queres to mnmze the verson space sze, lke we do. The remander of the paper s organzed as follows. Sectons 2 and 3 descrbe supervsed learnng and actve learnng respectvely. Secton 4 brefly ntroduces SVM algorthm. Secton 5 provdes the bascs on usng SVMs for actve learnng. Experments on real text data are provded n Secton 6. Secton 7 dscusses the expermental results and future work. We conclude the paper by summarzng the contrbuton of the paper to the lterature, socety and ndustry wth possble applcatons of the method to the emergng applcatons of herarchcally structured sets of categores such as Yahoo and Google. 2. SUPERVISED LEARNING The automated categorzaton (or classfcaton) of texts nto topcal categores has a long hstory, datng back at least to the early s. In the begnnng, the most effectve approach to the problem seemed to be that of manually buldng automatc classfers by means of knowledge engneerng technques. Those technques consst of manually defnng a set of rules encodng expert knowledge on how to classfy documents under a gven set of categores. However, manually defned rules are generally doman dependent and perform poorly for the newly encountered patterns n the unseen data. In the 90s, wth the boomng producton and avalablty of onlne documents, automated text categorzaton has Fgure 2. Actve Learner wtnessed more smarter methods lke supervsed learnng. In supervsed learnng, a general nductve process, called the learner, automatcally bulds a classfer by learnng the characterstcs of the categores from a set of prevously classfed documents. For ths purpose, frst, a tranng dataset s formed by gatherng a sgnfcant quantty of data that s ly sampled from the underlyng populaton dstrbuton. In tradtonal supervsed learnng, these large numbers of tranng data have to be prepared n advance and the documents categores are generally labeled manually. Wth the labeled tranng data, we use a learner to generate a mappng from documents to topcs. We call ths mappng a classfer. We can, then, use the classfer to label new, unseen documents. Ths methodology s called passve supervsed learnng [9]. A passve learner (Fg. 1) receves a data set from the world and then outputs a classfer (or model). Often, the most tme-consumng and costly task n these applcatons s the gatherng of data. In many cases, we have lmted resources for collectng such data. On the other hand, n some cases, such as web page categorzaton, we can collect necessary data easly, but t can take tremendous amount of tme to label them. Hence, t s partcularly valuable to determne ways n whch we can make use of these resources as much as possble. 3. ACTIVE LEARNING Actve learnng (Fg. 2) dffers from passve learnng (Fg. 1) n that the learnng algorthm tself attempts to select the most nformatve examples for tranng. Snce supervsed labelng of data s expensve, actve learnng attempts to reduce the human effort needed to learn an accurate result by selectng only the most nformatve examples for labelng. In [1], actve learnng s defned as any form of learnng n whch the learnng algorthm has some control over whch part of the nput space t receves nformaton from. An actve learnng strategy therefore allows the learner 1 to dynamcally select tranng examples, durng tranng, from a canddate set as receved from the supervsor 2. The learner captalzes on 1 Learner can also be thought as a computer or a machne. 2 Supervsor can be thought as a human who provdes labeled tranng data to the computer.

3 Fgure 3. Possble hyperplanes n bnary classfcaton settng current attaned knowledge to select examples from the canddate tranng set that are most lkely to solve the problem, or that wll lead to a maxmum decrease n error. Rather than passvely acceptng tranng patterns from the supervsor, the system s allowed to have some determnstc control over whch examples to accept and to gude the search for the most nformatve patterns. In pool-based actve learnng for classfcaton [6], the learner has access to a pool of unlabeled data and can request the true class label for a certan number of nstances n the pool. In many domans ths s a reasonable approach snce a large quantty of unlabeled data s readly avalable. The man ssue wth actve learnng s fndng a way to choose good requests or queres from the pool. 4. SUPPORT VECTOR MACHINES We employ Support Vector Machnes (SVMs) as our base learnng algorthm for ther effectveness n many learnng tasks, partcularly those nvolvng text classfcaton [3], [5]. We consder SVMs n the bnary classfcaton settng. As t can be seen from Fgure 3., one can fnd nfnte numbers of hyperplanes whch separates two dfferent labeled data f they are lnearly separable. On the other hand, SVMs fnd the hyperplane that separate the tranng data by a maxmal margn (Fg. 4). All vectors lyng on one sde of the hyperplane are labeled as 1 (or negatve), and all vectors lyng on the other sde are labeled as +1 (or postve). The tranng nstances that le closest to the hyperplane are called support vectors. A formal defnton of SVM s as follows. We are gven tranng data {x 1 x n } that are d vectors n some space X R. We are also gven ther labels {y 1... y n } where y { 1, 1}. SVMs allow us to project the orgnal tranng data n space X to a hgher dmensonal feature space F va a Mercer kernel operator K. Generally, ths transformaton s done f the data s Fgure 4. Support vector machne hyperplanes separate the tranng data by a maxmal margn lnearly nseparable n the orgnal lower dmensonal space. In other words, we consder the set of classfers of the form: f ( x) K( x, x). n = α When ( x) 0 = 1 f we classfy x as + 1; otherwse we classfy x as 1. When K satsfes Mercer s condton we can wrte:, ) = Φ( ) Φ( ) where Φ : X F and K( u v u v denotes an nner product. We can then rewrte f as: f ( x) = w r n r Φ( x), where w= α Φ( = 1 x In equaton (1), w r s the normal unt vector of the hyperplane that separates the data wth maxmal margn. SVM computes the α s that correspond to the maxmal margn hyperplane n F. By choosng dfferent kernel functons we can mplctly project the tranng data from X nto dfferent feature spaces whch we can perform lnear classfcaton on data. (A hyperplane n F maps to a more complex decson boundary n the orgnal space X.) Support vector machne classfers have met wth sgnfcant successes n numerous real world classfcaton tasks. We present an algorthm for performng actve learnng wth SVMs. We apply our method to text categorzaton doman and show that our method can sgnfcantly reduce the need for tranng data. 5. ACTIVE LEARNING WITH SVMS Support vector machnes (SVMs) are generally appled usng a ly selected tranng set classfed n advance. The theoretcal advantages and the emprcal success of Support Vector Machne makes t an attractve choce as a learnng method to use wth actve learnng. ) (1)

4 Gven an unlabeled pool U, an actve learner l has three components: ( f, q, X ). The frst component s a classfer, f : X { 1, + 1}, traned on the current set of labeled data X. The second component q (X ) s the queryng functon that, gven a current labeled set X, decdes whch nstance n U to query next. The actve learner can return a classfer f after each query or after some fxed number of queres. Ths brngs us the ssue of how to choose the next unlabeled nstance to query. We perform our experments n three settngs: Random pck: The pck method smply ly chooses the next query pont from the unlabeled pool. Ths method reflects what happens n the regular passve learnng settng. Smple margn actve learnng: Ths method chooses the closest document to the current hyperplane to ask for ts label. Clearly, selectng unlabeled nstances far away from the hyperplane s not helpful, snce ther class membershps are certan. The most nformatve nstances for refnng the hyperplane are the unlabeled nstances near the hyperplane, wthn the margn. The documents that smple margn algorthm focuses on le n the yellow regon(band) n Fgure 4. Smple actve learnng: Ths fnal method selects the next query wth the same method lke smple margn method does. However, n ths case, nstead of examnng all the unlabeled pool of data to see whch one s closest to the hyperplane, a small constant number of ly chosen examples are examned. The zed search frst samples M tranng examples and select the best one among these M examples. We choose M as 50 n our experments because mathematcally t can be proved that the best among 50 tranng examples has 95% chances to belong to the best 5% examples n the whole tranng set. The mathematcal proof s beyond the scope of ths paper, t can be found n [8]. 6. EXPERIMENTS The emprcal evaluaton s done on a collecton of news stores, the Reuters corpus, whch s known as the standard real-world benchmark for natural language processng, nformaton retreval and machne learnng systems. The frst step n text categorzaton s to transform documents from strngs of characters nto a representaton sutable for the learnng algorthm and the classfcaton task. Informaton Retreval research suggest that word stems work well as representaton unts. After preprocessng, the tranng corpus contans around 9000 dstnct terms. Ths number corresponds to the dmensons of the orgnal tranng data space X whch s mentoned n Secton 4. Each document s represented as a vector n the space and each dstnct word they contan s the vector component. Actve learnng s used n the context of text classfcaton throughout our experments. Reuters dataset conssts of text documents whch are already categorzed nto predefned categores. We wll act as f we don t know ther labels and perform automatc text classfcaton wth SVMs usng actve learnng and compare our results wth the actual labels of documents. At each experment, we consder the documents of one category as postve class and all the rest of the documents belongng to other categores as negatve class. Thus we can form a bnary classfcaton settng for SVMs. The dataset contans 93 tranng and 3299 test documents. We start wth ten labeled documents, 5 postve and 5 negatve documents. Ths mples that our unlabeled data pool conssts of 9593 documents ntally. Ths number goes down gradually as the documents from the pool are labeled and added to the tranng data. At each step, a query s selected from the unlabeled data pool accordng to the methods descrbed n Secton 5. Ths query s ncluded n the tranng set wth ts label. Then the model s created wth SVMs usng the current tranng set followed by the predcton step where 3299 test documents labels are predcted by the classfer. By comparng the predcted labels wth the actual labels we get the classfcaton accuracy values. An accuracy measurement technque called Precson-Recall Breakeven Pont () s used throughout the experments. Ths performance measure s chosen because t suts well when the postve and the negatve sets are unbalanced n number. The accuracy results for four categores among the most populated ten categores of Reuters dataset s presented n Fgure 5. Random pck and smple curves are the averages of ten runs. Our man nterest n these curves are the ones that belong to actve learnng methods. Random pck method just reflects what happens n the regular passve learnng settng. The curves reached ther peak values by smple margn actve learnng method by only labelng 250 documents on average. Wth smple actve learnng method, the peak values are reached wth 350 labeled documents on average. In practce ths means, users only have to label 250 documents wth smple margn method and 350 documents wth smple method nstead of 93 docs. Smple method s speed s evdently hgher than the smple margn method. Smple s curve reaches ts peak value 6 tmes faster than that of the smple margn s curve. Snce smple method does not have to examne every document n the unlabeled pool but deals wth 50 documents only, ts computaton complexty s much lower.

5 7. DISCUSSIONS a b c smplemargn smple smplemargn smple smplemargn smple smple margn smple d. Fgure 5. Classfcaton accuracy values n percentages for categores a.) Gran, b.) Money-fx, c.)trade d.) Crude As we can see n Fgure 5, n actve learnng setups, after addng certan number of labeled tranng data to the tranng set, the accuracy graphs reach saturaton at some value. In other words, addng more tranng data does not ncrease the accuracy of the classfer after some pont. The reason for ths s that by growng the sze of the tranng set, the learner s knowledge about large regons of the nput space becomes more and more confdent so that addtonal samples from these regons are bascally redundant; hence they do not contrbute consderably to an mprovement n ts generalzaton ablty. Wth the actve learnng methods, the most nformatve documents are already added nto the tranng set n the former steps, the remanng documents n the pool are not useful snce they don t gve extra nformaton to the system. In other words, ther addton to the tranng set does not change the already produced hyperplane whch s a well bult separator between two dfferent classes. Wth pck method, snce canddate documents for labelng are not selected wsely, t s not guaranteed that the nformatve nstances wll be added frst. Therefore, n ths case, the generalzaton ablty of the classfer tends to ncrease slower as the new data comes. In the prevous secton, we also observed an unusual phenomenon at some of the learnng curves. When tranng examples were added at, generalzaton ncreased monotoncally untl all avalable examples are added. When tranng examples were added by usng actve learnng method, generalzaton peaked to a level above the pont that s acheved by the learner when all data had fnally been added. We showed that better performance can be acheved from a small subset of the data that we can acheve usng all avalable data wth some categores (Fg. 5.a. & 5.b.). Based on our observaton, ths stuaton occurs f ether or both the postve and negatve sets n the classfcaton are nosy. Nose, n classfcaton systems corresponds to falsely labeled documents. The peak condton at the curves wll be nvestgated n detal n our future work. Future work can also nvolve determnng the place that ths peak performance s acheved durng the run tme so that the system stops automatcally addng examples. In ths way, we can get a hgher classfcaton accuracy by labelng and usng only the most nformatve documents. We beleve that the nformatve documents le n between the support vectors of each data set whch s represented as the yellow area n Fgure 4. Thus f we can detect the tme that the documents between current support vectors are exhausted, we can make the system stop for searchng new queres. Snce the documents are represented as vectors n the text doman, we can compare the new selected document s coordnate and ts dstance to the hyperplane wth those of the exstng support vectors.

6 Consderng the runnng tmes of the two actve learnng method, smple method has an evdent superorty over the smple margn method. Snce smple method does not have to examne the whole unlabeled data pool to select one query, t seems to be a very promsng actve learnng method for very large databases. 8. CONCLUSIONS Collectng tranng data n machne learnng can be dffcult, expensve and tme consumng. Ths paper advocates an approach whch facltates collectng tranng data. Wth the help of actve learnng, classfers can work wth less tranng data wthout any accuracy loss. Furthermore, computers learn faster wth less but very nformatve tranng data. It s also worthwhle to hghlght several contrbutons of the actve learnng method descrbed n ths paper: Smple margn and smple results mply that the actve learnng decreases the number of tranng documents needed by the learner machne. Consderng the tme ssues, the advantage of usng lesser tranng data s two folds: Frst, the need for manual labelng of documents decreases, so the effort and money used for ths purpose can be channeled to other drectons. Second, the tranng tmes of the classfer decrease snce the tranng tme of the classfer s very much dependent on the number of tranng nstances. Faster workng machnes are always desred n every part of lfe. Besdes, ths speed ncrease makes possble of dealng very large databases whch otherwse would be mpossble due to the computaton tmes. The classcal actve learnng methods consder all of the unlabeled documents n the pool. There s a prevalent belef that, all the unlabeled documents must be evaluated by the actve learnng process, to select the best document to ask to the user for ts label. Smple method shows the contrary. Regardless of the sze of the whole unlabeled tranng data pool, each teraton of the actve learnng step, we can ly select a small group of nstances and fnd the best document to ask for ts label wthn ths small group. Ths s also a reason to make actve learnng applcable to very large databases. Most mportantly, t can be used n emergng applcatons such as herarchcally structured sets of categores provded by companes lke Yahoo and Google. Those search engne companes work also on producng drectory structures for the whole World Wde Web. The crawlers of those companes download mllons of web pages nto ther servers everyday. Wth the method presented n ths paper, they can create the tranng set of ther classfers by labelng only a small number of web pages. Once they create a relable classfer, the rest of the Web can be categorzed nto predefned categores. In ths paper, the actve learnng algorthms are dscussed n the context of text categorzaton. Actve learnng s not restrcted only to the text doman, they can be appled to other domans as well. Image retreval (or categorzaton), handwrtten dgt recognton, proten classfcaton, recommendaton systems are the just few examples of the areas that actve learnng can be used. Experments on text categorzaton ndcate that the actve learnng method can be hghly effectve n makng computers learn faster. 9. ACKNOWLEDGEMENTS Specal thanks go to my advsor Prof. Lee Gles for hs advce and helpful comments throughout my studes... REFERENCES [1] Cohn, D. A., Atlas, L., Ladner, R., Improvng generalzaton wth actve learnng, Machne Learnng, 15 (1994), pages [2] Cohn, D. A., Ghahraman, Z., & Jordan, M. I., Actve learnng wth statstcal models, Journal of Artfcal Intellgence Research, 4, 1996, pages [3] Dumas, S. T., Platt, J. Heckerman D. Saham M., Inductve learnng algorthms and representatons for text categorzaton, In Proceedngs of the Seventh Internatonal Conference on Informaton and Knowledge Management, ACM Press, 1998 [4] Freund, Y., Seung, H. S., Shamr, E., & Tshby, N., Selectve samplng usng the query by commttee algorthm, Machne Learnng, 28, 1997, pages [5] Joachms, T., Text categorzaton wth support vector machnes, In Proceedngs of the European Conference on Machne Learnng. Sprnger-Verlag, 1998 [6] Lews, D., Gale, W., A sequental algorthm for tranng text classfers In Proceedngs of the Seventeenth Annual Internatonal ACM-SIGIR Conference on Research and Development n Informaton Retreval, pages Sprnger- Verlag, 1994 [7] Schohn, G., Cohn, D., Less s more: Actve learnng wth support vector machnes, In Proceedngs of the Internatonal Conference on Machne Learnng, [8] Scholkopf, B., Smola, A. J., Learnng wth Kernels, MIT Press, Cambrdge, MA, [9] Tong, S., Koller, D., Support vector machne Actve learnng wth applcatons to text classfcaton, Journal of Machne Learnng Research, 2001.

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou