Multi-Criteria-based Active Learning for Named Entity Recognition

Size: px

Start display at page:

Download "Multi-Criteria-based Active Learning for Named Entity Recognition"

Lizbeth Owens
5 years ago
Views:

1 Mult-Crtera-based Actve Learnng for Named Entty Recognton Dan Shen 1 Je Zhang Jan Su Guodong Zhou Chew-Lm Tan Insttute for Infocomm Technology 21 Heng Mu Keng Terrace Sngapore Department of Computer Scence Natonal Unversty of Sngapore 3 Scence Drve 2, Sngapore {shendan,zhange,suan,zhougd}@2r.a-star.edu.sg {shendan,zhange,tancl}@comp.nus.edu.sg Abstract In ths paper, we propose a mult-crterabased actve learnng approach and effectvely apply t to named entty recognton. Actve learnng targets to mnmze the human annotaton efforts by selectng examples for labelng. To maxmze the contrbuton of the selected examples, we consder the multple crtera: nformatveness, representatveness and dversty and propose measures to quantfy them. More comprehensvely, we ncorporate all the crtera usng two selecton strateges, both of whch result n less labelng cost than sngle-crteron-based method. The results of the named entty recognton n both MUC-6 and GENIA show that the labelng cost can be reduced by at least 80% wthout degradng the performance. 1 Introducton In the machne learnng approaches of natural la n- guage processng (NLP), models are generally traned on large annotated corpus. However, annotatng such corpus s expensve and tmeconsumng, whch makes t dffcult to adapt an exstng model to a new doman. In order to overcome ths dffculty, actve learnng (sample sele c- ton) has been studed n more and more NLP applcatons such as POS taggng (Engelson and Dagan 1999), nformaton extracton (Thompson et al. 1999), text classfcaton (Lews and Catlett 1994; McCallum and Ngam 1998; Schohn and Cohn 2000; Tong and Koller 2000; Brnker 2003), statstcal parsng (Thompson et al. 1999; Tang et al. 2002; Steedman et al. 2003), noun phrase chunkng (Nga and Yarowsky 2000), etc. Actve learnng s based on the assumpton that a small number of annotated examples and a large number of unannotated examples are avalable. Ths assumpton s vald n most NLP tasks. Dfferent from supervsed learnng n whch the entre corpus are labeled manually, actve learnng s to select the most useful example for labelng and add the labeled example to tranng set to retran model. Ths procedure s repeated untl the model acheves a certan level of performance. Practcally, a batch of examples are selected at a tme, called batchedbased sample selecton (Lews and Catlett 1994) snce t s tme consumng to retran the model f only one new example s added to the tranng set. Many exstng work n the area focus on two approaches: certanty-based methods (Thompson et al. 1999; Tang et al. 2002; Schohn and Cohn 2000; Tong and Koller 2000; Brnker 2003) and commttee-based methods (McCallum and Ngam 1998; Engelson and Dagan 1999; Nga and Yarowsky 2000) to select the most nformatve examples for whch the current model are most uncertan. Beng the frst pece of work on actve learnng for name entty recognton (NER) task, we target to mnmze the human annotaton efforts yet stll reachng the same level of performance as a supervsed learnng approach. For ths purpose, we make a more comprehensve consderaton on the contrbuton of ndvdual examples, and more mportantly maxmzng the contrbuton of a batch based on three crtera: nformatveness, representatveness and dversty. Frst, we propose three scorng functons to quantfy the nformatveness of an example, whch can be used to select the most uncertan examples. Second, the representatveness measure s further proposed to choose the examples representng the maorty. Thrd, we propose two dversty consderatons (global and local) to avod repetton among the examples of a batch. Fnally, two combnaton strateges wth the above three crtera are proposed to reach the maxmum effectveness on actve learnng for NER. 1 Current address of the frst author: Unverstät des Saarlandes, Computatonal Lngustcs Dept., Saarbrücken, Germany dshen@col.un-sb.de

2 We buld our NER model usng Support Vector Machnes (SVM). The experment shows that our actve learnng methods acheve a promsng result n ths NER task. The results n both MUC- 6 and GENIA show that the amount of the labeled tranng data can be reduced by at least 80% wthout degradng the qualty of the named entty recognzer. The contrbutons not only come from the above measures, but also the two sample selecton strateges whch effectvely ncorporate nformatveness, representatveness and dversty crtera. To our knowledge, t s the frst work on consderng the three crtera all together for actve learnng. Furthermore, such measures and strateges can be easly adapted to other actve learnng tasks as well. 2 Mult-crtera for NER Actve Learnng Support Vector Machnes (SVM) s a powerful machne learnng method, whch has been appled successfully n NER tasks, such as (Kazama et al. 2002; Lee et al. 2003). In ths paper, we apply actve learnng methods to a smple and effectve SVM model to recognze one class of names at a tme, such as proten names, person names, etc. In NER, SVM s to classfy a word nto postve class 1 ndcatng that the word s a part of an entty, or negatve class -1 ndcatng that the word s not a part of an entty. Each word n SVM s represented as a hgh-dmensonal feature vector ncludng surface word nformaton, orthographc features, POS feature and semantc trgger features (Shen et al. 2003). The semantc trgger features consst of some specal head nouns for an entty class whch s suppled by users. Furthermore, a wndow (sze = 7), whch represents the local context of the target word w, s also used to classfy w. However, for actve learnng n NER, t s not reasonable to select a sngle word wthout context for human to label. Even f we requre human to label a sngle word, he has to make an addton effort to refer to the context of the word. In our actve learnng process, we select a word sequence whch conssts of a machne-annotated named entty and ts context rather than a sngle word. Therefore, all of the measures we propose for actve learnng should be appled to the machneannotated named enttes and we have to further study how to extend the measures for words to named enttes. Thus, the actve learnng n SVMbased NER wll be more complex than that n smple classfcaton tasks, such as text classf caton on whch most SVM actve learnng works are conducted (Schohn and Cohn 2000; Tong and Koller 2000; Brnker 2003). In the next part, we wll ntroduce nformatveness, representatveness and dversty measures for the SVM-based NER. 2.1 Informatveness The basc dea of nformatveness crteron s smlar to certanty-based sample selecton methods, whch have been used n many prevous works. In our task, we use a dstance-based measure to evaluate the nformatveness of a word and extend t to the measure of an entty usng three scorng functons. We prefer the examples wth hgh nformatve degree for whch the current model are most uncertan Informatveness Measure for Word In the smplest lnear form, tranng SVM s to fnd a hyperplane that can separate the post ve and negatve examples n tranng set wth maxmum margn. The margn s defned by the dstance of the hyperplane to the nearest of the postve and negatve examples. The tranng examples whch are closest to the hyperplane are called support vectors. In SVM, only the support vectors are useful for the classfcaton, whch s dfferent from statstcal models. SVM tranng s to get these support vectors and ther weghts from tranng set by solvng quadratc programmng problem. The support vectors can later be used to classfy the test data. Intutvely, we consder the nformatveness of an example as how t can make effect on the support vectors by addng t to tranng set. An example may be nformatve for the learner f the dstance of ts feature vector to the hyperplane s less than that of the support vectors to the hyperplane (equal to 1). Ths ntuton s also ustfed by (Schohn and Cohn 2000; Tong and Koller 2000) based on a verson space analyss. They state that labelng an example that les on or close to the hyperplane s guaranteed to have an effect on the soluton. In our task, we use the dstance to measure the nformatveness of an example. The dstance of a word s feature vector to the hyperplane s computed as follows: N Dst( w) = α yk( s, w ) + b = 1 where w s the feature vector of the word, a, y, s corresponds to the weght, the class and the feature vector of the th support vector respectvely. N s the number of the support vectors n current model. We select the example wth mnmal Dst, whch ndcates that t comes closest to the hyperplane n feature space. Ths example s consdered most nformatve for current model Informatveness Measure for Named Entty

3 Based on the above nformatveness measure for a word, we compute the overall nformatveness degree of a named entty NE. In ths paper, we propose three scorng functons as follows. Let NE = w 1 w N n whch w s the feature vector of the th word of NE. Info_Avg: The nformatveness of NE s scored by the average dstance of the words n NE to the hyperplane. Info( NE) = 1 Dst( w ) w NE where, w s the feature vector of the th word n NE. Info_Mn: The nformatveness of NE s scored by the mnmal dstance of the words n NE. Info( NE) = 1 Mn{ Dst( w )} w NE Info_S/N: If the dstance of a word to the hyperplane s less than a threshold a (= 1 n our task), the word s consdered wth short dstance. Then, we compute the proporton of the number of words wth short dstance to the total number of words n the named entty and use ths proporton to quantfy the nformatveness of the named entty. NUM ( Dst( w ) < α ) w NE Info( NE) = N In Secton 4.3, we wll evaluate the effectveness of these scorng functons. 2.2 Representatveness In addton to the most nformatve example, we also prefer the most representatve example. The representatveness of an example can be evaluated based on how many examples there are smlar or near to t. So, the examples wth hgh representatve degree are less lkely to be an outler. Addng them to the tranng set wll have effect on a large number of unlabeled examples. There are only a few works consderng ths selecton crteron (McCallum and Ngam 1998; Tang et al. 2002) and both of them are specfc to ther tasks, vz. text classfcaton and statstcal parsng. In ths secton, we compute the smlarty between words usng a general vector-based measure, extend ths measure to named entty level usng dynamc tme warpng algorthm and quantfy the representatveness of a named entty by ts densty Smlarty Measure between Words In general vector space model, the smlarty between two vectors may be measured by computng the cosne value of the angle between them. The smaller the angle s, the more smlar between the vectors are. Ths measure, called cosne-smlarty measure, has been wdely used n nformaton retreval tasks (Baeza-Yates and Rbero-Neto 1999). In our task, we also use t to quantfy the smlarty between two words. Partcularly, the calculaton n SVM need be proected to a hgher dmensonal space by usng a certan kernel functon K ( w, w ). Therefore, we adapt the cosne-smlarty measure to SVM as follows: Sm ( w, w ) = k ( w, w ) k( w, w ) k( w, w ) where, w and w are the feature vectors of the words and. Ths calculaton s also supported by (Brnker 2003) s work. Furthermore, f we use the lnear kernel k ( w, w ) = w w, the measure s the same as the tradtonal cosne smlarty measure cosθ = and may be regarded as a w w w w general vector-based smlarty measure Smlarty Meas ure between Named Enttes In ths part, we compute the smlarty between two machne-annotated named enttes gven the smlartes between words. Regardng an entty as a word sequence, ths work s analogous to the algnment of two sequences. We employ the dynamc tme warpng (DTW) algorthm (Rabner et al. 1978) to fnd an optmal algnment between the words n the sequences whch maxmze the accumulated smlarty degree between the sequences. Here, we adapt t to our task. A sketch of the modfed algorthm s as follows. Let NE 1 = w 11 w 12 w 1n w 1N, (n = 1,, N) and NE 2 = w 21 w 22 w 2m w 2M, (m = 1,, M) denote two word sequences to be matched. NE 1 and NE 2 consst of M and N words respectvely. NE 1 (n) = w 1n and NE 2 (m) = w 2m. A smlarty value Sm(w 1n,w 2m ) has been known for every par of words (w 1n,w 2m ) wthn NE 1 and NE 2. The goal of DTW s to fnd a path, m = map(n), whch map n onto the correspondng m such that the accumulated smlarty Sm* along the path s maxmzed. N Sm* = Max { Sm( NE1( n), NE 2( map( n))} { map( n)} n = 1 A dynamc programmng method s used to determne the optmum path map(n). The accumulated smlarty Sm A to any grd pont (n, m) can be recursvely calculated as Sm ( n, m) = Sm( w, w ) + MaxSm ( n 1, q) A 1n 2m A q m Fnally, Sm* = Sm A ( N, M ) Certanly, the overall smlarty measure Sm* has to be normalzed as longer sequences normally gve hgher smlarty value. So, the smlarty between two sequences NE 1 and NE 2 s calculated as

4 Sm * Sm( NE1, NE2) = Max( N, M ) Representatveness Measure for Named Entty Gven a set of machne-annotated named enttes NESet = {NE 1,, NE N }, the representatveness of a named entty NE n NESet s quantfed by ts densty. The densty of NE s defned as the average smlarty between NE and all the other enttes NE n NESet as follows. Sm( NE, NE ) Densty( NE) = N 1 If NE has the largest densty among all the enttes n NESet, t can be regarded as the centrod of NE- Set and also the most representatve examples n NESet. 2.3 Dversty Dversty crteron s to maxmze the tranng utlty of a batch. We prefer the batch n whch the examples have hgh varance to each other. For example, gven the batch sze 5, we try not to select fve repettous examples at a tme. To our knowledge, there s only one work (Brnker 2003) explorng ths crteron. In our task, we propose two methods: local and global, to make the examples dverse enough n a batch Global Consderaton For a global consderaton, we cluster all named enttes n NESet based on the smlarty measure proposed n Secton The named enttes n the same cluster may be consdered smlar to each other, so we wll select the named enttes from dfferent clusters at one tme. We employ a K- means clusterng algorthm (Jelnek 1997), whch s shown n Fgure 1. Gven: NESet = {NE 1,, NE N } Suppose: The number of clusters s K Intalzaton: Randomly equally partton {NE 1,, NE N } nto K ntal clusters C ( = 1,, K). Loop untl the number of changes for the centrods of all clusters s less than a threshold Fnd the centrod of each cluster C ( = 1,, K). NECent = arg max( Sm( NE, NE)) NE C NE C Repartton {NE 1,, NE N } nto K clusters. NE wll be assgned to Cluster C f Sm( NE, NECent ) Sm( NE, NECent ), w w Fgure 1: Global Consderaton for Dversty: K- Means Clusterng algorthm In each round, we need to compute the parwse smlartes wthn each cluster to get the centrod of the cluster. And then, we need to compute the smlartes between each example and all centrods to repartton the examples. So, the algorthm s tme-consumng. Based on the assumpton that N examples are unformly dstrbuted between the K clusters, the tme complexty of the algorthm s about O(N 2 /K+NK) (Tang et al. 2002). In one of our experments, the sze of the NESet (N) s around and K s equal to 50, so the tme complexty s about O(10 6 ). For effcency, we may flter the enttes n NESet before clusterng them, whch wll be further dscussed n Secton Local Consderaton When selectng a machne-annotated named entty, we compare t wth all prevously selected named enttes n the current batch. If the smlarty between them s above a threshold ß, ths example cannot be allowed to add nto the batch. The order of selectng examples s based on some measure, such as nformatveness measure, representatveness measure or ther combnaton. Ths local selecton method s shown n Fgure 2. In ths way, we avod selectng too smlar examples (smlarty value ß) n a batch. The threshold ß may be the average smlarty between the examples n NESet. Gven: NESet = {NE 1,, NE N } BatchSet wth the maxmal sze K. Intalzaton: BatchSet = empty Loop untl BatchSet s full Select NE based on some measure from NESet. RepeatFlag = false; Loop from = 1 to CurrentSze(BatchSet) If Sm( NE, NE ) β Then RepeatFlag = true; Stop the Loop; If RepeatFlag == false Then add NE nto BatchSet remove NE from NESet Fgure 2: Local Consderaton for Dversty Ths consderaton only requres O(NK+K 2 ) computatonal tme. In one of our experments (N and K = 50), the tme complexty s about O(10 5 ). It s more effcent than clusterng alg o- rthm descrbed n Secton Sample Selecton strateges In ths secton, we wll study how to combne and strke a proper balance between these crtera, vz. nformatveness, representatveness and dversty,

5 to reach the maxmum effectveness on NER actve learnng. We buld two strateges to combne the measures proposed above. These strateges are based on the varyng prortes of the crtera and the varyng degrees to satsfy the crtera. Strategy 1: We frst consder the nformatveness crteron. We choose m examples wth the most nformatveness score from NESet to an ntermedate set called INTERSet. By ths preselectng, we make the selecton process faster n the later steps snce the sze of INTERSet s much smaller than that of NESet. Then we cluster the examples n INTERSet and choose the centrod of each cluster nto a batch called BatchSet. The centrod of a cluster s the most representatve example n that cluster snce t has the largest densty. Furthermore, the examples n dfferent clusters may be consdered dverse to each other. By ths means, we consder representatveness and dversty crtera at the same tme. Ths strategy s shown n Fgure 3. One lmtaton of ths strategy s that clusterng result may not reflect the dstrb u- ton of whole sample space snce we only cluster on INTERSet for effcency. The other s that snce the representatveness of an example s only evaluated on a cluster. If the cluster sze s too small, the most representatve example n ths cluster may not be representatve n the whole sample space. NE wth the maxmum value of ths functon from NESet. Second, we consder dversty crteron usng the local method n Secton We add the canddate example NE to a batch only f NE s dfferent enough from any prevously selected example n the batch. The threshold ß s set to the average par-wse smlarty of the enttes n NE- Set. Gven: NESet = {NE 1,, NE N } BatchSet wth the maxmal sze K. Intalzaton: BatchSet = Loop untl BatchSet s full Select NE whch have the maxmum value for the combnaton functon between Info score and Densty socre from NESet. NE = arg Max( λinfo( NE ) + (1 λ) Densty( NE )) NE NESet RepeatFlag = false; Loop from = 1 to CurrentSze(BatchSet) If Sm( NE, NE ) β Then RepeatFlag = true; Stop the Loop; If RepeatFlag == false Then add NE nto BatchSet remove NE from NESet Fgure 4: Sample Selecton Strategy 2 Gven: NESet = {NE 1,, NE N } BatchSet wth the maxmal sze K. INTERSet wth the maxmal sze M Steps : BatchSet = INTERSet = Select M enttes wth most Info score from NESet to INTERSet. Cluster the enttes n INTERSet nto K clusters Add the centrod entty of each cluster to BatchSet Fgure 3: Sample Selecton Strategy 1 Strategy 2: (Fgure 4) We combne the nformatveness and representatveness crtera usng the functo λinfo( NE) + (1 λ) Densty( NE), n whch the Info and Densty value of NE are normalzed frst. The ndvdual mportance of each crteron n ths functon s adusted by the tradeoff parameter λ ( 0 λ 1 ) (set to 0.6 n our experment). Frst, we select a canddate example 4 Expermental Results and Analyss 4.1 Experment Settngs In order to evaluate the effectveness of our sele c- ton strateges, we apply them to recognze proten (PRT) names n bomedcal doman usng GENIA corpus V1.1 (Ohta et al. 2002) and person (PER), locaton (LOC), organzaton (ORG) names n newswre doman usng MUC-6 corpus. Frst, we randomly splt the whole corpus nto three parts: an ntal tranng set to buld an ntal model, a test set to evaluate the performance of the model and an unlabeled set to select examples. The sze of each data set s shown n Table 1. Then, teratvely, we select a batch of examples followng the selecton strateges proposed, requre human experts to label them and add them nto the tranng set. The batch sze K = 50 n GENIA and 10 n MUC-6. Each example s defned as a machne-recognzed named entty and ts context words (prevous 3 words and next 3 words). Doman Class Corpus Intal Tranng Set Test Set Unlabeled Set Bomedcal PRT GENIA sent. (277 words) 900 sent. (26K words) 8004 sent. (223K words) PER 5 sent. (131 words) 7809 sent. (157K words) Newswre LOC MUC-6 5 sent. (130 words) 602 sent. (14K words) 7809 sent. (157K words) ORG 5 sent. (113 words) 7809 sent. (157K words) Table 1: Experment settngs for actve learnng usng GENIA1.1(PRT) and MUC-6(PER,LOC,ORG)

6 The goal of our work s to mnmze the human annotaton effort to learn a named entty recognzer wth the same performance level as supervsed learnng. The performance of our model s evaluated usng precson/recall/f-measure. 4.2 Overall Result n GENIA and MUC-6 In ths secton, we evaluate our selecton strateges by comparng them wth a random selecton method, n whch a batch of examples s randomly selected teratvely, on GENIA and MUC-6 corpus. Table 2 shows the amount of tranng data needed to acheve the performance of supervsed learnng usng varous selecton methods, vz. Random, Strategy1 and Strategy2. In GENIA, we fnd: The model acheves 63.3 F-measure usng 223K words n the supervsed learnng. The best performer s Strategy2 (31K words), requrng less than 40% of the tranng data that Random (83K words) does and 14% of the tranng data that the supervsed learnng does. Strategy1 (40K words) performs slghtly worse than Strategy2, requrng 9K more words. It s probably because Strategy1 cannot avod selectng outlers f a cluster s too small. Random (83K words) requres about 37% of the tranng data that the supervsed learnng does. It ndcates that only the words n and around a named entty are useful for classfcaton and the words far from the named entty may not be helpful. Class Supervsed Random Strategy1 Strategy2 PRT 223K (F=63.3) 83K 40K 31K PER 157K (F=90.4) 11.5K 4.2K 3.5K LOC 157K (F=73.5) 13.6K 3.5K 2.1K ORG 157K (F=86.0) 20.2K 9.5K 7.8K Table 2: Overall Result n GENIA and MUC-6 Furthermore, when we apply our model to newswre doman (MUC-6) to recognze person, locaton and organzaton names, Strategy1 and Strategy2 show a more promsng result by comparng wth the supervsed learnng and Random, as shown n Table 2. On average, about 95% of the data can be reduced to acheve the same performance wth the supervsed learnng n MUC-6. It s probably because NER n the newswre doman s much smpler than that n the bomedcal doman (Shen et al. 2003) and named enttes are less and dstrbuted much sparser n the newswre texts than n the bomedcal texts. 4.3 Effectveness of Informatveness-based Selecton Method In ths secton, we nvestgate the effectveness of nformatveness crteron n NER task. Fgure 5 shows a plot of tranng data sze versus F-measure acheved by the nformatveness-based measures n Secton 3.1.2: Info_Avg, Info_Mn and Info_S/N as well as Random. We make the comparsons n GENIA corpus. In Fgure 5, the horzontal lne s the performance level (63.3 F-measure) acheved by supervsed learnng (223K words). We fnd that the three nformatveness-based measures perform smlarly and each of them outperforms Random. Table 3 hghlghts the varous data szes to acheve the peak performance usng these selecton methods. We fnd that Random (83K words) on average requres over 1.5 tmes as much as data to acheve the same performance as the nformatveness-based selecton methods (52K words) F Supervsed Random Info_Mn Info_S/N Info_Avg K words 80 Fgure 5: Actve learnng curves: effectveness of the three nformatveness-crteron-based selectons comparng wth the Random selecton. Supervsed Random Info_Avg Info_Mn Info_ S/N 223K 83K 52.0K 51.9K 52.3K Table 3: Tranng data szes for varous selecton methods to acheve the same performance level as the supervsed learnng 4.4 Effectveness of Two Sample Selecton Strateges In addton to the nformatveness crteron, we further ncorporate representatveness and dversty crtera nto actve learnng usng two strateges descrbed n Secton 3. Comparng the two strateges wth the best result of the sngle-crteronbased selecton methods Info_Mn, we are to ustfy that representatveness and dversty are also mportant factors for actve learnng. Fgure 6 shows the learnng curves for the varous methods: Strategy1, Strategy2 and Info_Mn. In the begnnng teratons (F-measure < 60), the three methods performed smlarly. But wth the larger tranng set, the effcences of Stratety1 and Strategy2 begn to be evdent. Table 4 hghlghts the fnal result of the three methods. In order to reach the performance of supervsed learnng, Strategy1 (40K words) and Strategyy2 (31K words) requre about 80% and 60% of the data that Info_Mn (51.9K) does. So we beleve the effectve combnatons of nformatveness, representatveness and dversty wll help to learn the NER model more quckly and cost less n annotaton.

7 0.65 F Supervsed Info_Mn Strategy1 Strategy K words Fgure 6: Actve learnng curves: effectveness of the two mult-crtera-based selecton strateges comparng wth the nformatveness-crteron-based selecton (Info_Mn). Info_Mn Strategy1 Strategy2 51.9K 40K 31K Table 4: Comparsons of tranng data szes for the multcrtera-based selecton strateges and the nformatvenesscrteron-based selecton (Info_Mn) to acheve the same performance level as the supervsed learnng. 5 Related Work Snce there s no study on actve learnng for NER task prevously, we only ntroduce general actve learnng methods here. Many exstng actve learnng methods are to select the most uncertan examples usng varous measures (Thompson et al. 1999; Schohn and Cohn 2000; Tong and Koller 2000; Engelson and Dagan 1999; Nga and Yarowsky 2000). Our nformatveness-based measure s smlar to these works. However these works ust follow a sngle crteron. (McCallum and Ngam 1998; Tang et al. 2002) are the only two works consderng the representatveness crteron n actve learnng. (Tang et al. 2002) use the densty nformaton to weght the selected examples whle we use t to select examples. Moreover, the representatveness measure we use s relatvely general and easy to adapt to other tasks, n whch the example selected s a sequence of words, such as text chunkng, POS taggng, etc. On the other hand, (Brnker 2003) frst ncorporate dversty n actve learnng for text classfcaton. Ther work s smlar to our local consderaton n Secton However, he ddn t further explore how to avod selectng outlers to a batch. So far, we haven t found any prevous work ntegratng the nformatveness, representatveness and dversty all together. 6 Concluson and Future Work In ths paper, we study the actve learnng n a more complex NLP task, named entty recognton. We propose a mult-crtera-based approach to select examples based on ther nformatveness, representatveness and dversty, whch are ncorporated all together by two strateges (local and global). Experments show that, n both MUC- 6 and GENIA, both of the two strateges combnng the three crtera outperform the sngle crteron (nformatveness). The labelng cost can be sgnfcantly reduced by at least 80% comparng wth the supervsed learnng. To our best knowledge, ths s not only the frst work to report the emprcal results of actve learnng for NER, but also the frst work to ncorporate the three crtera all together for selectng examples. Although the current experment results are very promsng, some parameters n our experment, such as the batch sze K and the λ n the functon of strategy 2, are decded by our experence n the doman. In practcal applcaton, the optmal value of these parameters should be decded automatcally based on the tranng process. Furthermore, we wll study how to overcome the lmtaton of the strategy 1 dscussed n Secton 3 by usng more effectve clusterng algorthm. Another nterestng work s to study when to stop actve learnng. References R. Baeza-Yates and B. Rbero-Neto Modern Informaton Retreval. ISBN X. K. Brnker Incorporatng Dversty n Actve Learnng wth Support Vector Machnes. In Proceedngs of ICML, S. A. Engelson and I. Dagan Commttee- Based Sample Selecton for Probablstc Classfers. Journal of Artfcal Intellgence Research. F. Jelnek Statstcal Methods for Speech Recognton. MIT Press. J. Kazama, T. Makno, Y. Ohta and J. Tsu Tunng Support Vector Machnes for Bomedcal Named Entty Recognton. In Proceedngs of the ACL2002 Workshop on NLP n Bomedcne. K. J. Lee, Y. S. Hwang and H. C. Rm Two- Phase Bomedcal NE Recognton based on SVMs. In Proceedngs of the ACL2003 Workshop on NLP n Bomedcne. D. D. Lews and J. Catlett Heterogeneous Uncertanty Samplng for Supervsed Learnng. In Proceedngs of ICML, A. McCallum and K. Ngam Employng EM n Pool-Based Actve Learnng for Text Classfcaton. In Proceedngs of ICML, G. Nga and D. Yarowsky Rule Wrtng or Annotaton: Cost-effcent Resource Usage for Base Noun Phrase Chunkng. In Proceedngs of ACL, 2000.

8 T. Ohta, Y. Tates, J. Km, H. Mma and J. Tsu The GENIA corpus: An annotated research abstract corpus n molecular bology doman. In Proceedngs of HLT L. R. Rabner, A. E. Rosenberg and S. E. Levnson Consderatons n Dynamc Tme Warpng Algorthms for Dscrete Word Recognton. In Proceedngs of IEEE Transactons on acoustcs, speech and sgnal processng. Vol. ASSP-26, NO.6. D. Schohn and D. Cohn Less s More: Actve Learnng wth Support Vector Machnes. In Proceedngs of the 17 th Internatonal Conference on Machne Learnng. D. Shen, J. Zhang, G. D. Zhou, J. Su and C. L. Tan Effectve Adaptaton of a Hdden Markov Model-based Named Entty Recognzer for Bomedcal Doman. In Proceedngs of the ACL2003 Workshop on NLP n Bomedcne. M. Steedman, R. Hwa, S. Clark, M. Osborne, A. Sarkar, J. Hockenmaer, P. Ruhlen, S. Baker and J. Crm Example Selecton for Bootstrappng Statstcal Parsers. In Proceedngs of HLT- NAACL, M. Tang, X. Luo and S. Roukos Actve Learnng for Statstcal Natural Language Parsng. In Proceedngs of the ACL C. A. Thompson, M. E. Calff and R. J. Mooney Actve Learnng for Natural Language Parsng and Informaton Extracton. In Proceedngs of ICML S. Tong and D. Koller Support Vector Machne Actve Learnng wth Applcatons to Text Classfcaton. Journal of Machne Learnng Research. V. Vapnk Statstcal learnng theory. N.Y.:John Wley.

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.