Meta-Prediction for Collective Classification

Size: px

Start display at page:

Download "Meta-Prediction for Collective Classification"

Iris Preston
6 years ago
Views:

1 McDowell, L.., Gupta,.M., & Aha, D.W. (200). Meta-predcton n collectve classfcaton. To appear n Proceedngs of the Twenty- Thrd Florda Artfcal Intellgence Research Socety Conference. Daytona Beach, FL: AAAI Press. Meta-Predcton for Collectve Classfcaton Luke. McDowell, alyan Moy Gupta 2, and Davd W. Aha 3 Dept. of Computer Scence; U.S. Naval Academy; Annapols, MD nexus Research Corp.; Sprngfeld, VA Navy Center for Appled Research n Artfcal Intellgence; Naval Research Laboratory (Code 554); Washngton, DC lmcdowel@usna.edu kalyan.gupta@knexusresearch.com davd.aha@nrl.navy.ml Abstract When data nstances are nter-related, as are nodes n a socal network or hyperlnk graph, algorthms for collectve classfcaton (CC) can sgnfcantly mprove accuracy. Recently, an algorthm for CC named Cautous ICA (ICA C ) was shown to mprove accuracy compared to the popular ICA algorthm. ICA C mproves performance by ntally favorng ts more confdent predctons durng collectve nference. In ths paper, we ntroduce ICA MC, a new algorthm that outperforms ICA C when the attrbutes that descrbe each node are not hghly predctve. ICA MC learns a meta-classfer that dentfes whch node label predctons are most lkely to be correct. We show that ths approach sgnfcantly ncreases accuracy on a range of real and synthetc data sets. We also descrbe new features for the meta-classfer and demonstrate that a smple search can dentfy an effectve feature set that ncreases accuracy. Introducton In many classfcaton tasks, the nstances to be classfed (such as web pages or people n a socal network) are related n some way. Collectve classfcaton (CC) s a methodology that ontly classfes such nstances (or nodes). CC algorthms can attan hgher accuraces than non-collectve methods when nodes are nterrelated (Nevlle and Jensen 2000; Taskar, Abbeel, and oller 2002). Several CC algorthms have been studed, ncludng relaxaton labelng (Chakrabart, Dom, and Indyk 998), the Iteratve Classfcaton Algorthm (ICA) (Sen et al. 2008), loopy belef propagaton (LBP) (Taskar et al. 2002), and Gbbs samplng (Jensen, Nevlle, and Gallagher 2004). We focus on ICA because t s a popular and computatonally effcent algorthm that has good classfcaton performance (Sen et al. 2008). It makes ntal label predctons for each node v, then teratvely recomputes them based on the predctons for every node that lnks to v. Recently, a varant of ICA named Cautous ICA (ICA C ) (McDowell et al. 2007, 2009) was shown to often attan hgher accuraces than ICA. ICA C s based on the Copyrght 200, Assocaton for the Advancement of Artfcal Intellgence ( All rghts reserved. observaton that, snce some label predctons wll be ncorrect, ICA s use of all predctons may sometmes decrease accuracy. To counter ths effect, ICA C nstead ntally uses only some label predctons. By cautously choosng only those predctons that appear more lkely to be correct, ICA C can ncrease accuracy vs. ICA. In ths paper, we ntroduce Meta-Cautous ICA (ICA MC ), whch s exactly lke ICA C except n how t selects the set of predcted labels to use durng classfcaton. In partcular, ICA MC learns a meta-classfer to predct the lkelhood that a label predcton s correct. By carefully constructng a meta-tranng set from the orgnal tranng set, ICA MC can learn ths classfer and use t to select more relable predcted labels than ICA C, ncreasng accuracy. Our contrbutons are as follows. Frst, we present ICA MC, a novel algorthm that can sgnfcantly ncrease accuracy compared to ICA C, especally when the attrbutes that descrbe each node are not very predctve. Second, we ntroduce a technque to mprove accuracy by generatng more tranng examples for ICA MC s meta-classfer. Thrd, we descrbe new features for the meta-classfer and demonstrate that, whle the most effectve meta-features for ICA MC are task-dependent, a smple search dentfes an effectve set that ncreases accuracy. Emprcal evaluatons usng real and synthetc datasets support our clams. We next revew CC and the ICA and ICA C algorthms. Then we ntroduce ICA MC. Fnally, we present our expermental evaluaton and dscuss future research ssues. Collectve Classfcaton Assume we are gven a graph G = (V,E,X,Y,C), where V s a set of nodes, E s a set of (possbly drected) edges, each x X s an attrbute vector for node v V, each Y Y s a label varable for v, and C s the set of possble labels. We are also gven a set of known label values Y for nodes V V, so that Y = {y v V }. Fnally, assume that we are gven a tranng graph G Tr, whch s defned smlarly to G except that every node n G Tr s a known node. Then the task s to nfer Y U =Y Y, whch are the values of Y for the nodes n G whose labels are unknown. For each node v, let y be the true label and ŷ be the predcted label.

2 For example, consder the task of predctng whether a web page belongs to a professor or a student. Conventonal supervsed learnng approaches gnore the lnks and classfy each page usng attrbutes derved from ts content (e.g., words present n the page). In contrast, a technque for relatonal classfcaton explctly uses the lnks to construct addtonal features for classfcaton (e.g., for each page, nclude as features the words from hyperlnked pages). These relatonal features can ncrease classfcaton accuracy, though not always (Chakrabart et al. 998). Alternatvely, even greater (and usually more relable) ncreases can occur when the class labels of the lnked pages are used to derve relevant relatonal features (Jensen et al. 2004). However, usng features based on these labels s challengng because some or all of these labels are ntally unknown. Thus, ther labels must frst be predcted (wthout usng relatonal features) and then re-predcted n some manner (usng all features). Ths process of ontly nferrng the labels of nterrelated nodes s known as collectve classfcaton (CC). We next descrbe two exstng collectve nference algorthms, ICA and ICA C, and then ntroduce ICA MC. Each algorthm reles on a gven node classfer (M AR ) that predcts a node s label usng both attrbutes and relatons. ICA: Inference usng all predcted labels Fgure shows pseudocode for ICA, ICA C, and ICA MC (dependng on AlgType). Step s a bootstrap step that predcts the class label ŷ for each node n V U usng only attrbutes (conf records the confdence of ths predcton, but ICA does not use t). ICA then terates (step 2). Durng each teraton, t selects all avalable label predctons (step 3), computes the relatonal features values based on them (step 4), and then re-predcts the class label of each node usng both attrbutes and relatonal features (step 5). Step 6 s gnored for ICA. After teratng, step 7 returns the fnal set of predcted class labels and ther confdence values. ICA C : Inference usng some predcted labels In steps 3-4 of Fgure, ICA assumes that the predcted node labels are all equally lkely to be correct. When AlgType s nstead ICA C, the nference becomes more cautous by only consderng more confdent predctons. Specfcally, step 3 commts nto Y only the most confdent m of the currently predcted labels; other labels are consdered mssng and are gnored. Step 4 computes the relatonal features usng only the commtted labels, and step 5 performs classfcaton usng ths nformaton. Step 3 gradually ncreases the fracton of predcted labels that are commtted per teraton. Node label assgnments commtted n an teraton h are not necessarly commtted agan n future teratons (and may n fact change). ICA C requres a confdence measure (conf n Fgure ) to rank the current label predctons. As wth pror work (Nevlle and Jensen 2000, McDowell et al. 2007), we set conf to be the posteror probablty of the most lkely class ICA_classfy(V,E,X,Y,M AR,M A,M M,n,AlgType,) = // V=nodes; E=edges; X=attr. vectors; Y =labels of known nodes // M AR =node classfer (uses attrs. & relatons); M A = classfer // that uses attrs. only; M M =meta-classfer (predcts correctness) // n=# ters; AlgType=ICA, ICA C, or ICA MC ; =est. class dstr. for each node v V U do // Bootstrap (ŷ,conf ) M A (x ) 2 for h = 0 to n do 3 // Select node labels for computng relatonal feat. values f (AlgType = ICA) // Use all labels: nown or predcted Y Y {ŷ v V U } else // ICA C or ICA MC : Use known and m m V U * (h/n) // most confdent predcted labels Y Y {ŷ v V U rank(conf ) m} 4 for each node v V U do // Use labels selected above f calcrelatfeats(v,e,y) // to compute feat. values 5 for each node v V U do // Re-predct labels usng (ŷ,conf ) M AR (x,f ) 6 f (AlgType = ICA MC ) Z U {(ŷ,conf ) v V U } for each node v V U do // attrbutes and features // Compute meta-features; // use to re-estmate conf. // values for each node mf calcmetafeatures(,v,e,x,y,y,z U,M A,) conf M M (mf ) 7 // Return most lkely label (and conf. estmate) for each node return { (ŷ,conf ) v V U } Fgure : Pseudocode for ICA, ICA C, or ICA MC. Based on pror work (McDowell et al. 2009), we set n=0 teratons. for each node v. Ths s computed by the node classfer M AR based on the attrbutes and relatonal features of v. ICA C performs well on a varety of real and synthetc data, and attans hgher accuraces than ICA and smlar accuraces as more tme-consumng algorthms such as Gbbs samplng or LBP (McDowell et al. 2009). However, ICA C s ablty to select the best predcted labels depends entrely on the confdence value estmates from the node classfer. Accuracy may decrease f a msclassfed node s nonetheless assgned a hgh confdence value. Improvng ICA C wth Meta-Cauton To address ths potental problem wth ICA C, we created ICA MC. They are dentcal except that ICA MC uses a separate meta classfer to predct how lkely each predcton ŷ s to be correct. Below we descrbe ICA MC s use of ths meta-classfer, methods for generatng ts tranng data, and methods for constructng ts features. ICA MC : Inference usng predcted correct labels Fgure shows that ICA MC changes ICA C only n step 6. In partcular, after usng the node classfer to predct the label ŷ (and assocated confdence conf ) for every node, ICA MC computes the meta-feature values and then uses the metaclassfer M M to predct how lkely ŷ s to be correct. These predctons serve as the new confdence values that are then used n Step 3 of the next teraton to select the

3 commtted set Y. If the meta-classfer s confdence predctons more accurately dentfy those nodes whose labels are correctly predcted (compared to ICA C s smple confdence values), then accuracy should ncrease. Generatng meta-tranng data Learnng the meta-classfer requres constructng approprate meta-tranng data, whch we represent as a set of vectors. Fgure 2 shows the pseudocode for ths task, whose algorthm employs a holdout graph (a subset of the tranng set) wth nodes V, edges E, attrbutes X, and true labels Y. For each of T trals, step 3 randomly selects lp% of the nodes to be known; ths value s chosen to replcate the fracton of known labels that are present n the test set. It then executes ICA C on the graph, gven the known nodes (step 4). Ths yelds the set Z U, whch contans the label predctons and assocated confdence values for each node n V U. Usng these and the expected class dstrbuton (from the tranng set), t then generates a meta-tranng vector per node (steps 5-7). Ths vector ncludes eght meta-features (descrbed later) and a Boolean value that ndcates whether predcton ŷ s correct. Ths tranng data s later used to learn the meta-classfer that predcts the correctness of the ŷ estmates gven the values of the metafeatures. We set T=0 to conduct ten trals wth dfferent known nodes each tme. The goal s to reduce the bas that mght otherwse occur due to the partcular selecton of Y n step 3. We later compare ths wth the one-tral approach (T=). Generatng meta-features from meta-tranng data ICA MC needs useful meta-features to predct when the node classfer has correctly classfed a node. The constructed features are based on two key premses. Frst, we assume that the data exhbts relatonal autocorrelaton (correlaton of class labels among nterrelated nodes, Jensen et al., 2004) for use by the node classfer. Thus, each node s predcted label wll be nfluenced by the predctons of ts neghborng labels. Second, snce ICA MC (lke ICA C ) explots only some of the predcted labels durng each teraton, not all neghbor labels wll affect the predcton for v. We assume that the accuracy of predcton ŷ for teraton s affected only by the neghbors of v that were ncluded n the commtted set Y durng that same teraton. Let N refer to the set of such neghbors for v. Based on these two premses and addtonal ntutons descrbed below, we desgned eght features for ths ntal study of ICA MC. The frst three features are based on ones used by Blgc and Getoor (2008) for a related problem that s dscussed later. Future work should examne these choces and others n more detal. Suppose the CC algorthm predcts ŷ to be the label for node v, wth confdence conf. Then v s features are:. Local score: The CC algorthm s predctons should dffer from those of an attrbute-only classfer (e.g., M A n Fgure ), or there s no pont n executng CC. However, f M A and the node classfer M AR agree on a predcton, then t s more lkely to be correct. Ths generatemetavectors(v,e,x,y,m AR,M A,n,T,lp,) = // V=nodes, E=edges, X=attrbute vectors, Y=node labels // M AR = node classfer (uses attrs. & relats), M A = classfer // (attrs. only), n=# ICA C ters., T=# randomzed trals to use // lp=labeled proporton, = expected class dstrbuton MetaTranVecs 2 for = to T do 3 // Randomly select some nodes to be known Y randomselectsomenodes(v, Y, lp) // Randomze V U {v y Y-Y } // Nodes used for predcton 4 // Run ICA C to predct labels and compute confdences Z U ICA_classfy(V,E,X,Y,M AR,M A,,n,ICA C,) 5 for each v V U do // Calc. and store meta-feature vectors 6 mf calcmetafeatures(,v,e,x,y,y,z U,M A,) 7 MetaTranVecs MetaTranVecs mf 8 return MetaTranVecs // return all vectors of meta-features Fgure 2: Pseudocode to generate tranng vectors for the meta classfer used by ICA MC. heurstc s captured by usng, for each v, M A s confdence value for the ŷ that was predcted by M AR. nown nodes are assumed to be fully correct (score of ), though ths could be reduced to account for possble nose: P( Y yˆ x ) v V lf v V 2. Relatonal score: If a node s surrounded by nodes whose predctons are more lkely (e.g., have hgh lf scores), then ts predcton s also more lkely: rf lf N 3. Global score: Let Pror(c) be the fracton of tranng nodes wth class label c, and Posteror(c) be the fracton of test set labels predcted as c by the CC algorthm. If Posteror(c) s much hgher than Pror(c), then many nodes wth predcted label c may be ncorrect. Thus, the global score measures whether class y s over or underrepresented n the posteror dstrbuton: Pror( yˆ ) Posteror( yˆ ) gf 2 4. Node confdence: If the node classfer s confdent n some predcton ŷ (hgh posteror probablty), then ths suggests that ŷ s more lkely to be correct: cf conf v N v V v V If only ths feature s used, ICA MC devolves to ICA C. 5. Neghbor confdence: As wth the relatonal score, more confdent neghbor predctons suggest that a node s predcton s more lkely to be correct: nf cf N v N

4 6. Neghbor agreement: If most of node v s neghbors have the same predcted label, ths may ndcate that ŷ s more lkely to be correct. Let count (N ) and count 2 (N ) ndcate the count of the two most frequent label predctons n N. If the former value s large and the latter s small, then neghbor agreement s hgh: naf count ( N ) count 2( N ) N 7. nown neghbors: Havng many known neghbors ncreases the chances that a node s predcton s correct: knf N V 8. nown vcnty: A node s predcton may also be nfluenced by known nodes that are lnked to t by one or more ntervenng nodes. We use a smple measure that favors drect known neghbors, then counts (wth reduced weght) any known nodes reached va one addtonal node v: kvf N V 2 v v' ( N N Each of these eght features may not be useful for every dataset. However, ICA MC needs only some of the features to be useful the meta-classfer (we use logstc regresson) wll learn approprate parameters for each feature based on ther predctve accuracy on the metatranng data. Also, features that provde no beneft are dscarded by the feature search process descrbed later. Evaluaton Hypotheses. By default, ICA MC uses feature search and ten randomzed tranng data trals. Ths ICA MC attans hgher accuraces than ICA C (Hypothess #), ICA MC wthout such trals (#2), ICA MC wthout feature search (#3), and ICA MC wth ust the three features used by Blgc and Getoor (#4). Data Sets. We used the followng data sets (see Table ):. Cora (see Sen et al. 2008): A collecton of machne learnng papers categorzed nto seven classes. 2. CteSeer (see Sen et al. 2008): A collecton of research papers drawn from the CteSeer collecton. 3. WebB (see Nevlle and Jensen 2007): A collecton of web pages from four computer scence departments. 4. Synthetc: We generate synthetc data usng Sen et al. s (2008) graph generator. Smlar to ther defaults, we use a degree of homophly of 0.7 and a lnk densty of 0.4. Table : Data sets summary Characterstcs Cora CteSeer WebB Syn. Total nodes n.a. Avg. # nodes per test set Avg. lnks per node Class labels Non-rel. features aval Non-rel. features used Relatonal features used Folds ) Feature Representaton. Our node representaton ncludes relatonal features and non-relatonal attrbutes, as descrbed below. Non-relatonal (content) attrbutes: The real datasets are all textual. We use a bag-of-words representaton for the textual content of each node, where the feature correspondng to a word s assgned true f t occurs n the node and false otherwse. Our verson of the WebB dataset has 00 words avalable. For Cora and CteSeer, we used nformaton gan to select the 00 hghest-scorng words, based on McDowell et al. (2007), whch reported that usng more dd not mprove performance. Our focus s on the case where relatvely few attrbutes are avalable (or the attrbutes are not very predctve) as may occur n large real-world networks (c.f., Macskassy and Provost 2007, Gallagher et al. 2008). Thus, for most of our experments we randomly select 0 of the 00 avalable words to use as attrbutes. We also brefly dscuss results when usng 00 attrbutes. For the synthetc data, ten bnary attrbutes are generated usng the technque descrbed by McDowell et al. (2009). Ths model has a parameter ap (attrbute predctveness) that ranges from 0.0 to.0; t ndcates how strongly predctve the attrbutes are of the class label. We evaluate ap usng the values {0.2, 0.4, 0.6}. Relatonal features: Each relatonal feature value s a multset. For nstance, a possble feature value s {3 A, 2 B, mssng}, whch ndcates that a node lnks to 3 other nodes whose predcted label s A, 2 nodes whose predcton s B, and node labeled mssng. Durng nference, each label n the multset (excludng mssng labels) s separately used to update the probablty that a node has label c. Ths s the ndependent value approach that was ntroduced by Nevlle et al. (2003), used by Nevlle and Jensen (2007), and shown to be superor to count or proporton features by McDowell et al. (2009). See Nevlle et al. (2003) for more detals. For Cora and CteSeer, we compute a multset feature usng only ncomng lnks, and a separate such feature usng only outgong lnks. For WebB, we also compute one such feature usng co-ctaton lnks (a co-ctaton lnk exsts between nodes and f some node k lnks to both of them). For the synthetc data, the lnks are undrected, so there s a sngle relatonal feature. Classfers. For the node classfer, we used a naïve Bayes classfer. McDowell et al. (2009) reported that, usng multset features, t attaned hgher accuraces than dd alternatves such as logstc regresson. For the metaclassfer, we used logstc regresson, as dd Blgc and Getoor (2008). Future work should consder other choces. Test Procedure. We conducted an n-fold cross-valdaton study for each tested algorthm. For WebB, we treated each of the four schools as a separate fold. For Cora and CteSeer, we created fve dsont test sets by usng smlarty-drven snowball samplng (McDowell et al. 2009). Ths s smlar to the approach of Sen et al. (2008).

5 For all 3 datasets we tested on one graph, traned on two others, and used the remanng two (one for WebB) as a holdout set for learnng the meta-classfer and performng the meta-feature search. For the synthetc data, we performed 25 separate trals. For each tral we generated three dsont graphs: one test set, one tranng set, and one holdout set. We randomly selected lp=0% of each test set to form V (nodes wth known labels). Ths s a sparsely labeled task, whch s common n real data (Gallagher et al. 2008). To search for whch of the eght meta-features to use wth ICA MC, we use the smple, greedy Backwards Sequental Elmnaton (BSE) algorthm (ttler, 986). It evaluates accuracy on the holdout set wth ICA MC, then recursvely elmnates any meta-feature whose removal ncreases accuracy. To ncrease robustness, accuracy s averaged over ten executons of ICA MC, each tme usng a dfferent set of ntal known labels (as done for T=0 n Fgure 2). The fnal set of meta-features s used for testng. Tested Algorthms. We tested ICA, ICA C, and ICA MC. In addton, to assess the utlty of ICA MC s desgn decsons, we also tested three of ts ablated varants:. tral nstead of 0 : ths uses only one randomzed tral to collect meta-tranng data (.e., T= n Fgure 2) and only one evaluaton tral for the meta-feature search. 2. No meta-feature search : Ths skps search and uses all eght meta-features that were prevously descrbed. 3. Only Blgc meta-feats : Ths uses ust features #, #2, and #3 the set used by Blgc and Getoor (2008). Performance Measure. We compared all the algorthms on ther average classfcaton error rate on the test sets. Analyss. We performed ndependent analyses for each predcton task and ont analyses by poolng the observatons, ether for all the real data sets or for all the synthetc data condtons shown. Our analyss uses onetaled pared t-tests accepted at the 95% confdence level. Results. Table 2 dsplays the classfcaton error rates averaged over all the folds for each algorthm. For each (data set, algorthm) par, the best result s shown n bold. Result : ICA MC sgnfcantly outperforms ICA C and ICA when attrbute predctveness s low: Comparng ICA MC wth ICA C, we fnd that ICA MC reduces classfcaton error by % for the real data, and.9-6.9% for the synthetc data. Ths mprovement s sgnfcant n every case (p <.03 for the real data and p <.045 for the synthetc data). In addton, the pooled analyses found sgnfcant gans for both the real and synthetc data. Therefore, we accept Hypothess #. For the synthetc data, the gans clearly decrease as attrbute predctveness (ap) ncreases. Ths s consstent wth the results of McDowell et al. (2009), who report that the cautous use of relatonal nformaton s more mportant for CC algorthms when ap and/or the number of attrbutes s small. Snce ICA MC s even more cautous than ICA C, ICA MC has larger gans over ICA C when ap s small (the same relatve trend exsts between ICA C and the non- Table 2: Average % classfcaton error rate Real datasets Synthetc data Core Algorthms Web Cora CS B ap=.2 ap=.4 ap=.6 ICA ICA C ICA MC Gan* Varants of ICA MC tral nstead of No meta-feat. search Only Blgc meta-feats ndcates sgnfcantly worse behavor than ICA MC. * ndcates gan from meta-cauton (ICA C ICA MC) cautous ICA). Nonetheless, ICA MC contnues to provde a small gan even when ap s hgh a gan of 0.9% when ap=0.8 (results not shown). For the real data, ICA MC provdes gans for all three datasets, where the largest gan s wth WebB. WebB has more complex and numerous lnkng patterns (Macskassy and Provost 2007). For ths reason, ICA MC s careful selecton of whch neghborng labels to use for predcton may be especally mportant wth WebB. We repeated these experments wth real data usng 2, 5, or 20 attrbutes (nstead of 0) and found smlar results. In every case pooled analyses found a sgnfcant gan for ICA MC over ICA C (average gans rangng from %), wth the largest gans occurrng wth WebB. As wth the synthetc data, these gans dmnsh when the attrbutes are more predctve. For nstance, when 00 attrbutes are used the gans of ICA MC remaned but were small (0.2-.0%) and statstcally nsgnfcant. These results suggest that ICA MC s especally helpful when the attrbutes alone are not very predctve, and at least does no harm otherwse. Result 2: ICA MC wth randomzed trals and meta-feature search outperforms smpler varants: The bottom of Table 2 shows results wth the varants of ICA MC that do not use multple randomzed trals or do less or no meta-feature search. ICA MC outperforms the tral nstead of 0 and Only Blgc meta-feats varants, often sgnfcantly, and pooled analyses fnd that ICA MC outperforms both, for the real and for the synthetc data. Thus, we accept Hypotheses #2 and #4. ICA MC also sgnfcantly outperforms the varant that uses all eght meta-features ( No meta-feat. search ) for the real data, but not for the synthetc data (perhaps because smpler, undrected lnkng patterns were used n the synthetc data). Thus, we reect Hypothess #3. Despte the reecton of one hypothess, ICA MC always outperformed all three varants (or lagged by at most 0.2%) and sgnfcantly outperformed all three varants on the real datasets. Some of the varants that smplfy ICA MC s search process sometmes performed notably worse than even ICA C. Together, these results suggest that the complete ICA MC, wth randomzed trals and feature search, s the most robust performer.

6 Dscusson ICA MC ncreased accuracy compared to ICA and ICA C. However, why does ICA MC s meta-classfer more effectvely dentfy relable predctons than does ICA C s node classfer? Frst, the meta-classfer s task s smpler: choosng between two values (correct or ncorrect) vs. between all possble class labels. Second, the metaclassfer can use addtonal nformaton, such as the number of known labels, whch has no obvous utlty for predctng a partcular label, but does help estmate the correctness of the resultant predcton. Fnally, usng two dfferent classfers helps to reduce the bas due to usng the Naïve Bayes node classfer alone. Meta-feature search often sgnfcantly ncreased ICA MC s accuracy. However, s the same set of features almost always chosen? On average, the global score and node confdence features were selected most often, and known neghbor least often. Ths vared substantally, however, wth some features selected 90% of the tme for one dataset and never for another. These results, combned wth the results from Table 2, suggest that search s essental to make ICA MC robust across dfferent data, even f the default set of meta-features s further refned. We are not aware of any other work that uses a metaclassfer to mprove the operaton of a CC nference algorthm, although Blgc and Getoor (2008) dd use a smlar predctor to dentfy the least lkely CC label predctons (n order to purchase the correct labels for them). In contrast, we seek the most lkely predctons (to favor them for nference). They consdered three features for ths dfferent task, whch our search algorthm selected for ICA MC 62%, 67%, and 9% of the tme, respectvely. Thus, ther features are also useful for our task, although the results of the prevous secton show that usng only those features leads to very poor performance for ICA MC. Compared to ICA C, ICA MC requres addtonal computaton: to execute ICA C when collectng metatranng data, to execute ICA MC for feature selecton, and to tran the meta-classfer for each combnaton of metafeatures that are consdered. However, n many real-world graphs each node lnks to at most k other nodes, n whch case each of these steps s lnear n the number of nodes. In addton, once the meta-classfer s learned, ICA MC requres lttle addtonal tme for nference compared to ICA C (.e., t needs only one addtonal executon of the meta-classfer per teraton). Concluson We demonstrated that Meta-Cautous ICA (ICA MC ) sgnfcantly outperforms ICA C for many tasks. Moreover, we showed that aspects of ICA MC n partcular, ts use of multple randomzed tranng data trals and ts use of search for selectng meta-features were essental to achevng performance that was robust across a range of datasets. Snce ICA C has already been shown to be a very effectve CC algorthm, these results suggest that ICA MC should be serously consdered for CC applcatons, partcularly when attrbutes alone do not yeld hgh predctve accuracy. Further work s needed to confrm our results usng other datasets, meta-features, and classfers, and to consder how meta-cauton mght be extended to other CC algorthms. In addton, we ntend to consder technques for further reducng the tme complexty of ICA MC compared to ICA C. Acknowledgements We thank the Naval Research Laboratory for supportng ths work, and Bran Gallagher, Gta Sukthankar, and the anonymous revewers for ther helpful comments. References Blgc, M. and Getoor, L. (2008). Effectve label acquston for collectve classfcaton. Proceedngs of the Fourteenth ACM SIGDD Internatonal Conference on nowledge Dscovery and Data Mnng (pp. 43 5). Las Vegas, NV: ACM. Chakrabart, S., Dom, B., and Indyk, P. (998). Enhanced hypertext categorzaton usng hyperlnks. Proceedngs of the ACM SIGMOD Internatonal Conference on Management of Data (pp ). Seattle, WA: ACM. Gallagher, B., Tong, H., Elass-Rad, T., and Faloutsos, C. (2008). Usng ghost edges for classfcaton n sparsely labeled networks. Proceedngs of the Fourteenth ACM SIGDD Internatonal Conference on nowledge Dscovery and Data Mnng (pp ). Las Vegas, NV: ACM. Jensen, D., Nevlle, J., and Gallagher, B. (2004). Why collectve nference mproves relatonal classfcaton. Proceedngs of the Tenth ACM SIGDD Internatonal Conference on nowledge Dscovery and Data Mnng (pp ). Seattle, WA: ACM. ttler, J. (986). Feature selecton and extracton. In T.Y. Young &. Fu (Eds.), Handbook of Pattern Recognton and Image Processng. San Dego, CA: Academc Press. Macskassy, S. and Provost, F. (2007). Classfcaton n network data: a toolkt and a unvarate case study. Journal of Machne Learnng Research, 8, McDowell, L., Gupta,. M., and Aha, D.W. (2007). Cautous nference n collectve classfcaton. In Proceedngs of the Twenty-Second AAAI Conference on Artfcal Intellgence (pp ). Vancouver, Canada: AAAI. McDowell, L., Gupta,. M., and Aha, D.W. (2009). Cautous collectve classfcaton. Journal of Machne Learnng Research, 0, Nevlle, J., and Jensen, D. (2000). Iteratve classfcaton n relatonal data. In L. Getoor and D. Jensen (Eds.) Learnng Statstcal Models from Relatonal Data: Papers from the AAAI Workshop (Techncal Report WS-00-06). Austn, TX: AAAI. Nevlle, J. and Jensen, D. (2007). Relatonal dependency networks. Journal of Machne Learnng Research, 8, Nevlle, J., Jensen, D., and Gallagher, B. (2003). Smple estmators for relatonal bayesan classfers. Proceedngs of the Thrd IEEE Internatonal Conference on Data Mnng (pp ). Melbourne, FL: IEEE. Sen, P., Namata, G., Blgc, M., Getoor, L., Gallagher, B., and Elass-Rad, T. (2008). Collectve classfcaton n network data. AI Magazne, 29(3), Taskar, B., Abbeel, P., and oller, D. (2002). Dscrmnatve probablstc models for relatonal data. Proceedngs of the Eghteenth Conference on Uncertanty n Artfcal Intellgence (pp ). Edmonton, (BC) Canada: Morgan aufmann.

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto