Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Size: px

Start display at page:

Download "Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies"

Anissa Elliott
5 years ago
Views:

1 Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches Dkan Xng 1, Gu-Rong Xue 1, Qang Yang 2, Yong Yu 1 1 Shangha Jao Tong Unversty, Shangha, Chna {xaobao,grxue,yyu}@apex.sjtu.edu.cn 2 Hong Kong Unversty of Scence and Technology, Hong Kong, Chna qyang@cse.ust.hk ABSTRACT Organzng Web search results nto herarchcal categores facltates users browsng through Web search results, especally for ambguous queres where the potental results are mxed together. Prevous methods on search result classfcaton are usually based on pre-tranng a classfcaton model on some fxed and shallow herarchcal categores, where only the top-two-level categores of a Web taxonomy s used. Such classfcaton methods may be too coarse for users to browse, snce most search results would be classfed nto only two or three shallow categores. Instead, a deep herarchcal classfer must provde many more categores. However, the performance of such classfers s usually lmted because ther classfcaton effectveness can deterorate rapdly at the thrd or fourth level of a herarchy. In ths paper, we propose a novel algorthm known as Deep Classfer to classfy the search results nto detaled herarchcal categores wth hgher effectveness than prevous approaches. Gven the search results n response to a query, the algorthm frst prunes a wde-ranged herarchy nto a narrow one wth the help of some Web drectores. Dfferent strateges are proposed to select the tranng data by utlzng the herarchcal structures. Fnally, a dscrmnatve naïve Bayesan classfer s developed to perform effcent and effectve classfcaton. As a result, the algorthm can provde more meanngful and specfc class labels for search result browsng than shallow style of classfcaton. We conduct experments to show that the Deep Classfer can acheve sgnfcant mprovement over state-of-the-art algorthms. In addton, wth suffcent off-lne preparaton, the effcency of the proposed algorthm s sutable for onlne applcaton. Categores and Subject Descrptors H.4.m [Informaton Systems Applcatons]: Mscellaneous; I.5.4 [Pattern Recognton]: Applcatons Text processng Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. WSDM 08, February 11 12, 2008, Palo Alto, Calforna, USA. Copyrght 2008 ACM /08/ $5.00. General Terms Algorthms, Expermentaton Keywords Deep Classfer, Search Result Mnng, Herarchcal Classfcaton, Herarchy Prunng 1. INTRODUCTION Wth the ncreasng prevalence of Web technologes, Web search has become essental n everyday lfe. Current search engnes typcally return a long lst of search results n response to a user-ssued query. Although the most authortve results may be ranked hgh among all results under a proper rankng algorthm (e.g. PageRank [20]), t remans a queston whether the most authortve pages are what the user actually wants. In partcular, when a query s nherently ambguous and the frst pages are not the ntended results of the users, sometmes users may have dffculty n subsequently browsng the rest of the pages. For example, f a user wants to fnd the benefts from an apple frut and ssues a query apple, all results on the frst page are focused on the topc about the computer company Apple 1, rather than the ntended topc of the apple frut. Ths s because the Apple company s so well-known that Web pages about the sense, that s the Apple company, are more lkely to be well lnked and thus be ranked hgher. To make matters worse, the results about frut apple are placed apart from each other n the lst of search results. For the frut apple, the results are placed at 14, 37, 38, 62, 63, 66, 67, 71, 75, 76, 82, 84, 88 among top 100 search results from the Google search engne. A soluton s to perform classfcaton on the search results [5, 8]. As descrbed n [5, 8], automatc category organzaton brngs many advantages for users, where the search results are automatcally compled nto a herarchy accordng to dfferent potental meanngs. As ponted n these two papers, the users preferred the category nterface much better than the lst nterface; n fact they were 50% faster n fndng nformaton that was organzed nto categores. Generally, there are two types of models for classfcaton: shallow classfcaton and deep classfcaton. Shallow classfcaton [5, 8] trans the model on the top level or the top two levels of herarchy whle deep classfcaton [15] learns

2 the classfcaton models on an entre large-scale herarchy. However, the shallow classfcaton scheme may be too coarse for the gven search result data when all the resultant data tems are placed n only two or three classes. In contrast, placng the search result on a category n a deep herarchy can provde more content nformaton on the results than wth a shallow herarchy. Among the prevous researches, Lu et al. [15] ntroduced a top-down classfcaton strategy on a deep but wde target herarchy. The wdth leads to the large sze of the herarchy, resultng n a performance declne as the herarchcal depth ncreases. The measure of Macro-F1 was near 20% at the second level, and decreased to near 10% at the ffth. In addton to an accuracy decrease, tranng the model on a global herarchy may not be a good choce and yeld poor performance. For example, the classfcaton results of the query apple are supposed to be dstrbuted among the categores such as Health, Computer (f these categores are avalable), whle results of the query Saturn are expected to dstrbute among categores lke Scence, Game and Car. It s expected that a classfer nto two categores Health, Computer wll outperform a classfer nto more categores such as Health, Computer, Scence, Game and Car on search results n response to the query apple. Ths mples that t s desrable to employ an adaptve method for creatng classfers that uses dfferent target categores for dfferent user queres. Accordng to the above observatons, n ths paper, we propose a novel algorthm known as Deep Classfer to classfy search results nto a large and deep target herarchy adaptvely by utlzng the exstng herarches on the Web. We do ths as follows. We frst prune a large herarchy nto one wth a smaller sze whle keepng the orgnal herarchcal structure. Ths s accomplshed by frst queryng an on-lne Web drectory for the specfed query and then creatng all ancestors of these deep categores. Ths results n a deep but narrow target herarchy from the orgnally large and wde one. In ths way, a dfferent target herarchy s created adaptvely for a dfferent user query. The leaf nodes n such a herarchy wll be taken as the category canddates for the search results categorzaton. Based on the narrow and deep herarchy, we propose dfferent strateges for tranng the data selecton wth the help of the herarchcal structure. These classfcaton models are learned onlne usng an effcent mplementaton of the dscrmnatve naïve Bayesan classfer for classfcaton. Expermental results show that our Deep Classfer algorthm can acheve sgnfcant mprovement over state-of-theart algorthms. Furthermore, wth suffcent off-lne preprocessng, the effcency of the proposed algorthm s sutable for onlne applcaton. As a result, the entre algorthm can provde more meanngful and specfc class labels for search result browsng when compared to the shallow counterparts. It s worthy to menton that actually, the onlne Web drectores such as ODP, Yahoo! Drectory, etc, are straghtforward solutons to what has mentoned above. Searchng on these Web drectores, users wll be fed up wth search results attached wth category attrbutes, whch has already been assgned by human edtors and are of course herarchcal. However, the obvous defcency of these solutons s that the number of assembled Web pages s lmted n any human mantaned Web drectores. Pages one can search from these web drectores are much less than those from a ODP Pruned Herarchy Query Tranng Data Selecton Classfcaton Results classfed Search Engne Search Results Fgure 1: Overvew of Deep Classfer search engne. Our soluton combnes Web drectores and search engnes. From the former, we gan a rch (large and deep) concept herarchy and from the latter, we are able to search among bllons of pages ndexed by the search engne. The remander of ths paper s organzed as follows. Secton 2 gves a bref overvew of the Deep Classfer algorthm. In Secton 3, we propose dfferent strateges for tranng data selecton. In Secton 4, a dscrmnatve naïve Bayesan classfer s presented. The mplementaton ssues are descrbed n Secton 5. We wll report and analyze the results of a seres of experments of our proposed algorthm n Secton 6. Related work s dscussed n Secton 7. In Secton 8, we wll gve a concluson and dscuss the future work. 2. OVERVIEW OF DEEP CLASSIFIER In ths secton, we gve an overvew of our Deep Classfer algorthm. Fgure 1 llustrates the flowchart of our system. We frst let a user ssue a query, whch wll be submtted to an onlne Web drectory (e.g. Open Drectory Project (ODP) 2 ) and a search engne (e.g. Google 3 ) smultaneously to get the category nformaton and search results, respectvely. The onlne Web drectory may respond a lst of categores that are relevant to the query. For example, by ssung the query Saturn to ODP, ODP wll return fve categores (categores wth bold font n the rght part of the Fgure 2). By creatng the pruned herarchy from the fve categores, only twenty-four ancestors reman. Compared wth the entre herarchy as shown to the left of the Fgure 2, ths narrowdown procedure helps us greatly reduce the number of target category canddates. The leaf nodes n the pruned herarchy are regarded as our target category canddates. These nodes may have offsprng n the orgnal large herarchy but offsprng are now pruned. In other words, we may also pck up nternal nodes, nstead

3 Top ( ) Arts (18085) Busness (202209) Computers (101724) Internet (29828) Software (28566) Databases (1983) System (1016). Games (39019) Health (47319) Home (22965) Kds and Teens (26855) News (96444) Recreaton (77252) Reference (47948) Scence (74325) Shoppng (86090) Socety (184090) Sports (70617) Operatng Systems (5322) Dagnostcs (54). Hardware (5783) Query (Saturn) Top ( ) Arts (18085) Anmaton (6479) Games (39019) / Sega (56) Kds and Teens (26855) / / Recreaton (77252) Autos (5457) Scence (74325) / / / Saturn (2) Saturn (7) Makes and Models (4025) Planets Characters (94) / / Salor Saturn (10) Saturn (19) Saturn Fgure 2: Prune Large and Deep Herarchy nto Decent Sze. of only leaf nodes, of the orgnal large herarchy as our target category canddates. Next, based on the structure of the pruned herarchy, three tranng data selecton strateges are proposed n Secton 3 by utlzng the herarchcal structure. Ths step s mportant snce the labeled Web pages under one ODP category are usually too few to tran a relable classfer. Then, based on the tranng data, we perform classfcaton model learnng. Snce our algorthm s conducted onlne, t s mportant for the algorthm to be effcent. To satsfy ths goal, we propose a smple classfer based on naïve Bayesan whch s descrbed n Secton 4. We wll also compare our results wth the Support Vector Machne n experments to see how much tme one should wat to obtan a probably better classfcaton results by SVM. Fnally, we classfy the search results and demonstrate the results wth a herarchcal search result nterface for the user. Compared wth a top-level categorzaton or a top-two-level categorzaton proposed n [5, 8] lke Arts Games Kds and Teens Recreaton Scence Arts Anmaton Game Vdeo Game Kds and Teens School Tme Recreaton Autos Scence Astronomy our proposed Deep Classfer can (1) provde an nterface wth more meanngful class labels shown n the rght part of Fgure 2 than top-or-top-two-level style and (2) classfy search results more delcately. The second pont s not very obvous n ths case, snce the number of category canddates does not change between shallow and deep classfcaton, but we wll show examples of ths knd n the experment secton later (n Table 1). 3. TRAINING DATA SELECTION In ths secton, we dscuss strateges for tranng data selecton. Ths s a very mportant ssue n our task. In Fgure 2, the number after category name s the number of tranng data attached to the category (and ts offsprng). One can see that some category canddate contans very few labeled Web pages, whch s nsuffcent to buld a relable classfer. We wll propose three dfferent strateges for selectng the tranng data for our task. The best strategy s used n our fnal desgn of the Deep Classfer. 3.1 Flat Strategy The flat strategy s a smple method for tranng data selecton, n whch we transfer the herarchcal classfcaton task nto a flat classfcaton task. From the vewpont of herarchcal classfcaton, ths strategy places all the category canddates (those n bold font n Fgure 3) drectly at the root, whch s shown n Fgure 4. Then, we drectly classfy the search results nto category canddates by a flat classfer. As shown n Fgure 4, we can drectly tran the model usng the data from categores 44, 85, 205, 66, 874, 902, 42, 677 and 707. Ths method s smple to use, but t does not consder the herarchcal structure of web drectory. 3.2 Herarchcal Strategy Snce a herarchy s pruned to a manageable sze, the exstng top-down style can be tred even for onlne applcaton Fgure 3: An Example of Pruned Herarchy We now descrbe the algorthm accordng to the tree structure shown n Fgure 3. The category canddates are shown n bold font, and the herarchcal strategy s to frst classfy (estmate probabltes of) the search results nto categores marked wth 17, 23, 66 and 27. Thus a classfer s created by selectng tranng data under the nodes (and ther offsprng) 17, 23, 66 and 27. The estmated probablty of each non-canddate category s propagated to ther canddate offsprng. The rest of the algorthm s smlar to that at the 141

4 Fgure 4: Flat Strategy Fgure 5: Ancestor-Assstant Strategy top level. For example, another classfer wll be created by selectng the tranng data from the categores 102 and 203 and then classfyng the search results nto them. Formally, we represent the category canddate c wth ts ancestors, denoted by c 1, c 2,, c l where l s the length of the path from the category canddate c to the root. c 1 s the root node and c l s the category canddate tself. The probablty P(c 1 x) equals to 1. Now we formalze the probablty P(c x) that the test case x belongs to c by: P(c x) = P(c 1,c 2,, c l x) = P(c l x, c 1, c 2,, c l 1 ) P(c 1,, c l 1 x) = = l k=1 P(c k x, c 1,, c k 1 ) Each condtonal probablty n the product s estmated by a classfer at the k th level under the category c k. If there s only one chld under c k n the pruned tree, 1 s assgned drectly wthout the cost of classfer tranng. Search results are classfed by fndng c whch maxmze the a posteror P(c x). Ths strategy s dfferent from other two snce t requres learnng more than one classfers before makng a fnal decson on classfcaton. Several state-of-art algorthms worked n a smlar herarchcal way [5, 8, 15, 3]. Presentng ths strategy avals us to make comparson wth such work. 3.3 Ancestor-Assstant Strategy The ancestor assstant strategy s guded by the followng two consderatons. Frst, the tranng data from the category canddate tself may be nsuffcent n sze, especally for a deep category. Thus, we need to obtan more data elsewhere. Second, the tranng data from ts hgh ancestors may be too general to reflect the characterstcs of the deep category canddate. In one word, we want to some borrow tranng data from the ancestors but should not go too hgh up. Hence, we propose a trade-off between the herarchcal strategy and flat strategy. That s, we combne the tranng data from the category canddate tself and those from ts ancestors and sblngs. More precsely, we fnd the farthest ancestor of the category canddate whch s not the ancestors of other category canddates. And the Web pages at ths ancestor and all offsprng of ths ancestor are pcked up. The tranng data for category 874 are from those of 834 and all offsprng of 834 whle the tranng data for category 902 are from those of 854 and all offsprng of 854 snce the common ancestor s category 24 (n Fgure 3). From the vewpont of herarchcal classfcaton, we attach the maxmal subtree contanng one and only one category canddate to the root. The tree n Fgure 5 can clearly clarfy ths strategy. 4. CLASSIFICATION MODEL Due to the adaptve nature of the problem, a classfer that s fast to learn s preferred because the classfers are traned onlne at query tme. If a classfer such as SVM s employed, the long tranng tme mght prevent us from delverng the result n a tmely manner. To ths end, we prefer the naïve Bayesan classfer. 4.1 Event Model There are dfferent event models for nputs of naïve Bayesan text classfer. [16] mentoned two: mult-varate Bernoull and multnomal models. These two models regard each document as a whole and the jont dstrbuton of the whole documents and the target categores are consdered. In ths paper, we use the event model ntroduced n [17], whch makes t easy to nterpret our later modfcaton on model. We regard each document as a sequence of random varables, A 1,... A n, where n s the length of the sequence (document). Each A corresponds to the th word n the document. These random varables are ndependently, dentcally dstrbuted, as the naïve assumpton says. Ths makes observaton on any poston n a document can be used to estmate for all postons. The support of these varables are all dstnct word IDs. In later dscusson, we do not explctly attach a poston subscrpt for A. Then we do not drectly model the jont dstrbuton between documents and categores but between words and categores. And a document s the ntersecton of a sequence of word events. Or n other words, f one has a N-faced dce where N denotes the vocabulary sze and toss t up for fnte tmes, ths process generates a document by recordng each result after tossng up the dce. 4.2 Nave Bayesan Classfer Under the above event model, the classfer actually estmates P(A = w j c ) for all and j where j terates over all words and terates over all target canddates. The classfer estmates the probablty that a document (sequence of word events) belongs to a category by computng P(c v) P(v c )P(c ) = P(c ) P(A = w j c ) v j (1) where c s a category, v s the document, N s the vocabu- 142

5 lary sze, w j s each word n vocabulary, and v j s the correspondng count n v for word w j. P(c ) s called the a pror probablty that a document v belongs to category c and usually estmated by countng tranng data from dfferent categores. However, n our settng, the tranng data are taken from the manually created herarches such as ODP, and the test examples are from search engnes such as Google. The obtaned top 100 (or 200) search results (test cases) may be ncomplete. That means the dstrbuton of categores may not be the same as that n a manually created herarchy. We observed ths nconsstency n most of our evaluated queres. Therefore, we beleve that t s preferred to weaken the a pror probabltes and remove the P(c ) tem n the formulaton. Ths can also be regarded as the assumpton that the a pror probablty of each category s weakened to the same amount, 1/n, where n s the number of categores, accordng to the maxmum entropy prncple. Besdes, [12] assumed that a pror dd not actually have great mpact. 4.3 Dscrmnatve Nave Bayes Classfer Applyng Bayes theorem, we obtan P(A = w j c ) = = P(A = wj, c) P(c ) P(c A = wj)p(a = wj). P(c ) Under the assumpton that all P(c ) are equal, we can rewrte = = P(A = w j c ) v j ( P(c A = w j)p(a = w j) P(c ) ) vj ( P(c A = w j) v j P(A = wj) P(c ) P(c A = w j) v j ) vj snce all P(A = w j) are dentcal across dfferent categores gven the document. Now we arrve at P(c v) P(c A = w j) v j (2) Ths form facltates another way of parameter estmates. From the form of (1), each tem P(A = w j c ) v j n the score functon ndcates that f w j s common n a category c, then a test case wth hgh occurrence of w j wll receve a hgh score. Lkewse, f w j occurs very lttle n a category c, then a test case wth low occurrence of w j wll receve lttle penalty from the occurrence of w j. Thus, we can regard each P(A = w j c ) as a vote on smlarty between the category c and the test case v. The ultmate score s the combnaton of votes on smlarty from all occurrng words. From the form of (2), we can regard each word as a vote on dscrmnaton. If one word only occurs n one category, whch can be regarded as a dscrmnatve word for that category, t wll contrbute a maxmum vote, 1, regardless of how many other words occur n that category, whch may nfluence or weaken the vote from ths word n the standard naïve Bayesan classfer dscussed n Secton 4.2. Therefore, we refer to ths classfer as dscrmnatve naïve Bayesan classfer. 4.4 Parameter Estmate If we denote the occurrences of the event (A = w j, c ) n tranng data as δ j, for (1), a maxmum lkelhood estmate yelds and for (2), P(A = w j c ) = δj j δ j P(c A = w j) = δj δ j So, for standard naïve Bayesan classfer, the dscrmnant functon for c s f (v) = j δ j + ǫ v j ln j (δ j + ǫ) and for dscrmnatve naïve Bayesan classfer, the dscrmnant functon for c s g (v) = j ǫ s used for smoothng. δ j + ǫ v j ln (δ j + ǫ) 5. IMPLEMENTATION We employ the ODP herarchy n ths work. Ths drectory contans 17 top-level categores such as Busness, Computer, Game, Health, etc. There are totally 712,548 categores and 4,800,870 web pages n our dumped verson. To reduce the scale of whole herarchy to a manageable sze and lmt ourselves to Englsh content only, three toplevel categores, Adult, Regonal and World are not used n our whole system. If the category canddates under these three categores are returned from ODP after a query, we just gnore them. Thus, 1, 946, 361 documents and 170, 198 categores are remaned. 5.1 Off-lne Cache To create the classfer effcently gven a set of categores, we frst try to download all regstered Web pages n ODP. The Web pages crawled are somewhat less than those regstered n the ODP herarchy snce some network errors occurred durng crawlng. In total we crawled 1,297, 222 Web pages that are dstrbuted over 157, 927 categores. Then, we can buld a vocabulary lst of all Web pages crawled. The orgnal vocabulary contans 6,387,537 dstnct words 4. We perform a global feature selecton n whch words wth low occurrence are removed. Words less than 4 occurrences over the whole herarchy are removed from our vocabulary. Ths reduces vocabulary by 80%, down to 1,287,715 words. We pre-counted and saved δ j for all words n all categores. Thus a classfer can be quckly created by readng these δ j for nvolved words n nvolved categores. We use the Ancestor-Assstant strategy and dscrmnatve naïve Bayesan classfer n our fnal desgn of the Deep Classfer, whch wll be shown best among other alternatves later. 4 word here means a sequence of letter, number and hyphen and not startng and endng wth a hyphen 143

6 30 % All 1.9 mllon 1.3 mllon 0.3 mllon Level Fgure 6: # of Queres vs. # of Top-level Categores 5.2 Space and Tme Complexty The Deep Classfer requres a two-dmenson table of δ j. Each row corresponds to a category and each column a word n the vocabulary. The space complexty s O(M N), where N s the sze of vocabulary and M s the number of all categores. Not all the columns are loaded at ntalzaton for space consderaton. The Second Chance strategy [22] s employed to swap n and out rows. When testng a search result, only words occur n the snppets are consdered. So the tme complexty for testng s O(n logn + n m + K), where n s the length of the snppets, m s the number of category canddates and N s the sze of the whole vocabulary. The frst tem denotes the tme to convert snppets nto word IDs. The second tem denotes the tme to classfy. And K s the tme for memory swappng only when the requred column has not been loaded prevously, so t s optonal. 6. EXPERIMENTS In ths secton, we frst show the collected statstcs about the category dstrbuton for queres. Then we ntroduce the data sets we used for evaluaton. And then we valdate on Dataset I our statement n Secton 3 that tranng data s usually nsuffcent at deep categores. Fnally, performances of dfferent strateges for collectng tranng data proposed n Secton 3 and dfferent classfer models proposed n Secton 4 are compared on the two datasets. 6.1 Category Dstrbuton We have reported n the Introducton secton that the categores of a query are usually dstrbuted among a few concentrated categores. We collected 1000 popular queres from a famous search engne, and counted how many toplevel categores the results of these queres from ODP are dstrbuted over. The resultant dstrbuton s shown n Fgure 6. There are about 94.7% queres are dstrbuted over less than sx categores and about 74.2% queres are over three or less categores. The two most wdely dstrbuted queres are games (n 14 top-level categores) and books (n 12 top-level categores). Ths justfes two ssues: (1) Drectly classfyng search results nto top-level categores may be too coarse for many queres and arrangng search results nto deep categores can separate them nto more dfferent categores; (2) Adap- Fgure 7: Dstrbuton of Document under Dfferent Levels tvely learnng a classfer on dfferent categores for dfferent queres s preferred to a unversal counterpart on all categores snce queres wll not be dstrbuted over too many categores. Ths s obvous especally at the top level. 6.2 Data Set To the best of our knowledge, there s no standard data set for our evaluaton. In order to evaluate the performance of the Deep Classfer, we prepared two datasets, a large one and a small one, for evaluaton. In each dataset, we collected top 100 search results for each query Dataset I: Randomly Selected 100 Queres The ODP data we used n our Deep Classfer mplementaton (dscussed n Secton 5) for classfer tranng s about 1.3 mllon pages, there are another about 0.6 mllon pages are not ncluded. We use half 5 of them to buld a search engne ndexed by Lucene 6. Ths engne was bult to smulate the role of search engne lke Google. We randomly selected about 100 queres from query logs and collected top 100 search results of each query from our faked search engne. Snce each search result s from the ODP, the category attrbute s already present. Ths greatly saved human labor to annotate large number of search results wth categores and made t possble to prepare such a large dataset n lmted tme and human labor. We also compared the dstrbuton of the three collectons: full 1.9 mllon pages regstered n ODP, 1.3 mllon pages mentoned n Secton 5 for Deep Classfer tranng and 0.3 mllon pages for the faked search engne. Both the percentage of document number and category number under each level are close n these three collectons. The comparson of document number s shown n Fgure 7. That of category number s smlar. Ths ensures that we wll not ntroduce great bas of search results towards some levels, deep or shallow, of categores. Lkewse, ths also ensures that the network falure mentoned n Secton 5 dd not cause a greatly nconsstent dstrbuton of both Web pages and categores wth the full verson. On ths dataset, we also valdated the assumpton that 5 The 0.6 mllon pages are those whch cannot be downloaded at the frst tme and ths half s those successfully downloaded at a second try

7 Table 1: Selected Ambguous Queres query # of cat # of top-level cat max depth ajax apple dell jaguar java saturn subway trec ups Herarchcal Flat Ancestor-Assstant Mcro-F1 Macro-Precon Macro-Recall Macro-F1 tranng data at deep categores are very few. Flat strategy only obtaned 21.6 tranng examples per query and category canddate. Ths fgure sgnfcantly ncreased to when the Ancestor-Assstant strategy s employed Dataset II: Ambguous Queres Ambguous queres are lmted n number and queres randomly selected n the manner descrbed above are mostly unambguous. So the frst dataset can only reflect the performance of our system on facet categorzaton of unambguous queres. To make up ths defcency, we asked our users to label the top-100 search results from real Google for 9 selected ambguous queres. Each search result was labeled by two users and f the category they labeled dd not agree, we just gnore that search result for that query. Table 1 lsts all the queres used n our evaluaton system. For each query, we also lst the number of category canddates returned from ODP, the number of dfferent top-level categores, and the maxmal depths of category canddates. The dfferent values between the second and thrd columns supports the statement clamed n Secton 2 that classfyng on large and deep herarchy can be more delcate. 6.3 Overall Performance In ths secton, we wll ntroduce the experments we have conducted to valdate these three hypotheses: 1. The Ancestor-Assstant strategy may outperform the herarchcal strategy and the flat strategy. 2. The dscrmnatve naïve Bayesan classfer may outperform the standard one. 3. The dscrmnatve naïve Bayesan classfer s supposed to be much faster than SVM although some performance may be lost Strateges for Selectng Tranng Data In ths experment, we fx to employ the dscrmnatve naïve Bayesan classfer, whch wll be shown to acheve hgher performance later, and vary the strateges for selectng the tranng data. The Mcro-F1, Macro-Precson, Macro-Recall and Macro-F1[23] on Dataset I are reported n Fgures 8, whch are averaged over all queres. For each ndvdual query n Dataset II, we report these measures separately n Table 2. As shown, we can fnd the Ancestor-Assstant strategy for tranng data selecton can acheve hghest performance. There are about 11.3%, 3.4%, 13.4%, and 15.7% mprovement on Dataset I and about 53.3%, 52.4%, 43.1%, and Fgure 8: Dataset I: Dfferent Strategy for Tranng Data Selecton Standard NB Dscrmnatve NB Mcro-F1 Macro-Precon Macro-Recall Macro-F1 Fgure 9: Dataset I: Dfferent Classfers 60.0% mprovement on Dataset II from the herarchcal strategy on the measures of Mcro-F1, Macro-Precson, Macro- Recall, and Macro-F1, respectvely. There are about 21.3%, 10.3%, 19.5% and 24.7% mprovement on Dataset I and about 14.1%, 14.3%, 68.8%, and 67.7% mprovement on Dataset II from the flat strategy on these measures. The performance of the proposed Ancestor-Assstant strategy mproved sgnfcantly from both the herarchcal and flat strateges. Ths valdates the frst hypothess that the Ancestor- Assstant strategy whch employs the tranng data from both the category canddate tself and ts ancestors and sblngs, and converts herarchcal classfcaton nto flat mode s the best. One can fnd that the flat strategy s not stable. When the tranng data from the category canddate tself s very nsuffcent, the performance wll be very poor, especally on Macro-Recall. But f suffcent, t acheves comparable performance to the Ancestor-Assstant strategy. We consder the low performance of the herarchcal strategy whch s smlar to what s adopted n some exstng work [8, 5, 15, 3] as due to the followng two factors: The top-down scheme wll accumulate the error rates at each level whch gradually reaches an unbearable amount at some deep level of the herarchy. Ths problem s overcome n our flat and Ancestor-Assstant strateges where the classfcaton s performed usng a flat mode. 145

8 Query Mcro-F1 Macro-Precson Macro-Recall Macro-F1 flat he aa flat he aa flat he aa flat he aa ajax apple dell jaguar java saturn subway trec ups (average) Table 2: Dataset II: Dfferent Strateges for Tranng Data Selecton Query Mcro-F1 Macro-Precson Macro-Recall Macro-F1 std ds std ds std ds std ds ajax apple dell jaguar java saturn subway trec ups (average) Table 3: Dataset II: Dfferent Classfers In TAPER[3], the lengths of all paths from the root to the leaves equal, but ths s not true for general. The probablty estmated by ths strategy s the product of dfferent number of condtonal probabltes, whch necessarly favors those leaves at shallow levels Standard vs. Dscrmnatve Nave Bayesan Classfer In ths secton, we conduct experments to valdate the second hypothess. Snce the Ancestor-Assstant strategy s valdated the best, we only report the subsequent experments on ths strategy. We compared the performance between the standard naïve Bayesan classfer and the proposed dscrmnatve naïve Bayesan classfer. The results are shown n Fgure 9 for Dataset I, and Table 3 for Dataset II. The proposed dscrmnatve naïve Bayesan classfer can acheve about 2.8%, 2.3%, 3.9% and 4.0% mprovement on Dataset I and about 7.3%, 28.8%, 25.6% and 24.0% mprovement on Dataset II from the standard naïve Bayesan classfer on the measures of Mcro-F1, Macro-Precson, Macro- Recall, and Macro-F1, respectvely. We consder the hgher performance produced by the dscrmnatve naïve Bayesan classfer s based on followng observatons. When two categores n deep level of the herarchy need to be dstngushed and they have a common ancestor n deep level of herarchy too, e.g. the category 874 and 902 n Fgure 3 and ther common ancestor 24, these two categores consst many common words n major contents and only a few dscrmnatve words that are of competence to dscrmnate these two categores. In such stuaton, n standard naïve Bayesan classfer, the contrbuton from these dscrmnatve words are shrunken by a large denomnator whle dscrmnatve naïve Bayesan classfer wll assgn hgher weght for these dscrmnatve words. The expermental result also valdates the second hypothess that the dscrmnatve naïve Bayesan classfer outperforms the standard one Nave Bayesan Classfer vs. SVM We also compared naïve Bayesan classfers wth the sophstcated SVM on Dataset I to valdate our thrd hypothess. We use the lbsvm package[4] for SVM mplementaton. It employs one-aganst-rest strategy to support mult- 146

9 classfcaton. We faled to observe statstcally sgnfcant mprovements by SVM, but the tme used by SVM was about 20 tmes of the tme used by naïve Bayesan classfers. The tme recorded for SVM ncluded the tme to fetch tranng examples n vector forms va dsk I/O, model tranng and classfcaton. For each query, we clear the memory cache before classfcaton, so the tme for naïve Bayesan classfers also ncluded I/O cost for fetchng saved δ j. In other words, ths s the maxmum tme needed by naïve Bayesan classfers. If the tme for watng the response from ODP and the search engne are excluded, the tme used by dscrmnatve naïve Bayesan classfer s averagely less than 1 second ncludng any optonal I/O cost, whch makes t sutable for onlne applcaton Sgnfcance Test We perform one-taled pared t-tests to check whether mprovements mentoned above are of statstcal sgnfcance. The p-values are shown n Table 4. Dataset Par Mcro-F1 Macro-Precon Macro-Recall Macro-F1 I II ds vs std aa vs he aa vs flat ds vs std aa vs he aa vs flat Table 4: P-values of T-tests 7. RELATED WORK In ths secton, we dscuss some man researches related to our problem, ncludng search result categorzaton, classfcaton on herarchcal categores, Web drectory based applcatons and classfcaton models. 7.1 Search Result Categorzaton and Clusterng As a retreval system though, TAPER[3] also ncluded search result classfcaton. The classfcaton was performed off-lne rather than at query tme. The standard naïve Bayesan classfer combned wth the herarchcal strategy for tranng data selecton proposed n Secton 3.2 s very smlar to the classfcaton model employed n TAPER[3]. Search result classfcaton algorthms have been proposed n [5, 8], whch corresponds to the shallow classfcaton algorthms n ths work. Dumas and Chen [8] have developed a system called SWISH n whch the search results were automatcally categorzed based on the top-two-level categores of the LookSmart drectory, where Web pages n the same category were clustered together. As descrbed n ther work, the category organzaton brngs many sgnfcant advantages for Web users. Partcpants lked the category nterface much better than the lst nterface, and they were 50% faster n fndng nformaton that was organzed nto categores. In [13], researchers used the top-level ODP categores as the tranng examples. For the end user, the most notable consequence of usng a classfcaton technque s the qualty of the category names. Because the documents are classfed to an exstng taxonomy, the class names are predefned and can be carefully selected to optmally convey the ntended meanngs. Thus the namng problem assocated wth the clusterng methods s avoded. However, the classfcaton scheme may be too coarse for the gven data set resultng n a top-level categorzaton where all data tems are placed n one or two classes. Clusterng s a way to show the results n more detal topcs. Many algorthms on search result clusterng are proposed, e.g. Vvsmo 7. As mentoned n [11], besdes the lack of qualty of the resultng clusters, the dsadvantages nclude the lack of predctablty of the clusterng results and the dverse mx of the obtaned sub-cluster herarches. Ths problem s even there for state-of-the-art search engne systems such as Vvsmo. 7.2 Classfcaton on Herarchcal Categores Several other researches have nvestgated the problem of classfcaton over herarchcal taxonomes [2, 10, 21, 24]. Most of ther fndngs over the testng data sets only numbered n hundreds, or at most a few thousand categores. Lu et al. [15] conducted a large scale analyss on the entre Yahoo categores to show the performance of classfcaton on a large scale taxonomy. As stated n that paper, there are about 246,279 categores n Yahoo! Drectory. The performance of classfcaton on the top-level categores s about 72% of the Mcro-F1 measure. However, when classfyng the documents nto deeper categores, the performance decreased quckly. As shown n [15], the performance s about 30% lower on the measure of Mcro-F1 at the 4th level and deeper. Drectly buldng large scale taxonomy classfers cannot work n solvng ths problem due to the poor performance of classfcaton. 7.3 Web Drectory Applcaton Web drectores, such as Yahoo!, Open Drectory Project (ODP), Gmpsy and others, asked human edtors to revew Web stes under the Web drectores. The most notable Web drectory s the Open Drectory Project, whch s one of the largest efforts to manually annotate Web pages. Over 70,430 edtors are engaged n keepng the drectory reasonably up to date. As a result, the ODP now provdes access to over 4 mllon Web pages n the catalogs. These Web drectores are organzed n a tree structure. In addton, each category has been labeled wth an understandable name. The ODP has been successfully utlzed n much related work, such as classfcaton [8, 5, 18], personalzed Web search [6, 14], evaluaton [9, 1], etc. 7.4 Classfer Model The naïve Bayesan classfer has been used for a few decades. Serval work (e.g.[7]) tred to explan ts success although the naïve assumpton usually does not hold. Although the names share the same word dscrmnatve, our dscrmnatve naïve Bayesan classfer s dfferent from dscrmnatve models dscussed n [19], whch compared the dscrmnatve classfer and the generatve classfers. Our methods focused on dscrmnatve features rather than the classfer tself. The reversed naïve Bayesan classfer mentoned n [12] s very smlar to our dscrmnatve naïve Bayesan classfer. But the event model are dfferent, and we also proposed a dscrmnaton explanaton whch was not presented n [12]

10 In that paper, the author clamed that the reversed verson and the orgnal verson s the same under the assumpton that there are the same number of tranng data n each categores, whch s not true of our stuaton. 8. CONCLUSIONS AND FUTURE WORK The exstng algorthms for search result classfcaton are based on shallow levels of topc herarches. Therefore, they are too coarse to provde the needed nformaton for user browsng. The classfcaton algorthms on large and deep herarches can provde much detaled categores but tradtonally have been done wth low classfcaton performance. In ths paper, we proposed a novel algorthm known as Deep Classfer to classfy search results nto deep topc herarches n an adaptve manner. By prunng the orgnal herarchy nto a decent sze whle reservng the herarchcal structure and depth, we ntroduced dfferent strateges for collectng suffcent tranng data n order to buld relable classfers. Dscrmnatve naïve Bayesan classfers are employed for effcency and effectveness. Expermental results showed that the performance of the Ancestor- Assstant strategy can acheve the hghest performance when compared to other strateges, whle the dscrmnatve naïve Bayesan classfer can acheve hgher performance than the standard naïve Bayesan classfer. Furthermore, the search results are demonstrated n deep herarchcal style, whch provdes more nformatve class labels for users browsng. By consderng that usng the content rather than only the snppet can make some further mprovements on performance, n the future, we wll further verfy the performance on the whole content of Web page nstead of snppets. As mentoned n Secton 6, the number of top-level categores correspondng to a query are usually less than fve. We omtted the number of queres wth no returned categores from the ODP for the lmted coverage of human labeled Web drectory. In such a stuaton, the method n ths paper wll have dffcultes. It requres further research to handle such a stuaton. For example query expanson may be a soluton to ths. At present, we smply employ a flat classfer on all top-level categores to classfy the search results from these queres. 9. REFERENCES [1] S. M. Betzel, E. C. Jensen, A. Chowdhury, D. Grossman and O. Freder. Usng manually-bult web drectores for automatc evaluaton of known-tem retreval. In Proc. of SIGIR, Toronto, Canada, [2] L. Ca and T. Hofmann. Herarchcal document categorzaton wth support vector machnes. In Proc. of CIKM, pages 78 87, [3] S. Chakrabart, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selecton, classfcaton and sgnature generaton for organzng large text databases nto herarchcal topc taxonomes. VLDB Journal: Very Large Data Bases, 7(3): , [4] C.-C. Chang and C.-J. Ln. LIBSVM : a lbrary for support vector machnes, Software avalable at cjln/lbsvm [5] H. Chen and S. T. Dumas. Brngng order to the web: Automatcally categorzng search results. In Proc. of CHI 00, pages , August [6] P. Chrta, W. Nejdl, R. Pau and C. Kohlshuetter Usng ODP metadata to personalze search. In Proc. of SIGIR, Salvador, August [7] P. Domngos and M. J. Pazzan. On the optmalty of the smple bayesan classfer under zero-one loss. Machne Learnng, 29(2-3): , [8] S. T. Dumas and H. Chen. Herarchcal classfcaton of web content. In Proc. of SIGIR, pages , August [9] P. Ganesan, H. Garca-Molna and J. Wdom. Explotng herarchcal doman structure to compute smlarty. Techncal report, Stanford Computer Scence Dept, , June [10] M. Grantzer. Herarchcal text classfcaton usng methods from machne learnng. Master s Thess, Graz Unversty of Technology, [11] M. Hearst. Clusterng versus faceted categores for nformaton exploraton. Communcaton of the ACM, 49(4):59 61, [12] A. Juan and H. Ney. Reversng and smoothng the multnomal nave bayes text classfer, In Proc. of PRIS, Alacant (Span), pages , [13] B. Kules and B. Shnederman. Categorzed graphcal overvews for web search results: An exploratory study usng U.S. government agences as a meanngful and stable structure. Techncal report HCIL , CS-TR-4715, UMIACS-TR , ISR-TR [14] F. Lu, C. Yu and W. Meng. A personalzed web search by mappng user queres to categores. In Proc. of CIKM, pages , [15] T.-Y. Lu, Y.-M. Yang, H. Wan, H.-J. Zeng, Z. Chen and W.-Y. Ma. Support vector machnes classfcaton wth a very large-scale taxonomy. SIGKDD Exploratons, 7(1):36 43, [16] A. McCallum and K. Ngam. A comparson of event models for nave bayes text classfcaton, In AAAI-98 Workshop on Learnng for Text Categorzaton, [17] T. M. Mtchell. Machne Learnng, McGraw Hll. ISBN Secton [18] D. Mladenc. Turnng yahoo nto an automatc web page classfer. In Proc. of ECAI, Brghton, UK, pages , [19] A. Ng and M. Jordan. On dscrmnatve vs. generatve classfers: A comparson of logstc regresson and nave bayes, In NIPS 14, Cambrdge, MA: MIT Press, [20] B. Sergey and L. Page. The anatomy of a large-scale hypertextual Web search engne. Computer Networks and ISDN Systems, 30(1 7): , [21] A. Sun and E. Lm. Herarchcal text classfcaton and evaluaton. In Proc. of ICDM, pages , [22] A. S. Tanenbaum. Modern Operatng Systems (the second edton) secton 4.4.4, pages , New Jersey: Prentce-Hall [23] Y. Yang. An evaluaton of statstcal apphroaches to text categorzaton. Journal of Informaton Retreval, 1 No. 1/2:67 88, [24] Y. Yang, J. Zhang and B. Ksel. A scalablty analyss of classfers n text categorzaton. In Proc. of SIGIR, pages ,

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong