Deep Classification in Large-scale Text Hierarchies

Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong Unversty of Scence and Technology Clearwater Bay, Kowloon, Hong Kong qyang@cs.ust.hk ABSTRACT Most classfcaton algorthms are best at categorzng the Web documents nto a few categores, such as the top two levels n the Open Drectory Project. Such a classfcaton method does not gve very detaled topc-related class nformaton for the user because the frst two levels are often too coarse. However, classfcaton on a large-scale herarchy s known to be ntractable for many target categores wth cross-lnk relatonshps among them. In ths paper, we propose a novel deep-classfcaton approach to categorze Web documents nto categores n a large-scale taxonomy. The approach conssts of two stages: a search stage and a classfcaton stage. In the frst stage, a category-search algorthm s used to acqure the category canddates for a gven document. Based on the category canddates, we prune the large-scale herarchy to focus our classfcaton effort on a small subset of the orgnal herarchy. As a result, the classfcaton model s traned on the small subset before beng appled to assgn the category for a new document. Snce the category canddates are suffcently close to each other n the herarchy, a statstcal-language-model based classfer usng n-gram features s exploted. Furthermore, the structure of the taxonomy can be utlzed n ths stage to mprove the performance of classfcaton. We demonstrate the performance of our proposed algorthms on the Open Drectory Project wth over 3, categores. Expermental results show that our proposed approach can reach 5.8% on the measure of M-F at the 5th level, whch s 77.7% mprovement over top-down based SVM classfcaton algorthms. Categores and Subject Descrptors H.4.m [Informaton Systems]: Mscellaneous; I.5.4 [Pattern Recognton]: Applcatons Text processng General Terms: Algorthms, Performance, Expermentaton. Keywords: Deep Classfcaton, Large Scale Herarchy, Herarchcal Classfcaton.. INTRODUCTION Text classfcaton s at the heart of Web page classfcaton, whch can fnd many applcatons rangng from Web personalzaton to targeted advertsements [] on Web pages. In text classfcaton, our am s to categorze a gven text document nto predefned classes, where the man technques used are machne learnng methods such as support vector machnes (SVM). However, most machne Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SIGIR'8, July 2-24, 28, Sngapore. Copyrght 28 ACM 978--6558-64-4/8/7...$5.. learnng methods confne themselves to classfyng a document nto two or a few predefned categores. As such, the power of Webpage classfcaton s severely lmted. In ths paper, we take the frst step n explorng how to scale up the target categores from a few to hundreds of thousands, n herarches of classes such as the Open Drectory Project (ODP) and Yahoo! Drectores, thus elevatng text classfcaton to a new, practcal level. Three man dffcultes exst that prevent tradtonal approaches to classfcaton from beng appled. The frst s the sheer sze of the taxonomy of categores. Our experments show that as the number of classes ncreases to a moderate level, the predctve accuracy dramatcally decreases to a level that renders the classfers unusable. The second dffculty caused by the large sze of the taxonomy s that a very long tme for tranng s requred by tradtonal methods. Tradtonal methods become even ntractable for large scale herarches [2][3]. The thrd dffculty les n the fact that n practce, categores are usually organzed as a herarchcal structure. As a result, complex relatonshps, such as parent-chld relatons, often exst among the target classes. However, categores on a largescale herarchy are assumed to be ndependent by most of prevous works. Thus, these methods cannot utlze the structure nformaton. Moreover, the falure of ths assumpton may even mslead these methods and decrease ther performance. Hence, t s mportant to utlze the structure of taxonomy n order to obtan a satsfactory performance. Prevous methods to solvng the herarchcal classfcaton problem can be classfed accordng to the strateges used n classfcaton [8]. These methods can be generally dvded nto two types: bgbang approaches and top-down level based approaches. In bg-bang approaches, a sngle classfer s traned on the entre target herarchy. Bg bang methods may allow the classfcaton model to consder the herarchcal structure of classes. Examples are herarchcal SVM [2] and Roccho-lke classfers []. However, t s proved n [2][3] that t s nfeasble to drectly buld a classfer for a large-scale herarchy. A second approach to solvng the problem s the top-down approach, whch constructs classfers at each level of the category tree where each classfer works as a flat classfer at that level. A document s frst classfed by the classfer at the root level. It s then classfed by the classfers traned at the lower-level categores untl the document reaches a fnal category [6]. In order to classfy a document to a category correctly, t must be classfed perfectly at all the ancestors. As a result, a potental problem for the top-down approach s that msclassfcaton at a parent or ancestor category may force a document to be excluded from the chld categores before t could be examned by the classfers of the chld categores. Moreover, the classfcatons over hgh-level categores may fal easly snce some of the categores are too general and thus harder to dscrmnate as we show n the experments. In ths case, the performance of the top-down approach s sgnfcantly mpared. Ths ndcates that the approach makes very restrctve assumptons

on the herarches. Lu et al. [2] evaluated a herarchcal SVM classfcaton algorthm on the Yahoo! herarchy, whch contans 32,99 categores. The results show that the performance of classfcaton on herarchy drops quckly when the level of categores ncreased. Generally, text classfcaton on large-scale target herarches remans an unsolved problem. In ths paper, we propose a novel method that can overcome those dffcultes and consequently mprove the performance of classfcaton n large text herarches. In partcular, we present a two-stage approach for large-scale herarchcal classfcaton; we call our method deep classfcaton. In the frst stage, we organze the herarchy nto flat categores, where we perform a search process on large-scale herarches by retrevng the related categores for a gven document. We rank the categores and take the most related categores as category canddates. Thus, a large-scale herarchy s pruned nto a much smaller but focused one. In the second stage, we tran a classfcaton model on such a small subset of orgnal herarchy and classfy the gven document n that small subset. Durng ths stage, we propose several strateges for tranng classfers. The structure of the orgnal herarchy s utlzed to mprove the classfcaton performance. To evaluate our deep classfcaton approach, we have conducted several experments on the Open Drectory Project, whch contans more than 3, categores. We test the effectveness of proposed deep classfcaton algorthm by comparng to the state-ofthe-art herarchcal classfcaton algorthms. Expermental results show that our proposed approach can reach 5.8% on the measure of M-F at 5th level, whch s 77.7% mprovement over the topdown based SVM classfcaton algorthm. The rest of the paper s organzed as follows. In Secton 2, we gve a bref overvew of related work. In Secton 3, we descrbe the framework of proposed algorthms. In Sectons 4 and 5, we focus on dfferent strateges at each stage. The evaluaton results are shown n Secton 6. Secton 7 concludes wth a summary and suggestons for future work. 2. RELATED WORK 2. Tradtonal Text Classfcaton In tradtonal text classfcaton, many algorthms [7][22] have been proposed, such as Support Vector Machne (SVM), k-nearest Neghbor (knn), Nave Bayes (NB) and so on. Emprcal evaluatons on benchmark datasets such as Reuters 2578 [8] and RCV [] have shown that most of these methods are effectve n tradtonal text classfcaton applcatons. In Web applcatons, most of the classfcaton methods, such as SVM and NB, utlzed the text classfcaton methods for Web documents by ntroducng many novel features related to Web document lke anchor text, metadata and lnk structure to optmze the performance. As reported n [2], flat classfcaton based on SVM generally has worse performance than top-down based SVM for the large-scale herarchcal classfcaton. As the frst work to nvestgate the performance on large-scale herarchy, Lu et al. conducted a large scale analyss on the entre Yahoo categores and reported that the performance of flat SVM s about 3% lower on measures of Mcro-F at the 4th level and deeper. A recall system [3] was proposed on performng large scale flat classfcaton n whch a smple feature based ntermedate flterng s used to reduce the potental categores for an nstance to a small manageable set. However, the system dd not nvestgate the rch structure among the herarchcal categores. Our expermental results n Secton 6.3.4 show that hgher performance wll be acheved by consderng such structure nformaton. 2.2 Herarchcal Text Classfcaton There are generally two approaches adopted by the exstng herarchcal classfcaton methods [8], namely, bg-bang approach and top-down approach. 2.2. Bg-bang Approach As descrbed n [8], for the bg-bang approach, only a sngle classfer s used by consderng the herarchcal structure of the categores. Gven a document, the classfer assgns t to one or more categores n the category tree. The bg-bang approach has been desgned usng SVM [2], Roccho-lke classfer [], rule-based classfer [6] and assocaton rules [9]. Assumng the dstrbuton of herarchcal categores follows the power law, Yang et al. [24] gave a theoretcal analyss of scalablty of text classfcaton on flat and herarchcal methods. As reported n ther work, the tme cost of bg-bang classfcaton s larger than that of top-down herarchcal classfcaton. In [2], a modfed SVM verson s appled on the whole herarchy. In [4], a search based approach s proposed to fnd the top K most smlar categores for further search result flterng. In [4], McCallum et al. proposed a herarchcal classfcaton approach usng a shrnkage approach, n whch smoothed parameter estmaton of a data-sparse chld node s used wth ts parent node n order to obtan robust parameter estmates. An EM algorthm s used to evaluate the nterpolatng parameters. However, t s very dffcult to conduct ths process on our problem settng due to the large number of categores. Furthermore, n most prevous works, experments were conducted wth at most a few thousand categores. The task of buldng even a sngle classfer for a large-scale herarchy s known to be ntractable [2]. In contrast, as we show n ths paper, our method s scalable n handlng large text herarches wth hundreds of thousands of categores. 2.2.2 Top-down Approach Top-down level-based classfcaton has been desgned based on multple Bayesan classfers n [9] and SVM classfers n [5] and [6]. In [5] and [6], Dumas and Chen proposed a classfer on the top-two levels of the LookSmart categores wth 63 categores n total. A top-down based SVM s performed on a very large scale herarchy n [2]. As reported n the work, the performance s about 4% lower on measures of Mcro-F at the 5th level and deeper on Yahoo! drectory. Drectly buldng top-down classfers cannot work well n large scale herarchy due to the problem of error propagaton. TAPER [3] s a system for large scale herarchcal classfcaton usng nave Bayesan and feature selecton on dfferent level categores. TAPER also performed top-down classfcaton on the whole herarchy. In Error! Reference source not found., a search result classfcaton system was developed by classfyng the search results nto deep herarches by usng category canddates retreved by query. However, the work focused on the search results analyss through the query, and dd not drectly solve the document classfcaton ssue. Ths paper proposes a new algorthm for document classfcaton on deep herarches. 3. DEEP CLASSIFICATION In ths secton, we propose a deep-classfcaton algorthm for large scale category herarchy. Our algorthm works as follows. For a gven document, the entre categores can be dvded nto two knds accordng to ther smlarty to the document: related categores to

the document and unrelated categores to the document. For a very large scale herarchy, the number of related categores for a document s much less than the number of the unrelated categores. Tradtonal herarchcal classfcaton algorthms only focused on buldng a global classfcaton algorthm to optmze the performance for all categores despte the fact that most of the categores may not be related to a gven document. Our deep classfcaton approach can utlze such a property and thus focus on the categores related to the document. We frst extract a small subset of related categores from the large-scale herarches. We then perform classfcaton on these extracted categores utlzng the structure of the orgnal herarchy. Fgure. Flowchart of Deep Classfcaton The algorthm s shown n Fgure, where we present a two-stage algorthm consstng of a search stage and a classfcaton stage. In the search stage, we try to fnd a subset of categores from the large scale herarchy related to gven document. As a result, the large scale herarchy s pruned nto a small one. Then, n the classfcaton stage, we tran the classfer on ths small herarchy. It s ntutve that the classfcaton performance on a few categores wll be better than that on a larger set of categores. Moreover, structure nformaton of the orgnal herarchy s appled n ths stage to enhance the classfcaton results. In the search stage, a search based algorthm s used to fnd the category canddates for the gven document. We begn wth a set of categores and a pre-classfed tranng set of pages. One can obtan the tranng set from taxonomes lke ODP, Yahoo! or from some other resources dependng on the desred applcaton. Compared wth the entre herarchy, ths narrowng-down procedure helps reduce the number of target category canddates. The detals of ths part wll be dscussed n Secton 4. Next, based on the structure of the pruned herarchy, a classfer s traned and used to categorze the document nto categores. In ths stage, by consderng the pruned herarchcal structure, three tranng data selecton strateges are proposed n Secton 5. whch utlze the herarchcal structure. Then, based on selected tranng data, we perform classfcaton for the gven document. Snce the classfcaton model needs buldng nstantly, t s mportant for the algorthm to be effcent n order to make our method scalable. To satsfy ths goal, we compare dfferent classfers and propose a lght-weghtng classfer based on naïve Bayes classfer whch s descrbed n Secton 5.2. 4. STRATEGIES IN SEARCH STAGE In the search stage, we propose two strateges to fnd the category canddates for a gven document: document-based search strategy and category-based search strategy. 4. Document based Strategy Document based strategy compares the relevance between the gven document and these documents n the tranng set. The documents n a tranng set and the gven document to be classfed are both represented wth normalzed term frequency vectors. A comparson s done usng the cosne smlarty measure. Top N most smlar documents are selected as related documents to the gven document. These categores are taken as the category canddates. 4.2 Category based Strategy Wth Category based strategy, we represent the category wth the Web pages n ths category and then perform the smlarty calculaton between the categores and the gven document. From these pre-classfed pages n the categores, we can buld a vector of term frequences for each of the categores. The gven document s also represented wth the term frequency vector of the document. Then, we compute the cosne smlarty between the vector of a gven document and the categores. Based on the search stage, we can acqure the related categores, whch can be ether a leaf node or an nternal node of the herarchy. In the next step, we can classfy the gven document nto these category canddates. 5. STRATEGIES IN CLASSIFICATION STAGE Based on the related category canddates, a large herarchy s pruned nto a narrow one. A category s kept f the category or ts chld category s among the canddates. The remanng categores are removed from the herarchy. An example of pruned herarchy s shown n Fgure 2. Nne categores are shown wth bold font as the related categores to the gven document, whch are acqured based on the related categores search stage. Then, we perform classfcaton on the pruned herarchy. Snce the pruned herarchy stll has the relatonshp lnks among the categores, we wsh to use these relatons to enhance the results of classfcaton. We apply classfcaton wth dfferent strateges n ths stage. Below, we consder the steps of ths stage n detal. 5. Strateges for Tranng Data Selecton 5.. Flat Strategy The flat strategy s a smple strategy for tranng data selecton n whch we just consder the category canddates as a flat structure wthout consderng the category nformaton of ther ancestors. From the vewpont of herarchcal classfcaton, ths strategy places all the category canddates drectly at the root, whch s shown n Fgure 3. Then, we drectly tran the classfer based on the Web pages n the canddate categores. 5..2 Pruned Top-down Strategy Consderng the tree structure of pruned herarchy, we can use the pruned top-down based strategy to tran the classfers. The pruned top-down strategy can be taken as specfc type of a top-down classfcaton method proposed n [6][2] by frstly smplfyng the large herarchy nto a narrow one. A document s frst classfed by the classfer at the root level. It s then classfed by the classfers of the lower-level categores untl t reaches a fnal category.

5..3 Ancestor-assstant Strategy Fgure 2. Pruned Herarchy Fgure 3. Flat Strategy 44 85 23 66 834 854 42 677 25 875 874 92 689 77 Fgure 4. Ancestor-Assstant Strategy The structure of the herarchy s largely gnored by prevous two strateges. However, as dscussed n Secton, an deal strategy for tranng data selecton should take ths structural nformaton nto account. Thus, we propose the ancestor-assstant strategy to utlze ths nformaton. Ths strategy s guded by the followng two observatons. Frst, the tranng data from the category canddate tself may be nsuffcent n sze, especally for a deep category. Thus, we need to obtan more data elsewhere. Second, although the tranng data from ts hgher up ancestors may be too general to reflect the characterstcs of the deep category canddate, we can borrow data from the ancestors. We should not do ths for ancestors that are too hgh up. Hence, we propose a trade-off between the herarchcal strategy and flat strategy by combnng the tranng data from the category canddate tself and the tranng data from ts ancestors, as long as they do not share the common ancestors of other category canddates. By consderng the structure of the herarchy, the scarcty of tranng data on deep categores can be allevated. In addton, we nclude the tranng data from a node tself to reserve the characterstcs of the categores and the tranng data wll not be largely affected by the tranng data from ts ancestors. As shown n Fgure 2, snce the common ancestor s category 24, the tranng data for category 874 are from those of 834, 875 and 874 whle the tranng data for category 92 are from those of 854 and 92. The tree n Fgure 4 can clearly clarfy ths strategy. If the node may go up to a hgher level, too many tranng data wll be nvolved. As a result, large amounts of tranng data may cause the data to be unbalanced and degrade the performance. In ths work, we lmt the heght a node to be two-level-hgher than the node tself when applyng ths method. 5.2 Strateges for Classfer Selecton For a gven document, we need to tran a specfc classfer. Thus, t s preferred to employ a lghtweght classfer that does not cost too much tme for tranng. Ths s because a classfer on varous collectons of categores may be requred n response to dfferent documents. If a classfer such as SVM s employed, the long tranng tme mght prevent us from delverng the results to the user n a tmely manner. To ths end, we prefer the Nave Bayes Classfer (NBC) by consderng that probablstc estmaton of NB can be acqured off-lne. In the expermental part, we also gve the expermental results from SVM and compare the effcency and effectveness among them. 5.2. Standard NBC Standard NBC estmates the probablty that a test example belongs to a category by computng the followng: N d () P ( c d) P( d c ) P( c ) = P( c ) P( t c ) where c s a category, d s the test example, N s the vocabulary sze, t j s each term n vocabulary, and d j s the correspondng value n d for term t j (usually term frequency). Durng the classfcaton stage, the classfer s to assgn the category to the gven document accordng to: c* c d)} c ) P( d c )} (2) c C c ) c C N j= P( t j c C c ) It s clear that the probablty P( d c ) for each category c can be acqured off-lne. NBC wll take less tranng tme than SVM algorthm on the pruned herarches. Thus, t s a knd of lghtweght classfer. 5.2.2 N-Gram Language Models for Classfers In NBC, terms are consdered ndependent of each other gven the category. However, n our stuaton, most of canddate categores are very close to each other. It s dffcult for NBC to dstngush them based on the features of ndependent terms. In our work, we propose to use Markov n-gram language model to perform the classfcaton on canddate categores by consderng the Markov dependency between adjacent terms [7][5]. For a term sequence t t 2 LtT, the probablty of the sequence s wrtten as: T (3) P t t L t ) P( t t Lt ) v j } j= ( 2 N = = An n-gram model approxmates ths probablty by assumng that the only terms relevant to predctng P( t tl t ) are the prevous n- terms; that s, t assumes the Markov n-gram ndependence assumptons P( t t L t ) = P( t t n+ Lt ) We make a straghtforward maxmum lkelhood estmate of n-gram probabltes from a corpus by the observed frequency. We note that dfferent smoothng strateges have been proposed and evaluated n [5]. By usng n-gram features to text classfcaton, our predcton s: j j

c* c c C c ) c C d)} c ) P( d c )} T = P ( t c c C t Lt In ths work, we use a 3-gram for our classfcaton based on the result reported n [5], whch states that 3-grams can often result n the best performance for text classfcaton. 6. EXPERIMENTS 6. Expermental Setup 6.. Dataset # of Documents 35 3 25 2 5 5 2 599 689 32266 28896 293249 )} 56795 9998 3723 8552 39 57 2 3 4 5 6 7 8 9 2 Fgure 5. Documents Dstrbuton on Dfferent # of Categores 5 4 3 2 548 5 467 22359 43896 3526 25653 7225 5967 64 247 24 2 3 4 5 6 7 8 9 2 Fgure 6. Categores Dstrbuton on Dfferent To evaluate the performance of our algorthm, experments are conducted usng a set of classfed Web pages extracted from the Open Drectory Project (ODP) (http://dmoz.org/). ODP has about 4,8,87 Web pages and 72,548 categores, n whch each Web page s classfed by human experts nto 7 top level categores (Arts, Busness and Economy, Computers and Internet, Games, Health, Home, Kds and Teens, News, Recreaton, Reference, Regonal, Scence, Shoppng, Socety, Sports, Adult and World). Because the Web pages n the regonal category are also ncluded n other categores and because many Web pages n the category of the world are not wrtten n Englsh, these two categores are removed n our experments. Accordngly, 5 categores n all are used n the experments. After downloadng from the Web, we obtan about.3 mllon Web documents n all. The data are dvded nto a tranng set and a testng set. The dstrbuton of these Web pages on 3, categores s shown n Fgure 5. As shown n the fgure, about 76.8% of the documents belong to the top sx level categores and about 68.6% of the documents belong to forth-to-sxth-level categores. The dstrbuton of 3, categores s shown n Fgure 6. As shown n the fgure, about 67.8% of the categores are n the top 6 level categores and about 64.% of categores belong to four-to-sx-level category. Ths shows that classfyng the Web pages nto deep categores s very mportant. (4) As we mentoned n Secton, the number of related categores for a gven document s small. In ths part, we present statstcs to show the category number for each document. As shown n Table, about 93.46% of the documents belong to one category. Only 6.54% of the documents have two or more categores. It s thus reasonable to select a small subset of the large scale herarchy to perform the classfcaton n ths dataset. Table. Categores Number Dstrbuton Number of Categores Number of Documents Percentage 24977 93.46% 2 74237 5.7% 3 24.85% >=4 95.5% Snce the whole data set s too large, we take 3, documents from.3 mllon documents as the testng data. Furthermore, n order to tune the performance of dfferent strateges, 2, addtonal documents are also randomly selected, whch s called valdaton data. The remanng data set s taken as the tranng data. We buld the documents ndexng and the categores ndexng at the related categores search stage. 6..2 Evaluaton Metrcs In typcal classfcaton experments, the number of documents s usually a magntude greater than the number of categores. However, the number of target categores n our tests exceeds 3,. Conductng experments wth 3K* or even more testng documents s very tme-consumng. To avod the undefned problem of Ma-F measurements on a number of categores, we use the metrc M-F [2] descrbed n [2] to measure the M-F on dfferent level. The process of evaluaton s as follows. Frst, we classfy a document nto the whole deep herarchy. For example, a Web page p can be classfed nto the category Top/Computers/Programmng/Languages/JavaScrpt/W3C_DOM. Then, we evaluate the performance for each level of the herarches accordng to the classfed category. That s, when evaluatng the performance of level one, we wll judge whether p belongs to the category Top/Computers. When evaluatng the performance of level 2, we wll judge whether the Web page p belongs to Top/Computers/Programmng. Hence, t s dfferent from that tradtonal method that trans the classfer at level or level 2 by aggregatng the data of chldren nodes nto ts parent category and only evaluatng the performance at that level. 6.2 Overall Performance Three algorthms are compared n ths work: - Herarchcal SVM: Top-down classfcaton s an effcent algorthm. In ths work, we employ the herarchcal SVM as a representatve algorthm for top-down classfcaton. - Search based Strategy: As descrbed n our deep classfcaton algorthm, we can take the most smlar category as the category for the gven document, whch s smlar to the nearest neghbor approach. - Deep Classfcaton: Ths s our proposed algorthm. As we mentoned, there are several strateges for each step. We tune these strateges n Secton 6.3. Then, we take the strateges whch acheve hghest performance. Top categores are taken as category canddates. Category-based search, ancestor-assstant strategy and 3-gram language model for classfers are taken as the settng for deep classfcaton.

Each algorthm s tuned to acheve the hghest performance on the valdaton data. The overall performance for three algorthms s shown n Fgure 7. M-F.9.8.7.6.5.4.3.2. Search based Strategy Herarchcal SVM Deep Classfcaton 2 3 4 5 6 7 8 9 Fgure 7. Performance on Dfferent As shown n Fgure 7, our proposed deep classfcaton algorthm can acheve consstent mprovement over other algorthms at dfferent levels of the herarchy. As shown n Fgure 7, the performance of our proposed algorthm can reach 5.8% at level 5 whle the herarchcal SVM only acheve 29.2% at same level. The result shows that our algorthm can get about 77.4% mprovements over the top-down approach at level 5. By usng the two-stage schema, our algorthm can make accurate classfcaton on a pruned herarchy. Snce the herarchcal SVM s conducted through a topdown method, as we dscussed above, the structure of the herarchy s not properly utlzed, so the error at hgher levels wll be propagated to deeper level. As a result, the deep-level classfcaton cannot acheve good performance. Another reason s that herarchcal SVM cannot construct tranng set that are suffcent n sze when learnng deep categores of the herarchy. As a result, the performance of herarchcal SVM s sgnfcantly reduced over the deep level categores. Furthermore, as shown n the Fgure 7, the deep classfcaton algorthm also acheves hgher performance than the search based strategy. The result can prove that t s very necessary to perform the classfcaton stage for deep classfcaton algorthm, whch can lead to more precse results for the deep herarchy. 6.3 Strategy Selecton In ths secton, we wll evaluate dfferent strateges used n each stage of proposed deep classfcaton algorthm. Both algorthms are tested on 2 documents n the valdaton data, whch are randomly chosen. We tune these strateges one by one and fx the other strateges when tunng one strategy. 6.3. Search Strategy As proposed n Secton 4, there are two strateges n fndng the category canddates for a new document: document-based strategy and category-based strategy. Here we evaluate whch strategy can produce hgher performance. NB classfer s used as the classfer for ts smplcty. All top categores are used. The expermental results are shown n Fgure 8. As shown n Fgure 8, the category-based strategy can produce hgher performance than the document-based strategy at each level. At level 5, the categorybased strategy can acheve 69.2% mprovement over the documentbased strategy on the measure of M-F. We explan ths observaton by the fact that the smlarty score between several retreved documents n a category and a gven document cannot represent the smlarty between the whole category and the gven document. The category can provde more nformaton than an ndvdual document n that category. Furthermore, the tme cost for category-based strategy s much less than the document-based strategy. Thus, we use the category-based strategy n the search stage for the deep classfcaton algorthm..8 Category-Based.6 Document-Based.4.2 2 3 4 5 6 7 8 9 Fgure 8. Performance on Dfferent Search Strateges 6.3.2 Canddate Category Number Selecton In the search stage, the system can return dfferent numbers of category canddates. We try to decde how many top ranked categores to be used so the category canddates are adequate. If we only choose one category, the two-stage method s degenerated to the search based strategy only. We perform evaluaton on the tunng data. Our expermental result s reported n Fgure 9. As shown n Fgure 9, the more categores chosen by the search stage, the more lkely we can fnd the correct target category n the classfcaton stage. However, too many categores also aggravate the burden on tranng tme n the classfcaton stage..9.8.7.6.5.4.3.2 2 3 4 5 6 7 8 9 Top 2 3 # of category canddates 4 5 6 7 8 Fgure 9. Performance on Dfferent Number of Category Canddates As shown n the fgure, the performance on the top-3 levels s reduced when the number of canddate categores s ncreased from to, although very slghtly. However, n deeper levels, the performance ncreases sgnfcantly and tends to be stable near categores. Thus, the number of category canddates s set to consderng the trade-off between the tme complexty and the performance. In the followng experments, we set the search strategy as the category-based strategy and use the top categores as the number of category canddates. 6.3.3 Feature Selecton Based on the search stage, category canddates for a new document are found to reduce a large herarchy nto a small one. In our problem, the number of all features exceeds, n most stuatons. To solve ths problem, we carry out feature selecton and show the performance based on usng dfferent numbers of features. We perform the CHI-Square feature selecton, whch s verfed as the best feature selecton method for text classfcaton n [23]. Two dfferent learnng methods are evaluated: Herarchcal SVM and naïve Bayesan (NB). As shown n Fgure, we can fnd that the performance wth selected 2 features s smlar to that wth the whole features. But t s an obvous advantage that fewer features M-F M-F

can reduce tme of tranng and testng. Therefore, n ths work, the feature number s lmted to 2 selected by CHI-Square feature selecton. M-F.8.6.4.2 SVM NB+CHI NB SVM+CHI 2 3 4 5 6 7 8 9 Fgure. Performance on Feature Selecton 6.3.4 Tranng Data Selecton Based on the pruned herarchy, we consdered three strateges of tranng data selecton for further classfcaton. In order to show the performance of dfferent strateges, we conduct an experment on the small herarchy generated from the category canddates usng the naïve Bayesan classfer. The expermental results are shown n Fgure. As shown n the fgure, we can fnd the Ancestor- Assstant strategy for tranng data selecton can acheve hghest performance. There are about 3.6% and 9.5% mprovement over the herarchcal strategy and the flat strategy on the measurement of M-F, respectvely, at level 5. Flat.8 Pruned Top-Down Ancestor-Assstant.6.4.2 2 3 4 5 6 7 8 9 Fgure. Performance on Dfferent Strateges on Tranng Data Selecton As shown n these fgures, we can fnd that the performance of the flat strategy s lower than that of the Ancestor-Assstant strategy snce ths strategy gnores the structure of the herarchy. Thus t cannot acqure enough tranng data at some cases snce the nformaton from the ancestors s not used to enhance the classfer. The nformaton from the ancestors s vtally mportant when the tranng data from the category canddate tself s nsuffcent. The performance of the flat strategy wll be very poor n ths case. Ths experment also proves that usng rch structure of herarchcal categores can enhance the performance of large scale classfcaton, whch s largely gnored n [3]. The low performance of the Top-down strategy s due to two factors: () In the top-down scheme, error rates are accumulated at each level whch gradually reach an unbearable amount at some deep level of the herarchy. Ths problem s overcome n our flat and Ancestor-Assstant strateges where the classfcaton s performed usng a flat classfer. (2) The tranng data from an ancestor may be too general and cannot characterze the category canddates. In other words, ths method mproperly utlzes the structure nformaton and thus ntroduces nose when supplementng the tranng examples. For example, n Fgure 2, tranng data from category 834 and 854 are used to tran classfer when classfyng the documents n category 874 and 92, respectvely. Our Ancestor-Assstant strategy can overcome ths problem snce both generalzed nformaton from the M-F structure and specfc nformaton from the category tself are employed together. 6.3.5 Classfer Selecton Classfer selecton s a key step to get the fnal category for the new document. Snce the model s traned nstantly when gven a document, NB and 3-gram NB are proposed to use by consderng ther effcency. Here we conduct the experments to show the performance of two algorthms and also compare to the SVM algorthm. We show the performance of SVM wth the features generated by the 3-gram language model. We call t as 3-gram SVM. As shown n Fgure 2, we fnd that our proposed 3-gram based classfcaton method can acheve hgher performance than tradtonal NB. Snce the canddate categores are much smlar wth each other, t s dffcult for NB to dstngush them wthout consderng dependency between words. Another explanaton for ths ssue s that snce the category canddates are acqured based on the ndependent term features, f we stll rely on such features to do classfcaton, the effectveness of classfers wll be decreased. 3- gram classfer takes assocated terms nto account and thus more dscrmnatve features are used than NBC method. As a result, 3- gram classfer wll acheve hgher performance. M-F.8.6.4.2 3-Gram NB 3-Gram SVM NB SVM 2 3 4 5 6 7 8 9 Fgure 2. Performance on Dfferent Classfer Selecton Generally, SVM and 3-gram SVM based algorthms can acheve hgher performance that NB algorthm and 3-gram NB algorthm, respectvely. However, the second stage of deep classfcaton needs an effcent classfer because of the onlne computaton. If we use the 3-gram based SVM, t s very tme-consumng to tran the model n the onlne step. Hence, n ths work, a 3-gram NB s taken as the second-stage classfer because of ts hgher performance and effcency. 3-Gram NB 3-Gram SVM SVM.9.8.7 Dataset Dataset 2 Dataset 3 AVG Fgure 3. Performance for Dfferent Classfer on Far-Dstance Categores We also conducted addtonal experments to valdate ths concluson. We randomly pcked three groups of deep categores. Each group contans three categores whch are far apart from each another (they dffer at the frst level). We then performed both 3- gram classfer, NB, 3-gram SVM and SVM wth a lnear kernel on the same tranng and testng data under each category group. As shown n Fgure 3, these classfers acheve comparable performance to each other. Furthermore, SVM and 3-gram SVM can acheve better performance than NB and 3-gram classfer, respectvely. M-F

6.3.6 Tme Complexty The ndexng process and the tranng process for NB classfer and 3-gram language model for classfcaton are conducted off-lne. The tme complexty of onlne computaton s calculated as follows. As estmated n [24], the average tme for document-based search 2 and category-based search are O( nl n / V ) + O( n) 2 and O( ml n / V ) + O( m), respectvely. Here l n s the average length of new documents, V s the vocabulary sze, m and n s the number of categores and tranng document, respectvely. Snce n s much bgger than m, testng tme for category-based search wll be less than that of document-based search. For the classfcaton stage, we perform the classfcaton only on a narrow herarchy. Assume that we have m categores, whch s a constant, the tme cost s about O(l d *m +m logm ) for NBC and about O( l 3 d * m' + m' log m' ) for 3- gram language model. Therefore, the onlne tme complexty s acceptable, whch ndcates that our algorthm s scalable and can handle very large herarches effcently. 7. CONCLUSION AND FUTURE WORK In ths paper, we have proposed a novel algorthm for Web classfcaton on a large scale text herarchy. A two-stage algorthm s presented, consstng of a search stage and a classfcaton stage. The search stage prunes the orgnal large herarchy nto a small and tractable one. The structure of the orgnal herarchy s consdered when we tran a classfer n the classfcaton stage. As a result, our method s both effcent and effectve n handlng very large scaled herarches. Expermental results showed that our proposed algorthm can acheve 77.7% mprovement over top-down based SVM classfcaton algorthm on the accuracy at 5th level on the large-scale herarches. As one future work, we wll extend the deep classfcaton algorthm for dfferent knds of applcatons, such as onlne advertsement classfcaton. Another work s to mprove the effcency of the search stage algorthm of deep classfcaton. We wll develop more effectve ndexng algorthms to mprove the classfcaton performance. 8. REFERENCES [] Broder, A., Fontoura, M., Josfovsk, V., and Redel, L. A Semantc Approach to Contextual Advertsng. In Proc. of ACM SIGIR '7. ACM, New York, NY, pp. 559-566, 27. [2] Ca, L. and Hofmann, T. Herarchcal Document Categorzaton wth Support Vector Machnes, In Proc. of CIKM 24, pp. 78-87, 24. [3] Chakrabart, S., Dom, B., Agrawal, R., and Raghavan, P., Scalable Feature Selecton, Classfcaton and Sgnature Generaton for Organzng Large Text Databases nto Herarchcal Topc Taxonomes. The VLDB Journal, vol. 7, no. 3, pp. 63-78, 998. [4] Chekur, C., Goldwasser, M., Raghavan, P., and Upfal, E. Web search Usng Automatc Classfcaton. In Proc. of ACM WWW-96, San Jose, US, 996. [5] Chen, H., and Dumas S. Brngng Order to the Web: Automatcally Categorzng Search Results. In Proc. of CHI, pp. 45-52, 2. [6] Dumas, S. and Chen, H. Herarchcal Classfcaton of Web Content. In Proc. of 23th ACM SIGIR, pp. 256-263, 2. [7] Gao, J. F. and Ne, J. Y. Wu, G. Y. and Cao, G. H. Dependence Language Model for Informaton Retreval. In Proc. of 27th ACM SIGIR, pp. 7-77, ACM Press, 24. [8] http://www.davddlews.com/resources/testcollectons/reuters2 578/. [9] Koller, D. and Saham, M. Herarchcally Classfyng Documents usng Very Few Words. In Proc. of the 4th ICML, 997. [] Labrou, Y. and Fnn, T. W. Yahoo! as an Ontology: Usng Yahoo! Categores to Descrbe Documents. In Proc. of the 8th ACM CIKM, pp. 8-87, 999. [] Lews, D. D., Yang Y., Rose T. G., L F. RCV: a New Benchmark Collecton for Text Categorzaton Research. Journal of Machne Learnng Research, Vol. 5, pp. 36-397, 24. [2] Lu, T.-Y., Yang, Y.-M., Wan, H., Zeng, H.-J., Chen, Z. and Ma, W.-Y. Support Vector Machnes Classfcaton wth a Very Large-scale Taxonomy. SIGKDD Exploratons, 7(): pp. 36-43, 25. [3] Madan, O., Grener, W., Kempe, D., and Salavatpour, M. Recall Systems: Effcent Learnng and Use of Category Indces. In Proc. of AISTATS, 27. [4] McCallum, A. and Rosenfeld, R. Improvng Text Classfcaton by Shrnkage n a Herarchy of Classes. Tom Mtchell and Andrew Ng. ICML-98, 998. [5] Peng, F. C, Schuurmans, D. and Wang, S. J. Augumentng Nave Bayes Text Classfer wth Statstcal Language Models. Informaton Retreval, 7 (3-4), pp. 37-345, Kluwer Academc Publshers, 24 [6] Sasak, M. and Kta, K. Rule-based Text Categorzaton usng Herarchcal Categores. In Proc. of the IEEE Int. Conf. on Systems, Man, and Cybernetcs, pp. 2827-283, 998. [7] Sebastan, F. Machne Learnng n Automated Text Categorzaton. ACM Computng Surveys, Vol. 34, No., pp. -47, 22. [8] Sun, A. and Lm, E.-P. Herarchcal text classfcaton and evaluaton. In Proc. of IEEE ICDM (pp. 52-528). IEEE Computer Socety, 2. [9] Wang, K., Zhou, S., and He, Y. Herarchcal Classfcaton of Real Lfe Documents. In Proc. of the st SIAM Int. Conf. on Data Mnng, Chcago, 2. [2] Xng D. -K., Xue G.-R., Yang Q., Yu Y. Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches. In Proc. of ACM WSDM 28. pp. 39-48. [2] Yang, Y. An Evaluaton of Statstcal Approaches to Text Categorzaton. Journal of Informaton Retreval, Vol., No. /2, pp. 67-88, 999. [22] Yang, Y. and Lu, X. A Re-examnaton of Text Categorzaton Methods, In Proc. of ACM SIGIR 99, pp. 42-49, 999. [23] Yang, Y. and Pedersen, J.P. A Comparatve Study on Feature Selecton n Text Categorzaton. In Proc. of 4th ICML, pp. 42-42, 997. [24] Yang, Y., Zhang, J. and Ksel, B. A Scalablty Analyss of Classfers n Text Categorzaton. In Proc. of ACM SIGIR'3, pp. 96-3, 23.