Contents Contents...I List of Tables...VIII List of Figures...IX 1. Introduction Information Retrieval... 8

Size: px

Start display at page:

Download "Contents Contents...I List of Tables...VIII List of Figures...IX 1. Introduction Information Retrieval... 8"

Gwen Austin
6 years ago
Views:

1 Contents Contents...I List of Tables...VIII List of Figures...IX 1. Introdution Internet Information Internet Information Retrieval Doument Indexing Doument Retrieval Doument Ranking Internet Information Management Using Mahine Learning to Retrieve and Manage Information Information Retrieval Information Retrieval Systems Term Analysis and Weighting Applying Stoplists Stemming Term Weighting...11 I

2 Term Frequeny (TF)...11 Inverted Doument Frequeny (IDF)...12 TFxIDF Doument Representation and Index Boolean Model Vetor Spae Model (VSM) Extended Boolean Model Retrieving and Ranking Algorithms Boolean Retrieval Vetor Spae Model (VSM) Extended Boolean Model...16 MMM Model Other Retrieval Models User Relevane Feedbak Evaluation IR System Retrieval effetiveness Retrieval effiieny Storage effiieny Internet Searh Engine Challenges of Searh Engines The Internet Doument...22 The size of doument set...22 Unstrutured and redundant doument...22 Quality of doument...22 Different language of doument ontent...22 II

3 The boundary of the Internet douments The Internet User The Implementation of Searh Engines Web Crawler...23 HTTP module...23 HTML Parser...24 Consisteny Cheker...25 Exlusive partition oordinator Index Engine Query Engine...26 Query Representation...26 Query Result Meta Searh Engines Measures of Searh Engines Size Freshness Features...32 Crawling...32 HTML Extration...33 Indexing...33 Ranking Response Time Internet Information Management The Web Diretory Servie Advantages of Diretory Servie...39 III

4 4.3. Disadvantages of Diretory Servie Open Diretory Doument Management and Mahine Learning Mahine Learning Version Spae Learning ID Learning from Databases Our Learning Method Automati Categorization of Douments Doument Clustering...48 Hierarhial Agglomerative Clustering (HAC)...48 Iterative Clustering Doument Classifiation Mining Assoiation Rules Classifiation Learning from Databases Automati Generation of Conept Hierarhies - ARCH Conept Hierarhy (CH) of numerial attribute CH of symboli attribute CH of objetified attribute...58 No Sharing ase...58 Sharing ase...60 Inheritane Estimating of Optimal Generalization Level - OGL The ost model of learning rule s omplexity OGL method...63 IV

5 Simulation Results Learning with Attribute Seletion by Entropy - ASE Attribute Seletion by Entropy (ASE) An Example The auray of rules indued from ASE Disussion The Overview and Design of ACIRD An Overview Doument Operation Term Operation Coneptual Model File Struture Query Operation Hardware Doument and Knowledge Representation The Doument Classifiation Learning Preproessing Proess and Knowledge Representation...83 HTML Parser...84 Term Parser Feature Seletion Proess Learning Classifiation Knowledge Mining Term Assoiation...88 Granularity of mining assoiations...89 Domain of generating assoiation rules Refinement of Classifiation Knowledge...92 V

6 Perfet Term Support (PTS) Algorithm...92 Effets of Knowledge Refinement Proess The Retrieval System Searh Engine Web Crawler Fulltext Index Engine Ranking Algorithm Two-Phase Searh Analysis on User Query Log Two-Phase Searh Method The Implementation of ACIRD The Searh Engine The Crawling Modules URL Pruner URL Unifier MD5 Unifier Doument Proessing Modules HTML Parser Language Detetor En-BIG5 Splitter Term Parser & Extrator Indexer Searher Doument Classifiation Learning Similarity Measurement Experiment Results VI

7 8.3. Two-Phase Searh Engine Examples of Two-Phase Searh Conlusions and Future Work Contributions Future Work Referenes Appendix Proof of Heuristis used in Perfet Term Support Algorithm Proof of PTS An Example of PTS Completeness of PTS VII

8 List of Tables Table 3-1. The perentages of bad links found in eah searh engine...32 Table 3-2. Searh Engine Features...35 Table 4-1. The hart ompares the size of diretories at various servies, along with other key data...39 Table 7-1. The effetiveness of PTS algorithm on lassifiation knowledge...95 Table 7-2. Total referene ounts (reall rate) vs. the number of keywords (index rate)...98 Table 8-1. The distribution of training objets in the most general lasses Table Initial supports of terms in the lass 海報攝影 (Plaard & Photography) Table Loops of PTS for the lass 海報攝影 (Plaard & Photography) VIII

9 List of Figures Figure 2-1. The proesses of IRS...9 Figure 2-2. Effets of searh on total doument spae...19 Figure 2-3. Ahievable preision/reall graph...19 Figure 3-1. The size of eah searh engine s index...28 Figure 3-2. Searh engine sizes over time...29 Figure 3-3. Searh Engine Sizes (NEC Researh Institute)...30 Figure 3-4. Coverage of searh engines (April 1998)...31 Figure 3-5. Coverage of searh engines (May 1999)...32 Figure 4-1. The omparison of diretory size in Open Diretory and Yahoo!...41 Figure 5-1. The onept of Version Spae...44 Figure 5-2. The learned features of a lass hierarhy...51 Figure 6-1. The trapezoidal membership funtion...54 Figure 6-2. The fuzzy onept hierarhy and membership funtions...55 Figure 6-3. The translation from Gaussian funtion to trapezoidal membership funtion...57 Figure 6-4. The onept hierarhy of objetified attribute Engine...59 Figure 6-5. A onept hierarhy generated from the lass hierarhy...61 Figure 6-6. An example for desribing the onept of OGL...64 Figure 6-7. The simulation result of ARCH + OGL + ID3 (the rule omplexity)...66 IX

10 Figure 6-8. The simulation result of ARCH + OGL + ID3 (the rule auray)...66 Figure 6-9. The Comparison of Version Spae, modified ID3, and AGE...72 Figure Relationship between entropy and probability (example# = 100)...73 Figure Self-organizing OODM aording to learning rules...74 Figure 7-1. Two-phase query proess in ACIRD...79 Figure 7-2. Coneptual model and systemati knowledge representation of ACIRD...82 Figure 7-3. The doument lassifiation learning of ACIRD...83 Figure 7-4. The distribution of term supports of training data...86 Figure 7-5. The distribution of term supports of all the lasses of the training data...88 Figure 7-6. Constrution of Term Semanti Network...91 Figure 7-7. PTS Refinement on TSN...93 Figure 7-8. Additional modules to implement Chinese searh engine...95 Figure 7-9. Proessing flow of Two-Phase Searh...99 Figure 8-1. The arhiteture of the searh engine Figure 8-2. The lassifiation auray of assigning 8,855 testing objets to 386 most speifi lasses based on Top N Figure 8-3. The lassifiation auray of assigning 8,855 testing objets to 12 most general lasses Figure 8-4. The lassifiation auray of well-trained lasses Figure 8-5. Two-phase searh provided by ACIRD query interfae Figure 8-6. Class-level searh query result: mathed lasses Figure 8-7. Searh all objets under a lass query result: list all objets of a lass Figure 8-8. Objet-level searh query result: list diret objets of a lass Figure 8-9. Objet-level searh query result: searh objets in designated lasses Figure Searh all objets query result: searh all objets Figure TSN of the lass 海報攝影 (Plaard & Photography) X

11 1. Introdution The explosive growth of the Internet has dramatially revolutionized the tehnology of information aess and ommuniation. It has hanged the way of working in all walks of life. Based on a survey performed by Yam 1, the most popular ativities on the Internet 2 are searh (29.8%), software downloading (11%), (11%), and browsing diretories (10.6%). Software downloading is also a kind of information finding. Thus, a full half of all ativities on the Internet are information finding. Obviously, searh engine servie (information retrieval) and diretory servie (information management) are major topis onerning the Internet. Using searh engines, suh as AltaVista 3, Infoseek 4, HotBot 5, et., is urrently very popular on the Internet. In the searh environment, a user desribes his or her information needs by a query that is a sequene of terms onatenated by operators defined by the searh engine. The query is then mathed with doument indies to retrieve douments that are related to the user s request. The result is presented as a ranked list of douments. The approah is effiient in retrieving related douments based on the tehnology of inverted index [24]. Unfortunately, sine the approah always retrieves too many undesired results, it is inadequate for retrieving a large number of the Internet douments. For example, AltaVista retrieves 332,507 6 results for the query information retrieval. Based on the statistis of [86], the average query length is 1.3 words. Aording to the The test was done at April 28,

12 observation, The searh engine will retrieve ever more douments from queries, as the Internet grows. Another approah, diretory servie, tries to lassify douments into diretories aording to subjets or topis, and organized as a hierarhial tree or lattie, suh as Yahoo! 7, Magellan 8, and Yam 9. Users an disover their needed information by navigating the hierarhy or restriting the searh domain in some diretories. To aurately lassify millions of douments into a large lattie, hundreds of well-trained staff work diligently assigning new douments to the lattie. In the ase of Yam, more than 70,000 web pages have been olleted and manually ategorized into the Yam s lattie. However, there are over Taiwan 10 web pages in Taiwan. Therefore, the diretory servie suffers from the problem of information defiit, sine manually lassifying all these douments is slow. As a result, the manually lassified diretory servie will never ath up with the rate of growth of the Internet. We first define the sope of Internet information used in this dissertation. The differenes between doument and information are also pointed out. Related work of information retrieval and management is also introdued in this hapter Internet Information People depend on information to arry out the ativities of daily life. In reently years, more and more people have ome to depend on the Internet to fill a part of their information needs. Most notably, with the birth of the World Wide Web [8], the Internet has reated a revolution in the Information Age. In Feb. 2000, a joint study published by Inktomi and the NEC Researh Institute estimated that there were one billion indexable pages on the Web 11. Disovering useful information from over one billion pages is a daunting task. What is information? There is no fully satisfatory answer. In the omputer world, information is omposed of text (inluding numeri and date data), images, audio, video, and other multi-media objets. In this dissertation, the text objet is the only data type that is proessed so as to ontain The test was done at April 28, The measure is based on the fulltext searh engine of Yam

13 useful information. The other types are regarded as highly informative soures and have been ategorized as ontent-based retrieval, whih is not within the sope of the dissertation. If we restrit the definition of information to the domain of the Web, Internet based information an be defined as a set of data that have been mathed to a partiular information need and an be aessed from the Internet. Of ourse, Internet data onsist of any digitized media presented by Web browsers. If the Internet has never been reated, most of data in the world would not be as useful beause it is hard to aess and organize. For the same reason, if there were no searh engines and diretory servies, Web pages would also just be data that is hard to disover and use. In the past few years, the Web has gained popularity and has beome a onvenient and primary way of information publishing on the Internet. However, as the number of sites inreases many folds for eah year, it beomes impossible to disover desired douments through a large number of hyperlinks. This situation stimulates researhers into researh and development of the Internet Resoure Disovery, whih assists users in loating desired information effiiently. In [79], Shwartz mentions two basi problems involved with the Internet Resoure Disovery: distributing this information flexibly and effiiently, and haraterizing the resoure of interest using name/attribute desriptions. The former orresponds to the topi of Information Retrieval, and the later, to Information Management. Throughout this dissertation, we regard the Internet and World Wide Web (or the Web) as the same onept. Similarly, text data refers to Internet douments or the Web pages. Information an be regarded as the abstration of a set of related pages, suh as the result of a searh or as a diretory Internet Information Retrieval Information Retrieval (IR) was originally developed to help manage the huge sientifi literature that has developed sine the 1940s. Many universities and publi libraries now use automati IR systems to provide aess to books, journals, and other douments. In brief, IR is onerned with three proesses: preproessing and indexing douments into files or databases, retrieving related douments from storage, and summarizing retrieved douments as desired information. These three proesses are referred to as indexing, retrieving, and ranking of douments. 3

14 Doument Indexing A doument is a series of sentenes, i.e., a long string of text. Without preproessing the string, a linear san is the only way to retrieve information from the doument. Obviously, the searh time is very high for retrieving information from a large number of douments. To effiiently retrieve information from a large number of douments, eah doument must be deomposed into terms (or phrases) whih are filtered by stopwords and stemmed before indexing. That is, a doument is represented as a set of terms. To rapidly retrieve related douments by query terms, all the terms in a doument are indexed as term identifiers (TID), whih orresponds to doument identifiers (DID). This is widely used inverted index. In the indexing phase, the representation of a doument, alled the retrieval model, is determined. The subsequent retrieval and ranking proesses are also dependent on the retrieval model. One of the most popular models is the vetor-spae model, whih desribes both the query and the doument as a vetor in the term spae. A measure of the similarity between the two vetors is omputed to indiate the relevane between douments and the query Doument Retrieval Regarding a query as a short string, ompared with a long string for a doument, similar proesses are applied to deompose the query into terms. Based on mathing riteria provided by the IR system, douments related to a term or phrase are retrieved for omputing and ranking. The most obvious riterion is exat math, whih, beause of its effiieny, is widely used to retrieve doument text. In some ases where the text is not well adapted to exat mathing, approximate mathing is utilized to math texts with similar meaning. Combining both mathing riteria, retrieval systems may respond to a query with an approximate math, then limit the query with exat mathing. A user often thinks of needed information in one of two forms: a question, or a list of terms. The first form is related to natural language proessing (NLP) and is outside the sope of this dissertation. The seond, a list of terms, is simple but ambiguous. One standard model for interpreting the list of terms is as a Boolean query, whih is based on onepts of logi, or Boolean algebra. Eah term of the list is onatenated by AND, OR, or NOT. In a query based on the vetor-spae model, the list of terms is a vetor omposed of terms and weights. 4

15 Doument Ranking After retrieving related douments from a query, some measures are omputed to present the most possible desired doument to the user. Sine Boolean query retrieves douments that satisfy the query logi without the measure of term weight, the ranking poliy an be sorting results by several attributes, suh as doument identifier, date, size, length, et. In vetor query, the most popular ranking method is omputing the osine between query and doument. Conventional IR systems are effiient in handling a large olletion of douments. However, douments olleted from the Internet are extremely numerous and diverse. In suh a situation, for a query with one or two terms submitted to a searh engine (the Internet IR system) using similarity-based algorithms, the searh engine usually retrieves thousands of douments. Ranking a large number of douments using only a few terms is not likely to produe a set of douments meeting the user s preferene. Consequently, the user must navigate through many uninteresting douments before obtaining his desired information. Several searh engines have applied the relevane feedbak [76] to expand or refine the query based on douments seleted by the user. However, relevane feedbak based on seleted douments may not be effetive sine it is diffiult to grasp the user s intention from the feedbak Internet Information Management The oneptual gap between doument developers and doument users expands the differene between retrieval results and user expetation. Due to the rihness of language and ulture on the Internet, doument developers and users may use different terms to represent the same onept, or use the same term to desribe different meanings. Therefore, douments retrieved by searh engines often do not reflet users desires. As a result, searh engines generally retrieve thousands douments when only a few desired. In ontrast, desired douments may not be retrieved. For instane, searh engines do not math the term airline shedule in douments with the term flight shedule, whereas both terms are onsidered to have the same meaning. Consulting a thesaurus may resolve the latter problem. Using a thesaurus still annot resolve the problem that a term may have different meanings in different ontexts, suh as the term bank. A solution for this problem is to build different thesauri for different domains. Furthermore, due to the diversity and dynami nature of the Internet environment, no stati thesaurus an handle the shifting semantis of terms found in Internet douments. How does one onstrut different thesauri on the Internet? Researh is still ongoing. 5

16 Ideally, searh engines should be able to retrieve relevant douments effetively and effiiently, and present the douments in aordane with the user s expetation. For effiient retrieval, the searh engine should shrink the doument searh spae to redue searh time and the retrieved result. In addition, as a side produt, the redution of searh spae allows indies, or the ritial parts, stored in the physial memory to redue retrieval time. Assigning lasses to douments is essential to the effiient management and retrieval of knowledge [40], It also provides a framework for strutured query proessing and presentation. Diretory servie is a general ase of managing Internet information. Based on the hierarhial lassifiation of diretory servie, the onept of lass forms a speifi domain. If we an onstrut a thesaurus for the lass domain, queries in this lass an be greatly refined. The largest diretory servie, Yahoo!, has more than one hundred editors assign to register web links to the diretory. However, in omparison with a full text searh engine like AltaVista, the size of manually ategorized links of Yahoo! is tiny. That is, the manual onstrution of a diretory annot math the growth of the Internet. Hene, many studies on mahine learning (or data mining) on text lassifiation have been proposed to automatially onstrut a diretory servie Using Mahine Learning to Retrieve and Manage Information As noted previously, onventional tehniques of information retrieval (searh servie) and management (diretory servie) annot ope with the explosion of information on the Internet. But, both of these servies an serve different requirements from Internet users. The major problems are: How to automatially onstrut the diretory used by the diretory servie? How to effetively shrink the searh domain based on the diretory while searhing? Up to now, no automati text lassifiation system has been good enough to replae the manual ategorization proess. Researh on text lassifiation involves understanding the text, finding lassifiation algorithms for text lassifiation, and building the lass hierarhy (or lattie). In this dissertation, we propose a system, ACIRD 12 (Automati Classifier for Internet Resoure Disovery) [53, 55, 56], to failitate Internet doument organization and information disovery. The system is apable of learning lassifiation knowledge from already lassified douments. It also mines the rules of assoiation among terms to explore impliit term semantis, and uses term

17 assoiations to refine the knowledge of lasses in a lass lattie. To assist the user in disovering Internet information, the system implements a two-phase searh mehanism that presents a hierarhial searh result. Based on the disovered lassifiation knowledge and the given lass lattie, a two-phase searh an effetively shrink the searh domain, and be integrated with the full text searh tehnique. The remainder of this dissertation explores information retrieval researh, the Internet searh engine, information management, doument lassifiation, and mahine learning. The remaining hapters present an overview and implementation of our system for information retrieval and management. Finally, we summarize the ore tehnial ontributions of this dissertation and point out the future researh. 7

18 2. Information Retrieval As omputers beame readily available, they are andidates for the storage and retrieval of text. Early Database Management Systems (DBMS) provided an ideal platform for storing data, and an easily retrieve information. Libraries followed the paradigm of their atalogs and referenes to migrate their hardopy information into strutured databases. These requirements play a major role in available Information Retrieval Systems (IRS). Aademi IR researh was onstrained by the lak of omputing power to handle gigabyte sized text databases. In the 1990s, information published from the Internet and the enhanement of omputing power reativated researh on IRS. The growth of the Internet and the availability of enormous volumes of digitized data have attrated researhers to assist the user in loating data of interest. The Digital Library effort is also progressing with the goal of migrating from the traditional book environment to the digital library environment based on the Internet. In the world of the Internet, IRS is a system that is apable of storing, retrieving, and managing large amounts of information. It onsists of software programs that aid a user in finding the information he or she needs. The system may use standard omputer hardware or speialized hardware to support these funtions. A major objetive of IRS is to minimize human resoures required in finding needed information. Thus, a measure of information systems is how well they an minimize the overhead for a user to find his or her needed information. Overhead from a user s perspetive is the time required to find information, exluding the time spent atually reading the relevant data. Time spent searhing and reading non-relevant items are all aspets of the user s overhead. In this setion, we first introdue the implementation of IRS; the evaluation of IRS is also desribed as riteria of implementations. Extending IRS to the Internet searh engine is illustrated in the following hapter. 8

19 2.1. Information Retrieval Systems Most IR systems are implemented to failitate rapid retrieval of douments for diverse users based on the term-based indexing approah. Terms are extrated from a doument, then pruned, stored and indexed in a database by applying various indexing algorithms [24, 72, 78]. When the user issues a query, the query is proessed into a sequene of terms whih are then mathed with terms that have already been indexed for douments based on the TF IDF algorithm or other similar [80]. In order to indiate the degree of relevane of the douments to the query, the retrieved douments are presented as a ranked list. Another approah regards the features of a doument as a set of strings and sub-strings instead of terms. It is partiularly useful for appliations of string mathing (e.g., address math) and searhing of harater-based languages (e.g., Oriental languages suh as Chinese and Japanese) that searh for arbitrary-length strings. Compared with term-based indexing, the storage requirement of string-based indexing approah is muh higher. In addition, their ompliated data strutures take more time for retrieval. Some string-based indexing tehnologies, suh as PAT-tree [16] or signature files [19], have been proposed for improving the performane of various searh funtions, suh as prefix searhing, proximity searhing, range searhing, longest repetition searhing, most signifiant and most frequent searhing, and regular expression searhing [24]. However, these searh funtions are rarely used in general IR systems. In this dissertation, we fous on term-based indexing approahes. text Douments douments non-stoplist words *stemming *indiates optional operation or objet words stoplist break into words *stemming stemmed words stemmed words *term weighting Database term weights assign do id s doument numbers and *field numbers relevant doument sets Boolean operations parse query query terms *relevane judgments queries query interfae queries douments User *ranking retrieved doument set ranked doument set Figure 2-1. The proesses of IRS. 9

20 Figure 2-1 shows proesses of term-based IRS. We briefly partition IRS into four proesses: term analysis, doument representation and indexing, retrieving (mathing) and ranking, and user feedbak. In the following setions, we briefly present these proesses Term Analysis and Weighting To index raw douments for effiient retrieval, eah doument is assigned an identity and proessed into words by several proesses suh as word breaking, stoplist, stemming, term weighting, et. Douments are represented as vetors of binary or numeri features with weights derived from the words in douments Applying Stoplists Stoplists (or negative ditionaries) are ommonly found in systems and onsist of words whose frequeny of ourrene is of no value as a searhable token. For example, a word found in almost every doument would have no disrimination value during searh, e.g., the, a, an, et. Those words make up a large fration of the text of most douments: the ten most frequently ourring words in English typially aount for 20 to 30 perent of the tokens in a doument [25]. Eliminating suh words provides a huge savings of index spae, and improves the effetiveness of retrieval. Intuitively, stoplists are supposed to inlude the most frequently ourring words that are of no value for indexing. However, some stop words are important as index terms. For example, the 200 most frequently ourring words in English literature inlude time, war, home, life, water, and world. On the other hand, speialized databases should not ontain words that may normally infrequently used words, but that are very ommon in the speialized field. For example, a omputer literature database probably need not index terms like omputer, program, soure, mahine, and language. Numerous studies show that if the words in a doument are ranked in order of dereasing frequeny, they follow a relationship known as Zipf s law [88]. It says, the produt of word frequeny and rank is losed to onstant. rank frequeny onstant If the law held stritly, then we an make two observations. First, high frequeny words are not desirable beause they produe too many results from suh queries. Seond, rare words are not 10

21 desirable beause of their inability to retrieve douments. Thus, a system an filter out useless words based on a high threshold for stop words and a low threshold for rare words Stemming The goal of stemming is to improve performane and redue the need for storage by reduing the number of unique words. It maps multiple representations of a word as a single stemmed term to provide signifiant ompression, with assoiated saving in storage and proessing. Another major use of stemming is to improve reall. Through the generalization of a set of words to a onsistent stem, potentially relevant douments an be retrieved. However, the preision measure in based on minimizing the retrieval of non-relevant douments. Stemming may redue the retrieval preision beause of its effet of inreasing reall. The trade-off between reall and preision should be onsidered when stemming is involved. The most ommon stemming algorithm removes suffixes and prefixes to derive the final stem. The Porter algorithm [66] is based on a set of onditions of the stem, suffix and prefix and assoiated ations given the ondition. Other stemming tehniques are based on table lookup or ditionaries. The Kstem [47] algorithm used in the INQUERY system ombines a set of simple stemming rules with a ditionary to determine proessing tokens. It tries to avoid generalizing words with different meanings into the same root. For example, memorial and memorize redue to memory. But memorial and memorize are not synonyms and have different meanings. Frakes summarized studies of various stemming studies [24] Term Weighting After breaking words, removing stop words, and stemming, the remaining words are terms. Terms indiate the attribute of a doument. The attribute s value is orresponding to the term weight estimated by the weighting approah. There are two approahes to assigning a value to a term, binary (non-weighted) and weighted. For the binary attribute, a doument ontains a term with a value of one (true) or zero (false). As for the weight approah, the term weight is usually normalized to the range [0, 1]. There are several weighting approahes. Term Frequeny (TF) In a statistial system, it is trivial to alulate the frequeny of ourrene of a term within a doument; this is alled the term frequeny (TF). The simplest approah enounters problems of 11

22 normalization. The longer a doument is, the more often a term may our within the doument. Thus, the weight should be normalized before it is assigned to the term. The normalized TF is desribed in the following. Term Frequeny (normalized) of n n i, j max, j : the unmber of ourenes of t : the maximal unmber of ni, term t in doument d = TF = i i in d j. ourenes of any term in d. j j i, j j n max, j (EQ 2.1) Inverted Doument Frequeny (IDF) To improve the TF, IDF assigns the term weight based on the distribution of a term within a doument olletion. The are several ommonly used IDF measures [24]. IDF log N i = 1 n + i TF IDF log max i = 1 n + i N n IDF log i i = n i N : the number of douments in the olletion n i TF : the total number of max ourenes of t in the olletion : the maximum frequeny of any term in the olletion j (EQ 2.2) TFxIDF Combining the within-doument frequeny (TF) with the IDF weight often provides even more improvement. This is alled TFxIDF term weighting [74, 80]. The term weight of the doument and the query is as the following: w w i, j i, q TF TF i, q = TF 0.5 TF i, q = IDF i TF max, q : the frequeny of term i in query q max, q i, j IDF i : the maximum frequeny of any term in query (EQ.2.3) In [75], Salton and Bukley suggest reduing the query weighting w, to only the within-query i q term frequeny TF i, q for long queries ontaining multiple ourrenes of terms. 12

23 It is feasible to add additional weight to terms aording to the doument struture, suh as higher weights for terms appearing in the title or abstrat of the doument Doument Representation and Index The elementary data of IRS are the douments that are stored and retrieved, and the queries that desribe information needs. The major tasks of IRS are to interpret the information need and to identify douments that likely satisfy the need. These tasks are affeted by the data strutures that are used in IRS. In this dissertation, a doument is limited to the text data. There are several data strutures for interpreting a doument. One regards a doument as a vetor of terms with weights that are stored in databases. A seond onsiders a doument as being a sequene of string stored in a file. The most ommon example is to represent a doument as a set of features and to store those weighted-terms as inverted file [20]. Then one doument representation model, suh as the Boolean model, vetor spae model (VSM) [24], or probabilisti model [70], is applied to represent the doument. Another model might desribe a doument as attributes with semantis or meanings. For example, a doument markup language, suh as HTML [9] or SGML, is used to make web douments. They define tags, suh as title, anhor, or headings, to desribe the semantis of doument ontents. Moreover, natural language representations are able to present douments that are understandable to users without speial training. However, suh representations are generally employed to address problems and are infeasible for appliation to IRS. In this dissertation, we disuss models that present the doument as a set of features Boolean Model The Boolean model is a simple representation model based on set theory and Boolean algebra. It assumes that terms are present or absent in a doument. That is, the term weight is assumed to be binary (0, 1). The query is omposed of terms onatenated by NOT, AND, and OR. The main advantages of the Boolean model are its lean formalism and simpliity. However, the predition of relevant or non-relevant may lead to retrieving too few or too many douments. The following vetor spae mode improves upon these drawbaks. 13

24 Vetor Spae Model (VSM) VSM reognizes that the use of the binary weight Boolean model is too limiting, and proposes a framework in whih partial mathing is possible. This is aomplished by assigning non-binary weights to index terms of douments and queries [5]. The idea of VSM is that the meaning of a doument is represented by its words. If one an represent the words in the doument by a vetor, it is possible to ompare douments with queries to determine the similarity. The definition of VSM is given below. Eah terms t i of a doument d j is assoiated with non-binary weight w i, j. In the N-dimensional vetor spae, the doument vetor is d = w, w,..., w ). In the same way, j ( 1, j 2, j N, j eah terms t i of a query q is assoiated with non-binary weight w i, q and the query vetor is q = ( w, q, w2, q,..., wn, 1 q ) Extended Boolean Model Boolean retrieval is simple and elegant. However, sine there is no provision for term weighting, no ranking of the answer set is generated. As a result, the size of the output might be too large or too small [5]. Beause of these problems, modern IR systems are no longer based on the Boolean model. In fat, most new systems adopt VSM sine it is simple, fast, and has better performane. One alternative approah is to extend the Boolean model with the funtionality of partial mathing and term weighting. The Extended Boolean model [77] is based on the ombination of Boolean query formulations and harateristis of VSM. There are several models for retrieving ranked douments of Boolean queries. We introdue the model in the next setion Retrieving and Ranking Algorithms A query is muh shorter than a doument, and an individual term is likely to our only one or twie in a query. Consequently, term-frequeny statistis are meaningless in a query. The user must represent his information need in a form interpretable by the IR system, suh as the Boolean expression over keywords, natural language desription, an example for query by example, et. IRS will proess the query and eliit important words or features from the query and translate them into the system representative form suh as vetors of keywords. 14

25 Given a string or a set of terms, the query is first proessed by the same term analysis method. IRS then retrieves related douments aording to the query, and omputes the degree of relevane for those retrieved douments. By modeling douments and queries into vetors of words, the relevane between the query and doument is determined aording to the osine value between the two vetors. There are also several retrieving and ranking algorithms that are introdued in the following setions Boolean Retrieval Boolean retrieval deides that eah doument is either relevant or non-relevant. Queries speified as Boolean expressions have preise semantis. Boolean retrieval is widely used in many of the early ommerial bibliographi systems beause of its simpliity as well as formalism. Unfortunately, the Boolean model suffers from its simpliity, in that it is a binary deision riterion without any notion of gradation. Hene, a doument is determined to be either relevant or non-relevant. It is espeially diffiult to apply to the Internet IR Vetor Spae Model (VSM) There are several similarity measures for omparing a query vetor with a doument vetor. The most ommon of these is the osine of the angle between the query and doument vetor as shown in the following. Cosine similarity oeffiient: i= 1 N w i= i qw 1, i, j q d j SC( q, di ) = = (EQ 2.4) N 2 N 2 q d ( wi, q ) ( wi, j ) j i= 1 VSM has the following advantages for retrieval. Term weighting improves the effetiveness of retrieval. Partial mathing strategy allows the query to be approximated. Doument ranking is enhaned by the degree of similarity between the query and douments. In onlusion, VSM is simple, fast, and yields good retrieval performane. 15

26 Extended Boolean Model Many retrieval models have been proposed as alternatives to Boolean model, suh as Mixed Min and Max (MMM), Paie, and P-norm [24]. We introdue the simplest one, MMM, in this setion. MMM Model This model is based on the onept of fuzzy sets proposed by Zadeh [87] and is proposed by Fox and Sharat [23]. The doument weight of a doument with respet to an index term A is onsidered to be the degree of membership of the doument in the fuzzy set assoiated with A. The degree of membership for union and intersetion are defined as follows in fuzzy set theory: d d A B A B = min( d A = max( d, d A B, d B ) ) (EQ 2.5) Aording to fuzzy set theory, the query A OR B is assoiated the union of the two sets max( d, d ), and A AND B is orresponding to the intersetion of the two sets min( d, d ). A B MMM model tries to soften Boolean operators by regarding the linear ombination of the max and min measures. A B Given a doument d with the vetor ( d 1, d 2,..., d N ) for terms t1, t2,..., t N, and the queries Q = t OR t OR...OR t ) and Q = t AND t AND... AND t ), the similarity OR ( 1 2 between the query and the doument is: SC( Q SC( Q OR AND, d) = C OR1, d) = C *max( d AND1 N 1, d *min( d 1 2, d,..., d 2 N,..., d ) + C N AND OR2 ) + C ( 1 2 *min( d AND2 1, d *max( d 2 1,..., d, d 2 N ),..., d N N. (EQ 2.6) ) C OR1, COR2 are softness oeffiients for OR operator, and AND1, CAND2 C are for AND. Generally, we have C OR1 > COR2 and AND1 CAND2 C > sine we would like to give the maximum of the doument weights more important while onsidering OR query and the minimum more important while onsidering AND query. It is generally assumed that COR1 = 1 C OR2 and CAND1 = 1 CAND2 for simpliity. 16

27 Other Retrieval Models There are many variant models of Boolean model and VSM. We just briefly desribe them. The detail is in [30]. Probabilisti Retrieval [26]: A probability based on the likehood that a term will appear in a relevant doument is omputed for eah term in the olletion. For terms that math between a query and a doument, the similarity measure is omputed as the ombination of the probabilities of eah of the mathing terms. Inferene Networks: A Bayesian network is used to infer the relevane of a doument to a query. This is based on the evidene in a doument that allows an inferene to be made about the relevane of the doument. The strength of this inferene is used as the similarity oeffiient. Neural Network: A sequene of neurons, or nodes in a network, that fire when ativated by a query triggering links to douments. The strength of eah link in the network is transmitted to the doument and olleted to form a similarity oeffiient between the query and the doument. Networks are trained by adjusting the weights on links in response to predetermined relevant and irrelevant douments. Fuzzy Set Retrieval: A doument is mapped to a fuzzy set (a set that ontains not only the elements but also a number assoiated with eah element that indiates the strength of membership). Boolean queries are mapped into fuzzy set intersetion, union, and omplement operations that result in a strength of membership assoiated with eah doument that is relevant to the query. The strength is used as a similarity oeffiient User Relevane Feedbak Due to imperfet representation of doument ontents and user information requirements, the user rarely satisfied with the initial retrieval. Many IR systems interat with users to modify their queries to ahieve aeptable results; this is alled query reformulation. The most popular query reformulation strategy is relevane feedbak. The user is presented with a list of retrieved douments and, after examination, the user marks those that are relevant. In pratie, only the top 10 or 20 ranked douments need to be examined. Then the query reformulation program selets important terms (or expressions) from those marked douments, and enhanes the importane of these terms in a new query formulation. The interation an be done expliitly by the user. Or, IRS 17

28 an judge the desirability of eah retrieved doument by monitoring the user s behavior, and then automatially update the query. Two proesses, query expansion and term reweighting, are used to reformulate queries. In the vetor spae model, query expansion borrows new terms from important non-query terms ontained in relevant douments, or from a thesaurus. This enhanes the ability to respond to relevant douments by searhing for similar terms. Term reweighting modifies term weights based on the importane of terms in relevant douments. The retrieved douments are partitioned into relevant and non-relevant groups aording to the user s feedbak. Terms from the relevant douments are used to inrease the weight of the same terms in the query, while terms of the non-relevant set are used to derease term weights. The result of this reformulation is that the query and relevant douments beome loser in vetor spae. The SMART system [73] has shown good improvements in preision for small test olletions when relevane feedbak is used Evaluation IR System Interest in evaluation tehniques for IRS has inreased signifiantly with the ommerial use of IR tehnologies in daily life for Internet users. Evaluations were foused on the effetiveness of searh algorithms and done primarily by using a few small, but well-known, orpora of test douments of even smaller test databases. The reation of the annual Text Retrieval Evaluation Conferene (TREC) sponsored by the Defense Advaned Researh Projets Ageny (DARPA) and the National Institute of Standards and Tehnology (NIST) hanged the standard proess of evaluating information systems. The onferene provides aademi researhers and ommerial ompanies with a standard database, onsisting of gigabytes of test data, searh statements and the expeted results from the searhes, for use in testing of their systems Retrieval effetiveness A suessful IRS must be effetive in returning douments in response to the user s information needs. However, suh a vague statement of effetiveness is hard to measure. Several ways were proposed in past studies. The most popular measures ommonly assoiated with IRS are preision and reall. When a user initiates a searh for information, the database is logially divided into four segments as shown in Figure 2-2. Preision and reall are defined below [48]. Preision = Relevant Retrieved / Total Retrieved 18

29 Reall = Relevant Retrieve / Total Relevant Relevant Retrieved Non-Relevant Retrieved Relevant Not Retrieved Non-Relevant Not Retrieved Figure 2-2. Effets of searh on total doument spae. Given a query, preision is the ratio of retrieved relevant douments to total retrieved douments; reall is the ratio of retrieved relevant douments to total of relevant douments in the database. It is possible to alulate preision values assoiated with queries, assuming that the user provides relevane judgements. However, in an operational system, it is unrealisti to alulate reall beause there is no reasonable method to determine the totality of relevant douments. The assumption that all relevant data in the database are known is possible with very small databases. To test a searh against a large database, two approahes have been suggested. The first is to use a sampling tehnique aross the database, performing relevane judgements on the returned douments. This would form the basis for an estimate of the total relevant information in the database [27]. The other approah is to apply different searh strategies to the same database for the same query. An assumption is then made that all relevant douments in the database will be found in the aggregate of all the searhes [42]. The later tehnique is what is applied in the TREC-experiments. In this ontrolled environment, it is possible to reate preision/reall graphs by reviewing the retrieved result in ranked order and reording the hanges in preision and reall as eah retrieved doument is judged. A realisti graph is shown in Figure Preision Reall Figure 2-3. Ahievable preision/reall graph. 19

30 Retrieval effiieny Response time is a metri frequently olleted to determine the effiieny of the retrieval. It is defined as the time to exeute the query. The beginning is always measured from when the user submits his query. However, differing definitions of the end time an ause ambiguity in the response time. The end time an be determined either by the user s view or by the system s view. From a user s perspetive, a searh ould be omplete when the first result is available for the user to review. From a system perspetive, system resoures are in use until the system has determined all mathes. To guarantee the quality of measurement, the system perspetive is usually adapted to estimate response time sine the system should be able to serve many users simultaneously with tolerable response time. For example, if the user s tolerable response time is one seond, the searh effiieny an be evaluated based on the number of suessful user queries per seond Storage effiieny The measure of storage effiieny is the ratio of the size of the index to the size of doument set. Index Ratio = size(index) / size(do) Many indexes an be equal in size to that of the original doument set. Hene, the storage requirements an be double that needed for storing the original douments. 20

31 3. Internet Searh Engine The development of the Internet provides a signifiant hallenge to Information Retrieval systems. Over the span of several days, millions of published douments and thousands of new web sites are available on the Internet. Searh engines store an inverted index on the order of tens to hundreds of gigabytes, and serve millions of queries per day. In this vast sea of information, the hallenge is to retrieve relevant douments in response to almost any query on the Internet. However, it is a demanding task. There are hundreds of Internet information retrieval systems (searh engines) available on the Internet. AltaVista, Hotbot, Lyos, and Yahoo! are just the better known. These engines are popular beause of their rih indies and fast response. However, most searh engines do not suffer from produing no mathes, but rather from too many. In this hapter, we present studies of searh engines to point out the hallenges they fae Challenges of Searh Engines To understand the hallenges faing Internet searh engines, haraterizing the Internet, and in partiular, the Web is the first step. As of February, 2000, there are over one billion indexable pages. The total number of pages is probably muh greater. The most popular format for Web douments is HTML, followed by GIF and JPG, ASCII text, and Postsript, in that order [5]. HTML is an instane of SGML, but most HTML pages do not omply with all of the HTML speifiations. Typially there are several links in a page, most of them being loal (that is in the same Web server hierarhy). Considering these links, the Internet pages form a large doument network. 21

32 We an explore the hallenges of searh engines based on two faets: the doument set (index system) and the user (retrieval system) The Internet Doument There are several harateristis of Internet douments that make them hard to be indexed and retrieved. The size of doument set The exponential growth of the Internet poses the most diffiult issue: salability. This fator influenes both the database size and retrieval performane of searh engines. Unstrutured and redundant doument An HTML page is not well strutured; it is a semi-strutured doument. Without obvious strutures, searh engines have diffiulty identifying the major ontent of a page. For example, there are from one to several advertisement banners in dot-om pages. However, the ode blok for AD banners is also indexed by searh engines, but the information is really noise. Many web developers design multiple pages of the same site using dupliate TOC (table of ontent) links. The information is semantially redundant and hard to detet. Moreover, many pages are either repeated or very similar due to opy, mirror, or default pages on web servers. Approximately 30% of pages are dupliates [13]. Semanti redundany is even larger. Quality of doument HTML and the Web an be onsidered the most popular publishing medium. However, there is no serious or standardized publishing proess. So web pages an ontain invalid formatting, ontain bad links, or be poorly written by publi HTML editors. Different language of doument ontent The Internet onnets more than 200 ountries with many different languages. Most languages are English-like with small alphabets, but some of them are very large suh as Chinese and Japanese Kanji. Most searh engines just index pages in their loal language. Unfortunately, many pages are written without language information in the META tag CHARSET, making it diffiult to identify the language of a page. 22

33 The boundary of the Internet douments No searh engines an rawl and index all pages on the Internet. How to define a boundary on the Internet is also diffiult The Internet User From the perspetive of users, two problems are important: the query and the result. Users usually speify their queries in one or two words. The result might be thousands of pages. How does one selet pages that are really interesting. Some searh engines provide user relevane feedbak; however, only a relatively few users are patient enough to interat with searh engines to refine their queries The Implementation of Searh Engines Aording to the above desriptions, the main differene between onventional IRS and searh engine relates to the nature of the Web: the large, unstrutured, and unqualified doument set in hundreds languages. In this dissertation, we do not disuss the multi-lingual problems of the Web. With regard to the other fators, we review different implementations of searh engines and IRS in the following proesses: doument olletor (Web rawler), index engine, and query engine Web Crawler Web rawler is a program that reursively traverses Web pages based on links. It is also alled spider, robot, or wanderer. Crawlers embedded speial domain knowledge are alled agents or knowbots. Crawlers ollet douments from the Web instead of files or databases. There are several important funtions of the rawler. HTTP module First, the rawler follows the speifiation of HTTP to rawl the Web. An HTTP module based on network soket layer is needed. The following funtions should be implemented in this module. HTTP HEAD method To redue the probability of network ongestion, the rawler first verifies the modifiation status of a page by using HEAD method. If the page was modified, HTTP GET is used to get 23

34 the new ontent. However, some web servers are not ompatible with HTTP due to having no provision for the HEAD method. The module should detet the situation and use the GET method. HTTP 1.1 support There are more and more sites that only support HTTP 1.1 for enhaning server performane. Thus, the rawler should support not only HTTP 1.0 but also HTTP 1.1 [37]. Handling rediretion An URL referring to another new page due to movement of the old page is alled rediretion. There are two ways to speify the rediret information. One is using META tag HTTP-EQUIV=REFRESH in the HTML file; the other is sending HTTP rediretion (302) through the HTTP server. Using this, the system an get the real ontent page. However, some pages are moved without notifying the rediretion information. It is diffiult for the rawler to automatially detet this. Robot exlusion 13 Some Web sites do not like rawlers sine bad rawlers always visit a site with several threads based on the depth-first searh approah. Obviously, the quality of servie at the site degrades due to large and frequent requests from a bad rawler. If a site has a file named robot.txt, the rawler should follow the rules defined in the file while visiting the site. However, it is optional and depends on the rawler s manner. The manner of robot an be referred in [46]. Meta Robots Tag It is a speial meta tag that allows site owners to speify that a page should not be indexed. For example, <META NAME= ROBOTS CONTENT= NOINDEX >. It is ideal for those who annot reate a robots.txt file. HTML Parser

35 Seond, the Web pages written in HTML must be parsed into text ontent and links. An HTML parser is needed in the rawler. Based on HTML tags, the parser identifies the meaningful ontent of a page and links for following navigation. HTML parsers of some searh engines add weights to the ontent aording to HTML tags. As for the proessing of links, the parser should determine the absolute URL aording to the owner page s URL and the parsed relative URL. Consisteny Cheker Sine there are so many redundant pages on the Web, the rawler should detet a redundany before pushing pages into the index engine. URL uniqueness and MD5 heksum of pages are often applied to guarantee onsisteny. Due to the dynami nature of the Web, pages will sometimes be invalid (permission denied, not found, onnetion timeout, or server no response). The rawler should write bak the status ode to avoid rawling, indexing, and searhing those invalid pages. Exlusive partition oordinator To avoid the rawler visiting a site with many threads at the same time, there should be a oordinator to negotiate the data set of eah rawler. For example, partitioning the Web pages and distributing them to rawlers to make sure that no rawlers visit the same site in several seonds. Breadth-first navigation is a good approah to distribute the navigation to different sites Index Engine Most indies of searh engines use variants of the inverted file; this is a list of sorted terms with a set of pointers to the pages where terms our. Some searh engines use stopwords to redue the size of the index. Other operations may inlude removing puntuation and reduing multiple spaes to just one spae between eah term, onverting letters from upper ase to lower, et. Some indexing tehniques an redue the size of an inverted file to about 30% the size of the text. By applying ompression tehniques, the index size an then be redued to 10% of the text [83]. To effiiently find the query position in a page 14, some searh engines modify the pointer of an inverted file to point to positions of terms instead of pages. Glimpse [58] redues the index size via pointing to logial bloks rather than pages. Making all bloks the same size redues the variation 14 The servie is alled See Mathed Lines. 25

36 of page size. In this way, it not only redues the size of pointers but also redues the number of pointers beause terms have loality of referene. Most searh engines use a entralized rawl-index-query approah. To ope with the large sale of the Internet, hardware must be very powerful. In 1998, the AltaVista system was running on 20 multi-proessor mahines, eah of them having more than 130 GB of RAM and over 500 GB of disk spae. The query engine uses more than 75% of these resoures. The biggest problem is gathering of Web pages; this is beause of the widely distributed nature of web pages, saturated ommuniation links, and the load on the entralized server. In fat, this approah may not be able to ope with the growth of the Web. There are several variants of the entralized approah; the most important is Harvest [10]. Harvest uses a distributed approah to gather and distribute data. The main drawbak is that Harvest requires the oordination of several servers. Without a entralized oordinator, distributed servers might rawl through the same pages, inreasing the load on servers and networks. Another approah used by Harvest is topi speifi indexing. It tries to fous the index ontents thereby avoiding the indexing of too many terms Query Engine There are two important aspets of the query engine: the representation and the result of query. Query Representation Most searh engines provide simple query representation, a sequene of terms; however, the interpretation of eah searh engine is different. A user would expet that a given sequene of terms represents the same query in all searh engines, it does not. For example, a sequene of terms is a referene to the union of all pages having at least one of those terms in AltaVista, while it is referene to all pages having all the terms in HotBot. Some searh engines remove stop words, some do stemming, and some are ase sensitive. This makes for different results from searh engines. Almost all searh engines provide omplex queries, suh as Boolean operators, phrase searh, proximity searh, restrited searh, and wild ards, to help talented users find their information needs aurately. Query Result 26

37 All searh engines present the searh result as a ranked list of page. As with searhing, ranking (weighting) must be performed without aessing the page; only an index searh is done. There is not muh publi information about the speifi ranking algorithms used by urrent searh engines due to ommerial privay. It is also hard to ompare differenes among searh engines beause of different query representations and rawling data sets. Hene, onsidering the ranking (weighting) mehanism, there are no differenes between searh engines and IR systems. Some of the new ranking algorithms also use the link information of a page. For example, Google 15 [12] adds a term weight derived from link text to the referred page. This is a signifiant differene between searh engines and general IR systems. Moreover, some searh engines keep the liked referenes of pages in query results. The referene ount indiates the popularity of a page and is applied to rank the query result. Some use the number of link referenes to a page as the page s ranking value, whih is alled itation ount. To interat with users, some searh engines also provide relevane feedbak to let users refine their searh result. Those riteria make searh engines different from IR systems Meta Searh Engines Meta searh engine is an integrated servie of several searh engines sine no one servie fully overs all Web pages. It sends a given query to several searh engines, diretory engines and other Web databases, ollets the answers, and then unifies or summarizes the answers as a new result. The most famous meta searh engines are MetaCrawler 16 and SavvySearh 17. Their main advantage is the ability to ombine the top results of many searh engines from a single interfae. Thus, meta searh is also alled all-in-one searh. It should be mentioned that meta searh engines attrat users who are dissatisfied with standalone searh engines. This is espeially true when artiles appear that disuss how little of the web eah searh engine overs or how the results an be different from engine to engine. 18. Meta searh engines provide several improvements: A page retrieved by more than one searh engine should ontain more relevant information

38 The result an be sorted by different attributes suh as site, date, or relevane sore, if the member searh engines provide suh information. However, meta searh engines ould ultimately ause the member searh engines they depend on to ut off their liensing agreements, whih most of the major meta searh sites establish in order to avoid legal problems. However, for now, the major searh engines generally state that meta searh sites pose little burden, and even provide them with some benefiial exposure. Consequently, meta searh engines produe a win-win situation. Slower response is another drawbak of meta searh engines sine the response time is at least equal to the slowest member searh engine Measures of Searh Engines It is hard to measure the quality and performane of searh engines beause of the unpredited data set, network status, and users on the Internet. We will just summarize the issues as suggestions Size Figure 3-1. The size of eah searh engine s index (INK=Inktomi 19, FAST=FAST 20, AV=AltaVista, NL=Northern Light 21, EX=Exite 22, GG=Google 23, Go=Go 24 (Infoseek), LY=Lyos 25.)

The number of indexed pages is the guideline by whih a searh engine is evaluated. The larger the index, the more likely the searh engine will have a omprehensive reord of the Web.

39 The number of indexed pages is the guideline by whih a searh engine is evaluated. The larger the index, the more likely the searh engine will have a omprehensive reord of the Web. In Figure , SearhEngineWath reports the index size of eah searh engine as of April 11, Intuitively, looking for unusual information from searh engines with a large index makes sense, beause those searh engines over more of the Web, so users have a greater hane of finding what they are looking for. However, for general queries about popular topis, a large index does not neessarily equal better results beause of too many data. Figure 3-2. Searh engine sizes over time. By onsidering the trend of eah searh engine s index, the problem of information explosion seems severe. For example, in Figure 3-2, the size of AltaVista inreased about ten times from Deember 1995 (less than 25 millions) to Marh 2000 (250 millions). When AltaVista appeared in Deember 1995, its index size was muh larger than that of other searh engines. Competition fored most of searh engines to inrease their sizes in early Google is a speial ase. It has indexed 200 million pages, but beause of the way Google uses link data, it an atually return listings for sites it has never visited. This gives it overage of up to 350 million pages; hene, the extended bar

Figure 3-3. Searh Engine Sizes (NEC Researh Institute). Researhers at the NEC Researh Institute ran the same 575 queries on HotBot, AltaVista, Northern Light, Exite, Infoseek and Lyos.

40 Figure 3-3. Searh Engine Sizes (NEC Researh Institute). Researhers at the NEC Researh Institute ran the same 575 queries on HotBot, AltaVista, Northern Light, Exite, Infoseek and Lyos. They then ounted the mathing pages under a variety of onstraints. Dupliate pages weren t ounted, the maximum query limits for eah engine were not exeeded, and other ontrols were used to normalize aross servies. These experiments only estimate the overage rate based on the ross-referene among searh engines. In omparison with the size reported by eah searh engine, the result is shown in Figure Some searh engines seem to over estimate their index size. Even though all of them ompete in size wars, not one an even over half of the Internet. Based on the report from SearhEngineWath, shown in Figures 3-4 and 3-5, the overage rate of all engines has atually delined. That is, no engine an follow the growth of the Internet. Aording to a reent (September 6, 1999) study by the NEC Researh Institute, the Web now ontains about 800 million pages, up from 320 million in Deember During this time the total searh engine overage, however, has fallen from 60 to 42 perent, with no single engine indexing more than 16 perent of the Web, aording to the study

Figure 3-4. Coverage of searh engines (April 1998). (INK=Inktomi, FAST=FAST, AV=AltaVista, NL=Northern Light, EX=Exite, LY=Lyos, IS=InfoSeek, WC=WebCrawler.) Figure 3-5.

) As no searh engine is able to over all Internet pages, should we believe bigger is better? If you are looking for speifi information, for example ACIRD, it is indeed helpful to have a big index.

41 Figure 3-4. Coverage of searh engines (April 1998). (INK=Inktomi, FAST=FAST, AV=AltaVista, NL=Northern Light, EX=Exite, LY=Lyos, IS=InfoSeek, WC=WebCrawler.) Figure 3-5. Coverage of searh engines (May 1999). (INK=Inktomi, FAST=FAST, AV=AltaVista, NL=Northern Light, EX=Exite, LY=Lyos, IS=InfoSeek, WC=WebCrawler.) As no searh engine is able to over all Internet pages, should we believe bigger is better? If you are looking for speifi information, for example ACIRD, it is indeed helpful to have a big index. In ontrat, a large index is not neessarily helpful for general queries suh as Internet, sine retrieving more than ten thousand of hits is almost worthless. Thinking bak to reall and preision as mentioned above, bigger means higher rate of reall, but the preision always dereases due to inreased noise. In summary, none of the searh engines of today are perfet, but using the right one at the right time an make a substantial differene. 31

42 Freshness Another exellent evaluation of a searh engine is the freshness of its index. Freshness is important, both beause it saves people from wasting time, and beause it shows that the searh engine reflets urrent information available on the Web. Freshness an be estimated beause it is inversely related to the perentages of bad links found in a searh servie. Suh an evaluation is shown in Table 3-1. Due to the dynami nature of the Internet, it is diffiult to maintain freshness while at the same time maintaining immense size of the index. The results illustrate this bad side effet. Table 3-1. The perentages of bad links found in eah searh engine 28. Lyos Exite AltaVista Infoseek Northern Light HotBot 1.6% 2.0% 2.5% 2.6% 5.0% 5.3% The main fator determining freshness is the frequeny with whih the rawler revisits a page. If the rawler is apable of rawling pages frequently, then most retrieved pages are probably available. However, freshness is not always refleted in the entralized searh engine sine rawlers annot reah some sites. This problem probably originates from failure points in networks rather than from lak of response from those sites. Thus, freshness is better for searh engines with distributed arhitetures. Unfortunately, we are unable to find a freshness report for Harvest (a distributed arhiteture) in the report by SearhEngineWath. Another freshness measure orresponds to the dynami nature of the Web. This is the time differene between the page s last modified date and the rawler s last rawling date Features The following features an be used as guidelines for measuring the quality of searh engines. We ategorize these features aording to the omponents of searh engines desribed in previous setions. Crawling This setion illustrates fators related to how well searh engines rawl through web sites

43 Deep Crawling. Searh engines whih perform deep navigation will explore more pages from a web site, even if the pages are not expliitly submitted to these searh engines. In general, the larger a searh engine s index, the deeper it rawls. Robot exlusion. Intelligent Crawling. Some searh engines an learn how often pages hange. Pages that hange often may be visited more frequently. This fator affets the freshness. Instant Indexing. At searh engine whih does instant indexing, usually any page you submit will appear in the searh engine within one or two days after submission depending on the rawling period for newly submitted pages. This also affets the freshness of the index. URL Status Chek. It indiates the availability of a page. HTML Extration Frame Support. If a searh engine annot follow frame links, some links will be missed, affeting the rawling depth, and, as a result, the size of the searh engine. Image Maps. Some links are speified in a olletion of hot spots for a lient-side image map. This also affets the rawling depth. Meta Refresh. Some webmasters reate pages that automatially take visitors to different pages. The meta refresh tag is one typial way of doing this. Some searh engines will refuse to index a page with a high meta refresh rate. For example, Go will not index pages with any rediretion, whatsoever. Indexing This setion explains what ontent gets indexed. Full Body Text. Some searh engines just index the title or URL of a page instead of texts embedded in BODY tag. Stop Words or Stemming. Most searh engines support both funtions. 33

44 Desriptions, Keywords, and Classifiations in META tag 29. Valuable information might be speified in a META tag with attributes suh as Desriptions, Keywords, and Classifiation. However, few pages were designed to ontain meta-information. ALT Text. ALT text an be assoiated with images or text; it is also informative. Comments. Most searh engines skip omments. Stop Spam. Some searh engines penalize sites that attempt to spam the engines in order to improve their position. One ommon tehnique is staking or stuffing words on a page. This is where a word is repeated many times in a row. If a searh engines spots a spamming tehnique, it may downgrade a page s ranking or exlude it from listings altogether. In addition, some pages use hot and invisible words, whih is not relevant to the page ontent, to spam searh engines. For example, writing words like sex, mp3, or White House in the same olor as the page s bakground olor. Some searh engines do not index invisible texts. Ranking Most searh engines use the loation (tag position) and frequeny of keywords on a page as a basis for ranking. In addition to loation and frequeny, some engines may give a page a relevany boost based on other fators. Meta Tags Boost Ranking. Some searh engines that support the meta desription and keywords tag will also give pages an extra boost if searh terms appear in these areas. Not all searh engines that support the tags give a ranking boost. Diretories Boost Ranking. Some searh engines also review pages (usually, they are the roots of a site) in an assoiated diretory to give them boost ranking. Link Popularity Boost Ranking 30. Searh engines an determine the popularity of a page by analyzing how many links there are to it from other pages, and based on this, give the page boost rankings. Diret Hit Boost Ranking. Diret Hit is a system that measures what users lik on from searh results in order to refine relevany rankings. For example, HotBot integrates popularity into top results

45 SearhEngineWath summarizes most of the above features as shown in Table 3-2. Table 3-2. Searh Engine Features 32. This hart is as of Feb. 2, (INK=Inktomi, FAST=FAST, AV=AltaVista, NL=Northern Light, EX=Exite, LY=Lyos, Go=Go, HB=HotBot) Crawling Yes No Notes Deep Crawling AV, FAST, GG, INK, NL EX, Go, LY robots.txt All n/a Meta Robots Tag All n/a GG may not support Intelligent Crawling AV, Go EX, FAST, GG, INK, Lyos, NL Instant Indexing AV (pages appear within EX, FAST, GG, Go, INK, days) LY, NL HTML Extration Yes No Notes Frames Support AV, FAST, GG, NL EX, INK,Go, LY Image Maps AV,Go, NL EX, FAST, GG, INK, LY Meta Refresh AV,Go, LY EX, FAST, GG, INK, NL Indexing Yes No Notes Full Body Text All n/a Some stop words may not be indexed Stop Words & Stemming AV, EX, Inktomi, LY, GG FAST, Go, NL Meta Desriptions All but... FAST, GG, LY, NL Meta Keywords All but... EX, FAST, GG, LY, NL ALT Text AV,Go, LY EX, FAST, GG, INK, Nlight Comments INK Others Spam Yes No Notes Invisible Text Others EX, FAST, GG Ranking Yes No Notes Meta Tag Boost Ranking Go, Inktomi AV, EX, FAST, GG, Lyos, NL Diretories Boosts Ranking Go AV, EX, FAST, GG, INK, LY, NL Link Popularity Boost AV, EX, FAST, GG, Go, INK, LY Very important at GG Ranking NL Diret Hit Boost Ranking HB, LY Others

46 Response Time Response time is the same as the retrieval effiieny of IR as desribed previously. The system response time is often used as a measure sine the query response time is probably delayed by network ongestion. Only a few searh engines present the system response time in their searh results, suh as Fast and Google. 36

47 4. Internet Information Management Organizing douments as hierarhially strutured diretories is a ommon method for managing doument information. The onept of hierarhial diretories is widely used in phone books, address books, libraries, file systems, et. It is the most natural way for humans to organize information as knowledge. Using speifi attributes to onstrut the diertory is easy, even if the number of objet is very large. For example, in address book it is easy to follow the attributes ountry, ity, provine, or ounty, government region, road, and street to onstrut a world-wide address book. However, if there are no referred attributes in diretory objets, suh as douments, how an they be managed as a diretory hierarhy? The first proess is to extrat the onept behind the objet (doument). Managing douments as diretories is workable if the number of douments is not massive. For example, the approah is workable in libraries. On the Internet, more than one million pages are published in the time span of several days. One of the biggest hallenges is the dynami nature and diverse ultures of the Internet environment. In this dissertation, we fous on managing information on the Internet, i.e., the Internet diretory servie. In this hapter, we review the status of available diretory servies and point out their advantages and disadvantages. Finally, the new onept proposed by Open Diretory is introdued The Web Diretory Servie Diretories are usually human ompiled guides to the Web, where sites or pages are organized by ategory. The most popular and oldest Web diretory is Yahoo!. Other Web diretories inlude 37

48 LookSmart 33, Snap 34, Magellan 35, et. Most Web diretories nowadays provide searh servies. For example, Yahoo! and LookSmart are powered by Inktomi. Most searh engines, e.g., AltaVista, HotBot, InfoSeek, et., also provide diretories. Usually, pages are submitted to the Web diretory, where they are reviewed by the staff, and, if they are aepted, lassified in one or several ategories of the diretory hierarhy. Most diretory hierarhies are like trees with ross-referenes among nodes. Compared to the size of a searh engine, the size of a Web diretory is smaller. Diretory engines fous on how to assign Internet douments to the most appropriate diretories. That is, they are more onerned with the quality of the diretory rather than the size and retrieval performane of a searh engine. In Table , SearhEngineWath presents diretory size and the page inrement of popular searh engines. In this report, they ompare searh engines based on the following riteria: Type: shows whether a servie is primarily a diretory (D) or a searh engine (SE). AskJeeves is an answer servie (AS) and is more like a diretory, sine human beings ompile its listings. Editors: shows how many people are involved in produing the listings. More is not neessarily better, as some servies laim that tehnology helps them do more. However, a large number of editors is a good sign that a quality diretory is being built and kept up with the growth of the Web. Cats: shows how many ategories are in eah diretory. Links: shows how many individual links exist in the diretory. Either entire sites or individual web pages may be listed, depending on their exat ontent. Some sites may be listed in more than one ategory. Per Day: shows how many links are added per day. As Of: shows how urrent the information is, usually drawn from reent interviews. Some data in the previous table is updated with new information as it is available

49 Table 4-1. The hart ompares the size of diretories at various servies, along with other key data. Servie Type Editors Cats Links... As Of Yahoo! D 100+? 1.8 million 4/2000 Open Diretory D 15, , million 4/2000 LookSmart D 200 6,000 1 million 9/1999 Go (Infoseek) SE 100,000 50, ,000 1/2000 Open Diretory D 10,200 70, ,000 5/1999 Snap D , ,000 11/1999 AskJeeves AS 30 n/a 7 million answers 11/1998 AltaVista SE See LookSmart Exite SE See LookSmart HotBot SE See Open Diretory Lyos D See Open Diretory MSN Searh SE See LookSmart Netsape SE See Open Diretory As we an see, they spend muh manual ost to ollet and ategorize web sites and pages to keep with the growth of the Internet. For instane, the largest and most popular diretory servie, Yahoo!, employs more than one hundred editors to add new sites to its diretory. In omparison with the largest searh engine, Inktomi (500 million pages), Yahoo! (over 1.2 million pages) is the loser in the game of who is biggest. However, Yahoo! is still the largest portal site in terms of number of visitors. Based on the suess of Yahoo!, we review advantages and disadvantages of diretory servies vis-a-vis searh servies Advantages of Diretory Servie Yahoo s strength and suess as a guide omes from its reliane on human editors. Information organized as diretories by humans an be regarded as knowledge of high quality. Thus, diretory servies have the following advantages: Hierarhially strutured navigation is the only feasible and aeptable information arhiteture sine it is easy to understand and represent in the user interfae [62]. Human editors provide the brainpower and intuition to extrat information from douments, resulting in a higher quality of ontent for diretory servies. 39

50 4.3. Disadvantages of Diretory Servie Oftentimes, advantages from one viewpoint are also drawbaks from another, suh as the large sale of the Internet. There are massive pages published by friendly and unfriendly webmasters. It is hard to prevent someone from intentionally spamming the diretory servie. For example 37, someone might submit a site for kids to Yahoo!; the editor then adds the site to the lass about kids. A month later, person hanges the site to a sexual ontent servie. This then produes a wrong lassifiation for the site. This is not suh a big problem for searh engines sine they just re-index the site a few days later. In summary, diretory servies have the following disadvantages: The topi hierarhy gets more diffiult to navigate as it gets larger. Even though ross-referene links are used to good effet to alleviate this problem, usability will ultimately suffer unless better lassifiation methods are disovered [62]. One approah to ope with this problem is to provide personal diretory servies, suh as done by MyYahoo! 38. The dynami nature of Web pages auses olleted pages to beome bad or put under the wrong lassifiation. Bad links are easy to eliminate via rawling tehniques. As for wrong lassifiation, extra editorial effort is neessary to deal with wrong lassifiation or spamming pages. It is diffiult to ahieve instant adding in diretory servies. However, a page diretly submitted to searh engines like AltaVista, InfoSeek, and HotBot will be indexed within minutes to two days, depending on the servie, and assuming the page is not a spamming attempt 39. As the web grows, diretories with small editorial staffs will be unable to ope with the inreasing volume of web sites. The Open Diretory projet tries to resolve this problem via a vast army of volunteer editors. Automati lassifiation of douments is an alternative solution. However, no urrent doument lassifiation system is workable without involving manual efforts. 37 It s a real ase in YamKid, a portal site with diretory servie for kids in Taiwan (

4.4. Open Diretory The goal of the Open Diretory Projet 40 is to produe the most omprehensive diretory of the Web, by relying on a vast army of volunteer editors.

51 4.4. Open Diretory The goal of the Open Diretory Projet 40 is to produe the most omprehensive diretory of the Web, by relying on a vast army of volunteer editors. Instead of fighting the explosive growth of the Internet, Open Diretory 41 provides a means for the Internet to organize itself. As the Internet grows, so do the number of net-itizens. These itizens an eah organize a small portion of the Web and present it to the rest of the population, ulling out the bad and useless and preserving only the best. Open Diretory proposed the onept of The Republi of the Web in June 1998 and won the diretory size war within two years. Figure 4-1 illustrates the trend in diretory size of Yahoo! and Open Diretory. Figure 4-1. The omparison of diretory size in Open Diretory and Yahoo!

52 There are several portal sites and searh engines using Open Diretory data 43, suh as Netsape, AltaVista, Lyos, HotBot, and Google, et. The data is free and an be downloaded at DP_Data/ 42

53 5. Doument Management and Mahine Learning Aording to the definition of information in hapter one, well-organized arhiteture of douments an be useful information. In this hapter, we introdue mahine learning tehniques that are applied to doument management. That is doument lustering and lassifiation Mahine Learning In knowledge-based systems (expert systems), the proessing ost of knowledge aquisition is typially expensive. Suh a ost may involve many experts and knowledge engineers to ollaborate with eah other. Indutive learning from existing data (or instanes) is one of the approahes to redue the effort of knowledge aquisition. There are two types of learning: lustering and lassifiation. Clustering. Given a set of data, lustering methods attempt to partition the data into several set (lusters), and then assigns the data into those lusters. Hene, lustering is also alled unsupervised learning. Most lustering methods are based on statistis [39], suh as AutoClass that applies Bayesian lassifiation method on the basis of distane measurement [15]. Classifiation. Given a set of data assigned with lasses, the lassifiation learning tries to learn rules (or knowledge in other representation) from the given training set. Then these rules are applied to lassify new data that is the testing set. Thus, lassifiation learning is also alled supervised learning. Among those approahes, Version Spae [17, 62, 63] and ID3 [60, 33, 67, 11, 82] are the two most ommonly used methods. We first introdue two well-known indutive learning methods: Version Spae and ID3. Based on learning methods, several learning systems were proposed to learn rules from databases. By 43

54 desribing their advantages and disadvantages, our learning method is proposed to retain those advantages and overome those disadvantages Version Spae Learning In Version Spae [17, 62, 63], the training set is divided into positive and negative instanes to generalize S-set and speialize G-set respetively during the indution. In the beginning of indutive learning, S-set is initialized as the most speifi hypothesis (all the positive training instanes), while G-set is the most general hypothesis U (universe). In the learning proess, S-set is generalized (expanded) by inluding positive instanes, while G-set is speialized (shrunk) by exluding negative instanes. After the generalization/speialization of S-set/G-set, the subset interseted by both sets is the indutive result (Rule Spae) as shown in Figure 5-1. By the definition of Version Spae, S/G-set inludes/exludes all positive/negative instanes, the intersetion of S/G-set is thus the indutive rules. Null Desription Most General Conept U G Rule Spae Training Instanes S Most Speifi Conept Figure 5-1. The onept of Version Spae. In Mithell s Version Spae, U is the Cartesian produt V1 V2... Vn, where Vi is the number of distint values of attribute Ai. Sine there are two lasses (positive and negative), the domain of hypothesis language beomes enormous ( 2 U, where U is the ardinality of U [32]). Thus, Version Spae-based learning systems [14, 85, 31] use the generalization/speialization from the onept hierarhy to inrease the ommon values of an attribute and redue the hypothesis domain, i.e. reduing the omplexity of indutive rules. The hierarhy is normally onstruted based on user (expert) experienes. Suh an approah is inonvenient due to the involving of human 44

55 resoure, and the generalization/speialization is further proessed instane-by-instane that ultimately degrades the system performane. The omplexity of indutive rules also inreases in those methods due to the fat that they exhaustively onsider all (learning) attributes, and the auray may be dereased sine some weaker attributes (regarded as noisy attribute [52]) are involved. However the harateristi of data-driven (bottom up) learning strategy [31, 33, 34, 4], no re-aessing learned training instanes from the database, allow Version Spae-based methods to apply to very large databases (if they an restrit the domain of hypothesis language well) ID3 As for ID3 [60, 33, 67, 11, 82], an attribute s entropy is used to obtain an optimal order of attributes for lassifiation instead of the onept hierarhy of eah attribute for generalization/speialization. ID3 generates the deision tree by alulating the entropy of eah attribute, seleting the minimal one as the attribute of root node, and partitioning instanes aording to this attribute. Next, ID3 re-alulates the entropy of remaining attributes from eah partitioned subset reursively until no nodes ontain more than one lass s instanes. Comparing with Version Spae that generates rules by all attributes, ID3 generates its rules from paths of the deision tree; onsequently, some rules do not onsist of all attributes (it depends on the depth of the related deision path). Thus, by applying the onept of entropy, ID3 will redue the size of learning rules. However, when the training set is too enormous to put in memory, some instanes must be retrieved from the disk into the memory more than one, thereby dereasing the performane of ID3. Moreover, it also suffers from the massive omputation of entropy for eah attribute under this ondition and subsequently generates an extremely omplex deision tree owing to the fat that the onept of generalization/speialization is not applied to redue the number of distint values in an attribute. Obviously, the onept hierarhy of attribute an be added to ID3; however, the learning system an not determine whih generalization/speialization level should be applied in eah attribute. Thus, ID3 is useful only when the training set is small Learning from Databases Aumulating a large amount of training data for a problem domain to indue knowledge is a tedious work; instead, obtaining them from existing databases assoiated to a problem domain is a more feasible manner. From the first ommerial produt of the relational database, databases have beome inreasingly popular in omputer appliations. The database is normally used to manage 45

56 the information for a speial area (problem domain); attributes are defined in objets of the database for easy management. Thus, it an be applied to indutive learning without any modifiations. If rules an be indued from databases in a systemati manner, the system an automatially aquire knowledge and the ost of knowledge aquisition will be redued sine the additionally large amount of training instanes an be diretly replaed by databases. Before indutive rules beome evident and appliable, a large amount of training instanes must be indued to inrease the onfidene of rules. Unfortunately, the learning time of those algorithms is an exponential funtion with respet to the size of training set [4, 69]. Thus, many heuristis are applied to existing learning algorithms to redue their omplexity [14, 31]. Sine experts or knowledge engineers usually explore those heuristis, some erroneous assessments may ause inorretly learned rules. Hene, an extra effort is often required to validate those heuristis. Many indutive learning systems based on relational databases have been proposed [14, 85, 31] based on Mithell s Version Spae method. They promote the performane of learning algorithms by adding onept hierarhies of attributes and proposing some heuristis to eliminate useless learning attributes or instanes. If users want to apply those heuristis and input meta-knowledge (needed information during learning) for the learning system, the learning performane (the quality of learning rules and learning time) would obviously be adequate. However, users typially want to disover the embedded knowledge in databases without involving the omplex proesses or inputting the meta-knowledge before learning Our Learning Method To resolve the above drawbaks, we propose a learning system that an learn indutive rules from databases in objet-oriented data model [36, 54]. Based on entropy of attributes and the objet-oriented data model, we first proposed a method, Automati Generation of Conept Hierarhies (ARCH), to generate the onept hierarhy of an attribute to overome the restrition of Version Spae based methods. In the following method, estimation of Optimal Generalization Level (OGL), the entropy is also used to determine the optimal generalization/speialization level of eah onept hierarhy suh that ID3 an be used to indue rules from large training sets. Finally, a learning method, learning with Attribute Seletion by Entropy (ASE), is proposed to indue rules with most informative attributes based on their entropy and estimate the auray of rules from the entropy. The whole learning methods are on the basis of the losed-world assumption: all data in the database are orret. Therefore, training instanes an be partitioned into lasses aording to the target-onept (attributes of instanes) we want to learn. 46

57 Our learning system is not only developed on the basis of databases, but also follows the objetive of learning preise rules effiiently without other meta-knowledge from users suh as the onept hierarhy of eah learning attribute. The onept of entropy in information theory is used to obtain an optimal generalization level of eah attribute and disover the minimal number of features satisfying the given auray of indutive learning rules. The learning algorithms omplexity is linear to the number of training instanes and even unlimited by the size of main memory. The learning system is desribed in the following hapter Automati Categorization of Douments Many approahes are available to ategorize douments in the two main amps, manual lassifiation and automati lassifiation. Manual lassifiation of douments, for example Yahoo!, is time onsuming and expensive, and is infeasible for the urrent explosive growth of Internet douments. For automati lassifiation, lassifiation knowledge an be aquired from domain experts or learned automatially from training douments [3]. Knowledge aquired from domain experts, while relatively effetive, is expensive in terms of time and knowledge maintenane. Furthermore, the aquired knowledge may be inomplete that requires ompliated model and theory to apply the inomplete knowledge. On the ontrary, lassifiation knowledge automatially learned from training doument is effiient, but its auray is onstrained by the employed learning model and training data. Many text ategorization studies have been undertaken in information retrieval [3, 22, 40, 41, 43, 49, 50, 51, 84]. Herein, doument ategorization is used instead of text ategorization sine this work fouses on the Internet HTML douments rather than general texts. Doument ategorization involves the automatially grouping of douments. Many studies have addressed this issue by adopting similarity-based doument retrieval [84], relevane feedbak [76], text filtering [61], text ategorization [3, 49, 51], and text lustering [28, 50]. For example, SIFTER [61] uses vetor spae model for doument representation, unsupervised learning for doument lassifiation, and reinforement learning for user modeling to filter douments based on ontent and user speifi interests. Goldszmidt and Sahami proposed doument lustering based on probabilisti overlap between douments and doument lusters [28]. ExpNet [84] uses similarity measurement as the ategory ranking method to determine the best ategory for the input doument. INQUERY [49] employs three different lassifiation tehniques: a k-nearest-neighbor (knn) 47

58 approah using the belief sores as the distane metri, Bayesian independene lassifiers, and relevane feedbak. Conventional mahine learning methods are applied to databases, in whih eah reord (row or tuple) has attributes (olumns) regarded as its features. However, there are no features for text doument. Thus, haraterizing doument is the most important task while applying mahine learning to doument ategorization. Similar with types of mahine learning methods, we an ategorize doument ategorization into two types: doument lustering and doument lassifiation Doument Clustering Doument lustering tries to disover lusters (ategories), whih douments should be assigned, from a given doument set. Then, douments in the set are assigned to one or several of those lusters based on the exlusive or the overlapping lustering. When a luster of douments has been identified, the problem of denoting the luster onept arises. Most of studies use the entroid doument of the luster to represent the onept of the luster. The formal definition of doument lustering is: A doument set D ontains douments d1, d2,, dn. Doument lustering is an algorithm that partition D into m lusters C1, C2,, Cm. A doument di an be assigned to one or multiple lusters. In the ase of multiple assignment, di is assigned to luster Cj with membership grade orresponding to the degree of the assignment. Thus, there are two proesses in doument lustering: finding lusters and assigning douments. Usually, the doument lustering proess is interative with users. First, users give the number of luster m. The lustering system tries to partition the doument set into the given number of lusters. The proesses an be iterative till the onvergene state is reahed or reursively interative with users to onstrut a hierarhy of lusters. Seondly, the system assigns douments to disovered lusters. There are two ommon lustering methods employed to make the lustering proess be onverged. Hierarhial Agglomerative Clustering (HAC) The term agglomerative means the lustering proess starts with unlustered douments and performs pairwise similarity measures to determine the lusters. The algorithm is in the following. 48

59 1. Plaing eah doument into a distint luster. 2. Computing pairwise similarities between all suh lusters, and then merging two losest lusters as a new luster. The new luster is the parent of the two losest lusters in the hierarhy. The similarity between two lusters makes HAM algorithms differ. There are single link, omplete link, group average link, and Ward s method [24, 30]. 3. Go to step 2, unless the state is onverged. For example, the state an be k lusters or minimal n douments in the luster. Obviously, luster methods are omputationally intensive. For example, if lustering is based on a similarity measure, there are n*(n - 1)/2 similarity measures to be omputed for a set of n douments. Many algorithms begin with a matrix that ontains the similarity of eah doument with every other doument. For a olletion ontaining 1,000,000 doument, the matrix has 2 1,000,000 2 elements [30]. As the larger luster is formed, the luster that merged together are traked and form a hierarhy. That is the luster struture resulting from a HAC method is often displayed as a tree [24, 30]. The order of pairwise oupling of the objets in the data set is shown in the tree path. It is a useful representation when onsidering retrieval from a lustered set of douments, sine it indiates the paths that the retrieval proess may follow. The objetives of reating a hierarhy of lusters are [48]: Redue the overhaed of searh. Provide for a visual representation of the information spae. Expand the retrieval of retrieval douments. The entroid and median methods are used in HAC to represent the luster [24]. In the entroid method, eah luster is represented by the oordinates of a group entroid as it is formed. At eah stage in the lustering, the pair of lusters with the most similar mean entroid is merged. The median method is similar but the entroid of the two merging lusters is not weighted proportionally to the size of the lusters. Iterative Clustering 49

60 Interative lustering method [28], also referred to as realloation method, selets some initial partition of the data set and then moving douments from luster to luster to obtain an improved partition. The general algorithm is in the following. 1. Initialize the k lusters. 2. For eah doument, ompute the similarity to eah luster. 3. Assign eah doument to the luster to whih it is most similar. Realulate the luster entroid. 4. Go to step 2, unless the state is onverged. We note that the initialization in Step 1 will affet the onvergene of the algorithm. By using random seletion and HAC as the method to find initial lusters, results of the former were often omparable but in some ase worse than the later [28]. There are several ways to selet the similarity funtion. For example, by regarding eah doument or luster as a multi-dimensional distribution over a set of terns, similarity measure of vetor spae model an be also applied to be the similarity between two douments or between doument and luster. Many methods of finding lusters have been proposed and implemented as pakages. Some are based on statistial tehniques; others use graph theory as their basis. Some generate exlusive lusters in whih eah doument an not be inluded by other lusters; others generate overlapping lusters. Some approahes, whih interat with users, are top-down; others are bottom-up Doument Classifiation In ontrast to lustering, doument lassifiation tries to assign douments to one or multiple pre-defined lasses (ategories). Given a set of lasses (or a lass hierarhy) of manually ategorized douments, the lassifiation proess first learns the lassifiation knowledge. The knowledge is then applied to ategorize new douments automatially. The formal definition of doument lustering is: A doument set D ontains douments d1, d2,, dn, and a set C of lasses C1, C2,, Cm (in the hierarhial or lattied struture). Eah doument is assoiated with one or more lasses. Doument lassifiation is an algorithm that learn lassifiation knowledge CK from D and C. Given a set of new douments D ontains d 1, d 2,, d k, CK is applied to ategorize eah doument into lasses C1, C2,, Cm. 50

61 Previous mahine learning studies developed many algorithms that have been well tested and performed in many fields suh as mediine and finane. The widely used algorithms inluding ID3 [67], C4.5 [68], CN2 [18], and AQ algorithm [59] are applied to strutured training data, instead of non-strutured textual data in the doument lassifiation problem. Correspondingly, many approahes in doument lassifiation use a feature set to haraterize douments and apply algorithms suh as Bayesian independene lassifiers [50], k-nearest-neighbor method [22, 26], rule-based indution algorithms [3], and mixed approahes (e.g. INQUERY [49]) to lassify douments. For example, given a hierarhy of lasses (subjets) and a set of douments, the system learned features of lasses as shown in Figure 5-2. This is alled a deision tree in ID3 algorithm. The extended method of ID3, C4.5, applies a greedy divide and onquer approah to assign douments to the deision tree. Given a doument related to Internet rawler, the algorithm tries to find the target lass by seleting the lass, in eah level of the tree, with maximal gain value of the assoiated feature. The gain funtion makes the effetiveness of learning algorithms different. For example, some methods determine gain funtion based on the entropy (mutual information) of Information Theory, and some are based on the relevane between lass and doument features. Internet no no yes no FTP Telnet Web no yes Agent HTML HTTP Agent no yes Agent Crawler Figure 5-2. The learned features of a lass hierarhy. While onentrating on the doument lassifiation proess and learning algorithms, those systems omit the diversity of the douments, in the usage of terms and their semantis. In many learning appliations, the haraterized feature is an attribute-value pair that assumes with the same semantis in every lass. However, the semantis of a feature varies with different domains. For 51

62 example, the doument feature apple has different meanings for the domains omputer and food. In this dissertation, we apply mining assoiation rules to explore the semantis of features in a doument Mining Assoiation Rules Mining assoiation rules [1, 2, 81] was applied to disover the important assoiations among items in transations. The knowledge is useful to find an optimal item-arrangement in the supermarket to make ustoms ollet their needs rapidly. Aording to the definition of assoiation rule in [1], elements in the problem are items, transations, and the database. Let I = { 1 i, 2 i, K, i m } be a set of items. Let D be a set of transations (the transation database), where eah transation T is a set of items suh that T I. An assoiation rule is an impliation of the form X Y, where X I, Y I, and X Y =φ. The rule X Y holds in the transation set D with onfidene, if % of transations in D that ontain X also ontain Y. The rule X Y has support s in the transation set D if s% of transations in D that ontain X Y. Confidene denotes the strength of the impliation rule and support indiates the frequeny of the patterns in the rule. Rules with high onfidene and strong support are referred to as strong rules in [1]. The essential task of mining assoiation rules is to disover strong rules in large databases. The task is deomposed into two proesses [65, 1]: Disovery the large itemsets, i.e. the sets of itemsets that have transation support above a predetermined minimum support s. Use the large itemsets to generate the assoiation rules for the database. Obviously, the main ost of mining assoiation rules is in the first proess. Algorithms Apriori [2] and DHP [65] were proposed to the main ost. We follow the definition and map the problem of mining term assoiations to the speifiation. Therefore, mining assoiation rules an be diretly applied to our system to solve problems of urrent IR systems. 52

63 6. Classifiation Learning from Databases In our previous study, we proposed a lassifiation learning method to learn lassifiation rules from databases [36]. The objet-oriented database (OODB) is seleted as the test bed for training instane rather than the relational database (RDB) sine the semanti of objet-oriented data model (OODM) is stronger than the entity-relation data model (ERDM) [21, 6, 7, 44]. Consequently, further information an be learned from OODB/OODM. The ERDM only desribes the relations among data; thus, the user must input the onept hierarhy of attributes to generate more general onepts [14, 85]. However, the lass hierarhy of OODM is merely the generalization/speialization onept Automati Generation of Conept Hierarhies - ARCH A systemati method (ARCH) is proposed here to automatially onstrut the onept hierarhy of eah attribute in OODB. Three kinds of attributes in OODB are: numerial (ontinuous or disrete), symboli (unstrutured), and objetified attributes [21, 6, 7, 44]. The first two kinds of attributes are defined with primitive types whose onept hierarhies an be onstruted by using the fuzzy theory [45]. The last kind of attribute refers to non-primitive type whose onept hierarhy is onstruted aording to its relation (assoiation or aggregation) Conept Hierarhy (CH) of numerial attribute Regardless whether the value of numerial attribute is a ontinuous funtion or disrete event, the attribute an be fuzzified aording to its range [min, max], average, and distribution of message events (or events) whih refers to values of an attribute in its domain. In the system, the trapezoidal 53

64 funtion is used and denoted by (a, b,, d) to represent the fuzzy membership funtion. Figure 6-1 shows the fuzzy onept Medium represented by (a, b,, d). Medium = (a, b,, d), where min a b max Membership grade 1 Low Medium High 0 min a b d max Message Event Figure 6-1. The trapezoidal membership funtion. The trapezoidal funtion is seleted for its easy representation in a omputer program and approximation to frequent-used π funtions. Espeially, the triangular funtion is a speial ase (b = ) of it. It is easy to be defuzzified by omputing the trapezoid s enter. The harateristi of numerial events is the sequene among them. Thus, three ordered fuzzy linguisti symbols (FLS) and six fuzzy truth values (FTV) are defined here to represent the sequene. FLS: Low (L), Medium (M), and High (H) FTV: Most, More, Least, Less, Very, and Fairly Those FTVs also have their membership funtions and are used to desribe two FLSs, Low and High. The effet of FTV makes the two FLSs away from or lose to another FLS, Medium. Among those FTVs, Most/More and Least/Less are the omplements of eah other. FTVs an be used reursively as desired suh as Very Very High. Hene, any level of partial ordered fuzzy onept hierarhy an be generated. Figure 6-2 shows an example of fuzzy onept hierarhy and orresponding fuzzy membership funtion with three levels, i.e. only one level of FTV is used. Notably, subsuming relations arise between FLSs. However, the most speifi fuzzy onept is seleted if their membership grades are equal. For instane, if µ Most L = µ Very L = µ More L = 1 then the most speifi fuzzy onept is seleted, i.e. Most-L. The partial ordering is as the following. min Most-L Very-L More-L L Fairly-L M Fairly-H H More-H Very-H Most-H max 54

65 ALL Low Medium High Most-L Very-L More-L Fairly-L Fairly-H More-H Very-H Most-H Membership grade 1 Low Medium High Fairly-Low Fairly-High 0 Most Very More min More Very Most max Event Figure 6-2. The fuzzy onept hierarhy and membership funtions. Following above fuzzifiation proess, how many levels of FTVs should be used is the NP problem. Therefore, how many fuzzy onept nodes should be generated in the leaf-level of the hierarhy is first alulated by Expansion Proess. For an attribute, the proess expands the range of a fuzzy onept node for some lass by starting from an event of the lass and inluding its neighbors step-by-step. Due to the expansion, events of the other lasses may be inluded to make the auray derease; hene, the majority for a lass is defined by users to evaluate the termination of expansion. For instane, a majority 80% refers to the expansion must inlude at least 80% events for the lass, i.e. it annot inlude events of the other lasses over that 20%. The algorithm is implemented based on the onept of binary searh. Algorithm: Expansion Proess for an Attribute Assume: N is the number of training objets, and δ is a onstant of eah step of expansion. For eah attribute with event (e1,..., en) { Sort all events of the attribute as an Array. Array's element is filled with an event and its related lass. Invoke the reursive funtion: Expand(1, N) End For Funtion: Expand Expand(left, right) { // left, right: the urrent boundary 55

66 if (left = 0 OR right = 0) return; mid = ( right + left)/2; lass = Array[mid]; // get the lass of the starting expansion event auray = 100%; // initialized as 100% expansion = 0; While(auray >= majority AND expansion <= (right-mid)) { // maximal expansion: right-mid or mid - left Inlude δ neighbors and alulate the auray aording to lass. expansion = expansion + δ } // End of While expansion= expansion - δ // baktrak to the previous state Generate a Gaussian funtion aording to the urrent range of event. Expand(left, mid - expansion); Expand(mid + expansion, right); } // End of Expand The maximal number of expansion steps in the funtion Expand is less than N / δ (δ is the expansion onstant), and Expand is invoked one in suh a ase. On the other hand, the maximal times of invoking Expand is N with one expansion in its While-loop. Hene, the worst omplexity of the reursive funtion Expand is O(N), and the omplexity of Expansion Proess is ON ( log N+ N) sine the omplexity of the sorting is ON ( log N). By ombining the number of attributes M, the total omplexity of ARCH is OMN ( log N). After expanding a range field, a normal distribution funtion (or Gaussian funtion) [71] 2 2 ( 2 0 f x = e x u ) / σ ( ) 1 (EQ 6.1) is used as a riterion to approximately translate the range field into a fuzzy trapezoidal membership funtion (a, b,, d). The mean u refers to the top enter of the trapezoid. The variane σ orresponds to the length of top. The minimal and maximal value, (a, d), of the trapezoid an be solved by the following equations shown in Figure 6-3. event u = event# i = (b + ) / 2, σ = expansion = b, and solve (a, d) by 56

67 b a b ( u ) = f ( u σσ ), d ( u + σ ) =. (EQ 6.2) f ( u+ σ ) Membership grade f(x) 1 (b, 1) (, 1) ( u σ, f( u σ )) σ ( u + σ, f( u + σ )) 0 min a b u d max Event x Figure 6-3. The translation from Gaussian funtion to trapezoidal membership funtion. The partial order of those membership funtions an be easily identified aording to their orresponding sorted Gaussian mean. The remaining task is to determine the level number of the onept hierarhy by using: Leaf ( l 1 ) G Leaf ( l), and l 2 l 2 l Leaf ()= l , for l 2. (EQ 6.3) Where Leaf(l) is the number of nodes in level l of the tree shown as Figure 6-2, and G is the number of lusters. If G < Leaf(l) is true, the most left and right fuzzy nodes, i.e. Most-Low and Most-High ategories, are omitted to generate a onept hierarhy similar to Figure 6-2. After determining the level of onept hierarhy, the final task is to transform the membership funtion (a, b,, d) of Low/High ategories into (a, a,, d)/(a, b, d, d) respetively. Thus, the hierarhy shown in Figure 6-2 an be ompletely and automatially generated. During the generalization of the fuzzy onept hierarhy, an attribute s value is represented by many pairs of FLS and membership grade µ, (FLS, µ ). Moreover, FLS is regarded as the message event while alulating the entropy, and µ is used to alulate the total membership grade of events by using the fuzzy operator, union (s-norm). 57

68 CH of symboli attribute In ontrast to the numerial values, there is no ordering among symboli values. Thus, the onept hierarhy of the attribute should be onstruted aording to the indiation of the user. We omit the investigation sine it is beyond the sope of the dissertation CH of objetified attribute If an attribute is related (assoiation, aggregation, or inheritane) to a lass in OODM, it is objetified (non-primitive). Sine the related lass may ontain primitive or non-primitive attributes reursively, an inreasingly number of attributes are inluded in the learning proess and subsequently ause the omplexity inrease exponentially [4, 69]. Hene, learning attributes are partitioned here aording to suh an attribute. The riterion of proessing those partitioned attributes is determined by whether objets referred by the objetified attribute are shared or not. No Sharing ase If the relation (assoiation or aggregation) between relating-lass (the lass whih ontains the target-onept) and related-lass (the lass is referred by relating-lass) is 1-to-1 or 1-to-N. Then objets belonging to related-lass an be ompletely lassified into the same number of ategories aording to the lassifiation of relating-lass. For instane, the target-onept of learning from Car is partitioned into three groups: Sedan, SportsCar, and Bus; Engine is a related-lass of Car with 1-to-1 relation. Objets of Engine an be partitioned into three ategories: Sedan-Engine, SportsCar-Engine, and Bus-Engine. Thus, the training instanes, Car and Engine, are indued individually. Moreover, the lassifiation rules of the two lasses an be obtained, in whih rules of Engine are embedded in rules of Car. For instane, if the rule of SportsCar indued from Car is IF (manufature = BMW AND body = streamline ) THEN SportsCar. And the rule of SportsCar-Engine indued from Engine objets lassified by target-onept of Car is: IF (power = high AND (C.C. = Medium OR High )) THEN SportsCar-Engine. 58

69 The rule of SportsCar should be modified as: IF (manufature = BMW AND body = streamline Engine = SportsCar-Engine) THEN SportsCar. The above implies the indutive rule is partitioned into two rules. In this manner, the attribute Engine in the lass Car has a onept hierarhy and is generalized at the level Sport-Car Engine suh as in Figure 6-4. Consequently, the following advantages an be obtained by this approah: By partitioning training attributes, the learning algorithm s omplexity is redued. The influene among (noisy) attributes is minimized. If all attributes are involved in a learning proess, attributes from different lasses will ause more divergenes. The rules of Engine an be effiiently used in two lasses. In addition, the omplex rule of the omposite lass Car is divided into many smaller rules Car + Engine whih is effiient for inferene. Engine Sedan Engine SportsCar Engine Bus Engine Figure 6-4. The onept hierarhy of objetified attribute Engine. Heuristis: The onept hierarhy (generalization level) of 1-to-N (1-to-1 is inluded in it) objetified attribute orresponds to the indued rules of related-lass due to partitioning from the lassifiation events of relating-lass. Desription: We prove the theory from the previous example. Sine all instanes of SportsCar-Engine are inluded by (or assoiated to) SportsCar and exluded by other lasses through the attribute Engine, the entropy assoiated to the event SportsCar-Engine of the attribute Engine in the lass Car is equal to zero (if there are no same patterns with different ategories). The entropy values of events Sedan-Engine and Bus-Engine are also the same. Thus, the entropy of attribute Engine generalized by the onept hierarhy shown in Figure 6-4 is zero and the hierarhy an be used well. 59

70 However, the rule IF SportsCar-Engine THEN SportsCar an not be onluded here, sine attributes in the relating-lass itself are also important. Of ourse, the indued entropy of related-lass may be over than zero, thus we ombine it with attributes in the relating-lass to find a less entropy. Sharing ase Regarding the ase of M-to-N or N-to-1 relations, objets of related-lass an not be ompletely lassified sine they may be shared by those objets of relating-lass. Hene, the ategory of the related-lass s instane an not be determined. There are two approahes for suh relations: Regard suh an objetified attribute as noise and disard it. However, knowledge embedded in the data may not be extrated ompletely. Generate a new ategory to isolate those shared objets. If only some objets of related-lass are shared by some relating-objets, new ategories should be generated to apture suh knowledge. The new ategory an be easily generated from the intersetion of the lassifiations desribed previously. For example, the set interseted from objets with the same patterns in Sedan-Engine and SportsCar-Engine is a new ategory whih refers to The engine an be used in sedans and sports ar. Whih approahes should be applied depends on a rule s auray and omplexity. The first approah generates rules with a lower omplexity sine it regards the attribute as a noise. The seond one produes rules with a higher degree of auray sine it never exludes any attributes (whih attribute should be exluded is determined aording to its entropy), and more knowledge should be indued. However, its relative omplexity is high. Inheritane As for the inheritane, the relation of lass/sublass is 1-to-1. Thus, learning attributes are partitioned into two sets aording to the attribute defined in the superlass or sublass. For instane, Employee(Eno, Dno, Salary, Position) inherits Person(Name, Age, Degree, IQ, Sex, BirthPlae). The learning attributes are partitioned into two sets: Employee and Person. Thus, What kind of person is suitable to be a manager? an be disovered if the lassifiation rule is learned aording to the attribute Employee(Position). However, in ERDM, suh rules are never found sine it does not have a stronger semanti data model like OODM. 60

71 If the superlass is referred by another lass from the aggregation/assoiation attribute, the lass hierarhy from the superlass is merely the onept hierarhy of the attribute. For example, the lass hierarhy of Vehile in Figure 6-5 an be regarded as a onept hierarhy. Vehile Land Water Air Sedan SportsCar Bus Boat Submarine Airliner Jet Plane Spae Shuttle Figure 6-5. A onept hierarhy generated from the lass hierarhy Estimating of Optimal Generalization Level - OGL As mentioned above, the limitation of ID3 is that it never applies the onept of generalization and therefore ould not be used while the training set is very large. Although the onept hierarhy of eah attribute an be onstruted for generalization, whih level of an attribute should be generalized an not be determined sine ID3 never generalizes attributes. Of ourse, eah hierarhy an be generalized level-by-level, then ID3 is applied to find an optimal solution. However, ID3 must be exeuted L L... L n times, where L i is the level number of hierarhy of attribute 1 2 A i. If the generalization level of eah attribute an be individually determined, the ID3 an be applied to learn from large training sets with good performane and less attributes than Version Spae. This setion proposes a method to determine the optimal generalization level (OGL) of eah attribute. The ost model is first defined aording to the auray and omplexity the learning rule. Our approah applied in ID3 (alled modified ID3) is ompared with Version Spae by using the ost model as the riterion of simulation to verify OGL. In the next setion, this approah is implemented in our learning algorithm, Attribute Seletion by Entropy (ASE), to learn rules with least attributes. 61

72 The ost model of learning rule s omplexity The simulation aims to determine if OGL is optimal in most of the ases. Therefore, onsidering the run-time of modified ID3 and Version Spae is unneessary. The only signifiane is the performane of learning rules that is determined by two fators: the auray and the omplexity. The auray of indutive rules is a problem after applying OGL, sine some instanes may our with the same pattern while supporting to different lasses (alled ambiguity) due to the generalization. As mentioned previously, if ID3 and Version Spae do the exhaustive searh to obtain the 100% aurate rule, the omplexity of algorithm are extremely intratable. Indeed, [14, 85, 31, 52] modify Version Spae and indue rules with loose auray to redue the learning omplexity and disover more ompat rules. Thus, OGL an be used if users an aept unertain rules. The omplexity of indutive rules is determined whether CPU an infer results from those rules effiiently, i.e. the spae to store those rules. Thus, the following arguments must be onsidered: NC : Number of learning lasses (onepts), i.e. the number of ategories partitioned by the target onept (attribute). NR i : Number of indutive Rules for Class i (i = 1.. NC ). NA j : Number of Attributes in rule j, i.e. the number of terms in the rule s ondition-part ( j = 1.. NR i for Class i ). SA k : Size of attribute A k in rule j (k the attribute. = 1.. NA j ). It depends on the type (int, string,... et.) of RPC : Referene (or aessed) probability for Class i (i = 1.. NC ). All rules of a onept lass have the same RPC. If the first four arguments are onsidered, the omplexity of learning rules is NC NRi NAj i= 1 j= 1 k= 1 SA k. If we onsider the referene probability of the onept, its ost model is NRi NAj ( RPCi * SA j= 1 k = k) NC i=

73 OGL method OGL is based on the entropy of the information theory [29]. It an be used to alulate the entropy of a message event for some attribute from positive (+) and negative ( ) instanes. NI eci NI eci 0 Entropye = =+ ( *log 2 ) 1 i, NI NI e e (EQ 6.4) Where NI e is the number of instanes supporting to Class + or Class by Event e, and NI e C i is the number of instanes supporting to Class i among NI e, i.e. NI e = NI ec i=+, i. The above equation indiates that the entropy of a given event. The value will be minimized if all instanes with Event e support only one lass. On the other hand, if all instanes equally support all lasses, the entropy of the event is maximized sine no lass is strongly supported. The entropy of an attribute is the summation of entropy values of all events, whih is equal to the following equation after normalization. NI e 0 ( Entropye * ) 1, e NI (EQ 6.5) NI = total Number of all Instanes NI e = Number of Instanes with Event e By onsidering the onept hierarhy, the number of events is dereased during the generalization proess sine the number of nodes is redued for losing the root of the hierarhy. On the other hand, the range overed by an event inreases due to the generalization, and the entropy of the generalized event will inrease. Thus, an attribute s event number is opposite to its entropy. OGL attempts to disover a trade-off between event number and entropy. The less entropy of an attribute suggests that eah lass an be ategorized more preisely by the attribute. However, finding the entropy is diffiult unless the generalization level remains on the leaf node. An attribute without any generalization is not desired sine there will be too many events so that the rule omplexity is not aeptable. In ontrast, an attribute with root-level is also useless sine the entropy is maximized and the auray of rules is lost (supporting to nothing). Simulating the relation between the event number and entropy reveals that it is haoti. It is diffiult to derive a ost funtion of event number and entropy to minimize it and find the solution. 63

74 Thus, the maximal redution of event number and minimal promotion of entropy must be found during the generalization proess. Unfortunately, the units of event number [1, NI ] and entropy [0, 1] are different. Hene, the solution is obtained by the following onepts: Define a graph using entropy and event number as x-axis and y-axis, and draw a base line by points of leaf/root-level generalization shown in Figure 6-6. There are two kinds of points in the graph, one is E (Entropy alulated by training instanes of the attribute), and the other is Eg (Entropy guessed aording the base line). If the leaf/root-level points are (x0, y0)/(x1, y1), and one point alulated by training instanes is (x',y'), then base line: x = x1 x0 ( y y0) + x0, y1 y0 x1 x0 E = x', Eg = x y= y' = ( y' y0) + x0, and y1 y0 E = E Eg (EQ 6. 6) Event-Number leaf-level 1-level : E : Eg base line 2-level 3-level root-level Entropy delta: E - Eg Figure 6-6. An example for desribing the onept of OGL. By alulating the delta-entropy ( E ) of eah level (exept for leaf/root-level) in the attribute's onept hierarhy, the minimal E is obviously the desired OGL. In Figure 6-6, 64

75 3-level is the optimal solution sine it dramatially dereases Event Number only slightly inreases little Entropy. Sine the times of alulating entropy is log N (the number of levels in the onept hierarhy), and the worst ase of alulating a level s entropy ours in the leaf-level with N, the omplexity of OGL is ON ( log N) for eah attribute. Combine it with ARCH, its total omplexity is also OMN ( log N) Simulation Results Modified ID3 (OGL + ID3) is used here to simulate the performane of ARCH + OGL and then ompare it with Version Spaes aording to the auray and omplexity of indutive rules by the following arguments. We neglet ambiguities in modified ID3 during generating the deision tree and use the ardinality of ambiguity to alulate the ertainty fator (CF) of rules. The simulation is performed aording to the following assumptions: Ten numerial attributes (two-byte integers) are in the lass and there are no other types of attributes. Events of eah attribute are randomized as two-byte integers, the lass that a pattern belongs to is also randomized as 0 or 1. The onept hierarhy of eah attribute is generated by ARCH. The simulation tests the ases of the number of examples are 500, 1000, 1500,, Eah ase is exeuted fifty times and the average of results is alulated. By using the prior ost model and omparing the omplexity of learning rules of modified ID3 and Version Spae, the following assertions are derived based on the observations of Figure 6-7, 6-8: The knowledge omplexity of modified ID3 is dramatially dereased while the unertainty is slightly inreased. The rule omplexity is muh less than the size of training set (for the ase of 4000 instanes, the size is 4000*5*2 = 40000). Hene, a large amount of data an be indued into ompat knowledge through our methods. 65

76 Rule Complexity (Bytes) Version Spae Modified ID3 (ARCH + OGL + ID3) 0 Size of Training Set Figure 6-7. The simulation result of ARCH + OGL + ID3 (the rule omplexity). Rule Auray Version Spae 0.8 Modefied ID3 (ARCH + OGL + ID3) Size of Training Set Figure 6-8. The simulation result of ARCH + OGL + ID3 (the rule auray). However, onsidering the larger training set, the auray of indued knowledge may be less than 0.8 aording to Figure 6-8. Thus, in the next setion, another approah ASE (Attribute Seletion by Entropy) is proposed to balane the two extremities of Version Spae and ID3. 66

77 6.3. Learning with Attribute Seletion by Entropy - ASE Before employing the learning method, heuristis proposed in our previous study [52] an also be used to eliminate trivial redundant attributes, whih may redue the learning performane if they are involved. Remove the key attribute: If the attribute of a lass is a key or a part of the omposite keys, the attribute should be removed before learning. The key attribute never ontributes toward the learning rule to lassifiation, although the entropy is equal to zero sine all values are distint. (Most OODBs regard OIDs as keys in RDBs, however, users are permitted to define meaningful key attributes in OODB suh as our intelligent objet-oriented database system, SLOODS [35, 54]). Remove the attribute with a default value: In OODB, the default value of an attribute in a lass an be defined. During indutive learning, the attribute should not be involved sine all objets in the lass have the same value for the attribute. The situation an be found when omputing the attribute s entropy sine the entropy is larger than any entropy of other attributes. After eliminating some meaningless attributes, the learning system must indue aurate (approah to 100%) rules through prior alulated entropy values of attributes. Aording to the definition of entropy, if an attribute s entropy is losed to zero, we an lassify by this attribute preisely. However, when we learn lassifiation rules from databases, this kind of attribute is often used to determine the instane s target-onept. For instane, if we indue the lassifiation rules of Manager or Engineer of Employee objets aording to the target-attribute Position, the entropy of the attribute is equal to zero. Indutive learning attempts to disover the ombination of other features whose entropy is equal or losed to the entropy of the target-attribute, that is zero-entropy. Following this onept, a method, Attribute Seletion by Entropy (ASE), is first proposed to find attributes that an be ombined to replae the target-attribute with the least error. Next, the urve fitting method is used to find a funtion to map the value of entropy [0,1] into the value of probability, i.e. the degree of auray of indutive rules Attribute Seletion by Entropy (ASE) In this study, the greedy and hill limbing (aording the entropy of ombination attributes) approahes is used to find optimal lassifiation rules. The previous method is used to alulate 67

78 entropy values of all attributes and get the optimal generalization level. Next, those attributes are sorted in an asending order aording to their entropy values. The two smallest attributes are seleted and ombined as a new attribute. If the entropy of new attribute is smaller than the original two attributes, the new one substitute two attributes. Otherwise, the third small attribute is seleted instead of the seond one, and vie versa. Hene, greedy implies that the smallest one is the first hoie and hill limbing refers to the situation that the value of entropy annot be inreased. Algorithm: Attribute Seletion by Entropy (ASE): Attribute {a1, a2,..., am} is a set of attributes with Entropy set, {e1, e2,..., em}, where M is the number of attribites. Generate instanes aording to ARCH and OGL. The entropy of eah attribute is also alulated in those stages. Sort them by inreasing the order and put into Attribute-Entropy-List (AEL) indexed from 1 to M. That is e1<e2<...<em in AEL. Remove (a1, e1) from AEL. Let (A, E) = (a1, e1). // greedy While (AEL is not empty OR E > ε ) { // ε is losed to zero. (tempa, tempe) = (A, E); (a,e) = Remove the least elements from AEL; //greedy (tempa, tempe) = Combine(A,E,a,e); IF tempe < E THEN (A, E) = (tempa, tempe); // hill limbing ELSE do nothing. // that is disarding (a, e) from AEL. } // End of While Sine entropy values are reused from the prior OGL and ARCH, the remaining omplexity of ASE is its While-loop. The worst ase of the loop is equal to (M 1)!, where M is the number of attributes. Due to the onstant omplexity in the loop and (M 1)! is less than N for M << N, the omplexity of ASE an be omitted. Therefore, the total omplexity of AGE (ARCH + OGL + ASE) beomes also OMN ( log N) An Example The algorithm is desribed by an example from our previous study [52] as shown in the following table. Next, a omparison is made of the omplexity of indutive rules generated by ASE, modified ID3, and [52] (an Version Spae based approah). In this ase, the onept of a manager in a ompany is learned from the attribute Position. 68

79 Eno Salary Position Sex Age Name Degree IQ BirthPlae Manager M 50 Tom Master 130 Tainan Seretary F 25 Mary Bahelor 110 LA Sales-man M 30 John Bahelor 120 LA Manager M 54 Bill Master 135 Tainan Engineer F 32 Jan Bahelor 130 Taihung Engineer M 28 Jak Master 140 Taipei Manager M 40 Taid Ph.D. 170 Taipei Seretary F 27 Jany Bahelor 100 Kaohsiung Servie F 30 Tony Bahelor 110 Taihung Manager M 60 Lin Master 130 Tainan Servie M 32 Lee Bahelor 100 Taipei Engineer F 35 Cray Bahelor 120 LA. Those instanes are lassified by the target-attribute Position, and then those trivial attributes are eliminated by the prior heuristis. The attribute BirthPlae is an assoiated attribute. Thus, objets of the lass Address whih BirthPlae points to are partitioned into the following tables. Position Salary Sex Age Degree IQ BirthPlae Manager M 50 Master 130 Tainan M 54 Master 135 Tainan M 40 Ph.D. 170 Taipei M 60 Master 130 Tainan Position Salary Sex Age Degree IQ BirthPlae Non-Manager F 25 Bahelor 110 LA F 27 Bahelor 100 Kaohsiung F 35 Bahelor 120 LA F 32 Bahelor 130 Taihung M 28 Master 140 Taipei F 30 Bahelor 110 Taihung M 32 Bahelor 100 Taipei M 30 Bahelor 120 LA. The optimal generalization level of eah attribute is first derived from OGL method in the following table. Sine the event number of the symboli attribute, Degree, is very small, its onept hierarhy is trivial and is not onstruted here. Next, the entropy of eah event in an attribute is alulated by (EQ 6.4), then normalize and aumulate all entropy values by (EQ 6.5) as follows: 69

80 Entropy(Salary) = Sine E Sa H = 0 by 3 instanes, E Sa M = log 2 5 *log = by 6 instanes, and E Sa L = 0 by 3 instanes, the Normalized Total Entropy = 3* 0+ 6* * 0 = Entropy(Sex) = Entropy(Age) = Entropy(Degree) = Entropy(IQ) = Entropy(BirthPlae) = Attributes are sorted by their entropy values and the following sequene is obtained. By assuming size(salary) > size(age), Age is seleted before Salary, although they have the same entropy value. Position Degree Age Salary IQ Sex BirthPlae Manager Master H H H M Taiwan Master H H H M Taiwan Ph.D. M M H M Taiwan Master H H H M Taiwan Position Degree Age Salary IQ Sex BirthPlae Non-Manager Bahelor L L L F Foreign Bahelor L L L F Taiwan Bahelor M M M F Foreign Bahelor M M H F Taiwan Master L M H M Taiwan Bahelor M M L F Taiwan Bahelor M L L M Taiwan Bahelor M M M M Foreign Selet (Degree, Salary) and alulate its ombination entropy (CE). CONCEPT Manager: E MS H = 0, E Ph. D. M = 0. CONCEPT Non-Manager: E BA L = 0, E BA M = 0, E MS L = 0. 70

81 The CE is equal to 0, thus we reformulate the following rules. We also estimate the omplexity of rules (assume all attributes have the same data size) and ompare the result with modified ID3 and Version Spae of our previous work. In the above, all of them indue rules with 100% auray, while omplexities of learned rules are different. ASE: IF (Degree, Salary) = ( Ph.D., M ) OR ( MS, H ) THEN Manager IF (Degree, Salary) = ( MS, L ) OR ( BA, M ) OR ( BA, L ) THEN Non-Manager Complexity = 5*2 (Attribute, Event) pairs = 10 A-Es OGL + ID3: IF Degree = Ph.D. THEN Manager IF (Degree, Salary) = ( MS, H ) THEN Manager IF (Degree, Salary) = ( MS, L ) OR ( BA, M ) OR ( BA, L ) THEN Non-Manager Complexity = 9 A-Es Version-Spae (attributes are removed aording to heuristis in [52]): IF (Salary, Sex, Age, Degree, IQ, BirthPlae) = ( H, M, H, MS, H, Tw ) OR ( M, M, M, Ph.D., H, Tw ) THEN Manager IF (Salary, Age, Degree) = ( L, L, BA ) OR ( M, M, BA ) OR ( M, L, MS ) OR ( L, M, BA ) THEN Non-Manager 71

82 Complexity = 6*2+3*4 = 24 >> Complexity of ASE or ID3 Thus, the rule omplexity indued by ASE is losed to modified ID3 but better than Version Spae. We use ASE instead of applying OGL to ID3 (modified ID3) sine modified ID3 annot revise the unertainty due to ambiguities from the generalization OGL as shown in Figure 6-8 while the training set is enormous. Furthermore, alulating entropy values and partitioning instanes reursively degrade the performane of ID3. However, ASE uses the entropy of ombining attributes to indue knowledge with unertainty, if any ambiguities our due to applying OGL, the effet will at upon the entropy value of following ASE. Thus, ASE an ompensate the unertainty aused from OGL. The above ases are simulated again by employing our learning method AGE, and the results are shown in Figure 6-9. In this simulation, the entropy of indued rules is ontrolled in a threshold (ε = 0.469), i.e. the auray of indued rules is losed to 0.9. As for the alulating of the auray aording to the entropy is desribed in the following setion. Rule Complexity (Bytes) Rule Auray 1.0 Version Spae Version Spae AGE AGE Modefied ID3 (ARCH + OGL + ID3) Modified ID3 (ARCH + OGL + ID3) 0 Size of Training Set 0.8 Size of Training Set Figure 6-9. The Comparison of Version Spae, modified ID3, and AGE The auray of rules indued from ASE The results indued by ASE are rules assoiated with a least entropy. However, users want to know the auray of indutive rule instead of the entropy. Both ranges of entropy and probability are [0,1], but they are not equal. The mapping funtion between them must be investigated. We simulate it by the following onept. Assume there are two lasses C1 and C2, if we alulate the entropy of an Event-11 from C1 with the pair ( N C1, N C2 ) whih refers to N C1 and N C2 instanes support to Event-11 in C1 and C2. If N C1 > N C2, we an onlude the probability of lassifiation rule for C1 indued from Event-11 is 72

83 N N C1 + N C1 C2 Thus, we an know the entropy of Event-11 and the probability of rule indued from Event-11, we use points of event ( N C1, N C2 ) = (100,0), (99,1), (98,2),..., (50,50) whih refer to (entropy, probability) = (0,1.00), (0.081,0.99),..., (1,0.500). We use those fifty points to draw the graph shown in Figure 6-10, and ontinue to simulate the other ases suh as (1000,0) - (500,500) and (10000,0) - (5000,5000). The graph drawn in the PC is lose to Figure Hene, the urve fitting method of Numerial Analysis is used to find a funtion, Probability = f ( Entropy), by points (100,0) - (50,50). The funtion is then used to alulate the relative error of the other ases. The funtion is f( E) = E E E E E E E E (EQ 6.7) Probability (x=0) Entropy (y=0.5) Figure Relationship between entropy and probability (example# = 100). We alulate the relative error by (EQ 6.8) and simulate the ases with an instane number is equal to 10, 100, 1000, 10000, and The result is shown in the following table. RelativeError(RE) = P f(e) P (EQ 6.8) Instane Number Average RE RE > 0.01 <perentage> RE > 0.02 <perentage> RE > 0.03 <perentage> RE > 0.04 <perentage> Maximum RE % 10% 10% 10% % 2% 1% 1% , % 1.2% 0.6% 0.1% , % 1.14% 0.55% 0.04% , % 1.135% 0.545% 0.031%

84 We an find that f( E) is losed to the atual probability sine the average relative error is less than 0.5%, and the maximum relative error is approahing to 4% in about 0.04 % (lose to 0%) instanes while the training set is large. Thus, f( E) an be used to alulate the auray (ertainty fator) of indutive rule without distortion Disussion Sine the indutive rule merely represents the onept of a ategory, it an be used to evolve the shema of OODB, i.e. OODM. For the prior example in Figure 6-4, we an enhane the semanti of the shema suh as in Figure 6-11 by the following steps in our system [54]. After the shema evolution, the indued knowledge is embedded in the new OODM. Sedan, SportsCar, and Bus inherit Car. SportsCar overrides the attribute by setting body = {Streamline}. Sedan and Bus overrides the attribute: body = {~Streamline}. Sedan-Engine, SportsCar-Engine, and Bus-Engine inherit Engine. SportsCar-Engine overrides attributes by setting power = {High} and C.C. = {Medium, High}. Car manufature = {BMW,...} type = {Sports, Sedan, Bus) body = {Streamline, ~SL} Engine aggregate Car manufature body Engine Sedan SportsCar Bus Engine power C.C. Engine Old Shema power C.C. Self-Organized Shema Sedan Engine SportsCar Engine Bus Engine Figure Self-organizing OODM aording to learning rules. 74

85 7. The Overview and Design of ACIRD Web searh engines use full text index for fast doument retrieval and mathing by representing douments as a set of indexed terms with weights. Given a query with one or more terms, most searh engines are apable of retrieving douments that ontain the query terms in one seond. But, the result is often thousands of retrieved douments. For example, submitting the query rawler to AltaVista gets the result of 157,675 pages 44. On the other hand, Web diretories lassify links (pages) into a topi hierarhy developed and maintained by diretory editors. In addition, most diretory servies also provide searh funtions with the same indexing proesses on desriptions of diretories and links. For instane, there are fourteen most general subjet ategories and numerous sub-ategories in Yahoo!. Searhing rawler in Yahoo! gets results of one ategory and 39 links 45. That is searh engines suffer from retrieving too many, in ontrast, diretory servies worry about retrieving too few. In the dissertation, using mahine learning tehniques, we develop a system for information retrieval and management on the Internet environment. The system, Automati Classifier for Internet Resoure Disovery (ACIRD) [53, 55, 56, 57], tries to integrate the servie of searh and diretory to resolve the problem of searh engines and diretory servies. In the hapter, we first briefly desribe the overview of the system. The doument and knowledge (information) representation is introdued in the next setion. Based on the representation, the doument lassifiation system and the searh engine are designed and implemented. By illustrating proposed lassifiation method, the lassifiation system is desribed in the following. We also apply mining assoiation rules to learn term semantis in different lasses. The term semantis is 44 The test was done at May 8, The test was done at May 8,

86 applied to refine the lassifiation system. Finally, two-phase searh is design to integrate the servie of diretory and searh servies based one the lassifiation knowledge learned by the lassifiation system An Overview ACIRD is implemented to automatially ategorize douments, olleted from the Web pages or sites, into hierarhial ategories. It is motivated to improve the poor performane of Yam s manual ategorization that adds no more than 500 douments to lasses per day. Based on the manually olleted training doument set of Yam (the Web pages in Taiwan), ACIRD first learns the lassifiation knowledge that an be applied to lassify new pages and represent indexes of ategories. Currently, Yam s ategories form a graph with hierarhially generalized/speialized struture, whih is alled ACIRD Lattie in the dissertation. Based on the proposed doument lassifiation method, the lassifiation proess is deomposed into two phases: training phase and testing phase. In the first phase, urrent ategorized douments are employed as the training set to learn the lassifiation knowledge of lasses in ACIRD Lattie. Then, in the seond phase, the lately olleted douments manually ategorized by staff of Yam are applied to verify and refine the lassifiation knowledge of lattie. Using the lassifiation knowledge, we implement ACIRD Classifier to automatially lassify the Internet douments into proper lasses of the lattie. In the same way, Two-Phase Searh Engine is implemented to improve the searh performane by shrinking the searhing domain. The searh engine performs lass-level searh to retrieve relevant lasses in the first phase. Then, the objet-level searh is used to refine the retrieval of objets in these relevant lasses. In the rest of the setion, we give an overview of the system based on issues of IR systems in [24]: doument operation, oneptual model, term operation, file struture, query operation, and hardware. The following terminology is used throughout the dissertation. Class orresponds to a ategory of Yam, i.e. ACIRD Lattie. The lassifiation knowledge of a lass is represented by a group of terms with supports or keywords with membership grades. Objet orresponds to a doument (a Web page). The objet ontent is indexed with a set of terms with supports, whih denotes the objet knowledge. Term is the word or phrase extrated from objets by the term parser or generalized into lasses by the learning proess. 76

87 Support is the important degree of a term that supports some objet or lass. The value is normalized to [0, 1]. Keyword is orresponding to a representative term, i.e. its information quality is better than a term. Membership grade (MG) is the supporting degree of a keyword to some objet or lass as if support is the supporting degree of a term. It is used as the deviation between term and keyword Doument Operation Eah objet is assigned a unique OID (objet ID) in the database. An objet an be instantiated from several lasses aording to the urrent ategorization of Yam (ACIRD Lattie). Before performing the lassifiation proess, objets are preproessed into terms with supports to form the objet s feature vetor (knowledge). In the dissertation, we only onentrate on Chinese HTML douments, i.e. the HTML CHARSET is BIG5. Thus, we assume the objet ontent is written in English or Chinese (BIG5) Term Operation The objet ontent is divided into English and BIG5 strings. English strings are deomposed into terms by removing stop words, stemming, and term weighting proposed in onventional IR. Chinese terms are extrated from BIG5 strings aording to a Chinese term segmentation proess. The proess is based on a term base, whih is a database originated from the analysis of past query logs of Yam. Term operation is involved in Doument Operation and Query Operation Coneptual Model Within a omputer, a doument must be represented as symbols made up of bytes or bits. For the onsideration of effiieny, representing douments as a sequene of bytes is the fastest way to retrieve and ompute without further interpreting. Aording to different interpreting models of doument, vetor spae model with weight is the best hoie and is applied to the system. For eah objet, terms are extrated from its ontent. Objets ategorized by Yam are the training set of lassifiation. By learning terms from the training set of a lass, terms with supports to the 77

88 lass are used to represent the lassifiation knowledge of the lass. Based on a predefined threshold θ C of lass, terms of the lass are divided into two types. Representative term s support is equal to or larger than θ C. Representative term is regarded as keyword. Non-representative term s support is less than θ C. For eah lass, terms are also applied to mine assoiation rules between terms. These rules are named term assoiations. Based on the digraph onstruted from term assoiations and term supports to a lass, an inferene model of terms to the lass is formed to refine the semantis of a term by promoting the non-representative term to representative one File Struture In the system, the term-based inverted index of keywords or terms are stored in the relational database system (SQL Server 7.0) to represent the lass and objet knowledge. Given terms in a query, relevant lasses (indiated with CID) or objets (indiated with OID) an be retrieved effiiently Query Operation A query is formulated with a sequene of terms separated by spae or other delimiters or Boolean operators. Similarity math based on vetor spae model is applied to retrieve relevant objets of lasses. Applying the lassifiation knowledge (represented by a set of terms or keywords) of the lass, two-phase searh is used in the system. In the first phase, lass-level searh, query terms are used to searh qualified lasses. Those lasses form a shrunk view of ACIRD lattie to shrink the searhing domain of the following objet-level searh. In the seond phase, if the user wants to refine the searh in some lass, terms of the query are employed to retrieve relevant objets of the lass. The domain of retrieved douments is restrited in the lass so that the searh domain is shrunk. By integrating qualified lasses and mathed objets assoiated to lasses, a tree (or lattie) with probabilities (relevane sore) in branhes from lasses to objets is presented to the user. Thus, the two-phase searh mehanism shrinks the searhing domain and presents a hierarhial view to users instead of a ranked doument list. 78

89 Hardware Currently, the system is implemented on a Pentium II 266/256M SDRAM mahine with NT Server 4.0 and SQL Server 7.0 software systems. In Figure 7-1, we summarize the overview of the system. First, Doument Colletion and Class Lattie are onstruted from Yam. Then, terms with supports are extrated from douments to generalize Classifiation Knowledge. Finally, the knowledge, a set with terms, is refined to a subset with keywords as Refined Classifiation Knowledge based on mining Term Assoiation Rules in eah lass. Then the two-phase searh module parses Query into term-based Query Representation that an be used to math with Refined Classifiation Knowledge and Objet Knowledge. Query Term Class Term Assoiation Rules Doument (Objet) Query Query Representation Refined Classifiation Knowledge Class Lattie (Classifiation Knowledge) Doument Colletion (Objets in Database) Figure 7-1. Two-phase query proess in ACIRD Doument and Knowledge Representation In this setion, we define the terminology used herein, and introdue the oneptual model and knowledge representation of ACIRD. An entity is denoted as the lower ase letter, and a set or series of entities as the upper ase letter. For example, let denote a lass and C represent a set of 79

90 lasses. In the following, we desribe the system entities with their notations in parentheses from the higher level onept to the lower level one. ACIRD Lattie ( ( C, R) ) is the given lass lattie that onsists of a set of lasses C as L ACIRD nodes and a set of relations R as edges that onnets two nodes in C. The parent node of an edge is a super set of its hild node. The total number of lasses is denoted by L. Class () is a lass node of L ACIRD that possesses the knowledge generalized from the sublasses and diret objets in the lass. The number of intermediate sublasses and objets is. A lass an be hildren of one or more lasses. Objet (o) is an HTML doument onsists of paragraphs ( pg ) enlosed by HTML tags. An objet o belongs to one or several lasses in L ACIRD ( C, R). Paragraph (pg) onsists of a series of sentenes (S) that subsequently onsists of terms. A paragraph pg is informative if it is enlosed by informative HTML tags that are defined later. Term (t) is a stemmed word, exluding the stop words, extrated from sentenes of an informative prargraph. Eah term has a support value for the objet that it appears. The support ( sup t, o ) of t to o is alulated from the term frequeny and the weight of HTML tags that quantifies the importane of t to o. Objet Knowledge ( Know o ) is a set of seleted terms (T) with supports to the objet. Know o an be represented by the Term Support Graph ( TSG ( T, o, E) ), in whih eah direted edge in E from t i (in T) to o has a label sup, o. The number of extrated terms in o t i Know is denoted by Know o. Classifiation Knowledge of lass ( Know ) is a set of term T, in whih eah term t i has a support value sup t, to. Know is generalized from Know o of its diret objets and lassifiation knowledge of its hild lasses. Similar to Know o, Know an be represented as a graph TSG ( T,, E) that eah direted edges in E from t i (in T) to is labeled with sup,. t i The number of terms in Know is denoted by Know. For eah lass, mining assoiation rules is applied to mine assoiations among terms of Know. The mined rules are alled term assoiations. For eah pair of terms, t i and t j, there is 80

91 a orresponding onfidene ( onf t i t j ). A strongly onneted graph Term Assoiation Graph ( TAG ( T, E) ) an be generated by onsidering terms of T as nodes and term assoiations as edges labeled with onft i t. j For eah lass, Term Semantis Network ( TSN ( T,, E) ) is onstruted as the union of the TSG ( T,, E) and TAG ( T, E). TSN is used to represent the semantis of the lass and the relations among terms in the lass. Perfet Term Support (PTS) algorithm [55] is applied to promote sup t i, of the edges in TSN ( T,, E). The algorithm obtains an optimal path ( p * t, ) from t to, where * p t, is a path with the maximum value in the set of all the possible paths ( P, ) from t to in TSN ( T,, E). t The value of p, is the produt of onfidene values of the edges in the path and the support of t the term, t Z, at the end of the path to, ( onf t t onft t... onft t supt, ). The j j k y z z optimal support of t to (denoted as * sup t, ) is defined as the value of * p t,. A keyword ( k ) is a term that passes Filtering Proess, whih filters out terms whose * sup t, are less than the speified threshold θ C. For a keyword, its * sup t, is defined as membership grade ( MG t, ) of t to. Appliation of PTS and Filtering Proess refines Know to beome Refined Classifiation Knowledge ( * Know ). two-phase searh engine and automati lassifier of ACIRD. * Know forms the knowledge base employed by The left-hand side of Figure 7-2 illustrates the top-down view of those abstrations and the right-hand side shows the systemati knowledge representation. 81

92 lattie L ACIRD Lattie onstruted by Yam s Manual Categorization lass objet paragraph sentene term Assign Objet to Class by Yam { o o o o o o o } Proess Objet into Paragraphs ( pg pg pg pg pg pg pg ) Proess Paragraph into Sentene ( s s s s s s s ) Parse Sentene into Terms ( t t t t t t t ) Coneptual Knowledge Representation t Objet Term Support t t t t t Know (TSN ) Learning Classifiation & Mining Term Assoiations t t o t Know o (TSG) t t Perfet Term Support Algorithm t... Systemati Knowledge Representation t t t t t o Know o (TSG) t Know * ( TSN ) t t t t t t Figure 7-2. Coneptual model and systemati knowledge representation of ACIRD The Doument Classifiation Learning In this setion, we desribe the doument lassifiation learning of ACIRD in details. ACIRD adopts supervised learning tehniques and treats previously lassified douments as the training objets. In the training phase, objets ategorized in ACIRD Lattie are training douments. Classes assoiated to objets in the hierarhy are target lasses of those objets. ACIRD applies mahine learning tehniques to learn lassifiation knowledge as displayed in Figure 7-3. The testing phase, i.e. the doument lassifier, is desribed next hapter. With the bottom-up approah, the learning proesses are applied to eah lass of ACIRD Lattie from the most speifi lasses to the most general ones [53]. Eah lass s training objet is preproessed into a weighted term vetor in Preproessing Proess. The dimension of the vetor is then redued by Feature Seletion Proess to redue the omplexity of learning. Then, Classifiation Knowledge Learner indues the lassifiation knowledge of lass, Know, based on the lass s training objets. For the most speifi lass, Know is generalized from terms of all 82

93 objets ( Know ) in the lass. o Using Mahine Learning to Retrieve and Manage Internet Information Know an be represented by Term Support Graph (TSG). For the lasses other than the most speifi lasses, the learning proess is the same exept that the initial weighted term vetors originate from its sublasses and diret objets. Mining assoiation algorithm is then applied to mine assoiations of terms in TSG that the obtained term assoiations an be represented by Term Assoiation Graph (TAG). Combining TSG and TAG derives Term Semanti Network (TSN). TSN an be further optimized to beome TSN* to represent the refined lassifiation knowledge * Know of the lass. In the iteration of refinement proess, some terms may be promoted. As the promotions of some terms may be used to promote other terms, the promotion proess is applied reursively until the stable state is reahed. Douments Preproessing Proess Term Vetor Feature Seletion Proess Objet Knowledge (TSG) Classifiation Knowledge Refiner Refined Classifiation Knowledge (TSN*) Classifiation Knowledge Learner Stable Classifiation Knowledge (TSN) Term Assoiation Miner Classifiation Knowledge (TSG) Figure 7-3. The doument lassifiation learning of ACIRD Preproessing Proess and Knowledge Representation As we desribed in the above setion, an objet is deomposed into a set of paragraphs. The paragraph is a set of sentenes that orresponds to a set of terms. In Preproessing Proess, an objet is hierarhially deomposed and represented by Know o as shown in Figure 7-2. Preproessing proess onsists of two parsers, HTML Parser and Term Parser. HTML parser parses 83

94 an objet into paragraphs and determines their weights by judging their assoiated HTML tags. Term Parser partitions the paragraphs into sentenes and extrats terms from sentenes. Term Parser also alulates term supports using the weight assigned by HTML Parser and the term frequeny. HTML Parser An HTML doument onsists of paragraphs in whih assoiated HTML tags [9] indiate their importane and provide meta-level information. Web developers highlight the ontents using HTML tags, suh as title or headings (Hn). In addition, META tag allows developers to add extra information suh as CLASSIFICATIONS and KEYWORDS to the doument. Apparently, the impliation of tags should be onsidered while indexing the douments. In ACIRD, these tags are lassified into four types: Informative. The paragraph enlosed by tags, suh as CLASSIFICATION and KEYWORD in META, TITLE, Hn, B, I, and U, are either the meta knowledge of the douments or signifiant ontents presented to users. Thus, the tags have higher weights than the others. Skippable. Tags, suh as BR and P, do not affet the semantis of the doument and are omitted. Uninformative. Contents enlosed by tags, suh as AREA, COL, SCRIPT, and COMMENT, are invisible from the users. Thus, these tags and their orresponding ontents are exluded. Statistial. Contents enlosed by the tags, suh as!doctpye, APPLET, OBJECT, SCRIPT, et., are stored in database for statistial purpose. HTML Parser is implemented with two staks: one for HTML tags and the other for paragraphs. The algorithm is exeuted in one doument san. The omputational omplexity is O Know ). ( o Term Parser Term Parser partitions paragraph into sentenes, extrats terms in the sentenes, and ounts term frequeny (TF) of eah term. Designed to handle multi-lingual douments, urrently ACIRD onsiders English and Chinese (BIG5) languages only. En-BIG5 Splitter first divides a sentene into two string, English and BIG5. For a English string, it is easy to extrat terms aording separators. Eah extrated term pass the proess of stemming and stoplist [74]. For a BIG5 string, a sentene must be segmented into meaningful multi-harater terms. As there are no apparent separators for Chinese terms, Term Parser uses a olletion of multi-harater terms, i.e. the term base olleted from Yam s historial queries, to math and extrat meaningful terms. In addition, some rules may be needed to resolve the possible onflits while extrating andidate terms. After a 84

95 term t is extrated from an objet o, the support value sup, is measured based on TF and t o HTML weight, as defined in (EQ 7.1). The value, normalized in the range of [0, 1], indiates the importane of a term in representing the objet. sup' t, o highlighte d sup t, o i i = tf T j t in o i ij w T by tags sup' ti, o = MAX { sup' j, t, o i where tf T, and w j, } ij T is the term frequeny of t j is the maximum weighted tag in T. where sup is normalized to [0,1] i in the sentene j (EQ 7.1) Sine a sentene (or a paragraph) may be embedded in more than one tag, the maximum weight of the tags is used to alulate the term support. In ACIRD, TF and the maximum tag weight are used to alulate the term support rather than the TF IDF weighting approah. Inverted Doument Frequeny (IDF), designed to enhane the disriminating apability of high frequeny terms, is not ritial in our hierarhial learning model and two-phase searh disovery model. In ACIRD, a high frequeny term is onsidered to represent its lass, and may be generalized to the lassifiation knowledge of its parent lass, instead of being used as a disriminator of objets in the lass. For example, the term weight of www may be dereased by IDF, but it is a important term that an be generalized as the onept of higher level lass. Currently, Term Parser extrats Chinese terms based on the heuristis of long term first to resolve the ambiguity. I.e. for two terms that one is a part of the other, Term Parser hooses the longer one as a term in the sentene. A pre-onstruted term base is strutured as a B-tree [6] to allow prompt aess. In addition, the heuristis for Chinese term segmentation are supplied to handle the ambiguity of segmentations between onfliting andidate terms. The omplexity of term extration is O ( n 2 ), where n is the length of the input sentene, whih is approximate to the number of extrated objet terms Know o. Inluding the linear time omplexity of HTML Parser, the omplexity of Preproessing Proess is O ( Know o ) Feature Seletion Proess After HTML Parser and Term Parser are exeuted, the obtained objet knowledge an be represented as a vetor of attribute-value pairs, o = {( t, supt, o ),( t2, supt, o ),...,( tn, supt, )} n o Theoretially, the indution proess an be applied immediately to learn the lassifation knowledge from the objet knowledge. In pratie, the omplexity of a learning proess is

96 exponentially inreased by the the vetor size. Feature Seletion Proess is designed to derease the vetor size to redue the omplexity during the learning proess. For an objet, a pre-defined threshold of support θ s is used to disard less important terms. The remaining terms are used to represent the objet knowledge to the seletion of θ s. Know o. In this manner, the problem of feature seletion is shifted Number of Terms [0.0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0) [1.0, 1.0] Ranges of Term Supports A higher Figure 7-4. The distribution of term supports of training data. θ s disards more terms so that the remaining terms may not be suffiient to represent Know o. In ontrast, low θ s only has a slight effet in the feature seletion proess. In ACIRD, the seletion of θ s is adaptive to the emperial experiments. For instane, by analyzing the distribution of term supports from the training data as shown in Figure 7-4, it reveals that more than one half of term supports are in the range [0, 0.2). If we hoose θ = 0. 2 to filter out terms with low supports, the average number of terms in an objet is redued from to It is obvious that the omputational ost of the feature seletion in a lass is mainly on grouping terms from all objets. s 86

97 Learning Classifiation Knowledge Classifiation Knowledge Learner first generalizes the objet knowledge Know to the knowledge of most speifi lass Know by using indution learning, and then generalizes the obtained lass knowledge to its supper lasses. The indution proess is applied from the most speifi to the most general lasses. In onventional learning methods, the values of features of training objets are either TRUE or FALSE. I.e. the learning algorithms generalize term t i to lass based on the lass objets ontaining t i. Restated, it assumes all the terms are equally important so that the degrees of term supports to the objet or lass an be negleted. To amend this shortoming, we define the support of t to, denoted as sup t,, in equation (7.2). Similar to (7.1), t to [0, 1]. The equation is applied to the most speifi lass. o sup, is also normalized sup' sup t, t, i i = sup o j t, o sup' ti, = MAX { sup' i j, t, i sup t, o i, i.e., } j is the term support of t sup' t, i to o is normalized to sup i j, o t, i j is an objet in. in [0,1] (EQ 7.2) The support of term to general lasses an be obtained from the term support of its diret objets and hild lasses, as shown in Eq. (7.3). Note the number of objets in a hild lass affets the ontribution of the lass to the super lass. sup' sup t, = i sup ti, o + j o t, i j j sup' ti, = MAX { sup' t, i j } j, i.e., sup sup' t, i t, i j, where is a hild lass of lass, and j o is the number of objets in is normalized to sup j is a diret objet of, t, i in [0,1] j. (EQ 7.3) The algorithm of Classifiation Knowledge Learner is desribed in the following. 1. From the most speifi lasses to the most general lasses, perform the preproessing and feature seletion proesses of the objets. 2. For most speifi lass, alulate the term supports to lass based on (EQ 7.2). The omputational omplexity is the omplexity of grouping terms from all the objets of the lass. That is the sorting omplexity, O Know log Know ). ( 3. For all other (general) lasses, alulate the term supports to the lass based on (EQ 7.3). By regarding its hild lasses as diret objets, the omplexity is also the ost of grouping terms of 87

98 hild lasses and objets, O Know log Know ). Thus, the omplexity of the speial-general ( learning proess is O L Know log Know ). ( Due to the diversity of Internet douments, the number of terms in a lass is large and their term supports are generally low. For instane, in Figure 7-5, eah line represents the distribution of term supports of a lass. From the learning results, there are about 472 terms per lass in average and the supports are low. This figure reveals that most term supports loate in the low support range (e.g., [0, 0.3)). Therefore, a feature seletion proess is neessary to redue the low support terms inknow. Given a threshold θ = 0.1, on average there are 47 remaining terms per lass, 24 terms for θ = 0. 2, and 20 terms for θ = 0. 3, after the filtering. However, a filtering proess may remove meaningful, but with low supports, terms that are aliases of terms or losely related to high support terms. To alleviate this problem, we propose a method using mining assoiation tehnique to loate the term assoiations in a lass and then apply the result to enhane the supports of the otherwise filtered-out terms. Figure 7-5. The distribution of term supports of all the lasses of the training data Mining Term Assoiation The feature seletion proess in the lass level is more sophistiated than in the objet level. First, Know is generally larger than Know o, sine Know is generalized from many lassifiation 88

99 knowledge and objet knowledge. Seond, terms in an objet are more onsistent in both semantis and representation than in a lass. Sine an objet is typially written by one web developer, a simple filtering method using a threshold value performs well. On the ontrary, objets in a lass are olleted from many web servers and written by a variety of web developers that add diversity to the term wordings and usage. That is generalized terms in a lass are diverse. Diretly filtering Know with a threshold value θ may remove many representative terms but with low support values. While only few onepts remain, the reall rate on Know is likely low. Therefore, the system must identify and onsolidate terms related to the important onepts before applying filtering proess. In ACIRD, we apply mining term assoiation tehnique and propose perfet term support algorithm [55] to promote terms with low supports to be representative. Aording to the definition of assoiation rule in [1], the assoiation rule problem is defined as below. Let I = { 1 i, 2 i, K, i m } be a set of items and D be a set of transations (i.e. the transation database) in whih eah transation T is a set of items suh that T I. An assoiation rule is an impliation of the form X Y, where X I, Y I, and X Y =φ. The rule X Y holds in the transation set D with onfidene, if % of transations that ontain X also ontain Y. The rule X Y has support s in the transation set D if s% of transations that ontain X Y. Herein, the above definition is followed and the problem of mining term assoiations is mapped aording to the speifiation of mining assoiation rules. Two ritial issues should be addressed before implementing data mining proess: (a) the granularity (i.e. transation in [1]) used to mine assoiations and (b) the domain used to generate assoiation rules, whih orresponds to the transation database defined in [1]. Granularity of mining assoiations In [38], authors propose to restrit the granularity of generating assoiations to 3-10 sentenes per paragraph in order to redue the omputational omplexity. The restrition is impratial for web douments sine a paragraph may have hundreds of meaningful sentenes. In addition, the importane of a sentene in a web doument depends on the assoiated HTML tags, not its position. Therefore, the granularity of mining term assoiation is the whole informative HTML paragraphs. Domain of generating assoiation rules 89

100 As the Internet douments are published by diverse web developers, a term that represents different meanings is ommon pratie that its semantis depends on both the developers and ontexts. For example, when a doument mentions apple omputer in a paragraph, the semantis of apple is not likely apples of fruit. Most likely, the phrase indiates Maintosh in the lass of Computer. Similarly, apple and pie implies the apple of fruit in the lass of Food. The above observation supports restriting the domain of mining term assoiations within the boundary of a lass. On the other hand, it is also ommon to observe the meaning that has many forms of representations, making their assoiations promising andidates for mining. Based on these reasons, ACIRD applies the mining assoiation rules proess to mine term assoiations by the following assumptions: Term orresponds to item. Informative HTML paragraph orresponds to transation. Class orresponds to transation database. Conentrating on objets of a lass instead of all lasses also has the merit of a small database size, as the omplexity of mining assoiations is exponentially inreased with the size of the database. When the size of database is not large, a simple mining assoiation algorithm, suh as Apriori [2], an be effiiently applied. In this dissertation, we only onsider one-to-one term assoiations. The ost of mining term assoiations of a lass is O ( Know ). 2 We define onfidene (onf) and support (sup) of term assoiation t t in the following. i j onf t > t i term t, and df i j df ( ti = ( t t j ) df i j ( t i, where ) df ( t ) stands for the number of douments that ontain i t ) indiates the number of douments that ontain t i and t j. df ( ti t j ) supt t, where D stands for the number of douments in lass. i = (EQ 7.4) j D Confidene is assumed to be the degree of assoiation between terms and is employed by Classifiation Knowledge Refiner to refine Know to * Know. Support is the perentage of transations supporting the assoiated rules, and is assumed to be a metri of the orretness of the rules. For example, and sup 1 art, Art = Know of lass Art ontains the following term supports: sup exhibition, Art = Likely, t exhibition is filtered out from Know for its low support value. After 90

101 mining term assoiations of the lass Art, ACIRD identifies the term assoiation exhibition art with onf = and sup = Assume that a rule with 10% supports is exhibition art exhibition art useful. Following the definition of * sup defined in the previous setion, * sup exhibition,art is inreased from 0.13 to (i.e. * sup exhibition,art = exhibition art onf sup art, Art = = 0.826). The inferene proess promotes the support value of t exhibition to to pass the filter. Term Semanti Network (TSN) Mining Assoiation Rules Term Support Graph(TSG) Learning Classifiation Knowledge Extrating Terms Class Term Term Assoiation Graph(TAG) Terms Doument Doument Doument Doument Figure 7-6. Constrution of Term Semanti Network. After mining term assoiations of a lass, TSN is obtained, as shown in Figure 7-6. TSG denotes the term supports of a lass, TAG represents the term assoiations in a lass, and TSN is the union of TSG and TAG, i.e. TSN( T,, E) = TSG( T,, E) TAG( T, E). 91

102 Refinement of Classifiation Knowledge As the term assoiations are asymmetri, both TAG and TSN form strongly onneted digraphs. To determine * sup of a term, all the possible paths from the term to the lass must be onsidered. For a TSN, the number 46 of all possible paths from terms to the lass is n = n 1 n 1 P i i 1, where n is the number of terms. In ACIRD, the average number of terms of a lass is 472 making an exhaustive searh infeasible. Although the support value of term assoiation an be employed as a filter to remove rarely used terms, it is still omputationally expensive for a small number of terms. For instane, a lass with ten terms reates about algorithm is deemed neessary possible paths. Therefore, an effiient Herein, we present a novel PTS algorithm to loate sup for all terms in a lass in polynomial *, t i * time. Aording to the definition sup = MAX onf onf... onf sup } and t, { t t t t t t t, onf and sup range in [0, 1], the more edges involved in the path p( t, t j, tk,..., t y, tz, ) imply a smaller value of their produt. Restated, a sub-path of an optimal path must be an optimal path as well. The proposed greedy heuristis and algorithm are as follows. Heuristis: Divide the terms in TSN into two groups T and T *. Initially, T ontains all the j j k y z z terms and * T is empty. Eah time when finding a term t with the maximum * sup t, in T, t is moved from T to T *. The heuristi is repeatedly applied until T is empty. The proof an be referred in the Appendix. Perfet Term Support (PTS) Algorithm 1. [Initial state: This step initializes all sup and partitions the terms into two groups: *, t j * T ontains the term with maximum * sup and T ontains all others.] * Let sup t, sup t,, t j ; j j * * Let sup t, MAX { sup t, t i }; last i * * Let T {( t, sup )}; T { t last t last, last }; 46 ( n 1)! ( n 1)! ( n 1)! n ( n 1)! = n 1! 2! ( n 2)! n 1 i= 1 P n 1 i, where P n 1 i ( n 1)! = ( n 1 i)! 92

103 support Using Mahine Learning to Retrieve and Manage Internet Information 2. [This step updates every * sup, in T, if neessary. Where last t j t denotes the latest term added into * T.] If T is not empty, For eah t If onf j ontinue T suh that edge( t t t j last sup * t last Step 2 and 3. Otherwise, > sup j, t last ) E, stop. * * *, t, t, t t t, j, then sup j onf j last sup last. 3. [This step loates the term with maximum * sup t, from T and inserts it into * T.] Let t last { t T T t last k ; t k T T, and sup * T * + t last * t, ; k = MAX { sup Output (t last * t, j, sup t j * last, T}}; ). In [55], we have proved that PTS algorithm always obtains the optimal solution with omputation 2 omplexity O ( Know ). PTS an effiiently promote some non-representative terms by exploring their assoiations with representative terms. Figure 7-7 illustrates the effet of PTS. In the left-hand side, there are four non-representative terms in TSN. After refinement using PTS, in the right-hand side of the graph, three terms are promoted to be representative for their assoiations with the representative term. All other non-representative terms and orresponding assoiations are eliminated to redue the learning omplexity. TSN Class support MG Optimized TSN support support Representative Term Non-Representative Term MG MG MG support support Proomted-Representative Term MG Optimize TSN Figure 7-7. PTS Refinement on TSN. Effets of Knowledge Refinement Proess 93

104 An experiment is designed to ompare Know and * Know based on U Know whose lass keywords were seleted by ten human experts. For every lass, ten human experts selet terms onsidered representative of the lass from a given set of Internet douments. Those seleted terms are sorted by the times of seletion and used as the basis to judge the quality of learned knowledge. Let follows. Know denotes the number of keywords of Know, the preision and reall is defined as Preision of Know = Know Know U Know. Reall of Know = Know Know U Know U. (EQ 7.5) Experiment results indiate Knowledge Refining Proess indeed refines the knowledge ontents of lassifiation knowledge. The trade-off between preision and reall based on different feature seletion riteria is demonstrated as well. Two types of riteria are used to evaluate the outomes of the experiment results. Top n. All the * sup t, are sorted a desending order. The first n terms are seleted to be the keywords of * Know. * Threshold = θ. This riterion selets terms with sup. t, θ Table 1 summarizes the experiment results. Before applying PTS, the lowest preision is 0.76, due to the high seletion standards. However, the reall is low for the same reason. This observation implies that the Indution Proess does not learn the impliit assoiation among terms, although it generalizes the knowledge of objets to lasses. In ontrast to the ase without applying PTS, PTS inreases both preision and reall for the Top n riterion as it promotes important but non-representative terms at the ost of removing less important terms (as Top n riterion selets a fixed number of terms). For the Threshold = θ riterion, PTS dramatially inreases reall while dereasing preision sine it inreases the number of keywords when promoting the terms. Experimental results onfirm that Indution Proess and Knowledge Refining Proess disover the hidden semantis among terms. Our results further demonstrate that an aeptable ompromise between preision and reall an be ahieved with a arefully hosen seletion riterion. 94

105 Sine eah omponent of ACIRD an be exeuted in polynomial time and the knowledge refinement an be ahieved in a finite number of iterations, the omplexity of ACIRD is also polynomial. Table 7-1. The effetiveness of PTS algorithm on lassifiation knowledge. Before PTS algorithm After PTS Algorithm Seletion Criterion Preision Reall Preision Reall Top Top Threshold = Threshold = The Retrieval System Searh Engine By sharing omponents of Doument Classifiation Learning, the searh engine of ACIRD is produed by adding two modules, Web Crawler and Fulltext Index Engine, as shown in Figure 7-8. Web Crawler Preproessing Proess Term Vetor Feature Seletion Proess The Web Douments Fulltext Index Engine DB Objet Knowledge (TSG) Classifiation Knowledge Refiner Refined Classifiation Knowledge (TSN*) Classifiation Knowledge Learner Stable Classifiation Knowledge (TSN=TSG+TAG) Term Assoiation Miner Classifiation Knowledge (TSG) Figure 7-8. Additional modules to implement Chinese searh engine. 95

106 Web Crawler In fat, the web rawler is also applied to the doument lassifiation learning proess. It ollets the Web pages as the training douments. There are four modules based on the introdution in Chapter 3. HTTP module implements HTTP based on the network soket. Currently, the module provides HTTP HEAD and GET and supports the protool of version 1.0 and 1.1. HTML Parser supports most HTML tags based on the same HTML Parser of the doument lassifiation learning proess. It also extrats links from HTML pages for deeply rawling the Web. Crawling sheduler uses the breadth-first rawling approah. By reording lastest rawling time, lastest rawling status, and rawling history, Web Crawler avoids frequently visit pages by prediting the modifiation period based on those data. Consisteny heker detets the redundany before pushing pages into the following index engine. The page redundany of a page is determined by the uniqueness of page s URL (sheme + host + port number + path) and MD5 heksum of page ontent Fulltext Index Engine Most searh engines apply the inverted index to their index engines for its effiieny. Some engines employ ompression tehniques to redue the index size. A present, Exite uses a 50 GB index to retrieve about 50 million pages. In ontat to term indexing for English, there is no trivial delimiter for Chinese. Term-based inverted indexing an not be diretly applied to index Chinese douments. Based on the term parsing proess used in the doument lassifiation, extrated terms an be used to the inverted index. However, many new terms may not be olleted in the term base so that the term parser an not extrat suh terms. Hene, the index engine has to index all possible Chinese-string patterns. In the system, we partition eah Chinese sentene into fixed-length string patterns. Those patterns are regarded as terms stored in a ditionary list. Given the Chinese query string, relevant patterns are retrieved from the ditionary, then those relevant pages are retrieved. 96

107 Ranking Algorithm Using Mahine Learning to Retrieve and Manage Internet Information We use vetor spae model (the osine value) for relevane ranking of retrieved pages Two-Phase Searh Most searh engines return the query with a list of ranked douments. Usually, users submit queries with one or two terms, and results are thousands of douments. Navigating among so numerous douments to find information needs is time onsuming and inonvenient. ACIRD provides two-phase searh servie that allows users to perform both lass-level searh and objet-level searh. By utilizing two-phase searh, ACIRD applies lass-level searh to assoiate users information needs to some lasses in ACIRD Lattie. Then users fous on those lasses, navigate and retrieve the needed douments by performing objet-level searh in a lass. The above proedure an be exeuted repeatedly until the user has desired information. Ensuring the effetiveness of the two-phase searh requires that terms of user queries be in some lasses * Know. Thereby, the lass-level searh on * Know an return the relevant lasses. The onjeture most query terms are in query log. * Know is investigated by the following analysis on the user Analysis on User Query Log Herein, the query behavior of the Internet user is analyzed based on Yam s query log olleted in Otober Terms are extrated from queries and their frequenies are ounted as well. There are 9,644 distint Chinese terms in the user log, denoted by the set CT log, with totally 648,006 referenes. By regarding eah term as an information need and its frequeny as the referene ount, the information needs that users are interested in an be disovered. If a keyword CT log, its referene ount is assigned as the same referene ount of the term in CT log * Know ours in ; otherwise, the referene ount of the keyword is 0. The referene ount is applied to measure the reall rate. The number of retained keywords of * Know measures the index rate, i.e. the overage rate of * Know to CT log. Regarding referenes and query terms of query log as the baseline, the reall rate and index rate of eah test are defined as: Reall rate = total referene ounts of * Know / total referenes of CT log. 97

108 Index rate = number of keywords in * Know / number of query terms in CT log. To adjust the index rate of * Know, eah keyword seletion has a different threshold denoted by Th = x.x. With different thresholds, a series of examinations on the referene ounts of keywords in * Know are performed to verify the above onjeture. Table 7-2 lists the referene ounts (reall rate) and the number of keywords (index rate) of eah test. From the table, in the ase Th = 0 (i.e. no keyword is eliminated), indexed keywords over 96.92% information needs with about doubled index size of query terms. When Th = 0.5, the remaining keywords overs 69.89% of information needs with an index rate of 30.91%. With a suffiiently high reall rate, two-phase searh is apable to shrink the searhing domain to a redued lass lattie for effiient and effetive searh. Table 7-2. Total referene ounts (reall rate) vs. the number of keywords (index rate). Based Line: Query Terms in User Log 648,006 (100%) 9,644 (100%) Filter Threshold for * Know Total Referenes Needs Number of Keywords Th = ,065 (96.92%) 18,076 (187.43%) Th = ,906 (73.75%) 3,775 (39.14%) Th = ,277 (72.57%) 3,446 (35.73%) Th = ,396 (71.82%) 3,260 (33.8%) Th = ,468 (70.75%) 3,090 (32.04%) Th = ,897 (69.89%) 2,981 (30.91%) Th = ,661 (68.00%) 2,723 (28.24%) Th = ,649 (65.07%) 2,498 (25.90%) Th = ,615 (62.44%) 2,277 (23.61%) Th = ,439 (60.10%) 2,015 (20.89%) Th = ,249 (58.37%) 1,905 (19.75%) 98

109 Two-Phase Searh Method Using Mahine Learning to Retrieve and Manage Internet Information Query String Parse Term A Sequene of Query Terms Class-Level Searh Sublass Searh No info. needs in the lass Ranked Classes that math with the query Objet-Level Searh Class Searh User selets a desired lass Not Found in the lass Searh All Objets Conventional Searh Found Figure 7-9. Proessing flow of Two-Phase Searh. Above analyses points out the likelihood that performing a lass-level searh in the first phase has less information loss. Besides lass-level searh on a strutured presentation, ACIRD also provides onventional searhing approahes, inluding objet-level and all-objets searh as an esape for users. After briefly reviewing the findings of a lass-level searh, the user an navigate down or up along the lass lattie or hoose objet-level searh on a partiular lasses. Figure 7-9 illustrates the blok diagram of two-phase searh. The operations are desribed below. Proess query string: Parse the query string into a sequene of terms (keywords). Perform a lass-level searh: Retrieve lasses assoiated with the query terms, alulating the relevane sores, and sorting lasses by the sores in desending order. Generate and present the result in HTML format. Exeute an objet-level searh in a lass: Retrieve objets in the designated lass assoiated with the query terms, alulate eah objet s relevane sore, and sort the objets aording to the sores in desending order. Generate and present the findings in HTML format. Searh all objets: (User an hoose the onventional searh method to searh for all assoiated objets.) Retrieve all objets related to query terms, alulate eah objet s relevane sore, and sort the objets by the sores in dereasing order. Present the findings in HTML format. 99

110 8. The Implementation of ACIRD HTML and the Web an be onsidered to be the most popular publishing medium. However, there is no standardized publishing proess. So pages an ontain invalid format or bad links, or an be poorly written by publi HTML editors. Additionally, the Internet onnets more than 200 ountries with different languages. Most languages are similar to English with small alphabets, but some use very large alphabets suh as Chinese and Japanese Kanji. Most searh engines just index pages written in their loal language. Unfortunately, many pages are written without language information in the META tag CHARSET, making it diffiult to identify the language of a page without a CHARSET speifiation. In this hapter, we desribe the implementation of ACIRD that inludes the searh engine, and the doument lassifiation system, and the two-phase searh engine The Searh Engine The following modules are implemented arrying out of searh engine servie The Crawling Modules The implementation of an HTTP module is tedious work when following the hypertext transfer protool. We omit this detail in this dissertation. URL Pruner We design the following mehanism to define the Internet boundary that a searh engine rawls. Positive hosts speify sites in the rawling range. For example, is inluded in the range. Pages under the host (path) will be rawled. It has the highest priority. 100

111 Negative hosts indiates sites exluded by the rawler. For example, is a searh engine whose searh results should not be rawled. It has the seond highest priority. Domain list guarantees that the rawling range is restrited to the domains speified in the list, e.g.,.tw,.hinet.net,.ht.net, et. Using domain list is onvinent when inluding large ranges of the Internet. It has the lowest priority. URL Unifier This is the first level to filter out dupliate Web pages. URL is omposed of sheme, host, port number, and path. To avoid adding redundant data and further visiting the redundant page, URL uniqueness is heked. The property is ahieved by the unique onstraint of databases. MD5 Unifier There are many pages that present the same ontents using different URLs beause of mirror sites, default pages of web servers, rediretions of the same error handler, et. A simple hash (MD5 hashing) of eah page ontent is done while the page is being rawled. The property is ahieved by the unique onstraint of databases. URL and MD5 unifier guarantee that no dupliate pages rawled and retrieved by the searh engine. Aording to the statistis of our searh engine, there are 3,012,295 rawled pages with 243,036 dupliates 47 (redundant URLs and ontents), or about 8% dupliate pages. Removing the redundany of ontent will improve not only the effiieny of indexing but also the effetiveness of retrieval Doument Proessing Modules The HTML page is not well strutured; it is only semi-strutured. Without obvious strutures, searh engines have diffiulty identifying the major ontent of a page. For example, there are one or more advertising banners in dot-om pages. However, the ode blok for AD banner is also indexed by searh engines; this information is just noise. Moreover, many pages are either repeated or very similar due to being dupliate opies, mirrors, or default pages of Web servers. Approximately 30% of pages are dupliate [13]. Semanti redundany is even larger. HTML Parser 101

112 The HTML Parser is implemented with two staks: one for HTML tags and the other for paragraphs embedded in tags. It pushes paragraphs onto the stak when it enounters begin tag, and pops them when enountering end tag. The omplexity is the ost of one doument san. Eah paragraph is proessed by the following two proesses. Language Detetor As we said previously, many pages are published without CHARSET information or with wrong CHARSET information, leading to the problem that the language used annot be determined. If a searh engine is targeted to serve Chinese BIG5 and English pages, the doument proessing module has to detet the languages and determine whether the page should be inluded in the database. The idea of language detetion is based on the usage statistis for the BIG5 harater set, whih is based on rules, i.e., 80% the ontent uses only 20% of the haraters. Extrating BIG5 haraters of enough BIG5 douments, we an determine the 20% most frequently used BIG5 haraters; this is the popular BIG5 set. If a doument is written in BIG5, more than half of the haraters, exluding English haraters, should be loated in the popular BIG5 set. The same onept an be easily extended to any language detetor, as long as we have its orresponding popular harater set. En-BIG5 Splitter It is easy to split English and BIG5 strings based on their orresponding harater sets. Term Parser & Extrator A Chinese term parser is implemented based on a term base and the heuristi of longest term is seleted first. The term base originated from the query log of Yam s searh engine. Currently, the Yam searh engine serves about one million queries per day. Terms extrated by the Chinese term parser an be regarded as keywords, sine the term base ollets popular terms. As for other non-popular or rare terms, the searh engine indexes fixed-length string patterns into the term ditionary for performing fulltext searh. We all this module a fulltext term extrator. 47 The test was done at May 8,

113 Currently, this length is six Chinese BIG5 haraters (12 bytes). Obviously, the term ditionary extrated by the extrator is a superset of the term base used by the Chinese term parser. Term weight is based on TFxIDF weighting strategy Indexer The Indexer stores term identifiers (TID) and objet identifiers (OID) assoiated with weights into the database. Eah row of the database table is: Index(TID, OID, weight). The entralized arhiteture is applied to arry out the Indexer Searher By parsing the query into TIDs following the same proesses as in Term Parser, Searher effiiently retrieves relevant OIDs and weights. Then it alulates the relevane sore between the query and OIDs aording the similarity funtion, whih is the osine measure of their two representative vetors. The sore is used to sort the rank for retrieving page information stored in the database. +DB HTTP Module HTML Parser Indexer URL Pruner Term Parser & Extrator Searher The Web Douments URL Unifier -DB Language Detetor Query MD5 Unifier En-BIG5 Splitter Figure 8-1. The arhiteture of the searh engine. The arhiteture of the searh engine is shown in Figure 8-1. We partition the database into positive and negative database. Pages are moved to the negative database when they violate the rules defined 103

114 in the domain knowledge. For example, pages are either out-side the defined Internet range, redundant, or not written in the loal language Doument Classifiation Learning In this setion, we desribe the testing phase of the learning proess. Based on the learned knowledge, ACIRD Classifier automatially ategorizes newly olleted Internet objets. For eah objet, the lassifier assigns one or more lasses, whih are ompared with the lasses assigned by human experts to evaluate the lassifiation auray in the testing phase. A series of experiments and analyses reveal that douments. * Know provides high quality suggestions in lassifying Internet Similarity Measurement ACIRD Classifier uses the onventional similarity measurement, the osine value of feature vetors of doument and lass, defined in the following equation. sim( o, ) = is the support of t o 2 = sup ti in and o 2 t1, o ( sup o i 2 + sup ti, o ) to o, and mg 2 t2, o mg 2 ti,, where ti, is membership grade of t +... is the norm of the objet; t is a ommon term of o and, sup i i to, i.e., sup * t i, ti, o ; (EQ 8.1) 2 = mg 2 t1, + mg 2 t2, +... is the norm of the lass. Owing to the impreise nature of the onept of lass, the lass assignment of an objet annot be exatly true or false. In addition, ategorizing an objet to one lass only would also be impratial sine an objet may be oneptually related to several lasses. Therefore, for an input objet, ACIRD Classifier gives the best N lasses that are losest to the intention of the objet. The assignment riterion is alled Top N mathes. Classifiation auray is estimated by the riterion that the expert-assigned lass of a testing objet is loated in the set of best N mathed lasses. In this investigation, only the most speifi lasses and the most general lasses are onsidered for simpliity. 104

115 Experiment Results Using Mahine Learning to Retrieve and Manage Internet Information The target lass set of this experiment onsists of 512 lasses in ACIRD Lattie with 386 most speifi lasses. There are 9,778 training objets and 8,855 testing objets distributed into those lasses, whih were manually lassified into ACIRD Lattie. The training set and the testing set are disjoint. Before the learning proess, ten human experts extrat the keywords from eah lass as the lassifiation knowledge benhmark 48, denoted as Know. The testing proesses runs on U U Know, * Know, and Know, whih are marked as 10 Users, With PTS, Without PTS respetively. The result from a naive Bayes model of Rainbow 49, marked as Naive Bayes, is also inluded for omparison, as it is widely used in text lassifiation. Figure 8-2 summarizes their results. Aording to the results, * Know has the quality in par with the manually extrated lassifiation knowledge U Know in terms of the auray of lass assignment of objets. * Know and U Know are slightly better than naive Bayes model Users With PTS Without PTS Naive Bayes Average Auray Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 Top9 Top10 Threshold (Top N): The target lass is in the best N mathed lasses (N = 1, 2,, 10) Figure 8-2. The lassifiation auray of assigning 8,855 testing objets to 386 most speifi lasses based on Top N. 48 We hired ten part-time students, who are major in different domain knowledge, to be knowledge engineers. They ollet terms from pages of lasses. Major terms seleted by most students are extrated as the lassifiation knowledge. The experiment is sponsored by CCL, IIS, Aademia Sinia (

116 However, the lassifiation auray of all the ases is unsatisfatorily low. Closely examining the training and testing sets revealed that many lasses ontain insuffiient training objets, and some training and testing objets ontain very few keywords beause they are non-text pages or link-only pages. Thus, another experiment is designed herein to irumvent suh a situation. The same testing proess is performed based on the twelve most general lasses 50 of ACIRD Lattie, and the resulting lassifiation auray is shown in Figure 8-3. The Top 1 auray of * Know is inreased from to Suh an inrease is due to suffiient training objets in the testing lasses, and the total number of testing lasses is redued from 512 to Auray Users With PTS Without PTS Top1 Top2 Top3 Top4 Top5 Top6 Figure 8-3. The lassifiation auray of assigning 8,855 testing objets to 12 most general lasses. Table 8-1 lists the number of objets and keywords of the twelve most general lasses. From this table, the distribution of the numbers of objets is skewed, and some lasses still exhibit the problem of insuffiient training objets and keywords. To further investigate the problem, we perform another set of experiments on lasses with suffiient training objets and keywords only. 50 They are Arts, Humanities, Soial Sienes, Soiety and Culture, Natural Sienes, Computer and Internet, Health, News and Information, Eduation, Government and State, Companies, and Entertainment and Rereation. 106

117 Table 8-1. The distribution of training objets in the most general lasses. Class Name Objets Keywords Companies 2702 (27.88%) 950 (22.83%) Entertainment and Rereation 2577 (26.59%) 1084 (26.05%) Computer and Internet 1199 (12.37%) 471 (11.32%) Eduation 1169 (12.06%) 589 (14.15%) Soiety and Culture 502 (5.18%) 226 (5.43%) Government and State 384 (3.96%) 241 (5.79%) News and Information 288 (2.97%) 162 (3.89%) Health 280 (2.89%) 180 (4.32%) Arts 223 (2.30%) 115 (2.76%) Soial Siene 208 (2.15%) 92 (2.21%) Natural Siene 106 (1.09%) 44 (1.06%) Humanities 53 (0.55%) 8 (0.19%) In the experiment, every testing lass has at least forty training objets. The total testing lasses are redued from 512 to 48 without onsidering general/speifi relationships among lasses. Sine the lasses ontain a suffiient number of training objets, those lasses are referred to herein as well-trained lasses and their refined lassifiation knowledge as well-trained lassifiation knowledge. In addition, the lassifiation knowledge generated from the ten human experts is ompared again to evaluate the quality of well-trained lassifiation knowledge. Aording to Figure 8-4, the Top N lassifiation auray is markedly inreased. This figure also reveals that, when the lasses ontains a suffiient number of training objets, our learning model an understand more aurate lassifiation knowledge than that of human experts. Intuitively, the number of lasses is a fator that affets the lassifiation auray. However, aording the result of Figure 8-3 and 8-4, Top 1 values of twelve most general lasses and forty-eight well-trained lasses are and 0.486, and Top 6 values are and The result is interesting sine it shows that the number lasses is not the only major fator. 107

118 Auray Well-Trained Class 10 Users With PTS Without PTS Thershold (Top N, N = 1, 2,, 10) Figure 8-4. The lassifiation auray of well-trained lasses Two-Phase Searh Engine Based on the arhiteture shown in Figure 7-9 and 8-1, the implementation of two-phase searh engine is trivial. In this setion, we just demonstrate the example of this kind of searh servie Examples of Two-Phase Searh Users an searh for desired objets from ACIRD 51 by giving their query strings. For example, in Figure 8-5, the user selets the query mode Two-Phase Searh and gives the query interesting tehnial magazine. The searh interfae, searh result, and rawled pages are written in Chinese (BIG5). We manually translate these Chinese examples into English. 51 The urrent version of ACIRD ( provides Chinese interfae only. The figures shown in this example are their English translations. 108

Figure 8-5. Two-phase searh provided by ACIRD query interfae. Figure 8-6. Class-level searh query result: mathed lasses. Figure 8-6 summarizes the query findings in lass-level searh.

The former lists all objets in the lass (inluding objets of its sublasses), as shown in Figure 8-7; the latter lists the lass s diret objets only, as shown in Figure 8-8.

119 Figure 8-5. Two-phase searh provided by ACIRD query interfae. Figure 8-6. Class-level searh query result: mathed lasses. Figure 8-6 summarizes the query findings in lass-level searh. In the figure, Refined Searh in Class presents the lass name that user an resume the same query on the lass by liking on the lass name. Objet In Class shows two links, All and Diret. The former lists all objets in the lass (inluding objets of its sublasses), as shown in Figure 8-7; the latter lists the lass s diret objets only, as shown in Figure 8-8. MG (membership grade) indiates the normalized relevane sore. Figure 8-6 shows 8 mathed lasses, and the user an press the link in the bottom to searh all objets as in the onventional searh engine, if no lasses are interesting to the user. 109

If the user fouses on the lass Tehnial Journal and liks on it to perform objet-level searh, the searh

120 Figure 8-7. Searh all objets under a lass query result: list all objets of a lass. Figure 8-8. Objet-level searh query result: list diret objets of a lass. If the user fouses on the lass Tehnial Journal and liks on it to perform objet-level searh, the searh findings are shown in Figure 8-9. In Class presents one or several lasses that the objet belongs to. The lass hyperlink an be liked to list all objets in the lass. 110

Figure 8-10 summarizes the results for a situation that the user presses the link

121 Figure 8-9. Objet-level searh query result: searh objets in designated lasses. Figure Searh all objets query result: searh all objets. Figure 8-10 summarizes the results for a situation that the user presses the link to perform Searh All Objets. This example ontains a total of 746 relevant objets. In omparison with eight 111

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and