1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to provide the web user with the appropriate content in response to his or her query. In most cases, the user is flooded with thousands of web pages in response to his or her query and it is common knowledge that not many users go past the first few web pages. In spite of the multitude of the pages returned, most of the time, the average user does not find what he or she is looking for in the first few pages he or she manages to examine. It is really debatable as to how useful or meaningful it is for any search engine to return thousands of web pages in response to a user query. In spite of the sophisticated page ranking algorithms employed by the search engines, the pages the user actually needs may actually get lost in the huge amount of information returned. Since most users of the web are not experts, grouping of the web pages into categories helps them to navigate quickly to the category they are actually interested and subsequently to the specific web page. This will reduce the search space for the user to a great extent. Web page categorization is the main focus of this thesis. It is strongly believed and felt that the experience of a person using a web search engine is enhanced multifold if the results are nicely categorized
2 as against the case where the results are displayed in a structure less, flat manner. 1.2 Information Retrieval Web search owes its roots to the traditional information retrieval systems, even though it differs from it in some ways. Information retrieval systems work in controlled environments and it is relatively easier to build effective systems. Web search engines, on the other hand, have to work in a relatively uncontrolled environment, where the content is dynamic in nature and also stored in different formats and also unstructured. Information Retrieval is the problem of selection of relevant information from a document database in response to search queries given by a user [1],[6]. Information Retrieval systems (IRSs) deal with document data bases that usually consist of textual information and process user queries to provide the user with access to relevant information within a reasonably acceptable time interval. An IRS consists of three main components: A document collection, which stores the actual documents and the representation of their information contents. Text documents are typically represented in terms of the index terms which act as the content identifiers of the documents.
3 A query subsystem, which allows the users to formulate their queries and presents the relevant documents retrieved by the system to them. To do so, it includes a query language that collects the rules to generate legitimate queries and procedures to select the relevant documents. A matching mechanism, which evaluates the degree to which the document representations match the requirements expressed in the query, and retrieves those documents that are deemed to be relevant to it. The retrieval model of most of the commercially available IRSs is the boolean one, which is a robust and well formulated model albeit with some limitations. For example, it does not consider partial relevance and is not able to rank the retrieved documents by relevance. Due to this fact, some paradigms have been designed to extend this retrieval model and overcome these problems, with the vector space model [6] being the most popular. 1.3 Web Page Retrieval Although the traditional IR techniques are a few decades old, they still constitute the base of modern Web search engines. The popularity of the Web has transformed traditional IRSs into newer and more powerful search tools for locating content on the Internet. However, there are several differences due to the special characteristics of the World Wide
4 Web environment. The problem of searching the Web has become far more complex than it was in the past[4] mainly due to the increase in the size of the search space by several orders of magnitude and due to the multimedia nature of Web documents. The main differences between Web Page retrieval and traditional IR are listed as follows [3]: (1) The HTML-based nature of Web documents, that make them present a structure defined by the HTML tags. (2) The diversity of Web documents in terms of: (i) (ii) (iii) length, structure, writing style language and domains and existing information formats (3) The dynamic nature of many Web pages, in the sense that the content may keep changing, that makes it difficult to retrieve them. The previous aspects clearly show how Web retrieval has to extend traditional IR in order to deal with the special nature of Web documents. However, this usually makes Web engines focus more on the efficiency of the response than on the retrieval efficacy.
5 1.4 Classification and Clustering Classification and clustering are the two tasks which have been traditionally carried out by human beings who are experts in the given domain. But in this electronic age, with the explosion in the amount of information available on the net, it is becoming increasingly difficult for human experts to classify or cluster all the documents available on the World Wide Web. Hence, it is increasingly evident that machine learning techniques be used instead of human experts, to carry out the tasks of web document classification and clustering, as part of the activity of categorizing them. 1.4.1 Document Classification Given a document, the task of a classifier is to identify to which class (category) the document belongs to, from a set of pre defined classes. It is an example of supervised learning. In order to carry out text classification, a classifier needs to be trained first, with sufficient number of representative samples from each of the classes. The samples or examples as they are often called are usually picked by human experts. One important application of text classification is to help build a vertical search engine. Vertical search engines restrict searches to a particular topic. For example, the query computer science on a vertical search engine for the topic India would return a list of Indian computer science departments with higher precision and recall than the query
6 computer science India on a general purpose search engine. Some other applications include several of the preprocessing steps necessary for indexing: detecting a document s encoding (ASCII, Unicode, UTF-8) word segmentation identifying the language of a document The automatic detection of spam pages Many classification tasks have traditionally been solved manually. But manual classification is expensive to scale to problems of larger size. A second approach is to carry out classification by the use of rules, which are often equivalent to Boolean expressions. A rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating the rules and maintaining them over time is labor-intensive. A technically skilled person can create rule sets that are very accurate. But often, it can be very difficult to find someone with this specialized skill. A third approach to text classification is based on machine learning. In machine learning, the set of rules or, more generally, the decision criterion of the text classifier is learned automatically from training data. This approach is also called statistical text classification if the learning method is statistical. In statistical text classification, a number of good example documents from each pre defined class are required for training the classifier. The human element is still not
7 eliminated since the training documents come from a person who is an expert in the area. 1.4.2 Document Clustering Document clustering is the process of grouping a set of documents into subsets or clusters. The goal is to create clusters that are coherent internally, but are different from each other. In other words, documents within a cluster should be as similar as possible and documents in one cluster should be as dissimilar as possible from documents in other clusters. Clustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documents to classes unlike in classification. In clustering, it is the data distribution that will determine the cluster membership. The key input to a clustering algorithm is the similarity measure. In document clustering, the similarity or distance measure is usually the vector space similarity or distance. Different similarity or distance measures give rise to different cluster formations. Thus, the similarity or distance measure is very critical to the outcome of clustering. The cluster hypothesis states the fundamental assumption that is made when using clustering in information retrieval: Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. The hypothesis states that if there is a document from a cluster that is relevant to a search
8 request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many words. 1.5 Meta Search Engines A meta search engine does not crawl or index a document collection. Instead, it takes a query from the user, submits it to several search engines and then organizes the outputs received from all those search engines before displaying the results to the user. Since a given search engine covers only a part of the entire expanse of the world wide web at best, it seems like a good idea to collect the responses from several search engines for a given user query. This way, a bigger part of the World Wide Web can be accessed. Also, by collating the results from several search engines a much better ranking to the web pages[176] can be provided as they are displayed in the browser for the benefit of the user.
9 Figure1.1 Clustered search results shown by Clusty of Vivisimo The Vivísimo Clustering Engine[177], known by the name Clusty, uses clustering as the core technology. Document clustering methods never need to touch or know about the larger collection from which search results are taken, or undergo any other pre-processing steps. Organizing the search results occurs just before a user is shown the long list of search results. The final output is a hierarchy (or tree) on the left of a split screen with the search results on the right. The interaction is based on the familiar Microsoft Explorer style of interacting with a file system.
10 1.6 Motivation for Clustering the Search Results It is found that people will not search for long on the web. There is a limited average time they will spend before giving up, or becoming very upset with the search technology available to them. On average, Webrage is uncaged (sic) after twelve minutes of fruitless searching, although about seven percent of the 566 people surveyed by Roper Starch Worldwide say ire starts rising within three minutes [179]. When users become frustrated while searching the internet, they either try another search engine, re-formulate the search, or quit. For users, seeing clustered search results has three benefits[178]: Brings into easy view those search results that otherwise would remain invisible because they are far down the list. Allows users to examine nearly double the number of relevant documents than in the case of result lists of commercial search engines Leads to effortless knowledge discovery as the user learns the types or subtopics of available information relating to the query. Provides context by placing related documents within a single folder for joint viewing All of these factors have significant impact on a user s search productivity.
11 1.7 Objectives of the Research Work Categorization of search results returned by a search engine is the main focus of this thesis. The following are the main objectives of the work reported herein. To try and find a method that automatically determines the number of natural clusters (K) present in a set of web search results with reasonable accuracy. Once clustering is done, which is to know essentially which search results (web pages) belong together as a group, each web page can be labeled with the corresponding cluster number. This means that the individual web pages are automatically annotated with a label instead of a human expert doing that job. When the popular vector space model is used to represent web pages, the resulting dimensionality is usually very high, a few thousand terms, even for a moderately large data collection. In that context, feature selection for the purpose of dimensionality reduction becomes very essential. As rough set reducts are known to uncover data dependencies very well, another main objective of this work is to investigate their applicability in their current form to the domain of web page categorization and suggest any modifications if found necessary. Classification of instances of a dataset is carried out by a classifier after it learns the model from a training dataset. The training data
12 generally consists of instances which are labeled by a human expert. The labels are the classes into which the instances of the dataset are divided and are fixed by the human expert. The essence is that human intervention is required in the form of preparing the training data. Clustering of large datasets is universally accepted as a difficult problem, since it tries to group instances together, without the helping hand of a human supervisor. Also, algorithms such as K-Medoids have a very high time complexity and have runtimes which are unacceptably high, even for moderately large datasets. The third objective of the thesis is to integrate both clustering and classification in the context of web page categorization. In the proposed approach, clustering would be used instead of a human expert in preparing the training data for the classifier. 1.8 Organization of the Thesis The contents of the thesis are organized and presented in the following manner. Chapter 2 presents the literature review on the existing approaches for web page categorization. In chapter 3, a review of the vector space model is presented. Vector space model is a popular document representation model and it has been used in this thesis. Further, a brief introduction is given to
13 some of the existing similarity measures and clustering methods that are generally used in webpage categorization. Chapter 4 presents a brief overview of the rough set theory and the QuickReduct algorithm. QuickReduct algorithm can be used for the purpose of dimensionality reduction through feature selection. A review of the existing work related to QuickReduct is also presented. Chapter 5 presents the Find-K algorithm, which is one of the main contributions of this thesis. Along with the description of the algorithm, results and discussion are also presented. A graph based mathematical model is also presented here. In chapter 6, a modification to the existing rough set based QuickReduct algorithm is proposed. Chapter 7 presents a new approach to web page categorization, called Integrated Machine Learning Approach (IMLA). This method puts together clustering and classification by integrating the work reported in chapter 5 and 6 and thereby proposes a novel approach to web page categorization. Chapter 8 presents discussion and conclusions, limitations of the current work and future scope.