In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Similar documents
INTRODUCTION. Chapter GENERAL

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

International ejournals

Domain Specific Search Engine for Students

Chapter 27 Introduction to Information Retrieval and Web Search

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Information Discovery, Extraction and Integration for the Hidden Web

CHAPTER 7 CONCLUSION AND FUTURE WORK

Text Mining. Representation of Text Documents

automatic digitization. In the context of ever increasing population worldwide and thereby

Chapter 2 BACKGROUND OF WEB MINING

Modelling Structures in Data Mining Techniques

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

An Approach To Web Content Mining

Comparison of FP tree and Apriori Algorithm

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Chapter 6: Information Retrieval and Web Search. An introduction

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

A Content Vector Model for Text Classification

Application of rough ensemble classifier to web services categorization and focused crawling

Using Clusters on the Vivisimo Web Search Engine

Information Retrieval

PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES.

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

SSV Criterion Based Discretization for Naive Bayes Classifiers

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Information Retrieval May 15. Web retrieval

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Domain-specific Concept-based Information Retrieval System

Top Required Skills for SEO Specialists. Worldwide Research

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Document Clustering For Forensic Investigation

Overview of Web Mining Techniques and its Application towards Web

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Text Classification in Electronic Discovery. Disclaimer

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Efficient Content Based Image Retrieval System with Metadata Processing

Introduction to Information Retrieval

Provided by TryEngineering.org -

Texture Image Segmentation using FCM

How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments?

Keywords Data alignment, Data annotation, Web database, Search Result Record

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.


Information Retrieval

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

What to come. There will be a few more topics we will cover on supervised learning

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Sentiment analysis under temporal shift

Review on Text Mining

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Hybrid Recommendation System Using Clustering and Collaborative Filtering

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Text Mining: A Burgeoning technology for knowledge extraction

Information Retrieval

WebBiblio Subject Gateway System:

Full Website Audit. Conducted by Mathew McCorry. Digimush.co.uk

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

Information Retrieval (Part 1)

A Content Based Image Retrieval System Based on Color Features

Semi supervised clustering for Text Clustering

Patent Image Retrieval

Next Generation LMS Evaluation

Search Engines. Information Retrieval in Practice

INF-3981 Master s Thesis in Computer Science. Inferring Image Semantics from Collection Information

ResPubliQA 2010

MURDOCH RESEARCH REPOSITORY

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Including the Size of Regions in Image Segmentation by Region Based Graph

Automated Online News Classification with Personalization

Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering Recommendation Algorithms

Clustering. Supervised vs. Unsupervised Learning

Unsupervised Learning

World Wide Web has specific challenges and opportunities

Ranking Clustered Data with Pairwise Comparisons

WordPress SEO. Basic SEO Practices Using WordPress. Leo Wadsworth LeoWadsworth.com

Ranking models in Information Retrieval: A Survey

Election Analysis and Prediction Using Big Data Analytics

Iteration Reduction K Means Clustering Algorithm

Information Retrieval and Web Search Engines

Machine Learning in Digital Security

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Similarity search in multimedia databases

Chapter 10. Conclusion Discussion

CS47300 Web Information Search and Management

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

Efficient Indexing and Searching Framework for Unstructured Data

Information Retrieval Spring Web retrieval

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Transcription:

1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to provide the web user with the appropriate content in response to his or her query. In most cases, the user is flooded with thousands of web pages in response to his or her query and it is common knowledge that not many users go past the first few web pages. In spite of the multitude of the pages returned, most of the time, the average user does not find what he or she is looking for in the first few pages he or she manages to examine. It is really debatable as to how useful or meaningful it is for any search engine to return thousands of web pages in response to a user query. In spite of the sophisticated page ranking algorithms employed by the search engines, the pages the user actually needs may actually get lost in the huge amount of information returned. Since most users of the web are not experts, grouping of the web pages into categories helps them to navigate quickly to the category they are actually interested and subsequently to the specific web page. This will reduce the search space for the user to a great extent. Web page categorization is the main focus of this thesis. It is strongly believed and felt that the experience of a person using a web search engine is enhanced multifold if the results are nicely categorized

2 as against the case where the results are displayed in a structure less, flat manner. 1.2 Information Retrieval Web search owes its roots to the traditional information retrieval systems, even though it differs from it in some ways. Information retrieval systems work in controlled environments and it is relatively easier to build effective systems. Web search engines, on the other hand, have to work in a relatively uncontrolled environment, where the content is dynamic in nature and also stored in different formats and also unstructured. Information Retrieval is the problem of selection of relevant information from a document database in response to search queries given by a user [1],[6]. Information Retrieval systems (IRSs) deal with document data bases that usually consist of textual information and process user queries to provide the user with access to relevant information within a reasonably acceptable time interval. An IRS consists of three main components: A document collection, which stores the actual documents and the representation of their information contents. Text documents are typically represented in terms of the index terms which act as the content identifiers of the documents.

3 A query subsystem, which allows the users to formulate their queries and presents the relevant documents retrieved by the system to them. To do so, it includes a query language that collects the rules to generate legitimate queries and procedures to select the relevant documents. A matching mechanism, which evaluates the degree to which the document representations match the requirements expressed in the query, and retrieves those documents that are deemed to be relevant to it. The retrieval model of most of the commercially available IRSs is the boolean one, which is a robust and well formulated model albeit with some limitations. For example, it does not consider partial relevance and is not able to rank the retrieved documents by relevance. Due to this fact, some paradigms have been designed to extend this retrieval model and overcome these problems, with the vector space model [6] being the most popular. 1.3 Web Page Retrieval Although the traditional IR techniques are a few decades old, they still constitute the base of modern Web search engines. The popularity of the Web has transformed traditional IRSs into newer and more powerful search tools for locating content on the Internet. However, there are several differences due to the special characteristics of the World Wide

4 Web environment. The problem of searching the Web has become far more complex than it was in the past[4] mainly due to the increase in the size of the search space by several orders of magnitude and due to the multimedia nature of Web documents. The main differences between Web Page retrieval and traditional IR are listed as follows [3]: (1) The HTML-based nature of Web documents, that make them present a structure defined by the HTML tags. (2) The diversity of Web documents in terms of: (i) (ii) (iii) length, structure, writing style language and domains and existing information formats (3) The dynamic nature of many Web pages, in the sense that the content may keep changing, that makes it difficult to retrieve them. The previous aspects clearly show how Web retrieval has to extend traditional IR in order to deal with the special nature of Web documents. However, this usually makes Web engines focus more on the efficiency of the response than on the retrieval efficacy.

5 1.4 Classification and Clustering Classification and clustering are the two tasks which have been traditionally carried out by human beings who are experts in the given domain. But in this electronic age, with the explosion in the amount of information available on the net, it is becoming increasingly difficult for human experts to classify or cluster all the documents available on the World Wide Web. Hence, it is increasingly evident that machine learning techniques be used instead of human experts, to carry out the tasks of web document classification and clustering, as part of the activity of categorizing them. 1.4.1 Document Classification Given a document, the task of a classifier is to identify to which class (category) the document belongs to, from a set of pre defined classes. It is an example of supervised learning. In order to carry out text classification, a classifier needs to be trained first, with sufficient number of representative samples from each of the classes. The samples or examples as they are often called are usually picked by human experts. One important application of text classification is to help build a vertical search engine. Vertical search engines restrict searches to a particular topic. For example, the query computer science on a vertical search engine for the topic India would return a list of Indian computer science departments with higher precision and recall than the query

6 computer science India on a general purpose search engine. Some other applications include several of the preprocessing steps necessary for indexing: detecting a document s encoding (ASCII, Unicode, UTF-8) word segmentation identifying the language of a document The automatic detection of spam pages Many classification tasks have traditionally been solved manually. But manual classification is expensive to scale to problems of larger size. A second approach is to carry out classification by the use of rules, which are often equivalent to Boolean expressions. A rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating the rules and maintaining them over time is labor-intensive. A technically skilled person can create rule sets that are very accurate. But often, it can be very difficult to find someone with this specialized skill. A third approach to text classification is based on machine learning. In machine learning, the set of rules or, more generally, the decision criterion of the text classifier is learned automatically from training data. This approach is also called statistical text classification if the learning method is statistical. In statistical text classification, a number of good example documents from each pre defined class are required for training the classifier. The human element is still not

7 eliminated since the training documents come from a person who is an expert in the area. 1.4.2 Document Clustering Document clustering is the process of grouping a set of documents into subsets or clusters. The goal is to create clusters that are coherent internally, but are different from each other. In other words, documents within a cluster should be as similar as possible and documents in one cluster should be as dissimilar as possible from documents in other clusters. Clustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documents to classes unlike in classification. In clustering, it is the data distribution that will determine the cluster membership. The key input to a clustering algorithm is the similarity measure. In document clustering, the similarity or distance measure is usually the vector space similarity or distance. Different similarity or distance measures give rise to different cluster formations. Thus, the similarity or distance measure is very critical to the outcome of clustering. The cluster hypothesis states the fundamental assumption that is made when using clustering in information retrieval: Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. The hypothesis states that if there is a document from a cluster that is relevant to a search

8 request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many words. 1.5 Meta Search Engines A meta search engine does not crawl or index a document collection. Instead, it takes a query from the user, submits it to several search engines and then organizes the outputs received from all those search engines before displaying the results to the user. Since a given search engine covers only a part of the entire expanse of the world wide web at best, it seems like a good idea to collect the responses from several search engines for a given user query. This way, a bigger part of the World Wide Web can be accessed. Also, by collating the results from several search engines a much better ranking to the web pages[176] can be provided as they are displayed in the browser for the benefit of the user.

9 Figure1.1 Clustered search results shown by Clusty of Vivisimo The Vivísimo Clustering Engine[177], known by the name Clusty, uses clustering as the core technology. Document clustering methods never need to touch or know about the larger collection from which search results are taken, or undergo any other pre-processing steps. Organizing the search results occurs just before a user is shown the long list of search results. The final output is a hierarchy (or tree) on the left of a split screen with the search results on the right. The interaction is based on the familiar Microsoft Explorer style of interacting with a file system.

10 1.6 Motivation for Clustering the Search Results It is found that people will not search for long on the web. There is a limited average time they will spend before giving up, or becoming very upset with the search technology available to them. On average, Webrage is uncaged (sic) after twelve minutes of fruitless searching, although about seven percent of the 566 people surveyed by Roper Starch Worldwide say ire starts rising within three minutes [179]. When users become frustrated while searching the internet, they either try another search engine, re-formulate the search, or quit. For users, seeing clustered search results has three benefits[178]: Brings into easy view those search results that otherwise would remain invisible because they are far down the list. Allows users to examine nearly double the number of relevant documents than in the case of result lists of commercial search engines Leads to effortless knowledge discovery as the user learns the types or subtopics of available information relating to the query. Provides context by placing related documents within a single folder for joint viewing All of these factors have significant impact on a user s search productivity.

11 1.7 Objectives of the Research Work Categorization of search results returned by a search engine is the main focus of this thesis. The following are the main objectives of the work reported herein. To try and find a method that automatically determines the number of natural clusters (K) present in a set of web search results with reasonable accuracy. Once clustering is done, which is to know essentially which search results (web pages) belong together as a group, each web page can be labeled with the corresponding cluster number. This means that the individual web pages are automatically annotated with a label instead of a human expert doing that job. When the popular vector space model is used to represent web pages, the resulting dimensionality is usually very high, a few thousand terms, even for a moderately large data collection. In that context, feature selection for the purpose of dimensionality reduction becomes very essential. As rough set reducts are known to uncover data dependencies very well, another main objective of this work is to investigate their applicability in their current form to the domain of web page categorization and suggest any modifications if found necessary. Classification of instances of a dataset is carried out by a classifier after it learns the model from a training dataset. The training data

12 generally consists of instances which are labeled by a human expert. The labels are the classes into which the instances of the dataset are divided and are fixed by the human expert. The essence is that human intervention is required in the form of preparing the training data. Clustering of large datasets is universally accepted as a difficult problem, since it tries to group instances together, without the helping hand of a human supervisor. Also, algorithms such as K-Medoids have a very high time complexity and have runtimes which are unacceptably high, even for moderately large datasets. The third objective of the thesis is to integrate both clustering and classification in the context of web page categorization. In the proposed approach, clustering would be used instead of a human expert in preparing the training data for the classifier. 1.8 Organization of the Thesis The contents of the thesis are organized and presented in the following manner. Chapter 2 presents the literature review on the existing approaches for web page categorization. In chapter 3, a review of the vector space model is presented. Vector space model is a popular document representation model and it has been used in this thesis. Further, a brief introduction is given to

13 some of the existing similarity measures and clustering methods that are generally used in webpage categorization. Chapter 4 presents a brief overview of the rough set theory and the QuickReduct algorithm. QuickReduct algorithm can be used for the purpose of dimensionality reduction through feature selection. A review of the existing work related to QuickReduct is also presented. Chapter 5 presents the Find-K algorithm, which is one of the main contributions of this thesis. Along with the description of the algorithm, results and discussion are also presented. A graph based mathematical model is also presented here. In chapter 6, a modification to the existing rough set based QuickReduct algorithm is proposed. Chapter 7 presents a new approach to web page categorization, called Integrated Machine Learning Approach (IMLA). This method puts together clustering and classification by integrating the work reported in chapter 5 and 6 and thereby proposes a novel approach to web page categorization. Chapter 8 presents discussion and conclusions, limitations of the current work and future scope.