Chapter 1. Web-Mining and Information Retrieval. 1.1 Introduction

Size: px

Start display at page:

Download "Chapter 1. Web-Mining and Information Retrieval. 1.1 Introduction"

Ralph Moore
6 years ago
Views:

1 Chapter 1 Web-Mining and Information Retrieval 1.1 Introduction The World Wide Web or simply the web may be seen as a huge collection of documents freely produced and published by a very large number of people, without any solid editorial control. This is probably the most democratic and anarchic widespread mean for anyone to express feelings, comments, convictions and ideas, independently of ethnics, sex, religion or any other characteristic of human societies. The web constitutes a comprehensive, dynamic, permanently up-to-date repository of information regarding most of the areas of human knowledge (Hu, 2002) and supporting an increasingly important part of commercial, artistic, scientific and personal transactions, which gives rise to a very strong interest from individuals, as well as from institutions, at a universal scale. However, the web also exhibits some characteristics that are adverse to the process of collecting information from it in order to satisfy specific needs: the large volume of data it contains; its dynamic nature; being mainly constituted by unstructured or semi-structured data; content and format heterogeneity and irregular data quality are some of these adverse characteristics. End-users also introduce some additional difficulties in the retrieval process: information needs are often imprecisely defined, generating a semantic gap between user needs and their specification. The satisfaction of a specific information need on the web is supported by search engines and other tools aimed at helping users gather information from the web. The user is usually not assisted in the subsequent tasks of organizing, analyzing and exploring the answers produced. These answers are usually flat lists of large sets of web pages which demand significant user effort to be explored. Satisfying information needs on the web is usually seen as an ephemeral one-step process of information search (the traditional search engine paradigm). Given these characteristics, it is highly demanding to satisfy private or institutional information needs on the web. The web itself, and the interests it promotes, are growing and changing rapidly, at a global scale, both as mean of A Study of Web Mining Tools for Query Optimization Page 1

2 divulgation and dissemination and also as a source of generic and specialized information. Web users have already realized the potential of this huge information source and use it for many purposes, mainly in order to satisfy specific information needs. Simultaneously the web provides a ubiquitous environment for executing many activities, regardless of place and time. 1.2 Web Mining Web mining is a very hot research topic which combines two of the activated research areas: Data Mining and World Wide Web. The Web mining research relates to several research communities such as Database, Information Retrieval and Artificial Intelligence [1]. Web mining is defined by [Coo97] as the discovery and analysis of useful information from the WWW. Web mining is used to extract interesting and potentially useful patterns and implicit information from artefacts or activity related to the WWW. Web mining in relation to other forms of data mining and retrieval is illustrated using Figure 1.1. The diagram demonstrates the fact that web mining is performed on an unstructured source, i.e. web sites. Figure 1.1: Web mining in relation to other forms of data mining and retrieval A Study of Web Mining Tools for Query Optimization Page 2

3 1.2.1 Web Content Mining Web content mining is the automatic search of information resources available online [Coo97]. As a process, web content mining goes beyond keyword extraction since web documents present no machine-readable semantic. The two groups of web content mining approaches concentrate on different aspects. Agent based approach directly mines document contents. Database approach improves the search strategy of the search engine with regard to the database it uses Web Structure Mining Web content mining focuses on the internal structure of a web document, web structure mining tries to discover the link structure of the hyperlinks at the inter-document level Web Usage Mining Web usage mining is defined as the discovery of user access patterns from web servers. Web servers record and accumulate user interaction data each time a user makes a request for resources. Analyzing these web access logs can reveal patterns regarding a user are browsing habits through the web server [2]. Figure 1.2: Taxonomy of Web Mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Web mining A Study of Web Mining Tools for Query Optimization Page 3

4 methodologies can generally be classified into one of three distinct categories: Web structure, Web content and Web usage mining. The goal of Web structure mining is to categorize the Web pages and generate information such as the similarity and relationship between them, taking advantage of their hyperlink topology. In the latter years, the area of Web structure mining focuses on the identification of authorities, i.e. pages that are considered as important sources of information from many people in the Web community. Web content mining has to do with the retrieval of information (content) available on the Web into more structured forms as well as its indexing for easy tracking information locations. Web content may be unstructured (plain text), semi-structured (HTML documents), or structured (extracted from databases into dynamic Web pages). Such dynamic data cannot be indexed and consist what is called the hidden Web. A research area closely related to content mining is text mining. Web content mining is nowadays strongly interrelated with Web structure mining, since usually both are used in combination for extracting and organizing information from the Web. Web content mining provides methods enabling the automated discovery, retrieval, organization, and management of the vast amount of information and resources available in the Web. Cooley et al. [CMS97] categorize the main research efforts in the area of Content Mining in two approaches, the Information Retrieval (IR), and the Database (DB) approach. The IR approach involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information. Web usage mining is the process of identifying browsing patterns by analyzing the user s navigational behavior. This information takes as input the usage data, i.e. the data residing in the Web server logs, recording the visits of the users to a Web site. Extensive research in the area of Web usage mining led to the appearance of a related research area, that of Web personalization. Web personalization utilizes the results produced after performing Web usage mining, in order to dynamically provide recommendations to each user. A Study of Web Mining Tools for Query Optimization Page 4

5 Web mining is moving the World Wide Web toward a more useful environment in which users can quickly and easily find the information they need. It includes the discovery and analysis of data, documents, and multimedia from the World Wide Web. Web mining uses document content, hyperlink structure, and usage statistics to assist users in meeting their information needs. The Web itself and search engines contain relationship information about documents. Web mining is the discovery of these relationships and is accomplished within three sometimes overlapping areas. Content mining is first. Search engines define content by keywords. Finding contents keywords and finding the relationship between a Web page s content and a user s query content is content mining. Hyperlinks provide information about other documents on the Web thought to be important to another document. These links add depth to the document, providing the multi-dimensionality that characterizes the Web. Mining this link structure is the second area of Web mining. Finally, there is a relationship to other documents on the Web that are identified by previous searches. These relationships are recorded in logs of searches and accesses. Mining these logs is the third area of Web mining. Understanding the user is also an important part of Web mining. Analysis of the user s previous sessions, preferred display of information, and expressed preferences may influence the Web pages returned in response to a query. Web mining is interdisciplinary in nature, spanning across such fields as information retrieval, natural language processing, information extraction, machine learning, database, data mining, data warehousing, user interface design, and visualization. Techniques for mining the Web have practical application in m- commerce, e-commerce, e-government, e-learning, distance learning, organizational learning, virtual organizations, knowledge management, and digital libraries. 1.3 Web Mining and Information Retrieval Web IR is the application of IR to the web. In classical IR, users specify queries, in some query language, representing their information needs. The A Study of Web Mining Tools for Query Optimization Page 5

6 system selects the set of documents in its collection that seem the most relevant to the query and presents them to the user. Users may then refine their queries to improve the answer. In the web environment user intents are not static and stable as they usually are in traditional IR. In the web, the information need is associated with a given task (Broder, 2002) that is not known in advance and may be quite different from user to user, even if the query specification is the same. The identification of this task and the mental process of deriving a query from an information need are crucial aspects in web IR. Web IR is related to web mining the automatic discovery of interesting and valuable information from the web (Chakrabarti, 2003). It is generally accepted that web mining is currently being developed towards three main research directions, related to the type of data they mine: web content mining, web structure mining and web usage mining (Kosala et al., 2000). Recently another type of data document change, page age and information recency is generating research interests: it is related to a temporal dimension and allows for analyzing the growth and dynamics over time of the Web (Baeza-Yates, 2003; Cho et al., 2000; Lim et al., 2001). This categorization is merely conceptual, these areas are not mutually exclusive and some techniques dedicated to one may use data that is typically associated with others. Web content mining concerns the discovery of useful information from web page content which is available in many different formats (Baeza-Yates, 2003) textual, metadata, links, multimedia objects, hidden and dynamic pages and semantic data. Web structure mining tries to infer knowledge from the link structure on the web (Chakrabarti et al., 1999a). Web documents typically point at related documents through a link forming a social network. This network can be represented by a directed graph where nodes represent documents and arcs represent the links between them. The analysis of this graph is the main goal of web structure mining (Donato et al., 2000; Kumar et al., 2000). In this field, two algorithms, which rank web pages according to their relevance, have received special attention: PageRank(Brin et al., 1998) and Hyperlink Induced Topic Search, or HITS (Kleinberg, 1998). A Study of Web Mining Tools for Query Optimization Page 6

7 Web usage mining tries to explore user behavior on the web by analyzing data originated from user interaction and automatically recorded in web server logs. The applications of web usage mining usually intend to learn user profiles or navigation patterns. Web usage mining is essentially aimed at predicting the next user request based on the analysis of previous requests. Markov models are very common in modeling user requests or user paths within a site (Borges, 2000). Association rules and other standard data mining and OLAP techniques are also explored. (Cooley et al., 1997) presents an overview of the most relevant work in web usage mining [3]. IR is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non relevant as possible (Rijsbergen, 1979). Some have claimed that resource or document discovery (IR) on the Web is an instance of Web content mining and the others associate web mining with intelligent IR. Actually IR has the primary goals of indexing text and searching for useful documents in a collection and nowadays research in IR includes modeling, document classification and categorization, user interfaces, data visualization, filtering, etc. (Baeza-Yates &Berthier, 1999). The task that can be considered to be an instance of Web mining is Web document classification or categorization, which could be used for indexing. Viewed in this respect, Web mining is part of the (Web) IR process. (Kosala&Blockeel, 2000)[4]. 1.4 Web Mining and Information Extraction: IE has the goal of transforming a collection of documents, usually with the help of anir system, into information that is more readily digested and analyzed (Cowie&Lehnert, 1996). IE aims to extract relevant facts from the documents while IR aims to select relevant documents (Pazienza, 1997). While IE is interested in the structure or representation of a document, IR views the text in a document just as a bag of unordered words (Wilks, 1997). Thus, in general IE works at a finer granularity level than IR dose on the documents. Building IE systems manually is not feasible and scalable for such a dynamic and diverse medium such as web contents (Muslea, Minton &Knoblock, 1998). Due to this A Study of Web Mining Tools for Query Optimization Page 7

8 nature of the Web, most IE systems focus on specific web sites to extract. Others use machine learning or data mining techniques to learn the extraction patterns or rules for Web documents semi-automatically or automatically (Kushmerick, 1999). Within this view, Web mining is used to improve Web IE (Web mining is part of IE) (Kosala&Blockeel, 2000). An example of IE without Web mining is what done by (El-Beltagy, Rafea&Abdelhamid) for building a model for automatically augmenting segments documents with metadata using dynamically acquired background domain knowledge in order to assist users in easily locating information within these documents through a structured front end[5].web mining can be divided into four subtasks: Information Retrieval/Resource Discovery (IR) Find all relevant documents on the web. The goal of IR is to automatically find all relevant documents, while at the same time filter out the non-relevant ones. Search engines are a major tool people use to find web information. Search engines use key words as the index to perform query. Users have more control in searching web content. Automated programs such as crawlers and robots are used to search the web. Such programs traverses the web to recursively retrieve all relevant documents. A search engine consists of three components: a crawler which visits web sites, indexing which is updated when a crawler finds a site, and a ranking algorithm which records those relevant web sites. However, current search engines have a major problem -low precision, which is manifested often by the irrelevance of searched results Information Extraction (IE):automatically extract specific fragments of a document from web resources retrieved from the IR step. Building a uniform IE system is difficult because the web content is dynamic and diverse. Most IE systems use the \wrapper" [33] technique to extract a specific information for a particular site. Machine learning techniques are also used to learn the extraction rules Generalization: discover information patterns at retrieved web sites. The purpose of this task is to study users' behavior and interest. Data mining A Study of Web Mining Tools for Query Optimization Page 8

9 techniques such as clustering and association rules are utilized here. Several problems exist during this task. Because web data are heterogeneous, imprecise and vague, it is difficult to apply conventional clustering and association rule techniques directly on the raw web data Analysis/Validation: analyze, interpret and validate the potential information from the information patterns. The objective of this task is to discover knowledge from the information provided by former tasks. Based on web data, one can build models to simulate and validate web information.[6]. 1.5 Information Retrieval and Web The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval However, as an academic field of study INFORMATION RETRIEVAL might be defined thus Information retrieval (IR) is finding material (usually documents) of an unstructured nature As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their . Information retrieval is fast becoming the dominant form of information access, overtaking traditional database style searching (usually text) that satisfies an information need from within large collections (usually stored on computers). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term unstructured data refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. In reality, almost no data are truly unstructured. This is definitely true of all text data if you count the latent linguistic structure of human languages. But even accepting that the intended notion of structure is overt A Study of Web Mining Tools for Query Optimization Page 9

10 structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web pages). IR is also used to facilitate semi structured search such as finding a document where the title contains Java and the body contains threading. The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically. Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web [7]. 1.6 The Web The web is a public service constituted by a set of applications aimed at extracting documents from computers accessible in Internet the Internet is a network of computer networks. One can also describe the web as an information repository distributed over millions of computers interconnected through Internet (Baldi et al., 2003). The W3C defines web in a broad way: the World Wide Web is the universe of network-accessible information, an embodiment of human A Study of Web Mining Tools for Query Optimization Page 10

11 knowledge. Due to its comprehensiveness, with contents related to most subjects of human activity, and global public acceptance, either at a personal or institutional level, the web is widely explored as an information source. Web dimension and dynamic nature become serious drawbacks when it comes to retrieving information. Another relevant characteristic of the web is the absence of any global editorial control over its content and format. This contributes largely to web success but also contributes to a high degree of heterogeneity in content, language, structure, correctness and validity. Although the problems raised by the size of the web, around 11,5 109 pages (Gulli et al., 2005), and its dynamics require special treatment, it seems that the major difficulties concerning the processing of web documents are generated by the lack of editorial rules and the lack of a common ontology, which would allow for unambiguous document specification and interpretation. In the absence of such normative rules, each document has to be treated as unique. In this scenario, document processing cannot be based on any underlying structure. Although HTML already involves some structure its use is not mandatory. Therefore, the higher level of abstraction that may assure compatibility with a generic web document is the common bagof-words (Chakrabarti, 2003). This low abstraction level is not very helpful for automatic processing, requiring significant computational costs. The web is a vast and popular repository, containing information related to almost all human activities and being used to perform an ever growing set of distinct activities (bank transactions, shopping, chatting, government transactions, weather report and getting geographic directions, just to name a few). Despite the difficulties this medium poses to automatic as well as to non-automatic processing, it has been increasingly explored and has been motivating efforts, from both academic and industry, which aim to facilitate this exploration. Currently the web is a repository of documents, the majority of them HTML documents, that can be automatically presented to users but that do not have a base model that might be used by computers to acquire semantic information on the objects being manipulated. The semantic web is a formal attempt from W3C to transform the web in a huge database that might be easier to process automatically than our current syntactic A Study of Web Mining Tools for Query Optimization Page 11

12 web. However, despite many initiatives on the semantic web (Lu et al., 2002), the web has its own dynamics and web citizens are pushing the web to the social plan. Collaborative systems, radical trust and participation are the main characteristics of web2.0, a new paradigm emerging since 2004 (O Reily, 2004) A Retrospective View of Web Information Retrieval In the early 1950s, technical librarianship faced a crisis. The scientific boom sparked by the Second World War had released a flood of publications, approaching a million new articles each year. Scientists could no longer stay abreast of current research by general reading alone. Papers relevant to a new project, but not previously known to the researcher, had to be retrieved at the project s outset and the librarian had to facilitate this retrieval. A variety of cataloguing schemes had been suggested as tools for retrieval, but none had been rigorously tested for effectiveness, and all were labour-intensive to implement In responding to technical information s rapid growth, librarians and information scientists developed the field of information retrieval. The defining discovery of the field was that complex schemes for organizing and cataloguing information into hierarchical taxonomies did little better than simply indexing the plain words occurring in the text: the crucial part of information retrieval lay in the process of retrieval. The finding that taxonomy was redundant was little short of scandalous after all, Western information science had since Aristotle been founded on subdividing knowledge by genus and species. But the effect was liberating. Word occurrences are readily indexed by computer, and retrieval technology could be constructed on top of such indexes without having to solve deep problems in human language analysis and semantics. Significantly, the sufficiency of word occurrence indexing was not argued theoretically (which, after centuries of such theoretical dispute, would hardly have had an impact), but demonstrated empirically, through careful evaluation. In the mid 1990s, users of the newly-emerged web faced a crisis. The number of web sites was growing rapidly, and finding information by following a trail of links from a few popular central sites was no longer an adequate access A Study of Web Mining Tools for Query Optimization Page 12

13 method. Manually curated directories such as that of Yahoo! were popular, but manual curation was expensive and scaled poorly. Experienced users could not keep up with the growth in the number of sites, even in areas of personal interest to them; and, for novice users, the task of finding useful information on the web was daunting. Faced with the mushrooming growth of the web in the second half of the 1990s, a new kind of service provider turned to the decades-old technology of information retrieval, producing the web search engine. Web search transformed information retrieval from the rarefied activity of librarians, researchers, journalist fact-checkers, and intelligence analysts, to the daily activity of almost the entire computer-enabled population. In doing so, search providers finally bridged a long-established gap between theory and practice. As early as the 1960s, researchers had developed statistical techniques for effectively retrieving and ranking documents against plain keyword queries. The retrieval technology deployed in practice, though, used logical, Boolean query languages that relied upon the patience and expertise of the querier to formulate complex query expressions, precisely specifying their information need. But web users little expertise, and less patience, for constructing complex queries. Search engines therefore turned to simple queries and sophisticated retrieval, finally deploying, on a massive scale, the techniques developed three decades earlier, so creating the modern search engine. To the surprise once more of some search technologists, simple keyword search simply worked. In an increasingly competitive search market, though, how could a provider verify the effectiveness of their search results, and compare their offering with that of their competitors? Search technology connects simple queries with unannotated documents, relieving both the producer and the consumer of information from the complexity of matching information resources to information needs. The result is tools that allow neophyte users to find relevant information, across billions of web documents, in a fraction of a second. But in doing away with complex, formal information representations in favour of rough approximations, statistical information retrieval introduced an important problem. It is not possible to A Study of Web Mining Tools for Query Optimization Page 13

14 objectively and deterministically state that an information object matches an information request, even in the terms in which the request is formulated. One can say that a document has been manually assigned a certain classification under a hierarchical taxonomy; one can even say that a document contains a Boolean combination of terms; but one cannot conclusively say that an uncategorized document meets a user s information need as expressed by a handful of keywords. The contemporary retrieval system sits at the interface between computational formalism on the one hand, and the ambiguity of human cognition on the other. There is uncertainty in what the retrieval system should do, and therefore in how correct a set of results are. The ambiguity of the retrieval task makes the question of retrieval effectiveness a crucial and contested one. Methods for evaluating effectiveness are therefore essential, in both research and deployment. Retrieval evaluation relies fundamentally on human assessment of result quality. The noncomputability of effectiveness makes information retrieval a deeply empirical discipline, closer to natural or even social science than to formal computational theory. The complex, interlocked relations that connect imprecise queries, uncurated documents, and inchoate information needs, are not given, but must be hypothesized and tested on observed search behavior. The importance of empirical evaluation in information retrieval has been recognized since the field began; the initial work that established the primacy of retrieval over indexing gained much of its impact from the meticulous and painstaking experimental work on which it was based. But the same scale of data that makes retrieval technology necessary, also makes manual assessment costly. While result quality can be measured by directly assessing user satisfaction with, or utility gained from, retrieval results, such direct measurement of the user s satisfaction with the results lists as a whole is neither reusable nor reliably repeatable. Assessing the results of any single system is time-consuming, and there are many competing retrieval algorithms, each tuned by numerous parameters. A parameter change that takes a few minutes to decide upon, and a few seconds to run, could take days to manually assess. Moreover, if each A Study of Web Mining Tools for Query Optimization Page 14

15 research group produces its own, independent assessments of retrieval quality, then not only is much effort duplicated, but also reproduciblity is impaired, and the potential for bias is introduced. And tuning nowadays is often performed automatically through machine learning; fitting a manual review stage into each learning iteration would be unworkable. The need for scale and automatability, plus the desire for repeatability and objectivity, has led the information retrieval community to develop hybrid evaluation technologies, part manual, part automated. The most important of the evaluation tools is the test collection: a corpus of documents, with a set of queries (known as topics) to run against the corpus, and judgments of which documents are (independently) relevant to each query. These relevance judgments must be manually formed: but once made, the test collection can in principle be reused indefinitely for fully automated evaluation. The result is an automated and reusable evaluation method, based on a simplified model of retrieval. Test collection evaluation has been the bedrock of retrieval research for half a century. Collection-based experimentation has grown even more in importance since the arrival, beginning in the early 1990s, of large scale, collaboratively developed, and readily obtainable test collections. And (to judge from publicly available information) the test collection method is also core to the quality assurance and improvement methods of commercial web search engines. The practice of retrieval evaluation, though, has run well ahead of the theory. It was only at the end of the 1990s that the reliability, efficiency, and interpretability of evaluation results began to be formally investigated. The delay was in part because it was only after large-scale collaborative experiments had been running for several years that the datasets needed for a critical investigation of evaluation became available. Initial enquiries, while foundational, tended to be either ad-hoc, or else applied statistical methodology developed in other areas to retrieval evaluation without considering the field s distinctive features. These omissions are currently being remedied by the research community. It is in the context of the effort for greater reliability, accuracy, robustness, and efficiency in collection-based retrieval evaluation that this thesis is presented. A Study of Web Mining Tools for Query Optimization Page 15

16 Building on the foundational work in the area, and employing the large evaluation datasets now available, major advances in the accuracy and comparability of evaluation scores can be made in the design of efficient and reliable experiments, in the extensibility of test collections in dynamic evaluation environments, and in the measurement of retrieval similarity without relevance assessment. Technical contributions with awareness of the wider context of evaluation, and of the necessity of mixing experimental rigour with research innovation can also be offered. The need to store and retrieve written information became increasingly important over centuries, especially with inventions like paper and the printing press. Soon after computers were invented, people realized that they could be used for storing and mechanically retrieving large amounts of information. In 1945 Vannevar Bush published a ground breaking article titled As We May Think that gave birth to the idea of automatic access to large amounts of stored knowledge[8]. In the 1950s, this idea materialized into more concrete descriptions of how archives of text could be searched automatically. Several works emerged in the mid 1950s that elaborated upon the basic idea of searching text with a computer. One of the most influential methods was described by H.P. Luhn in 1957, in which (put simply) he proposed using words as indexing units for documents and measuring word overlap as a criterion for retrieval [9]. Several key developments in the field happened in the 1960s. Most notable were the development of the SMART system by Gerard Salton and his students, first at Harvard University and later at Cornell University; [10] and the Cranfield evaluations done by Cyril Cleverdon and his group at the College of Aeronautics in Cranfield [11]. The Cranfield tests developed an evaluation methodology for retrieval systems that is still in use by IR systems today. The SMART system, on the other hand, allowed researchers to experiment with ideas to improve search quality. A system for experimentation coupled with good evaluation methodology allowed rapid progress in the field, and paved way for many critical developments. A Study of Web Mining Tools for Query Optimization Page 16

17 The 1970s and 1980s saw many developments built on the advances of the 1960s. Various models for doing document retrieval were developed and advances were made along all dimensions of the retrieval process. These new models/techniques were experimentally proven to be effective on small text collections (several thousand articles) available to researchers at the time. However, due to lack of availability of large text collections, the question whether these models and techniques would scale to larger corpora remained unanswered. This changed in 1992 with the inception of Text Retrieval Conference, or TREC[12]. TREC is a series of evaluation conferences sponsored by various US Government agencies under the auspices of NIST, which aims at encouraging research in IR from large text collections. With large text collections available under TREC, many old techniques were modified, and many new techniques were developed (and are still being developed) to do effective retrieval over large collections[13]. The evolution of IR systems may be organized in four distinct periods, with significant differences among the methods that were applied and the sources used during each one. During an initial period, up to the 50s, the indexing and searching processes were handled manually. Indexes were based on taxonomies or alphabetical lists of previously specified concepts. During this phase, IR systems were mainly used by librarians and scientists. During a second period, between around 1950 and the advent of web in the early 90s, the pressure on the field and the evolution on computer and database technology allowed for significant improvements. Process went from manual to automated annotation of documents; however indexes were still built from restricted descriptions of documents (mainly abstracts and document titles). IR was viewed as finding the right information in text databases. Operating IR systems frequently required specific learning. IR systems utilization was expensive and available only to restricted groups. During a third period, covering the 90s, the process of indexing and searching becomes fully automated. Full text indexes are built; web mining evolves and explores not only content but also structure and usage. IR systems become unrestricted, cheap, widely available and A Study of Web Mining Tools for Query Optimization Page 17

18 widely used. From around 2000 on, the fourth and actual period, other sources of evidence are explored trying to improve systems performance. Searching and browsing are the two basic IR paradigms on the web (Baeza-Yates et al., 1999). Three approaches to IR seem to have emerged (Broder et al., 2005): The search-centric approach argues that free search has become so good and the search user-interface so common, that users can satisfy all their needs through simple queries. Search engines follow this approach; The taxonomy navigation approach claims that users have difficulties expressing their information needs; organizing information on a hierarchical structure might help finding relevant information. Directory search systems follow this approach; The meta-data centric approach advocates the use of meta-data for narrowing large sets of results (multi faceted search); third generation search engines are trying to improve the quality of their answers by merging several sources of evidence. IR systems also have to solve problems related to their sources and how to build their databases/indexes. Several crawling algorithms have been explored, in order to overcome problems of scale arising from web dimension, such as focused crawling (Chakrabarti et al., 1999b), intelligent crawling (Aggarwal et al., 2001) and collaborative crawling (Aggarwal et al., 2004) that explores user behavior registered in server logs. Other approaches have also been proposed: meta-search explores the small overlap among search engines indexes sending the same query to a set of search engines and merging their answers a few specific problems arise from this approximation (Wang et al., 2003); dynamic search engines try do deal with web dynamics, such search engines do not have any permanent index but instead crawl for their answers at query time (Hersovici et al., 1998); interactive search (Bruza et al., 2000) wrapsa general purpose search engine into an interface that allows users to navigate towards their goal through a query-bynavigation process. At present, IR research seems to be focused on retrieval of A Study of Web Mining Tools for Query Optimization Page 18

19 high quality, integration of several sources of evidence and multimedia retrieval[3]. TREC hasalso branched IR into related but important fields like retrieval of spoken information, non-english language retrieval, information filtering, user interactions with a retrieval system, and so on. 1.7 Basic Processes of Information Retrieval There are three basic processes an information retrieval system has to support: the representation of the content of the documents, the representation of the user s information need, and the comparison of the two representations. The processes are visualized in figure 1.3 (Croft 1993). In the figure, squared boxes represent data and rounded boxes represent processes. Figure 1.3: Information Retrieval Process (Croft 1993) Representing the documents is usually called the indexing process. The process takes place off-line, that is, the end user of the information retrieval system is not directly involved. The indexing process results in a formal representation of the document: the index representation or document representation. Often, full text retrieval systems use a rather trivial algorithm to derive the index representations, for instance an algorithm that identifies words in A Study of Web Mining Tools for Query Optimization Page 19

20 an English text and puts them to lower case. The indexing process may include the actual storage of the document in the system, but often documents are only stored partly, for instance only title and abstract, plus information about the actual location of the document. The process of representing the information problem or need is often referred to as the query formulation process. The resulting formal representation is the query. In a broad sense, query formulation might denote the complete inter active dialogue between system and user, leading not only to a suitable query but possibly also to a better understanding by the user of his/her information need. In this thesis however, query formulation generally denotes the automatic formulation of the query when there are no previously retrieved documents to guide the search, that is, the formulation of the initial query. The automatic formulation of successive queries is called relevance feedback in this thesis. The user and the system communicate the information need by respectively queries and retrieved sets of documents. This is not the most natural form of communication. Humans would use natural language to communicate the information need amongst each other. Such a natural language statement of the information need is called a request. Automatic query formulation inputs the request and outputs an initial query. In practice, this means that some or all of the words in the request are converted to query terms, for instance by the rather trivial algorithm that puts words to lower case. Relevance feedback inputs a query or a request and some previously retrieved relevant and non-relevant documents to output a successive query. The comparison of the query against the document representations is also called the matching process. The matching process results in a ranked list of relevant documents. Users will walk down this document list in search of the information they need. Ranked retrieval will hopefully put the relevant documents somewhere in the top of the ranked list, minimizing the time the user has to invest on reading the documents. Simple but effective ranking algorithms use the frequency distribution of terms over documents. For instance, the words family and entertainment mentioned in the first section occur relatively infrequent in the whole book, which indicates that this book should not A Study of Web Mining Tools for Query Optimization Page 20

21 receive a top ranking for the request family entertainment. Ranking algorithms based on statistical approaches easily halve the time the user has to spend on reading documents Basic models of information retrieval a brief overview A mathematical model of information retrieval guides the implementation of information retrieval systems. In the traditional information retrieval systems, which are usually operated by professional searchers, only the matching process is automated; indexing and query formulation are manual processes. For these systems, mathematical models of information retrieval therefore only have to model the matching process. In practice, traditional information retrieval systems use the Boolean model of information retrieval The Boolean model Is an exact matching model, that is, it either retrieves documents or not without ranking them. The model supports the use of structured queries, which do not only contain query terms, but also relations between the terms defined by the query operators AND, OR and NOT In modern information retrieval systems, which are usually operated by nonprofessional users, query formulation is automated as well. However, candidate mathematical models for these systems still only model the matching process. There are many candidate models for the matching process of ranked retrieval systems. These models are so-called approximate matching models, that is, they use the frequency distribution of terms over documents to compute the ranking of the retrieved sets. Each of these models has its own advantages and disadvantages. However, there are two classical candidate models for approximate matching: the vector space model and the probabilistic model. They are classical models, not only because they were introduced already in the early 70 s, but also because they represent classical problems in information retrieval. A Study of Web Mining Tools for Query Optimization Page 21

22 The vector space model Represents the problem of ranking the documents given the initial query. The Vector model, probably the most commonly used, assigns real non-negative weights to index terms in documents and queries. In this model, documents are represented by vectors in a multi-dimensional Euclidean space. Each dimension in this space corresponds to a relevant term/word contained in the document collection. The degree of similarity of documents with regard to queries is evaluated as the correlation between the vectors representing the document and the query which can be, and usually is, quantified by the cosine of the angle between the two vectors. In the vector model, index term weights are usually obtained as a function of two factors: the term frequency factor, TF, a measure of intra-cluster similarity; computed as the number of times that the term occurs in document, normalized in a way as to make it independent of document length and an inverse document frequency, IDF, a measure of inter-cluster dissimilarity; weights each term according to its discriminative power in the entire collection. This model s main advantages are related to improvements in retrieval performance due to term weighting; partial matching that allows retrieval of documents that approximate the query conditions. The index term independency assumption is probably its main disadvantage The probabilistic model Represent the problem of ranking the documents after some feedback is gathered. Probabilistic models compute the similarity between documents and queries as the odds of a document being relevant to a query. Index term weights are binary. This model ranks documents in decreasing order of their probability of being relevant, which is an advantage. Its main disadvantages are: the need to guess the initial separation of documents into relevant and non-relevant; weights are binary; index terms are assumed to be independent From a practical point of view, the Boolean model, the vector space model and the probabilistic model represent three classical problems of information A Study of Web Mining Tools for Query Optimization Page 22

23 retrieval, respectively structured queries, initial term weighting, and relevance feedback. The Boolean model provides the query operators AND, OR and NOT to formulate structured queries. The vector space model was used by Salton and his colleagues for hundreds of term weighting experiments in order to find algorithms that predict which documents the user will find relevant given the initial query (Salton and Buckley 1988).3 The probabilistic model, provides a theory of optimum ranking if examples of relevant documents are available [14] Evaluation of Information Retrieval System Evaluation studies investigate the degree to which the stated goals or expectations have been achieved or the degree to which these can be achieved. The three major purposes given for evaluating an information retrieval system were the need for measures with which to make merit comparisons within a single test situation, the need for measures with which to make comparisons between results obtained in difficult test situations and the need for assessing the merit a real-life system. A number of studies have been conducted to measure the performance of the information retrieval system. Some criteria have been proposed by several researchers for the evaluation of information retrieval systems [CC66, LFW68, and SG83]. These criteria include: coverage of the system, form of presentation of the search output, user effort, the response time of the system, and recall and precision. Retrieval effectiveness is defined in terms of retrieving relevant documents and not retrieving non-relevant documents. Two traditional factors of measuring effectiveness are Recall and Precision Evaluation criteria Recall indicates the ability of a system to present all relevant items or documents. In reality it may not be possible to retrieve all the relevant items from a collection, especially when the collection is large. A system may be able to retrieve a proportion of the total relevant documents. Thus, the performance of a system is often measured by the recall ratio, which denotes the percentage of relevant items retrieved in a given situation. A Study of Web Mining Tools for Query Optimization Page 23

24 Precision implies the ability of a system to present only relevant items or documents and therefore not to retrieve non-relevant documents. This factor-that is, how far the system is able to withhold unwanted items in a given situation-is measured in terms of precision ratio. These two measures are denoted by the following formulas: A Study of Web Mining Tools for Query Optimization Page 24

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval