Chapter 1. Web-Mining and Information Retrieval. 1.1 Introduction

Size: px
Start display at page:

Download "Chapter 1. Web-Mining and Information Retrieval. 1.1 Introduction"

Transcription

1 Chapter 1 Web-Mining and Information Retrieval 1.1 Introduction The World Wide Web or simply the web may be seen as a huge collection of documents freely produced and published by a very large number of people, without any solid editorial control. This is probably the most democratic and anarchic widespread mean for anyone to express feelings, comments, convictions and ideas, independently of ethnics, sex, religion or any other characteristic of human societies. The web constitutes a comprehensive, dynamic, permanently up-to-date repository of information regarding most of the areas of human knowledge (Hu, 2002) and supporting an increasingly important part of commercial, artistic, scientific and personal transactions, which gives rise to a very strong interest from individuals, as well as from institutions, at a universal scale. However, the web also exhibits some characteristics that are adverse to the process of collecting information from it in order to satisfy specific needs: the large volume of data it contains; its dynamic nature; being mainly constituted by unstructured or semi-structured data; content and format heterogeneity and irregular data quality are some of these adverse characteristics. End-users also introduce some additional difficulties in the retrieval process: information needs are often imprecisely defined, generating a semantic gap between user needs and their specification. The satisfaction of a specific information need on the web is supported by search engines and other tools aimed at helping users gather information from the web. The user is usually not assisted in the subsequent tasks of organizing, analyzing and exploring the answers produced. These answers are usually flat lists of large sets of web pages which demand significant user effort to be explored. Satisfying information needs on the web is usually seen as an ephemeral one-step process of information search (the traditional search engine paradigm). Given these characteristics, it is highly demanding to satisfy private or institutional information needs on the web. The web itself, and the interests it promotes, are growing and changing rapidly, at a global scale, both as mean of A Study of Web Mining Tools for Query Optimization Page 1

2 divulgation and dissemination and also as a source of generic and specialized information. Web users have already realized the potential of this huge information source and use it for many purposes, mainly in order to satisfy specific information needs. Simultaneously the web provides a ubiquitous environment for executing many activities, regardless of place and time. 1.2 Web Mining Web mining is a very hot research topic which combines two of the activated research areas: Data Mining and World Wide Web. The Web mining research relates to several research communities such as Database, Information Retrieval and Artificial Intelligence [1]. Web mining is defined by [Coo97] as the discovery and analysis of useful information from the WWW. Web mining is used to extract interesting and potentially useful patterns and implicit information from artefacts or activity related to the WWW. Web mining in relation to other forms of data mining and retrieval is illustrated using Figure 1.1. The diagram demonstrates the fact that web mining is performed on an unstructured source, i.e. web sites. Figure 1.1: Web mining in relation to other forms of data mining and retrieval A Study of Web Mining Tools for Query Optimization Page 2

3 1.2.1 Web Content Mining Web content mining is the automatic search of information resources available online [Coo97]. As a process, web content mining goes beyond keyword extraction since web documents present no machine-readable semantic. The two groups of web content mining approaches concentrate on different aspects. Agent based approach directly mines document contents. Database approach improves the search strategy of the search engine with regard to the database it uses Web Structure Mining Web content mining focuses on the internal structure of a web document, web structure mining tries to discover the link structure of the hyperlinks at the inter-document level Web Usage Mining Web usage mining is defined as the discovery of user access patterns from web servers. Web servers record and accumulate user interaction data each time a user makes a request for resources. Analyzing these web access logs can reveal patterns regarding a user are browsing habits through the web server [2]. Figure 1.2: Taxonomy of Web Mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Web mining A Study of Web Mining Tools for Query Optimization Page 3

4 methodologies can generally be classified into one of three distinct categories: Web structure, Web content and Web usage mining. The goal of Web structure mining is to categorize the Web pages and generate information such as the similarity and relationship between them, taking advantage of their hyperlink topology. In the latter years, the area of Web structure mining focuses on the identification of authorities, i.e. pages that are considered as important sources of information from many people in the Web community. Web content mining has to do with the retrieval of information (content) available on the Web into more structured forms as well as its indexing for easy tracking information locations. Web content may be unstructured (plain text), semi-structured (HTML documents), or structured (extracted from databases into dynamic Web pages). Such dynamic data cannot be indexed and consist what is called the hidden Web. A research area closely related to content mining is text mining. Web content mining is nowadays strongly interrelated with Web structure mining, since usually both are used in combination for extracting and organizing information from the Web. Web content mining provides methods enabling the automated discovery, retrieval, organization, and management of the vast amount of information and resources available in the Web. Cooley et al. [CMS97] categorize the main research efforts in the area of Content Mining in two approaches, the Information Retrieval (IR), and the Database (DB) approach. The IR approach involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information. Web usage mining is the process of identifying browsing patterns by analyzing the user s navigational behavior. This information takes as input the usage data, i.e. the data residing in the Web server logs, recording the visits of the users to a Web site. Extensive research in the area of Web usage mining led to the appearance of a related research area, that of Web personalization. Web personalization utilizes the results produced after performing Web usage mining, in order to dynamically provide recommendations to each user. A Study of Web Mining Tools for Query Optimization Page 4

5 Web mining is moving the World Wide Web toward a more useful environment in which users can quickly and easily find the information they need. It includes the discovery and analysis of data, documents, and multimedia from the World Wide Web. Web mining uses document content, hyperlink structure, and usage statistics to assist users in meeting their information needs. The Web itself and search engines contain relationship information about documents. Web mining is the discovery of these relationships and is accomplished within three sometimes overlapping areas. Content mining is first. Search engines define content by keywords. Finding contents keywords and finding the relationship between a Web page s content and a user s query content is content mining. Hyperlinks provide information about other documents on the Web thought to be important to another document. These links add depth to the document, providing the multi-dimensionality that characterizes the Web. Mining this link structure is the second area of Web mining. Finally, there is a relationship to other documents on the Web that are identified by previous searches. These relationships are recorded in logs of searches and accesses. Mining these logs is the third area of Web mining. Understanding the user is also an important part of Web mining. Analysis of the user s previous sessions, preferred display of information, and expressed preferences may influence the Web pages returned in response to a query. Web mining is interdisciplinary in nature, spanning across such fields as information retrieval, natural language processing, information extraction, machine learning, database, data mining, data warehousing, user interface design, and visualization. Techniques for mining the Web have practical application in m- commerce, e-commerce, e-government, e-learning, distance learning, organizational learning, virtual organizations, knowledge management, and digital libraries. 1.3 Web Mining and Information Retrieval Web IR is the application of IR to the web. In classical IR, users specify queries, in some query language, representing their information needs. The A Study of Web Mining Tools for Query Optimization Page 5

6 system selects the set of documents in its collection that seem the most relevant to the query and presents them to the user. Users may then refine their queries to improve the answer. In the web environment user intents are not static and stable as they usually are in traditional IR. In the web, the information need is associated with a given task (Broder, 2002) that is not known in advance and may be quite different from user to user, even if the query specification is the same. The identification of this task and the mental process of deriving a query from an information need are crucial aspects in web IR. Web IR is related to web mining the automatic discovery of interesting and valuable information from the web (Chakrabarti, 2003). It is generally accepted that web mining is currently being developed towards three main research directions, related to the type of data they mine: web content mining, web structure mining and web usage mining (Kosala et al., 2000). Recently another type of data document change, page age and information recency is generating research interests: it is related to a temporal dimension and allows for analyzing the growth and dynamics over time of the Web (Baeza-Yates, 2003; Cho et al., 2000; Lim et al., 2001). This categorization is merely conceptual, these areas are not mutually exclusive and some techniques dedicated to one may use data that is typically associated with others. Web content mining concerns the discovery of useful information from web page content which is available in many different formats (Baeza-Yates, 2003) textual, metadata, links, multimedia objects, hidden and dynamic pages and semantic data. Web structure mining tries to infer knowledge from the link structure on the web (Chakrabarti et al., 1999a). Web documents typically point at related documents through a link forming a social network. This network can be represented by a directed graph where nodes represent documents and arcs represent the links between them. The analysis of this graph is the main goal of web structure mining (Donato et al., 2000; Kumar et al., 2000). In this field, two algorithms, which rank web pages according to their relevance, have received special attention: PageRank(Brin et al., 1998) and Hyperlink Induced Topic Search, or HITS (Kleinberg, 1998). A Study of Web Mining Tools for Query Optimization Page 6

7 Web usage mining tries to explore user behavior on the web by analyzing data originated from user interaction and automatically recorded in web server logs. The applications of web usage mining usually intend to learn user profiles or navigation patterns. Web usage mining is essentially aimed at predicting the next user request based on the analysis of previous requests. Markov models are very common in modeling user requests or user paths within a site (Borges, 2000). Association rules and other standard data mining and OLAP techniques are also explored. (Cooley et al., 1997) presents an overview of the most relevant work in web usage mining [3]. IR is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non relevant as possible (Rijsbergen, 1979). Some have claimed that resource or document discovery (IR) on the Web is an instance of Web content mining and the others associate web mining with intelligent IR. Actually IR has the primary goals of indexing text and searching for useful documents in a collection and nowadays research in IR includes modeling, document classification and categorization, user interfaces, data visualization, filtering, etc. (Baeza-Yates &Berthier, 1999). The task that can be considered to be an instance of Web mining is Web document classification or categorization, which could be used for indexing. Viewed in this respect, Web mining is part of the (Web) IR process. (Kosala&Blockeel, 2000)[4]. 1.4 Web Mining and Information Extraction: IE has the goal of transforming a collection of documents, usually with the help of anir system, into information that is more readily digested and analyzed (Cowie&Lehnert, 1996). IE aims to extract relevant facts from the documents while IR aims to select relevant documents (Pazienza, 1997). While IE is interested in the structure or representation of a document, IR views the text in a document just as a bag of unordered words (Wilks, 1997). Thus, in general IE works at a finer granularity level than IR dose on the documents. Building IE systems manually is not feasible and scalable for such a dynamic and diverse medium such as web contents (Muslea, Minton &Knoblock, 1998). Due to this A Study of Web Mining Tools for Query Optimization Page 7

8 nature of the Web, most IE systems focus on specific web sites to extract. Others use machine learning or data mining techniques to learn the extraction patterns or rules for Web documents semi-automatically or automatically (Kushmerick, 1999). Within this view, Web mining is used to improve Web IE (Web mining is part of IE) (Kosala&Blockeel, 2000). An example of IE without Web mining is what done by (El-Beltagy, Rafea&Abdelhamid) for building a model for automatically augmenting segments documents with metadata using dynamically acquired background domain knowledge in order to assist users in easily locating information within these documents through a structured front end[5].web mining can be divided into four subtasks: Information Retrieval/Resource Discovery (IR) Find all relevant documents on the web. The goal of IR is to automatically find all relevant documents, while at the same time filter out the non-relevant ones. Search engines are a major tool people use to find web information. Search engines use key words as the index to perform query. Users have more control in searching web content. Automated programs such as crawlers and robots are used to search the web. Such programs traverses the web to recursively retrieve all relevant documents. A search engine consists of three components: a crawler which visits web sites, indexing which is updated when a crawler finds a site, and a ranking algorithm which records those relevant web sites. However, current search engines have a major problem -low precision, which is manifested often by the irrelevance of searched results Information Extraction (IE):automatically extract specific fragments of a document from web resources retrieved from the IR step. Building a uniform IE system is difficult because the web content is dynamic and diverse. Most IE systems use the \wrapper" [33] technique to extract a specific information for a particular site. Machine learning techniques are also used to learn the extraction rules Generalization: discover information patterns at retrieved web sites. The purpose of this task is to study users' behavior and interest. Data mining A Study of Web Mining Tools for Query Optimization Page 8

9 techniques such as clustering and association rules are utilized here. Several problems exist during this task. Because web data are heterogeneous, imprecise and vague, it is difficult to apply conventional clustering and association rule techniques directly on the raw web data Analysis/Validation: analyze, interpret and validate the potential information from the information patterns. The objective of this task is to discover knowledge from the information provided by former tasks. Based on web data, one can build models to simulate and validate web information.[6]. 1.5 Information Retrieval and Web The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval However, as an academic field of study INFORMATION RETRIEVAL might be defined thus Information retrieval (IR) is finding material (usually documents) of an unstructured nature As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their . Information retrieval is fast becoming the dominant form of information access, overtaking traditional database style searching (usually text) that satisfies an information need from within large collections (usually stored on computers). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term unstructured data refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. In reality, almost no data are truly unstructured. This is definitely true of all text data if you count the latent linguistic structure of human languages. But even accepting that the intended notion of structure is overt A Study of Web Mining Tools for Query Optimization Page 9

10 structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web pages). IR is also used to facilitate semi structured search such as finding a document where the title contains Java and the body contains threading. The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically. Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web [7]. 1.6 The Web The web is a public service constituted by a set of applications aimed at extracting documents from computers accessible in Internet the Internet is a network of computer networks. One can also describe the web as an information repository distributed over millions of computers interconnected through Internet (Baldi et al., 2003). The W3C defines web in a broad way: the World Wide Web is the universe of network-accessible information, an embodiment of human A Study of Web Mining Tools for Query Optimization Page 10

11 knowledge. Due to its comprehensiveness, with contents related to most subjects of human activity, and global public acceptance, either at a personal or institutional level, the web is widely explored as an information source. Web dimension and dynamic nature become serious drawbacks when it comes to retrieving information. Another relevant characteristic of the web is the absence of any global editorial control over its content and format. This contributes largely to web success but also contributes to a high degree of heterogeneity in content, language, structure, correctness and validity. Although the problems raised by the size of the web, around 11,5 109 pages (Gulli et al., 2005), and its dynamics require special treatment, it seems that the major difficulties concerning the processing of web documents are generated by the lack of editorial rules and the lack of a common ontology, which would allow for unambiguous document specification and interpretation. In the absence of such normative rules, each document has to be treated as unique. In this scenario, document processing cannot be based on any underlying structure. Although HTML already involves some structure its use is not mandatory. Therefore, the higher level of abstraction that may assure compatibility with a generic web document is the common bagof-words (Chakrabarti, 2003). This low abstraction level is not very helpful for automatic processing, requiring significant computational costs. The web is a vast and popular repository, containing information related to almost all human activities and being used to perform an ever growing set of distinct activities (bank transactions, shopping, chatting, government transactions, weather report and getting geographic directions, just to name a few). Despite the difficulties this medium poses to automatic as well as to non-automatic processing, it has been increasingly explored and has been motivating efforts, from both academic and industry, which aim to facilitate this exploration. Currently the web is a repository of documents, the majority of them HTML documents, that can be automatically presented to users but that do not have a base model that might be used by computers to acquire semantic information on the objects being manipulated. The semantic web is a formal attempt from W3C to transform the web in a huge database that might be easier to process automatically than our current syntactic A Study of Web Mining Tools for Query Optimization Page 11

12 web. However, despite many initiatives on the semantic web (Lu et al., 2002), the web has its own dynamics and web citizens are pushing the web to the social plan. Collaborative systems, radical trust and participation are the main characteristics of web2.0, a new paradigm emerging since 2004 (O Reily, 2004) A Retrospective View of Web Information Retrieval In the early 1950s, technical librarianship faced a crisis. The scientific boom sparked by the Second World War had released a flood of publications, approaching a million new articles each year. Scientists could no longer stay abreast of current research by general reading alone. Papers relevant to a new project, but not previously known to the researcher, had to be retrieved at the project s outset and the librarian had to facilitate this retrieval. A variety of cataloguing schemes had been suggested as tools for retrieval, but none had been rigorously tested for effectiveness, and all were labour-intensive to implement In responding to technical information s rapid growth, librarians and information scientists developed the field of information retrieval. The defining discovery of the field was that complex schemes for organizing and cataloguing information into hierarchical taxonomies did little better than simply indexing the plain words occurring in the text: the crucial part of information retrieval lay in the process of retrieval. The finding that taxonomy was redundant was little short of scandalous after all, Western information science had since Aristotle been founded on subdividing knowledge by genus and species. But the effect was liberating. Word occurrences are readily indexed by computer, and retrieval technology could be constructed on top of such indexes without having to solve deep problems in human language analysis and semantics. Significantly, the sufficiency of word occurrence indexing was not argued theoretically (which, after centuries of such theoretical dispute, would hardly have had an impact), but demonstrated empirically, through careful evaluation. In the mid 1990s, users of the newly-emerged web faced a crisis. The number of web sites was growing rapidly, and finding information by following a trail of links from a few popular central sites was no longer an adequate access A Study of Web Mining Tools for Query Optimization Page 12

13 method. Manually curated directories such as that of Yahoo! were popular, but manual curation was expensive and scaled poorly. Experienced users could not keep up with the growth in the number of sites, even in areas of personal interest to them; and, for novice users, the task of finding useful information on the web was daunting. Faced with the mushrooming growth of the web in the second half of the 1990s, a new kind of service provider turned to the decades-old technology of information retrieval, producing the web search engine. Web search transformed information retrieval from the rarefied activity of librarians, researchers, journalist fact-checkers, and intelligence analysts, to the daily activity of almost the entire computer-enabled population. In doing so, search providers finally bridged a long-established gap between theory and practice. As early as the 1960s, researchers had developed statistical techniques for effectively retrieving and ranking documents against plain keyword queries. The retrieval technology deployed in practice, though, used logical, Boolean query languages that relied upon the patience and expertise of the querier to formulate complex query expressions, precisely specifying their information need. But web users little expertise, and less patience, for constructing complex queries. Search engines therefore turned to simple queries and sophisticated retrieval, finally deploying, on a massive scale, the techniques developed three decades earlier, so creating the modern search engine. To the surprise once more of some search technologists, simple keyword search simply worked. In an increasingly competitive search market, though, how could a provider verify the effectiveness of their search results, and compare their offering with that of their competitors? Search technology connects simple queries with unannotated documents, relieving both the producer and the consumer of information from the complexity of matching information resources to information needs. The result is tools that allow neophyte users to find relevant information, across billions of web documents, in a fraction of a second. But in doing away with complex, formal information representations in favour of rough approximations, statistical information retrieval introduced an important problem. It is not possible to A Study of Web Mining Tools for Query Optimization Page 13

14 objectively and deterministically state that an information object matches an information request, even in the terms in which the request is formulated. One can say that a document has been manually assigned a certain classification under a hierarchical taxonomy; one can even say that a document contains a Boolean combination of terms; but one cannot conclusively say that an uncategorized document meets a user s information need as expressed by a handful of keywords. The contemporary retrieval system sits at the interface between computational formalism on the one hand, and the ambiguity of human cognition on the other. There is uncertainty in what the retrieval system should do, and therefore in how correct a set of results are. The ambiguity of the retrieval task makes the question of retrieval effectiveness a crucial and contested one. Methods for evaluating effectiveness are therefore essential, in both research and deployment. Retrieval evaluation relies fundamentally on human assessment of result quality. The noncomputability of effectiveness makes information retrieval a deeply empirical discipline, closer to natural or even social science than to formal computational theory. The complex, interlocked relations that connect imprecise queries, uncurated documents, and inchoate information needs, are not given, but must be hypothesized and tested on observed search behavior. The importance of empirical evaluation in information retrieval has been recognized since the field began; the initial work that established the primacy of retrieval over indexing gained much of its impact from the meticulous and painstaking experimental work on which it was based. But the same scale of data that makes retrieval technology necessary, also makes manual assessment costly. While result quality can be measured by directly assessing user satisfaction with, or utility gained from, retrieval results, such direct measurement of the user s satisfaction with the results lists as a whole is neither reusable nor reliably repeatable. Assessing the results of any single system is time-consuming, and there are many competing retrieval algorithms, each tuned by numerous parameters. A parameter change that takes a few minutes to decide upon, and a few seconds to run, could take days to manually assess. Moreover, if each A Study of Web Mining Tools for Query Optimization Page 14

15 research group produces its own, independent assessments of retrieval quality, then not only is much effort duplicated, but also reproduciblity is impaired, and the potential for bias is introduced. And tuning nowadays is often performed automatically through machine learning; fitting a manual review stage into each learning iteration would be unworkable. The need for scale and automatability, plus the desire for repeatability and objectivity, has led the information retrieval community to develop hybrid evaluation technologies, part manual, part automated. The most important of the evaluation tools is the test collection: a corpus of documents, with a set of queries (known as topics) to run against the corpus, and judgments of which documents are (independently) relevant to each query. These relevance judgments must be manually formed: but once made, the test collection can in principle be reused indefinitely for fully automated evaluation. The result is an automated and reusable evaluation method, based on a simplified model of retrieval. Test collection evaluation has been the bedrock of retrieval research for half a century. Collection-based experimentation has grown even more in importance since the arrival, beginning in the early 1990s, of large scale, collaboratively developed, and readily obtainable test collections. And (to judge from publicly available information) the test collection method is also core to the quality assurance and improvement methods of commercial web search engines. The practice of retrieval evaluation, though, has run well ahead of the theory. It was only at the end of the 1990s that the reliability, efficiency, and interpretability of evaluation results began to be formally investigated. The delay was in part because it was only after large-scale collaborative experiments had been running for several years that the datasets needed for a critical investigation of evaluation became available. Initial enquiries, while foundational, tended to be either ad-hoc, or else applied statistical methodology developed in other areas to retrieval evaluation without considering the field s distinctive features. These omissions are currently being remedied by the research community. It is in the context of the effort for greater reliability, accuracy, robustness, and efficiency in collection-based retrieval evaluation that this thesis is presented. A Study of Web Mining Tools for Query Optimization Page 15

16 Building on the foundational work in the area, and employing the large evaluation datasets now available, major advances in the accuracy and comparability of evaluation scores can be made in the design of efficient and reliable experiments, in the extensibility of test collections in dynamic evaluation environments, and in the measurement of retrieval similarity without relevance assessment. Technical contributions with awareness of the wider context of evaluation, and of the necessity of mixing experimental rigour with research innovation can also be offered. The need to store and retrieve written information became increasingly important over centuries, especially with inventions like paper and the printing press. Soon after computers were invented, people realized that they could be used for storing and mechanically retrieving large amounts of information. In 1945 Vannevar Bush published a ground breaking article titled As We May Think that gave birth to the idea of automatic access to large amounts of stored knowledge[8]. In the 1950s, this idea materialized into more concrete descriptions of how archives of text could be searched automatically. Several works emerged in the mid 1950s that elaborated upon the basic idea of searching text with a computer. One of the most influential methods was described by H.P. Luhn in 1957, in which (put simply) he proposed using words as indexing units for documents and measuring word overlap as a criterion for retrieval [9]. Several key developments in the field happened in the 1960s. Most notable were the development of the SMART system by Gerard Salton and his students, first at Harvard University and later at Cornell University; [10] and the Cranfield evaluations done by Cyril Cleverdon and his group at the College of Aeronautics in Cranfield [11]. The Cranfield tests developed an evaluation methodology for retrieval systems that is still in use by IR systems today. The SMART system, on the other hand, allowed researchers to experiment with ideas to improve search quality. A system for experimentation coupled with good evaluation methodology allowed rapid progress in the field, and paved way for many critical developments. A Study of Web Mining Tools for Query Optimization Page 16

17 The 1970s and 1980s saw many developments built on the advances of the 1960s. Various models for doing document retrieval were developed and advances were made along all dimensions of the retrieval process. These new models/techniques were experimentally proven to be effective on small text collections (several thousand articles) available to researchers at the time. However, due to lack of availability of large text collections, the question whether these models and techniques would scale to larger corpora remained unanswered. This changed in 1992 with the inception of Text Retrieval Conference, or TREC[12]. TREC is a series of evaluation conferences sponsored by various US Government agencies under the auspices of NIST, which aims at encouraging research in IR from large text collections. With large text collections available under TREC, many old techniques were modified, and many new techniques were developed (and are still being developed) to do effective retrieval over large collections[13]. The evolution of IR systems may be organized in four distinct periods, with significant differences among the methods that were applied and the sources used during each one. During an initial period, up to the 50s, the indexing and searching processes were handled manually. Indexes were based on taxonomies or alphabetical lists of previously specified concepts. During this phase, IR systems were mainly used by librarians and scientists. During a second period, between around 1950 and the advent of web in the early 90s, the pressure on the field and the evolution on computer and database technology allowed for significant improvements. Process went from manual to automated annotation of documents; however indexes were still built from restricted descriptions of documents (mainly abstracts and document titles). IR was viewed as finding the right information in text databases. Operating IR systems frequently required specific learning. IR systems utilization was expensive and available only to restricted groups. During a third period, covering the 90s, the process of indexing and searching becomes fully automated. Full text indexes are built; web mining evolves and explores not only content but also structure and usage. IR systems become unrestricted, cheap, widely available and A Study of Web Mining Tools for Query Optimization Page 17

18 widely used. From around 2000 on, the fourth and actual period, other sources of evidence are explored trying to improve systems performance. Searching and browsing are the two basic IR paradigms on the web (Baeza-Yates et al., 1999). Three approaches to IR seem to have emerged (Broder et al., 2005): The search-centric approach argues that free search has become so good and the search user-interface so common, that users can satisfy all their needs through simple queries. Search engines follow this approach; The taxonomy navigation approach claims that users have difficulties expressing their information needs; organizing information on a hierarchical structure might help finding relevant information. Directory search systems follow this approach; The meta-data centric approach advocates the use of meta-data for narrowing large sets of results (multi faceted search); third generation search engines are trying to improve the quality of their answers by merging several sources of evidence. IR systems also have to solve problems related to their sources and how to build their databases/indexes. Several crawling algorithms have been explored, in order to overcome problems of scale arising from web dimension, such as focused crawling (Chakrabarti et al., 1999b), intelligent crawling (Aggarwal et al., 2001) and collaborative crawling (Aggarwal et al., 2004) that explores user behavior registered in server logs. Other approaches have also been proposed: meta-search explores the small overlap among search engines indexes sending the same query to a set of search engines and merging their answers a few specific problems arise from this approximation (Wang et al., 2003); dynamic search engines try do deal with web dynamics, such search engines do not have any permanent index but instead crawl for their answers at query time (Hersovici et al., 1998); interactive search (Bruza et al., 2000) wrapsa general purpose search engine into an interface that allows users to navigate towards their goal through a query-bynavigation process. At present, IR research seems to be focused on retrieval of A Study of Web Mining Tools for Query Optimization Page 18

19 high quality, integration of several sources of evidence and multimedia retrieval[3]. TREC hasalso branched IR into related but important fields like retrieval of spoken information, non-english language retrieval, information filtering, user interactions with a retrieval system, and so on. 1.7 Basic Processes of Information Retrieval There are three basic processes an information retrieval system has to support: the representation of the content of the documents, the representation of the user s information need, and the comparison of the two representations. The processes are visualized in figure 1.3 (Croft 1993). In the figure, squared boxes represent data and rounded boxes represent processes. Figure 1.3: Information Retrieval Process (Croft 1993) Representing the documents is usually called the indexing process. The process takes place off-line, that is, the end user of the information retrieval system is not directly involved. The indexing process results in a formal representation of the document: the index representation or document representation. Often, full text retrieval systems use a rather trivial algorithm to derive the index representations, for instance an algorithm that identifies words in A Study of Web Mining Tools for Query Optimization Page 19

20 an English text and puts them to lower case. The indexing process may include the actual storage of the document in the system, but often documents are only stored partly, for instance only title and abstract, plus information about the actual location of the document. The process of representing the information problem or need is often referred to as the query formulation process. The resulting formal representation is the query. In a broad sense, query formulation might denote the complete inter active dialogue between system and user, leading not only to a suitable query but possibly also to a better understanding by the user of his/her information need. In this thesis however, query formulation generally denotes the automatic formulation of the query when there are no previously retrieved documents to guide the search, that is, the formulation of the initial query. The automatic formulation of successive queries is called relevance feedback in this thesis. The user and the system communicate the information need by respectively queries and retrieved sets of documents. This is not the most natural form of communication. Humans would use natural language to communicate the information need amongst each other. Such a natural language statement of the information need is called a request. Automatic query formulation inputs the request and outputs an initial query. In practice, this means that some or all of the words in the request are converted to query terms, for instance by the rather trivial algorithm that puts words to lower case. Relevance feedback inputs a query or a request and some previously retrieved relevant and non-relevant documents to output a successive query. The comparison of the query against the document representations is also called the matching process. The matching process results in a ranked list of relevant documents. Users will walk down this document list in search of the information they need. Ranked retrieval will hopefully put the relevant documents somewhere in the top of the ranked list, minimizing the time the user has to invest on reading the documents. Simple but effective ranking algorithms use the frequency distribution of terms over documents. For instance, the words family and entertainment mentioned in the first section occur relatively infrequent in the whole book, which indicates that this book should not A Study of Web Mining Tools for Query Optimization Page 20

21 receive a top ranking for the request family entertainment. Ranking algorithms based on statistical approaches easily halve the time the user has to spend on reading documents Basic models of information retrieval a brief overview A mathematical model of information retrieval guides the implementation of information retrieval systems. In the traditional information retrieval systems, which are usually operated by professional searchers, only the matching process is automated; indexing and query formulation are manual processes. For these systems, mathematical models of information retrieval therefore only have to model the matching process. In practice, traditional information retrieval systems use the Boolean model of information retrieval The Boolean model Is an exact matching model, that is, it either retrieves documents or not without ranking them. The model supports the use of structured queries, which do not only contain query terms, but also relations between the terms defined by the query operators AND, OR and NOT In modern information retrieval systems, which are usually operated by nonprofessional users, query formulation is automated as well. However, candidate mathematical models for these systems still only model the matching process. There are many candidate models for the matching process of ranked retrieval systems. These models are so-called approximate matching models, that is, they use the frequency distribution of terms over documents to compute the ranking of the retrieved sets. Each of these models has its own advantages and disadvantages. However, there are two classical candidate models for approximate matching: the vector space model and the probabilistic model. They are classical models, not only because they were introduced already in the early 70 s, but also because they represent classical problems in information retrieval. A Study of Web Mining Tools for Query Optimization Page 21

22 The vector space model Represents the problem of ranking the documents given the initial query. The Vector model, probably the most commonly used, assigns real non-negative weights to index terms in documents and queries. In this model, documents are represented by vectors in a multi-dimensional Euclidean space. Each dimension in this space corresponds to a relevant term/word contained in the document collection. The degree of similarity of documents with regard to queries is evaluated as the correlation between the vectors representing the document and the query which can be, and usually is, quantified by the cosine of the angle between the two vectors. In the vector model, index term weights are usually obtained as a function of two factors: the term frequency factor, TF, a measure of intra-cluster similarity; computed as the number of times that the term occurs in document, normalized in a way as to make it independent of document length and an inverse document frequency, IDF, a measure of inter-cluster dissimilarity; weights each term according to its discriminative power in the entire collection. This model s main advantages are related to improvements in retrieval performance due to term weighting; partial matching that allows retrieval of documents that approximate the query conditions. The index term independency assumption is probably its main disadvantage The probabilistic model Represent the problem of ranking the documents after some feedback is gathered. Probabilistic models compute the similarity between documents and queries as the odds of a document being relevant to a query. Index term weights are binary. This model ranks documents in decreasing order of their probability of being relevant, which is an advantage. Its main disadvantages are: the need to guess the initial separation of documents into relevant and non-relevant; weights are binary; index terms are assumed to be independent From a practical point of view, the Boolean model, the vector space model and the probabilistic model represent three classical problems of information A Study of Web Mining Tools for Query Optimization Page 22

23 retrieval, respectively structured queries, initial term weighting, and relevance feedback. The Boolean model provides the query operators AND, OR and NOT to formulate structured queries. The vector space model was used by Salton and his colleagues for hundreds of term weighting experiments in order to find algorithms that predict which documents the user will find relevant given the initial query (Salton and Buckley 1988).3 The probabilistic model, provides a theory of optimum ranking if examples of relevant documents are available [14] Evaluation of Information Retrieval System Evaluation studies investigate the degree to which the stated goals or expectations have been achieved or the degree to which these can be achieved. The three major purposes given for evaluating an information retrieval system were the need for measures with which to make merit comparisons within a single test situation, the need for measures with which to make comparisons between results obtained in difficult test situations and the need for assessing the merit a real-life system. A number of studies have been conducted to measure the performance of the information retrieval system. Some criteria have been proposed by several researchers for the evaluation of information retrieval systems [CC66, LFW68, and SG83]. These criteria include: coverage of the system, form of presentation of the search output, user effort, the response time of the system, and recall and precision. Retrieval effectiveness is defined in terms of retrieving relevant documents and not retrieving non-relevant documents. Two traditional factors of measuring effectiveness are Recall and Precision Evaluation criteria Recall indicates the ability of a system to present all relevant items or documents. In reality it may not be possible to retrieve all the relevant items from a collection, especially when the collection is large. A system may be able to retrieve a proportion of the total relevant documents. Thus, the performance of a system is often measured by the recall ratio, which denotes the percentage of relevant items retrieved in a given situation. A Study of Web Mining Tools for Query Optimization Page 23

24 Precision implies the ability of a system to present only relevant items or documents and therefore not to retrieve non-relevant documents. This factor-that is, how far the system is able to withhold unwanted items in a given situation-is measured in terms of precision ratio. These two measures are denoted by the following formulas: A Study of Web Mining Tools for Query Optimization Page 24

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Chapter 2 BACKGROUND OF WEB MINING

Chapter 2 BACKGROUND OF WEB MINING Chapter 2 BACKGROUND OF WEB MINING Overview 2.1. Introduction to Data Mining Data mining is an important and fast developing area in web mining where already a lot of research has been done. Recently,

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science Northeastern University Major Contributors Gerard Salton! Vector Space Model Indexing Relevance Feedback SMART Karen

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 00 Motivation What is Information Retrieval? The meaning of the term Information Retrieval (IR) can be

More information

Enhanced retrieval using semantic technologies:

Enhanced retrieval using semantic technologies: Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008

More information

Document Clustering for Mediated Information Access The WebCluster Project

Document Clustering for Mediated Information Access The WebCluster Project Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE YING DING 1 Digital Enterprise Research Institute Leopold-Franzens Universität Innsbruck Austria DIETER FENSEL Digital Enterprise Research Institute National

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc.

Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc. Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc. This paper provides an overview of a presentation at the Internet Librarian International conference in London

More information

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Markus Krötzsch Pascal Hitzler Marc Ehrig York Sure Institute AIFB, University of Karlsruhe, Germany; {mak,hitzler,ehrig,sure}@aifb.uni-karlsruhe.de

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Information Retrieval (Part 1)

Information Retrieval (Part 1) Information Retrieval (Part 1) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Università di Padova Anno Accademico 2008/2009 1 Bibliographic References Copies of slides Selected

More information

Human-Computer Information Retrieval

Human-Computer Information Retrieval Human-Computer Information Retrieval Gary Marchionini University of North Carolina at Chapel Hill march@ils.unc.edu CSAIL MIT November 12, 2004 Message IR and HCI are related fields that have strong (staid?)

More information

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University Information Retrieval CS6200 Jesse Anderton College of Computer and Information Science Northeastern University What is Information Retrieval? You have a collection of documents Books, web pages, journal

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri and Shamkant B. Navathe CHAPTER 1 Databases and Database Users Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Slide 1-2 OUTLINE Types of Databases and Database Applications

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

Analysis on the technology improvement of the library network information retrieval efficiency

Analysis on the technology improvement of the library network information retrieval efficiency Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the

More information

Remotely Sensed Image Processing Service Automatic Composition

Remotely Sensed Image Processing Service Automatic Composition Remotely Sensed Image Processing Service Automatic Composition Xiaoxia Yang Supervised by Qing Zhu State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University

More information

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information

More information

Introduction to Information Retrieval. Hongning Wang

Introduction to Information Retrieval. Hongning Wang Introduction to Information Retrieval Hongning Wang CS@UVa What is information retrieval? 2 Why information retrieval Information overload It refers to the difficulty a person can have understanding an

More information

CHAPTER 8 Multimedia Information Retrieval

CHAPTER 8 Multimedia Information Retrieval CHAPTER 8 Multimedia Information Retrieval Introduction Text has been the predominant medium for the communication of information. With the availability of better computing capabilities such as availability

More information

Component-Based Software Engineering TIP

Component-Based Software Engineering TIP Component-Based Software Engineering TIP X LIU, School of Computing, Napier University This chapter will present a complete picture of how to develop software systems with components and system integration.

More information

21. Search Models and UIs for IR

21. Search Models and UIs for IR 21. Search Models and UIs for IR INFO 202-10 November 2008 Bob Glushko Plan for Today's Lecture The "Classical" Model of Search and the "Classical" UI for IR Web-based Search Best practices for UIs in

More information

Information mining and information retrieval : methods and applications

Information mining and information retrieval : methods and applications Information mining and information retrieval : methods and applications J. Mothe, C. Chrisment Institut de Recherche en Informatique de Toulouse Université Paul Sabatier, 118 Route de Narbonne, 31062 Toulouse

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Information Retrieval

Information Retrieval Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 669-674 Research India Publications http://www.ripublication.com/aeee.htm Data Warehousing Ritham Vashisht,

More information

A Study of Future Internet Applications based on Semantic Web Technology Configuration Model

A Study of Future Internet Applications based on Semantic Web Technology Configuration Model Indian Journal of Science and Technology, Vol 8(20), DOI:10.17485/ijst/2015/v8i20/79311, August 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 A Study of Future Internet Applications based on

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Taccumulation of the social network data has raised

Taccumulation of the social network data has raised International Journal of Advanced Research in Social Sciences, Environmental Studies & Technology Hard Print: 2536-6505 Online: 2536-6513 September, 2016 Vol. 2, No. 1 Review Social Network Analysis and

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably

More information

Competitive Intelligence and Web Mining:

Competitive Intelligence and Web Mining: Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Part I: Future Internet Foundations: Architectural Issues

Part I: Future Internet Foundations: Architectural Issues Part I: Future Internet Foundations: Architectural Issues Part I: Future Internet Foundations: Architectural Issues 3 Introduction The Internet has evolved from a slow, person-to-machine, communication

More information

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

CE4031 and CZ4031 Database System Principles

CE4031 and CZ4031 Database System Principles CE431 and CZ431 Database System Principles Course CE/CZ431 Course Database System Principles CE/CZ21 Algorithms; CZ27 Introduction to Databases CZ433 Advanced Data Management (not offered currently) Lectures

More information

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

Generalized Document Data Model for Integrating Autonomous Applications

Generalized Document Data Model for Integrating Autonomous Applications 6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Generalized Document Data Model for Integrating Autonomous Applications Zsolt Hernáth, Zoltán Vincellér Abstract

More information

Data Mining and Warehousing

Data Mining and Warehousing Data Mining and Warehousing Sangeetha K V I st MCA Adhiyamaan College of Engineering, Hosur-635109. E-mail:veerasangee1989@gmail.com Rajeshwari P I st MCA Adhiyamaan College of Engineering, Hosur-635109.

More information

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2 Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1907-1911 1907 Web-Based Data Mining in System Design and Implementation Open Access Jianhu

More information

Semantic Clickstream Mining

Semantic Clickstream Mining Semantic Clickstream Mining Mehrdad Jalali 1, and Norwati Mustapha 2 1 Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Computer Science, Universiti

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,

More information

Managing Change and Complexity

Managing Change and Complexity Managing Change and Complexity The reality of software development Overview Some more Philosophy Reality, representations and descriptions Some more history Managing complexity Managing change Some more

More information

The main website for Henrico County, henrico.us, received a complete visual and structural

The main website for Henrico County, henrico.us, received a complete visual and structural Page 1 1. Program Overview The main website for Henrico County, henrico.us, received a complete visual and structural overhaul, which was completed in May of 2016. The goal of the project was to update

More information

INFORMATION TECHNOLOGY COURSE OBJECTIVE AND OUTCOME

INFORMATION TECHNOLOGY COURSE OBJECTIVE AND OUTCOME INFORMATION TECHNOLOGY COURSE OBJECTIVE AND OUTCOME CO-1 Programming fundamental using C The purpose of this course is to introduce to students to the field of programming using C language. The students

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology Preface The idea of improving software quality through reuse is not new. After all, if software works and is needed, just reuse it. What is new and evolving is the idea of relative validation through testing

More information

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database

An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database Toru Fukumoto Canon Inc., JAPAN fukumoto.toru@canon.co.jp Abstract: A large number of digital images are stored on the

More information

Component-Based Software Engineering TIP

Component-Based Software Engineering TIP Component-Based Software Engineering TIP X LIU, School of Computing, Napier University This chapter will present a complete picture of how to develop software systems with components and system integration.

More information

Challenges of Analyzing Parametric CFD Results. White Paper Published: January

Challenges of Analyzing Parametric CFD Results. White Paper Published: January Challenges of Analyzing Parametric CFD Results White Paper Published: January 2011 www.tecplot.com Contents Introduction... 3 Parametric CFD Analysis: A Methodology Poised for Growth... 4 Challenges of

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste

MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS By SSLMIT, Trieste The availability of teaching materials for training interpreters and translators has always been an issue of unquestionable

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Effective Knowledge Navigation For Problem Solving. Using Heterogeneous Content Types

Effective Knowledge Navigation For Problem Solving. Using Heterogeneous Content Types From: AAAI Technical Report WS-97-09. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Effective Navigation For Problem Solving Using Heterogeneous Content Types Ralph Barletta and

More information

Patent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012

Patent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012 Patent documents usecases with MyIntelliPatent Alberto Ciaramella IntelliSemantic 25/11/2012 Objectives and contents of this presentation This presentation: identifies and motivates the most significant

More information

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical

More information

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites An Empirical Evaluation of User Interfaces for Topic Management of Web Sites Brian Amento AT&T Labs - Research 180 Park Avenue, P.O. Box 971 Florham Park, NJ 07932 USA brian@research.att.com ABSTRACT Topic

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

Mission-Critical Customer Service. 10 Best Practices for Success

Mission-Critical  Customer Service. 10 Best Practices for Success Mission-Critical Email Customer Service 10 Best Practices for Success Introduction When soda cans and chocolate wrappers start carrying email contact information, you know that email-based customer service

More information

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW 6 S FOR A COMPLETE MARKETING WORKFLOW 01 6 S FOR A COMPLETE MARKETING WORKFLOW FROM ALEXA DIFFICULTY DIFFICULTY MATRIX OVERLAP 6 S FOR A COMPLETE MARKETING WORKFLOW 02 INTRODUCTION Marketers use countless

More information

Thanks to our Sponsors

Thanks to our Sponsors Thanks to our Sponsors A brief history of Protégé 1987 PROTÉGÉ runs on LISP machines 1992 PROTÉGÉ-II runs under NeXTStep 1995 Protégé/Win runs under guess! 2000 Protégé-2000 runs under Java 2005 Protégé

More information

2 The IBM Data Governance Unified Process

2 The IBM Data Governance Unified Process 2 The IBM Data Governance Unified Process The benefits of a commitment to a comprehensive enterprise Data Governance initiative are many and varied, and so are the challenges to achieving strong Data Governance.

More information

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Web Usage Mining using ART Neural Network. Abstract

Web Usage Mining using ART Neural Network. Abstract Web Usage Mining using ART Neural Network Ms. Parminder Kaur, Lecturer CSE Department MGM s Jawaharlal Nehru College of Engineering, N-1, CIDCO, Aurangabad 431003 & Ms. Ruhi M. Oberoi, Lecturer CSE Department

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information