Chapter 1. Web-Mining and Information Retrieval. 1.1 Introduction
|
|
- Ralph Moore
- 6 years ago
- Views:
Transcription
1 Chapter 1 Web-Mining and Information Retrieval 1.1 Introduction The World Wide Web or simply the web may be seen as a huge collection of documents freely produced and published by a very large number of people, without any solid editorial control. This is probably the most democratic and anarchic widespread mean for anyone to express feelings, comments, convictions and ideas, independently of ethnics, sex, religion or any other characteristic of human societies. The web constitutes a comprehensive, dynamic, permanently up-to-date repository of information regarding most of the areas of human knowledge (Hu, 2002) and supporting an increasingly important part of commercial, artistic, scientific and personal transactions, which gives rise to a very strong interest from individuals, as well as from institutions, at a universal scale. However, the web also exhibits some characteristics that are adverse to the process of collecting information from it in order to satisfy specific needs: the large volume of data it contains; its dynamic nature; being mainly constituted by unstructured or semi-structured data; content and format heterogeneity and irregular data quality are some of these adverse characteristics. End-users also introduce some additional difficulties in the retrieval process: information needs are often imprecisely defined, generating a semantic gap between user needs and their specification. The satisfaction of a specific information need on the web is supported by search engines and other tools aimed at helping users gather information from the web. The user is usually not assisted in the subsequent tasks of organizing, analyzing and exploring the answers produced. These answers are usually flat lists of large sets of web pages which demand significant user effort to be explored. Satisfying information needs on the web is usually seen as an ephemeral one-step process of information search (the traditional search engine paradigm). Given these characteristics, it is highly demanding to satisfy private or institutional information needs on the web. The web itself, and the interests it promotes, are growing and changing rapidly, at a global scale, both as mean of A Study of Web Mining Tools for Query Optimization Page 1
2 divulgation and dissemination and also as a source of generic and specialized information. Web users have already realized the potential of this huge information source and use it for many purposes, mainly in order to satisfy specific information needs. Simultaneously the web provides a ubiquitous environment for executing many activities, regardless of place and time. 1.2 Web Mining Web mining is a very hot research topic which combines two of the activated research areas: Data Mining and World Wide Web. The Web mining research relates to several research communities such as Database, Information Retrieval and Artificial Intelligence [1]. Web mining is defined by [Coo97] as the discovery and analysis of useful information from the WWW. Web mining is used to extract interesting and potentially useful patterns and implicit information from artefacts or activity related to the WWW. Web mining in relation to other forms of data mining and retrieval is illustrated using Figure 1.1. The diagram demonstrates the fact that web mining is performed on an unstructured source, i.e. web sites. Figure 1.1: Web mining in relation to other forms of data mining and retrieval A Study of Web Mining Tools for Query Optimization Page 2
3 1.2.1 Web Content Mining Web content mining is the automatic search of information resources available online [Coo97]. As a process, web content mining goes beyond keyword extraction since web documents present no machine-readable semantic. The two groups of web content mining approaches concentrate on different aspects. Agent based approach directly mines document contents. Database approach improves the search strategy of the search engine with regard to the database it uses Web Structure Mining Web content mining focuses on the internal structure of a web document, web structure mining tries to discover the link structure of the hyperlinks at the inter-document level Web Usage Mining Web usage mining is defined as the discovery of user access patterns from web servers. Web servers record and accumulate user interaction data each time a user makes a request for resources. Analyzing these web access logs can reveal patterns regarding a user are browsing habits through the web server [2]. Figure 1.2: Taxonomy of Web Mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Web mining A Study of Web Mining Tools for Query Optimization Page 3
4 methodologies can generally be classified into one of three distinct categories: Web structure, Web content and Web usage mining. The goal of Web structure mining is to categorize the Web pages and generate information such as the similarity and relationship between them, taking advantage of their hyperlink topology. In the latter years, the area of Web structure mining focuses on the identification of authorities, i.e. pages that are considered as important sources of information from many people in the Web community. Web content mining has to do with the retrieval of information (content) available on the Web into more structured forms as well as its indexing for easy tracking information locations. Web content may be unstructured (plain text), semi-structured (HTML documents), or structured (extracted from databases into dynamic Web pages). Such dynamic data cannot be indexed and consist what is called the hidden Web. A research area closely related to content mining is text mining. Web content mining is nowadays strongly interrelated with Web structure mining, since usually both are used in combination for extracting and organizing information from the Web. Web content mining provides methods enabling the automated discovery, retrieval, organization, and management of the vast amount of information and resources available in the Web. Cooley et al. [CMS97] categorize the main research efforts in the area of Content Mining in two approaches, the Information Retrieval (IR), and the Database (DB) approach. The IR approach involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information. Web usage mining is the process of identifying browsing patterns by analyzing the user s navigational behavior. This information takes as input the usage data, i.e. the data residing in the Web server logs, recording the visits of the users to a Web site. Extensive research in the area of Web usage mining led to the appearance of a related research area, that of Web personalization. Web personalization utilizes the results produced after performing Web usage mining, in order to dynamically provide recommendations to each user. A Study of Web Mining Tools for Query Optimization Page 4
5 Web mining is moving the World Wide Web toward a more useful environment in which users can quickly and easily find the information they need. It includes the discovery and analysis of data, documents, and multimedia from the World Wide Web. Web mining uses document content, hyperlink structure, and usage statistics to assist users in meeting their information needs. The Web itself and search engines contain relationship information about documents. Web mining is the discovery of these relationships and is accomplished within three sometimes overlapping areas. Content mining is first. Search engines define content by keywords. Finding contents keywords and finding the relationship between a Web page s content and a user s query content is content mining. Hyperlinks provide information about other documents on the Web thought to be important to another document. These links add depth to the document, providing the multi-dimensionality that characterizes the Web. Mining this link structure is the second area of Web mining. Finally, there is a relationship to other documents on the Web that are identified by previous searches. These relationships are recorded in logs of searches and accesses. Mining these logs is the third area of Web mining. Understanding the user is also an important part of Web mining. Analysis of the user s previous sessions, preferred display of information, and expressed preferences may influence the Web pages returned in response to a query. Web mining is interdisciplinary in nature, spanning across such fields as information retrieval, natural language processing, information extraction, machine learning, database, data mining, data warehousing, user interface design, and visualization. Techniques for mining the Web have practical application in m- commerce, e-commerce, e-government, e-learning, distance learning, organizational learning, virtual organizations, knowledge management, and digital libraries. 1.3 Web Mining and Information Retrieval Web IR is the application of IR to the web. In classical IR, users specify queries, in some query language, representing their information needs. The A Study of Web Mining Tools for Query Optimization Page 5
6 system selects the set of documents in its collection that seem the most relevant to the query and presents them to the user. Users may then refine their queries to improve the answer. In the web environment user intents are not static and stable as they usually are in traditional IR. In the web, the information need is associated with a given task (Broder, 2002) that is not known in advance and may be quite different from user to user, even if the query specification is the same. The identification of this task and the mental process of deriving a query from an information need are crucial aspects in web IR. Web IR is related to web mining the automatic discovery of interesting and valuable information from the web (Chakrabarti, 2003). It is generally accepted that web mining is currently being developed towards three main research directions, related to the type of data they mine: web content mining, web structure mining and web usage mining (Kosala et al., 2000). Recently another type of data document change, page age and information recency is generating research interests: it is related to a temporal dimension and allows for analyzing the growth and dynamics over time of the Web (Baeza-Yates, 2003; Cho et al., 2000; Lim et al., 2001). This categorization is merely conceptual, these areas are not mutually exclusive and some techniques dedicated to one may use data that is typically associated with others. Web content mining concerns the discovery of useful information from web page content which is available in many different formats (Baeza-Yates, 2003) textual, metadata, links, multimedia objects, hidden and dynamic pages and semantic data. Web structure mining tries to infer knowledge from the link structure on the web (Chakrabarti et al., 1999a). Web documents typically point at related documents through a link forming a social network. This network can be represented by a directed graph where nodes represent documents and arcs represent the links between them. The analysis of this graph is the main goal of web structure mining (Donato et al., 2000; Kumar et al., 2000). In this field, two algorithms, which rank web pages according to their relevance, have received special attention: PageRank(Brin et al., 1998) and Hyperlink Induced Topic Search, or HITS (Kleinberg, 1998). A Study of Web Mining Tools for Query Optimization Page 6
7 Web usage mining tries to explore user behavior on the web by analyzing data originated from user interaction and automatically recorded in web server logs. The applications of web usage mining usually intend to learn user profiles or navigation patterns. Web usage mining is essentially aimed at predicting the next user request based on the analysis of previous requests. Markov models are very common in modeling user requests or user paths within a site (Borges, 2000). Association rules and other standard data mining and OLAP techniques are also explored. (Cooley et al., 1997) presents an overview of the most relevant work in web usage mining [3]. IR is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non relevant as possible (Rijsbergen, 1979). Some have claimed that resource or document discovery (IR) on the Web is an instance of Web content mining and the others associate web mining with intelligent IR. Actually IR has the primary goals of indexing text and searching for useful documents in a collection and nowadays research in IR includes modeling, document classification and categorization, user interfaces, data visualization, filtering, etc. (Baeza-Yates &Berthier, 1999). The task that can be considered to be an instance of Web mining is Web document classification or categorization, which could be used for indexing. Viewed in this respect, Web mining is part of the (Web) IR process. (Kosala&Blockeel, 2000)[4]. 1.4 Web Mining and Information Extraction: IE has the goal of transforming a collection of documents, usually with the help of anir system, into information that is more readily digested and analyzed (Cowie&Lehnert, 1996). IE aims to extract relevant facts from the documents while IR aims to select relevant documents (Pazienza, 1997). While IE is interested in the structure or representation of a document, IR views the text in a document just as a bag of unordered words (Wilks, 1997). Thus, in general IE works at a finer granularity level than IR dose on the documents. Building IE systems manually is not feasible and scalable for such a dynamic and diverse medium such as web contents (Muslea, Minton &Knoblock, 1998). Due to this A Study of Web Mining Tools for Query Optimization Page 7
8 nature of the Web, most IE systems focus on specific web sites to extract. Others use machine learning or data mining techniques to learn the extraction patterns or rules for Web documents semi-automatically or automatically (Kushmerick, 1999). Within this view, Web mining is used to improve Web IE (Web mining is part of IE) (Kosala&Blockeel, 2000). An example of IE without Web mining is what done by (El-Beltagy, Rafea&Abdelhamid) for building a model for automatically augmenting segments documents with metadata using dynamically acquired background domain knowledge in order to assist users in easily locating information within these documents through a structured front end[5].web mining can be divided into four subtasks: Information Retrieval/Resource Discovery (IR) Find all relevant documents on the web. The goal of IR is to automatically find all relevant documents, while at the same time filter out the non-relevant ones. Search engines are a major tool people use to find web information. Search engines use key words as the index to perform query. Users have more control in searching web content. Automated programs such as crawlers and robots are used to search the web. Such programs traverses the web to recursively retrieve all relevant documents. A search engine consists of three components: a crawler which visits web sites, indexing which is updated when a crawler finds a site, and a ranking algorithm which records those relevant web sites. However, current search engines have a major problem -low precision, which is manifested often by the irrelevance of searched results Information Extraction (IE):automatically extract specific fragments of a document from web resources retrieved from the IR step. Building a uniform IE system is difficult because the web content is dynamic and diverse. Most IE systems use the \wrapper" [33] technique to extract a specific information for a particular site. Machine learning techniques are also used to learn the extraction rules Generalization: discover information patterns at retrieved web sites. The purpose of this task is to study users' behavior and interest. Data mining A Study of Web Mining Tools for Query Optimization Page 8
9 techniques such as clustering and association rules are utilized here. Several problems exist during this task. Because web data are heterogeneous, imprecise and vague, it is difficult to apply conventional clustering and association rule techniques directly on the raw web data Analysis/Validation: analyze, interpret and validate the potential information from the information patterns. The objective of this task is to discover knowledge from the information provided by former tasks. Based on web data, one can build models to simulate and validate web information.[6]. 1.5 Information Retrieval and Web The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval However, as an academic field of study INFORMATION RETRIEVAL might be defined thus Information retrieval (IR) is finding material (usually documents) of an unstructured nature As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their . Information retrieval is fast becoming the dominant form of information access, overtaking traditional database style searching (usually text) that satisfies an information need from within large collections (usually stored on computers). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term unstructured data refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. In reality, almost no data are truly unstructured. This is definitely true of all text data if you count the latent linguistic structure of human languages. But even accepting that the intended notion of structure is overt A Study of Web Mining Tools for Query Optimization Page 9
10 structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web pages). IR is also used to facilitate semi structured search such as finding a document where the title contains Java and the body contains threading. The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically. Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web [7]. 1.6 The Web The web is a public service constituted by a set of applications aimed at extracting documents from computers accessible in Internet the Internet is a network of computer networks. One can also describe the web as an information repository distributed over millions of computers interconnected through Internet (Baldi et al., 2003). The W3C defines web in a broad way: the World Wide Web is the universe of network-accessible information, an embodiment of human A Study of Web Mining Tools for Query Optimization Page 10
11 knowledge. Due to its comprehensiveness, with contents related to most subjects of human activity, and global public acceptance, either at a personal or institutional level, the web is widely explored as an information source. Web dimension and dynamic nature become serious drawbacks when it comes to retrieving information. Another relevant characteristic of the web is the absence of any global editorial control over its content and format. This contributes largely to web success but also contributes to a high degree of heterogeneity in content, language, structure, correctness and validity. Although the problems raised by the size of the web, around 11,5 109 pages (Gulli et al., 2005), and its dynamics require special treatment, it seems that the major difficulties concerning the processing of web documents are generated by the lack of editorial rules and the lack of a common ontology, which would allow for unambiguous document specification and interpretation. In the absence of such normative rules, each document has to be treated as unique. In this scenario, document processing cannot be based on any underlying structure. Although HTML already involves some structure its use is not mandatory. Therefore, the higher level of abstraction that may assure compatibility with a generic web document is the common bagof-words (Chakrabarti, 2003). This low abstraction level is not very helpful for automatic processing, requiring significant computational costs. The web is a vast and popular repository, containing information related to almost all human activities and being used to perform an ever growing set of distinct activities (bank transactions, shopping, chatting, government transactions, weather report and getting geographic directions, just to name a few). Despite the difficulties this medium poses to automatic as well as to non-automatic processing, it has been increasingly explored and has been motivating efforts, from both academic and industry, which aim to facilitate this exploration. Currently the web is a repository of documents, the majority of them HTML documents, that can be automatically presented to users but that do not have a base model that might be used by computers to acquire semantic information on the objects being manipulated. The semantic web is a formal attempt from W3C to transform the web in a huge database that might be easier to process automatically than our current syntactic A Study of Web Mining Tools for Query Optimization Page 11
12 web. However, despite many initiatives on the semantic web (Lu et al., 2002), the web has its own dynamics and web citizens are pushing the web to the social plan. Collaborative systems, radical trust and participation are the main characteristics of web2.0, a new paradigm emerging since 2004 (O Reily, 2004) A Retrospective View of Web Information Retrieval In the early 1950s, technical librarianship faced a crisis. The scientific boom sparked by the Second World War had released a flood of publications, approaching a million new articles each year. Scientists could no longer stay abreast of current research by general reading alone. Papers relevant to a new project, but not previously known to the researcher, had to be retrieved at the project s outset and the librarian had to facilitate this retrieval. A variety of cataloguing schemes had been suggested as tools for retrieval, but none had been rigorously tested for effectiveness, and all were labour-intensive to implement In responding to technical information s rapid growth, librarians and information scientists developed the field of information retrieval. The defining discovery of the field was that complex schemes for organizing and cataloguing information into hierarchical taxonomies did little better than simply indexing the plain words occurring in the text: the crucial part of information retrieval lay in the process of retrieval. The finding that taxonomy was redundant was little short of scandalous after all, Western information science had since Aristotle been founded on subdividing knowledge by genus and species. But the effect was liberating. Word occurrences are readily indexed by computer, and retrieval technology could be constructed on top of such indexes without having to solve deep problems in human language analysis and semantics. Significantly, the sufficiency of word occurrence indexing was not argued theoretically (which, after centuries of such theoretical dispute, would hardly have had an impact), but demonstrated empirically, through careful evaluation. In the mid 1990s, users of the newly-emerged web faced a crisis. The number of web sites was growing rapidly, and finding information by following a trail of links from a few popular central sites was no longer an adequate access A Study of Web Mining Tools for Query Optimization Page 12
13 method. Manually curated directories such as that of Yahoo! were popular, but manual curation was expensive and scaled poorly. Experienced users could not keep up with the growth in the number of sites, even in areas of personal interest to them; and, for novice users, the task of finding useful information on the web was daunting. Faced with the mushrooming growth of the web in the second half of the 1990s, a new kind of service provider turned to the decades-old technology of information retrieval, producing the web search engine. Web search transformed information retrieval from the rarefied activity of librarians, researchers, journalist fact-checkers, and intelligence analysts, to the daily activity of almost the entire computer-enabled population. In doing so, search providers finally bridged a long-established gap between theory and practice. As early as the 1960s, researchers had developed statistical techniques for effectively retrieving and ranking documents against plain keyword queries. The retrieval technology deployed in practice, though, used logical, Boolean query languages that relied upon the patience and expertise of the querier to formulate complex query expressions, precisely specifying their information need. But web users little expertise, and less patience, for constructing complex queries. Search engines therefore turned to simple queries and sophisticated retrieval, finally deploying, on a massive scale, the techniques developed three decades earlier, so creating the modern search engine. To the surprise once more of some search technologists, simple keyword search simply worked. In an increasingly competitive search market, though, how could a provider verify the effectiveness of their search results, and compare their offering with that of their competitors? Search technology connects simple queries with unannotated documents, relieving both the producer and the consumer of information from the complexity of matching information resources to information needs. The result is tools that allow neophyte users to find relevant information, across billions of web documents, in a fraction of a second. But in doing away with complex, formal information representations in favour of rough approximations, statistical information retrieval introduced an important problem. It is not possible to A Study of Web Mining Tools for Query Optimization Page 13
14 objectively and deterministically state that an information object matches an information request, even in the terms in which the request is formulated. One can say that a document has been manually assigned a certain classification under a hierarchical taxonomy; one can even say that a document contains a Boolean combination of terms; but one cannot conclusively say that an uncategorized document meets a user s information need as expressed by a handful of keywords. The contemporary retrieval system sits at the interface between computational formalism on the one hand, and the ambiguity of human cognition on the other. There is uncertainty in what the retrieval system should do, and therefore in how correct a set of results are. The ambiguity of the retrieval task makes the question of retrieval effectiveness a crucial and contested one. Methods for evaluating effectiveness are therefore essential, in both research and deployment. Retrieval evaluation relies fundamentally on human assessment of result quality. The noncomputability of effectiveness makes information retrieval a deeply empirical discipline, closer to natural or even social science than to formal computational theory. The complex, interlocked relations that connect imprecise queries, uncurated documents, and inchoate information needs, are not given, but must be hypothesized and tested on observed search behavior. The importance of empirical evaluation in information retrieval has been recognized since the field began; the initial work that established the primacy of retrieval over indexing gained much of its impact from the meticulous and painstaking experimental work on which it was based. But the same scale of data that makes retrieval technology necessary, also makes manual assessment costly. While result quality can be measured by directly assessing user satisfaction with, or utility gained from, retrieval results, such direct measurement of the user s satisfaction with the results lists as a whole is neither reusable nor reliably repeatable. Assessing the results of any single system is time-consuming, and there are many competing retrieval algorithms, each tuned by numerous parameters. A parameter change that takes a few minutes to decide upon, and a few seconds to run, could take days to manually assess. Moreover, if each A Study of Web Mining Tools for Query Optimization Page 14
15 research group produces its own, independent assessments of retrieval quality, then not only is much effort duplicated, but also reproduciblity is impaired, and the potential for bias is introduced. And tuning nowadays is often performed automatically through machine learning; fitting a manual review stage into each learning iteration would be unworkable. The need for scale and automatability, plus the desire for repeatability and objectivity, has led the information retrieval community to develop hybrid evaluation technologies, part manual, part automated. The most important of the evaluation tools is the test collection: a corpus of documents, with a set of queries (known as topics) to run against the corpus, and judgments of which documents are (independently) relevant to each query. These relevance judgments must be manually formed: but once made, the test collection can in principle be reused indefinitely for fully automated evaluation. The result is an automated and reusable evaluation method, based on a simplified model of retrieval. Test collection evaluation has been the bedrock of retrieval research for half a century. Collection-based experimentation has grown even more in importance since the arrival, beginning in the early 1990s, of large scale, collaboratively developed, and readily obtainable test collections. And (to judge from publicly available information) the test collection method is also core to the quality assurance and improvement methods of commercial web search engines. The practice of retrieval evaluation, though, has run well ahead of the theory. It was only at the end of the 1990s that the reliability, efficiency, and interpretability of evaluation results began to be formally investigated. The delay was in part because it was only after large-scale collaborative experiments had been running for several years that the datasets needed for a critical investigation of evaluation became available. Initial enquiries, while foundational, tended to be either ad-hoc, or else applied statistical methodology developed in other areas to retrieval evaluation without considering the field s distinctive features. These omissions are currently being remedied by the research community. It is in the context of the effort for greater reliability, accuracy, robustness, and efficiency in collection-based retrieval evaluation that this thesis is presented. A Study of Web Mining Tools for Query Optimization Page 15
16 Building on the foundational work in the area, and employing the large evaluation datasets now available, major advances in the accuracy and comparability of evaluation scores can be made in the design of efficient and reliable experiments, in the extensibility of test collections in dynamic evaluation environments, and in the measurement of retrieval similarity without relevance assessment. Technical contributions with awareness of the wider context of evaluation, and of the necessity of mixing experimental rigour with research innovation can also be offered. The need to store and retrieve written information became increasingly important over centuries, especially with inventions like paper and the printing press. Soon after computers were invented, people realized that they could be used for storing and mechanically retrieving large amounts of information. In 1945 Vannevar Bush published a ground breaking article titled As We May Think that gave birth to the idea of automatic access to large amounts of stored knowledge[8]. In the 1950s, this idea materialized into more concrete descriptions of how archives of text could be searched automatically. Several works emerged in the mid 1950s that elaborated upon the basic idea of searching text with a computer. One of the most influential methods was described by H.P. Luhn in 1957, in which (put simply) he proposed using words as indexing units for documents and measuring word overlap as a criterion for retrieval [9]. Several key developments in the field happened in the 1960s. Most notable were the development of the SMART system by Gerard Salton and his students, first at Harvard University and later at Cornell University; [10] and the Cranfield evaluations done by Cyril Cleverdon and his group at the College of Aeronautics in Cranfield [11]. The Cranfield tests developed an evaluation methodology for retrieval systems that is still in use by IR systems today. The SMART system, on the other hand, allowed researchers to experiment with ideas to improve search quality. A system for experimentation coupled with good evaluation methodology allowed rapid progress in the field, and paved way for many critical developments. A Study of Web Mining Tools for Query Optimization Page 16
17 The 1970s and 1980s saw many developments built on the advances of the 1960s. Various models for doing document retrieval were developed and advances were made along all dimensions of the retrieval process. These new models/techniques were experimentally proven to be effective on small text collections (several thousand articles) available to researchers at the time. However, due to lack of availability of large text collections, the question whether these models and techniques would scale to larger corpora remained unanswered. This changed in 1992 with the inception of Text Retrieval Conference, or TREC[12]. TREC is a series of evaluation conferences sponsored by various US Government agencies under the auspices of NIST, which aims at encouraging research in IR from large text collections. With large text collections available under TREC, many old techniques were modified, and many new techniques were developed (and are still being developed) to do effective retrieval over large collections[13]. The evolution of IR systems may be organized in four distinct periods, with significant differences among the methods that were applied and the sources used during each one. During an initial period, up to the 50s, the indexing and searching processes were handled manually. Indexes were based on taxonomies or alphabetical lists of previously specified concepts. During this phase, IR systems were mainly used by librarians and scientists. During a second period, between around 1950 and the advent of web in the early 90s, the pressure on the field and the evolution on computer and database technology allowed for significant improvements. Process went from manual to automated annotation of documents; however indexes were still built from restricted descriptions of documents (mainly abstracts and document titles). IR was viewed as finding the right information in text databases. Operating IR systems frequently required specific learning. IR systems utilization was expensive and available only to restricted groups. During a third period, covering the 90s, the process of indexing and searching becomes fully automated. Full text indexes are built; web mining evolves and explores not only content but also structure and usage. IR systems become unrestricted, cheap, widely available and A Study of Web Mining Tools for Query Optimization Page 17
18 widely used. From around 2000 on, the fourth and actual period, other sources of evidence are explored trying to improve systems performance. Searching and browsing are the two basic IR paradigms on the web (Baeza-Yates et al., 1999). Three approaches to IR seem to have emerged (Broder et al., 2005): The search-centric approach argues that free search has become so good and the search user-interface so common, that users can satisfy all their needs through simple queries. Search engines follow this approach; The taxonomy navigation approach claims that users have difficulties expressing their information needs; organizing information on a hierarchical structure might help finding relevant information. Directory search systems follow this approach; The meta-data centric approach advocates the use of meta-data for narrowing large sets of results (multi faceted search); third generation search engines are trying to improve the quality of their answers by merging several sources of evidence. IR systems also have to solve problems related to their sources and how to build their databases/indexes. Several crawling algorithms have been explored, in order to overcome problems of scale arising from web dimension, such as focused crawling (Chakrabarti et al., 1999b), intelligent crawling (Aggarwal et al., 2001) and collaborative crawling (Aggarwal et al., 2004) that explores user behavior registered in server logs. Other approaches have also been proposed: meta-search explores the small overlap among search engines indexes sending the same query to a set of search engines and merging their answers a few specific problems arise from this approximation (Wang et al., 2003); dynamic search engines try do deal with web dynamics, such search engines do not have any permanent index but instead crawl for their answers at query time (Hersovici et al., 1998); interactive search (Bruza et al., 2000) wrapsa general purpose search engine into an interface that allows users to navigate towards their goal through a query-bynavigation process. At present, IR research seems to be focused on retrieval of A Study of Web Mining Tools for Query Optimization Page 18
19 high quality, integration of several sources of evidence and multimedia retrieval[3]. TREC hasalso branched IR into related but important fields like retrieval of spoken information, non-english language retrieval, information filtering, user interactions with a retrieval system, and so on. 1.7 Basic Processes of Information Retrieval There are three basic processes an information retrieval system has to support: the representation of the content of the documents, the representation of the user s information need, and the comparison of the two representations. The processes are visualized in figure 1.3 (Croft 1993). In the figure, squared boxes represent data and rounded boxes represent processes. Figure 1.3: Information Retrieval Process (Croft 1993) Representing the documents is usually called the indexing process. The process takes place off-line, that is, the end user of the information retrieval system is not directly involved. The indexing process results in a formal representation of the document: the index representation or document representation. Often, full text retrieval systems use a rather trivial algorithm to derive the index representations, for instance an algorithm that identifies words in A Study of Web Mining Tools for Query Optimization Page 19
20 an English text and puts them to lower case. The indexing process may include the actual storage of the document in the system, but often documents are only stored partly, for instance only title and abstract, plus information about the actual location of the document. The process of representing the information problem or need is often referred to as the query formulation process. The resulting formal representation is the query. In a broad sense, query formulation might denote the complete inter active dialogue between system and user, leading not only to a suitable query but possibly also to a better understanding by the user of his/her information need. In this thesis however, query formulation generally denotes the automatic formulation of the query when there are no previously retrieved documents to guide the search, that is, the formulation of the initial query. The automatic formulation of successive queries is called relevance feedback in this thesis. The user and the system communicate the information need by respectively queries and retrieved sets of documents. This is not the most natural form of communication. Humans would use natural language to communicate the information need amongst each other. Such a natural language statement of the information need is called a request. Automatic query formulation inputs the request and outputs an initial query. In practice, this means that some or all of the words in the request are converted to query terms, for instance by the rather trivial algorithm that puts words to lower case. Relevance feedback inputs a query or a request and some previously retrieved relevant and non-relevant documents to output a successive query. The comparison of the query against the document representations is also called the matching process. The matching process results in a ranked list of relevant documents. Users will walk down this document list in search of the information they need. Ranked retrieval will hopefully put the relevant documents somewhere in the top of the ranked list, minimizing the time the user has to invest on reading the documents. Simple but effective ranking algorithms use the frequency distribution of terms over documents. For instance, the words family and entertainment mentioned in the first section occur relatively infrequent in the whole book, which indicates that this book should not A Study of Web Mining Tools for Query Optimization Page 20
21 receive a top ranking for the request family entertainment. Ranking algorithms based on statistical approaches easily halve the time the user has to spend on reading documents Basic models of information retrieval a brief overview A mathematical model of information retrieval guides the implementation of information retrieval systems. In the traditional information retrieval systems, which are usually operated by professional searchers, only the matching process is automated; indexing and query formulation are manual processes. For these systems, mathematical models of information retrieval therefore only have to model the matching process. In practice, traditional information retrieval systems use the Boolean model of information retrieval The Boolean model Is an exact matching model, that is, it either retrieves documents or not without ranking them. The model supports the use of structured queries, which do not only contain query terms, but also relations between the terms defined by the query operators AND, OR and NOT In modern information retrieval systems, which are usually operated by nonprofessional users, query formulation is automated as well. However, candidate mathematical models for these systems still only model the matching process. There are many candidate models for the matching process of ranked retrieval systems. These models are so-called approximate matching models, that is, they use the frequency distribution of terms over documents to compute the ranking of the retrieved sets. Each of these models has its own advantages and disadvantages. However, there are two classical candidate models for approximate matching: the vector space model and the probabilistic model. They are classical models, not only because they were introduced already in the early 70 s, but also because they represent classical problems in information retrieval. A Study of Web Mining Tools for Query Optimization Page 21
22 The vector space model Represents the problem of ranking the documents given the initial query. The Vector model, probably the most commonly used, assigns real non-negative weights to index terms in documents and queries. In this model, documents are represented by vectors in a multi-dimensional Euclidean space. Each dimension in this space corresponds to a relevant term/word contained in the document collection. The degree of similarity of documents with regard to queries is evaluated as the correlation between the vectors representing the document and the query which can be, and usually is, quantified by the cosine of the angle between the two vectors. In the vector model, index term weights are usually obtained as a function of two factors: the term frequency factor, TF, a measure of intra-cluster similarity; computed as the number of times that the term occurs in document, normalized in a way as to make it independent of document length and an inverse document frequency, IDF, a measure of inter-cluster dissimilarity; weights each term according to its discriminative power in the entire collection. This model s main advantages are related to improvements in retrieval performance due to term weighting; partial matching that allows retrieval of documents that approximate the query conditions. The index term independency assumption is probably its main disadvantage The probabilistic model Represent the problem of ranking the documents after some feedback is gathered. Probabilistic models compute the similarity between documents and queries as the odds of a document being relevant to a query. Index term weights are binary. This model ranks documents in decreasing order of their probability of being relevant, which is an advantage. Its main disadvantages are: the need to guess the initial separation of documents into relevant and non-relevant; weights are binary; index terms are assumed to be independent From a practical point of view, the Boolean model, the vector space model and the probabilistic model represent three classical problems of information A Study of Web Mining Tools for Query Optimization Page 22
23 retrieval, respectively structured queries, initial term weighting, and relevance feedback. The Boolean model provides the query operators AND, OR and NOT to formulate structured queries. The vector space model was used by Salton and his colleagues for hundreds of term weighting experiments in order to find algorithms that predict which documents the user will find relevant given the initial query (Salton and Buckley 1988).3 The probabilistic model, provides a theory of optimum ranking if examples of relevant documents are available [14] Evaluation of Information Retrieval System Evaluation studies investigate the degree to which the stated goals or expectations have been achieved or the degree to which these can be achieved. The three major purposes given for evaluating an information retrieval system were the need for measures with which to make merit comparisons within a single test situation, the need for measures with which to make comparisons between results obtained in difficult test situations and the need for assessing the merit a real-life system. A number of studies have been conducted to measure the performance of the information retrieval system. Some criteria have been proposed by several researchers for the evaluation of information retrieval systems [CC66, LFW68, and SG83]. These criteria include: coverage of the system, form of presentation of the search output, user effort, the response time of the system, and recall and precision. Retrieval effectiveness is defined in terms of retrieving relevant documents and not retrieving non-relevant documents. Two traditional factors of measuring effectiveness are Recall and Precision Evaluation criteria Recall indicates the ability of a system to present all relevant items or documents. In reality it may not be possible to retrieve all the relevant items from a collection, especially when the collection is large. A system may be able to retrieve a proportion of the total relevant documents. Thus, the performance of a system is often measured by the recall ratio, which denotes the percentage of relevant items retrieved in a given situation. A Study of Web Mining Tools for Query Optimization Page 23
24 Precision implies the ability of a system to present only relevant items or documents and therefore not to retrieve non-relevant documents. This factor-that is, how far the system is able to withhold unwanted items in a given situation-is measured in terms of precision ratio. These two measures are denoted by the following formulas: A Study of Web Mining Tools for Query Optimization Page 24
Chapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationChapter 2 BACKGROUND OF WEB MINING
Chapter 2 BACKGROUND OF WEB MINING Overview 2.1. Introduction to Data Mining Data mining is an important and fast developing area in web mining where already a lot of research has been done. Recently,
More informationAN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES
Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationWEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS
1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,
More informationCS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University
CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science Northeastern University Major Contributors Gerard Salton! Vector Space Model Indexing Relevance Feedback SMART Karen
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationINTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 00 Motivation What is Information Retrieval? The meaning of the term Information Retrieval (IR) can be
More informationEnhanced retrieval using semantic technologies:
Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008
More informationDocument Clustering for Mediated Information Access The WebCluster Project
Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationSEMANTIC WEB POWERED PORTAL INFRASTRUCTURE
SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE YING DING 1 Digital Enterprise Research Institute Leopold-Franzens Universität Innsbruck Austria DIETER FENSEL Digital Enterprise Research Institute National
More informationINFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE
15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find
More informationDesign and Implementation of Search Engine Using Vector Space Model for Personalized Search
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,
More informationEmerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc.
Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc. This paper provides an overview of a presentation at the Internet Librarian International conference in London
More informationCategory Theory in Ontology Research: Concrete Gain from an Abstract Approach
Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Markus Krötzsch Pascal Hitzler Marc Ehrig York Sure Institute AIFB, University of Karlsruhe, Germany; {mak,hitzler,ehrig,sure}@aifb.uni-karlsruhe.de
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationDomain Specific Search Engine for Students
Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam
More informationInformation Retrieval (Part 1)
Information Retrieval (Part 1) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Università di Padova Anno Accademico 2008/2009 1 Bibliographic References Copies of slides Selected
More informationHuman-Computer Information Retrieval
Human-Computer Information Retrieval Gary Marchionini University of North Carolina at Chapel Hill march@ils.unc.edu CSAIL MIT November 12, 2004 Message IR and HCI are related fields that have strong (staid?)
More informationInformation Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University
Information Retrieval CS6200 Jesse Anderton College of Computer and Information Science Northeastern University What is Information Retrieval? You have a collection of documents Books, web pages, journal
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
Copyright 2016 Ramez Elmasri and Shamkant B. Navathe CHAPTER 1 Databases and Database Users Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Slide 1-2 OUTLINE Types of Databases and Database Applications
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationAnalysis on the technology improvement of the library network information retrieval efficiency
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the
More informationRemotely Sensed Image Processing Service Automatic Composition
Remotely Sensed Image Processing Service Automatic Composition Xiaoxia Yang Supervised by Qing Zhu State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University
More informationArchives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment
Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information
More informationIntroduction to Information Retrieval. Hongning Wang
Introduction to Information Retrieval Hongning Wang CS@UVa What is information retrieval? 2 Why information retrieval Information overload It refers to the difficulty a person can have understanding an
More informationCHAPTER 8 Multimedia Information Retrieval
CHAPTER 8 Multimedia Information Retrieval Introduction Text has been the predominant medium for the communication of information. With the availability of better computing capabilities such as availability
More informationComponent-Based Software Engineering TIP
Component-Based Software Engineering TIP X LIU, School of Computing, Napier University This chapter will present a complete picture of how to develop software systems with components and system integration.
More information21. Search Models and UIs for IR
21. Search Models and UIs for IR INFO 202-10 November 2008 Bob Glushko Plan for Today's Lecture The "Classical" Model of Search and the "Classical" UI for IR Web-based Search Best practices for UIs in
More informationInformation mining and information retrieval : methods and applications
Information mining and information retrieval : methods and applications J. Mothe, C. Chrisment Institut de Recherche en Informatique de Toulouse Université Paul Sabatier, 118 Route de Narbonne, 31062 Toulouse
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationInformation Retrieval
Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationInternational Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.
A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish
More informationData Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini
Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 669-674 Research India Publications http://www.ripublication.com/aeee.htm Data Warehousing Ritham Vashisht,
More informationA Study of Future Internet Applications based on Semantic Web Technology Configuration Model
Indian Journal of Science and Technology, Vol 8(20), DOI:10.17485/ijst/2015/v8i20/79311, August 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 A Study of Future Internet Applications based on
More informationImplementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky
Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding
More informationTaccumulation of the social network data has raised
International Journal of Advanced Research in Social Sciences, Environmental Studies & Technology Hard Print: 2536-6505 Online: 2536-6513 September, 2016 Vol. 2, No. 1 Review Social Network Analysis and
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationA Framework for adaptive focused web crawling and information retrieval using genetic algorithms
A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably
More informationCompetitive Intelligence and Web Mining:
Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationPart I: Future Internet Foundations: Architectural Issues
Part I: Future Internet Foundations: Architectural Issues Part I: Future Internet Foundations: Architectural Issues 3 Introduction The Internet has evolved from a slow, person-to-machine, communication
More informationEmpowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia
Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user
More informationDynamic Visualization of Hubs and Authorities during Web Search
Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationCE4031 and CZ4031 Database System Principles
CE431 and CZ431 Database System Principles Course CE/CZ431 Course Database System Principles CE/CZ21 Algorithms; CZ27 Introduction to Databases CZ433 Advanced Data Management (not offered currently) Lectures
More informationAdaptable and Adaptive Web Information Systems. Lecture 1: Introduction
Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October
More informationAutomatic Identification of User Goals in Web Search [WWW 05]
Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality
More informationGeneralized Document Data Model for Integrating Autonomous Applications
6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Generalized Document Data Model for Integrating Autonomous Applications Zsolt Hernáth, Zoltán Vincellér Abstract
More informationData Mining and Warehousing
Data Mining and Warehousing Sangeetha K V I st MCA Adhiyamaan College of Engineering, Hosur-635109. E-mail:veerasangee1989@gmail.com Rajeshwari P I st MCA Adhiyamaan College of Engineering, Hosur-635109.
More informationRETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1907-1911 1907 Web-Based Data Mining in System Design and Implementation Open Access Jianhu
More informationSemantic Clickstream Mining
Semantic Clickstream Mining Mehrdad Jalali 1, and Norwati Mustapha 2 1 Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Computer Science, Universiti
More informationCHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES
70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically
More informationEnhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,
More informationManaging Change and Complexity
Managing Change and Complexity The reality of software development Overview Some more Philosophy Reality, representations and descriptions Some more history Managing complexity Managing change Some more
More informationThe main website for Henrico County, henrico.us, received a complete visual and structural
Page 1 1. Program Overview The main website for Henrico County, henrico.us, received a complete visual and structural overhaul, which was completed in May of 2016. The goal of the project was to update
More informationINFORMATION TECHNOLOGY COURSE OBJECTIVE AND OUTCOME
INFORMATION TECHNOLOGY COURSE OBJECTIVE AND OUTCOME CO-1 Programming fundamental using C The purpose of this course is to introduce to students to the field of programming using C language. The students
More informationInternational Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani
LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models
More informationis easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology
Preface The idea of improving software quality through reuse is not new. After all, if software works and is needed, just reuse it. What is new and evolving is the idea of relative validation through testing
More informationOntology Based Prediction of Difficult Keyword Queries
Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationAn Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database
An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database Toru Fukumoto Canon Inc., JAPAN fukumoto.toru@canon.co.jp Abstract: A large number of digital images are stored on the
More informationComponent-Based Software Engineering TIP
Component-Based Software Engineering TIP X LIU, School of Computing, Napier University This chapter will present a complete picture of how to develop software systems with components and system integration.
More informationChallenges of Analyzing Parametric CFD Results. White Paper Published: January
Challenges of Analyzing Parametric CFD Results White Paper Published: January 2011 www.tecplot.com Contents Introduction... 3 Parametric CFD Analysis: A Methodology Poised for Growth... 4 Challenges of
More informationChapter 6 Evaluation Metrics and Evaluation
Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationMULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste
MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS By SSLMIT, Trieste The availability of teaching materials for training interpreters and translators has always been an issue of unquestionable
More informationPeer-to-Peer Systems. Chapter General Characteristics
Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationEffective Knowledge Navigation For Problem Solving. Using Heterogeneous Content Types
From: AAAI Technical Report WS-97-09. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Effective Navigation For Problem Solving Using Heterogeneous Content Types Ralph Barletta and
More informationPatent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012
Patent documents usecases with MyIntelliPatent Alberto Ciaramella IntelliSemantic 25/11/2012 Objectives and contents of this presentation This presentation: identifies and motivates the most significant
More informationIntelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD
World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical
More informationAn Empirical Evaluation of User Interfaces for Topic Management of Web Sites
An Empirical Evaluation of User Interfaces for Topic Management of Web Sites Brian Amento AT&T Labs - Research 180 Park Avenue, P.O. Box 971 Florham Park, NJ 07932 USA brian@research.att.com ABSTRACT Topic
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationInformation Discovery, Extraction and Integration for the Hidden Web
Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk
More informationVALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018
More informationMission-Critical Customer Service. 10 Best Practices for Success
Mission-Critical Email Customer Service 10 Best Practices for Success Introduction When soda cans and chocolate wrappers start carrying email contact information, you know that email-based customer service
More information6 TOOLS FOR A COMPLETE MARKETING WORKFLOW
6 S FOR A COMPLETE MARKETING WORKFLOW 01 6 S FOR A COMPLETE MARKETING WORKFLOW FROM ALEXA DIFFICULTY DIFFICULTY MATRIX OVERLAP 6 S FOR A COMPLETE MARKETING WORKFLOW 02 INTRODUCTION Marketers use countless
More informationThanks to our Sponsors
Thanks to our Sponsors A brief history of Protégé 1987 PROTÉGÉ runs on LISP machines 1992 PROTÉGÉ-II runs under NeXTStep 1995 Protégé/Win runs under guess! 2000 Protégé-2000 runs under Java 2005 Protégé
More information2 The IBM Data Governance Unified Process
2 The IBM Data Governance Unified Process The benefits of a commitment to a comprehensive enterprise Data Governance initiative are many and varied, and so are the challenges to achieving strong Data Governance.
More informationWEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE
WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationWeb Usage Mining using ART Neural Network. Abstract
Web Usage Mining using ART Neural Network Ms. Parminder Kaur, Lecturer CSE Department MGM s Jawaharlal Nehru College of Engineering, N-1, CIDCO, Aurangabad 431003 & Ms. Ruhi M. Oberoi, Lecturer CSE Department
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More information