CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Size: px

Start display at page:

Download "CHAPTER THREE INFORMATION RETRIEVAL SYSTEM"

Emil McCarthy
5 years ago
Views:

2 CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost everyone to search desired information in the various fields such as business, entertainment, research etc. It is important to understand the basic mechanism of Information Retrieval in order to understand the search engines and their working. According to Lang searching within a document collection for a particular information need which is -Yates and Ribeiroretrieval deals with the representation, storage, organization of, and access to information item, in order to give the user the possibility to easily access the desired There is a clear cut difference between classical information retrieval and web information retrieval system. Classical information retrieval is search of restricted collections that are not linked [122]. The documents in the classical information retrieval system are stored in physical form such as searching an item in a book. But, now a days, the documents are in computerized form which are retrieved with the help of special tools or techniques known as information retrieval models. Web information retrieval, on the other hand is the search from globally large collection of documents such as search from search engines like Bing, Yahoo Google etc. [122].

3 Information Retrieval System INFORMATION SYSTEM process the data and information in a given organization, which may include manual processes and autom There are four important computer based information systems: 1. Management Information Systems [77] 2. Database Management Systems [115] 3. Question-Answering systems 4. Information Retrieval systems Figure 3.1: Overlap among Information System Types Source: Introduction to Modern Information Retrieval, TMH The input information is generally taken in the form of natural language texts, document or abstracts. The output is response to search requests [159]. There is a

4 Information Retrieval System 32 significant overlap of Information retrieval with other information systems. Figure 3.1 depicts the working and overlap of each information system. 3.3 FUNCTIONAL APPROACH TO IR Figure 3.2 shows the functional approach of the information retrieval system. There are three major components of the the information retrieval system. 1. A set of information items 2. A set of requests 3. A set of mapping mechanisms Figure 3.2: Functional Overview of Information Retrieval Source: Introduction to Modern Information Retrieval, TMH 3.4 SEARCHING PROCESS A typical search process is shown in figure 3.3. It involves various steps showing an optimal method of searching an item in the database [123]. Boolean search methods [35] are usually used in the web information retrieval system. The main task in the search process is to coordinate the terms to formulate the actual search statement. The whole search process mainly depends on the effective combination of the search terms.

5 Information Retrieval System 33 Figure 3.3: Optimal Searching Process 3.5 RETRIEVAL MODELS There are several retrieval models to improve the retrieval process. The various information retrieval models classified into the following categories [59]: 1. User centric or cognitive models 2. System centric models 3. Alternative models The user centric model also consider ways in which the query is formulated in the form of user information needs, the human computer interaction during the search process [46], the environment which the search is carried out and the way in which the information is used to meet specific information need in addition to retrieval mechanisms used in matching queries.

6 Information Retrieval System 34 The system centric model is based on logical and mathematical principles such as probabilistic model, Boolean search and vector processing models [59]. In probabilistic model, the search is carried out by comparing the relevance probabilities of the documents [143] while queries are compared with terms which are used to represent the documents in case of Boolean search model. The global similarity between queries and set of documents is compared case of vector processing model. Best match searching and relevance feedback model The purpose of best match searching [60] is to create the ranked out which necessitates to calculate the relative significance of retrieved items which in turn requires weighting the search terms in one or the other way. A similarity consists of two main components: 1. A term weighting scheme that indicates the significance of a term by assigning numerical values to each index term in the document or query. 2. A similarity coefficient which uses these weights to compute the similarity between query and retrieved item. Each query term is compared against the each term in the database in case of best match search technique, the measure of similarity is calculated between the term in the document and the query and finally all the items retrieved so far are sorted with decreasing similarity values [58]. The ranking of the documents involves some sort of quantitative measurement [170]. The various weighting schemes are used to produce best results such as term frequency and collection frequency [170]. 3.6 WORLD WIDE WEB The World Wide We access information via World Wide Web. The Internet host machines were 147,344,723 in January 2002 and increased to 908,585,739 in July 2012 [89] which shows large percentage of increase in ten years which in turn shows enormous increase in the number of websites. The CommerceNet survey indicates that total numbers of users were about 490 million in the year 2002 and increased to 2,405,518,376 in June 2012 [87].

7 Information Retrieval System 35 It was estimated in a research [1] that the numbers of indexable web pages were about 11.5 billion pages in The recent survey estimates the number of indexable web pages is billion [173]. It would have not been possible without powerful tools to extract the information from such a large source of information i.e., World Wide Web [114]. Four main methods to find out information on Web are identified by [135]: 1. Using a known URL 2. Using Hypertext links to navigate from a web page to another web page 3. Narrowcast services or Portals which push web pages to users according to their particular profiles 4. Search engines which allow users to search the web exploring traditional and advanced information retrieval techniques It was estimated by [168] that 85 % of Internet users exploit search engines to locate the information. In another research by Jansen and Pooch [24], it was estimated that that 71 % of web users use search engines to find other websites. Search engines are the most essential tools to search the web. Advance information retrieval techniques are used by the search engines to extract information from the web [135]. Classical Information Retrieval words such as "a", "of" and "is" do not contain semantic information. These words are called stop words and are usually not used for document representation. The remaining words are content words and can be used to represent the document. Variations of the same word may be mapped to the same term. For example, the words "beauty", "beautiful" and "beautify" can be denoted by the term "beaut". This can be achieved by a stemming program. After removing stop words and stemming, each document can be logically represented by a vector of n terms [181], where n is the total number of distinct terms in the set of all documents in a document Let us consider that the document d is represented by the vector ( d1 di dn) where, di is a weight assigned to the i th term in the document d. If a term is there in the document, the weight is assigned on following two factors.

8 Information Retrieval System The term frequency represented by tf of a term in a document is the number of times the term occurs in the document. The higher the term frequency of a term is, the more important the term is in representing the contents of the document. As a consequence, the term frequency weight ( tfw) of the term in the document is usually a monotonically increasing function of its term frequency. 2. The document frequency represented by df is the number of documents having the term. Higher the document frequency of a term, the less important the term is in discriminating documents having the term from documents not having it. Thus, the weight of a term based on its document frequency is usually monotonically decreasing and is known the inverse document frequency weight, represented by idfw [20]. Product of inverse document frequency weight and term frequency weight represent the weight of a term in a document. The precision and recall are the two common parameters to calculate the effectiveness of information retrieval [10]. The recall and precision can be calculated by following formulae: The set of test queries are used to evaluate the effectiveness of the retrieval system. The set of relevant documents is recognized against each query. The value of precision and recall is calculated from above formula. The average precision recall curve is drawn from the set of precision recall value over the set of queries. This curve is used to determine the effectiveness of the system. The both precision and recall value of an ideal information retrieval system should be equal to one, i.e., the system retrieves only relevant documents and nothing else.

9 Information Retrieval System WEB BASED SEARCH ENGINES In order to retrieve the web page, we use search engines as an information retrieval system [40]. The HTML and XML (extensible Markup Language) tags present in the web pages express wealthy information. The tag informat ion is used by many search engines like Altavista and Google to determine the importance of the term. Due to the position of the term in the web page or special font, the higher weights may be allocated to the term [13]. The web pages in the World Wide Web are broadly linked. The link between the pages provides much useful information such as 1. Link indicates a good likelihood that the contents of the two pages are related. 2. The author of a page values the contents of another page. The linkage information has been used to compute the global importance i.e. PageRank of Web pages based on whether a page is pointed to by many pages and/or by important pages [117]. 3. The linkage information has also been used to compute the authority or the degree of importance of Web pages with respect to a given topic [103]. For example, IBM's Clever Project is to develop a search engine that employs the technique of computing the authorities of Web page for a given query [165]. 4. Linkage information can also be utilized in another way. When a page A has a link to page B, a set of terms known as anchor terms is usually associated with the link. The purpose of using the anchor terms is to provide information regarding the contents of page B to facilitate the navigation by human users. The anchor terms often provide related terms or synonyms to the terms used to index page B. To utilize such valuable information, several search engines like Google [162] have recommended using anchor terms to represent linked pages. An Internet survey was conducted by Manning, Raghavan, & Schutze, 2009 [39] which show that 92% of the Internet users find Web as the good place for getting information [113] through Pew Internet. The web search engine is easy to use and convenient which makes it a successful tool for the web search. Other reasons are ease of availability to Internet users, can be

10 Information Retrieval System 38 used anytime and anywhere and easily accessible. To maximize the search engine visibility, a lot of efforts are put search engine optimization [27]. A research shows that consumer demand increases five times to purchase a product through website than through banner or other advertisement [163] which can be time and cost effective. A study [101] shows that search engines came into existence in 1994 in the form of research projects by faculty and graduate students. There were around 2000 searching tools by the end of December 1997 and around 25 general purpose search engines by the end of There are more than 900 search engines [26] as indexed by Big Search Engine Index in One search engine can be distinguished from on the basis of following criteria: 1. Size: the number of sites or pages indexed 2. Speed: how fast the engine can find the information requested 3. the actual query 4. Update rate: How current is the information contained in their databases There are three major components of the search engines: 1. Robot (spider) which crawls the web and captures new web pages 2. Database which include serial files, indexes and inverted files for the captured web pages 3. Agent which perform the search process The search engines build their database by crawling web pages periodically and indexing the Web pages that are suitable to be added in the database. When a query is submitted by the user to search engine, the appropriate match of the query term with database is carried out by the search engine with the help of complicated searching algorithms. The searching algorithms vary from one search engine to other. Lastly, the retrieved documents are ranked according to the relevancy of the document with the query. The query term frequency was the key method in ranking the web pages as pointed by Kamfai Wong [84]. Later on Kleinberg [103] did a significant work in page link analysis and it as a very dominant technique in ranking web pages and other Hyperlinked documents.

11 Information Retrieval System 39 The exact ranking algorithms are not disclosed by the search engine and it is a commercial secret but some general information and techniques to retrieve the web pages are published [107]. H. Vernon Leighton [80] in a research showed that the relevancy score of the document will increase if the words appear in a page title or heading. It is found that the users need is not fulfilled by a single search engine as information is scattered over disjoint set of databases or search engines. Also, the rapid rate of information explosion on the web makes a single search engine incapable to index all web pages [128]. The Meta search engine is the solution to the above problem. A Meta search engine is search engine generally does not maintain its own index for the documents. Instead, it queries several participating search engines and aggregates their individual results into a unified result set, re-ranks the results returned by search engine based on the Meta search engines are based on data fusion techniques and require three major steps: 1. Selection of the most comprehensive databases 2. Ranking the selected database properly and combining the retrieved results 3. Merging the results in a single unified list of documents using the most suitable merging algorithm There are so many advantages of using Meta search engine [68]: 1. It increases the search coverage of the web pages 2. Facilitates the invocation of multiple search engines 3. Solves the extendibility issues in searching information 4. Improve information retrieval effectiveness There are several design issues and technical issues [169] which are to be considered in building the efficient Meta search engine. Some of them are listed below: 1. Selection of the database or component engine selection 2. Query analysis

12 Information Retrieval System Query scheduling or dispatch 4. Rank aggregation 5. Result merging which is a key component of the Meta search system. The effectiveness of a Meta search system is directly related to the result merging algorithm It is not sure that the Meta search engines provide a complete solution to search from widespread Web. The quality of the results retrieved from the Meta search engine largely depends on the underlying component search engines which constantly undergo some changes such as changes in the output format, content of their index and ranking algorithms etc.

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods