Information Retrieval and Search Engine

Size: px

Start display at page:

Download "Information Retrieval and Search Engine"

Cameron Marshall
6 years ago
Views:

1 Information Retrieval and Search Engine ดร. ชชาต หฤไชยะศกด Choochart Haruechaiyasak, Ph.D. งานว+จ-ยและพ-ฒนาเทคโนโลย8สารสนเทศ Research and Development on Information (RDI) ศFนยGเทคโนโลย8อ+เลIกทรอน+กสGและคอมพ+วเตอรGแหNงชาต+ (เนคเทค) National Electronics and Computer Technology Center (NECTEC)

2 Lecture Outline Information Retrieval (IR) History and overview System functionalities: Indexing: Inverted file Retrieval: Vector space model Search Engine (SE) Web Crawler Link analysis and search result ranking Open-Source IR Platform Lucene: Open-source IR library Sansarn Look!: A system for developing search engines and IR systems for Thai language

3 Information Retrieval (IR)

4 What is Information Retrieval? An information retrieval system is a device interposed between a potential user of information and the information collection itself. For a given information problem, the purpose of the system is to capture wanted items and to filter out unwanted items. Information retrieval deals with the representation, storage, and access to documents or representatives of documents.

5 Data Retrieval VS. Information Retrieval

6 History s: Initial exploration of text retrieval systems for small corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vector-space models of retrieval. Prof. Salton and his students at Cornell University are the leading researchers in the area s: Large document database systems, many run by companies: Lexis-Nexis / MEDLINE

7 History (cont'd) 1990 s: Searching FTPable documents on the Internet Archie / WAIS Searching the World Wide Web Lycos / Yahoo / Altavista Organized Competitions NIST TREC Recommender Systems Amazon.com / CDnow.com / imdb.com Automated Text Categorization & Clustering

8 History (cont'd) 2000 s: Link analysis for Web Search Google Automated Information Extraction Whizbang / Fetch / Burning Glass Question Answering TREC Q/A track Multimedia IR Image / Video / Audio and music Cross-Language IR Document Summarization

9 Related Research Areas Database Management Library and Information Science Artificial Intelligence (AI) Natural Language Processing (NLP) Machine Learning (ML)

10 Database Management Focused on structured data stored in relational tables rather than free-form text. Focused on efficient processing of well-defined queries in a formal language (SQL). Clearer semantics for both data and queries. Recent move towards semi-structured data (XML) brings it closer to IR.

11 Library and Information Science Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). Concerned with effective categorization of human knowledge. Concerned with citation analysis and bibliometrics (structure of information). Recent work on digital libraries brings it closer to CS & IR.

12 Artificial Intelligence (AI) Focused on the representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge and queries: First-order Predicate Logic Bayesian Networks Recent work on web ontologies and intelligent information agents brings it closer to IR.

13 Natural Language Processing (NLP) Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.

14 NLP: IR Directions Methods for determining the sense of an ambiguous word based on context (Word-Sense Disambiguation) Methods for identifying specific pieces of information in a document (Information Extraction) Methods for answering specific natural-language questions from document corpora. (Question & Answering)

15 Machine Learning (ML) Focused on the development of computational systems that improve their performance with experience. Automated classification of examples based on learning concepts from labeled training examples (supervised learning). Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).

16 ML : IR Directions Text Categorization Automatic hierarchical classification (Yahoo). Adaptive filtering/routing/recommending. Automated spam filtering. Text Clustering Clustering of IR query results. Automatic formation of hierarchies (Yahoo). Learning for Information Extraction Text Mining

17 Overview of IR IR is concerned with indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent killer application. Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

18 Relevance Relevance is a subjective judgement and may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

19 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

20 Problems with Keywords May not retrieve relevant documents that include synonymous terms. restaurant vs. café PRC vs. China May retrieve irrelevant documents that include ambiguous terms. bat (baseball vs. mammal) Apple (company vs. fruit) bit (unit of data vs. act of eating)

21 Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.

22 Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

23 IR Functional Components Two IR main functions: (1) Indexing: (system perspective) - Text processing - Index construction (2) Retrieval: (user perspective) - User interface - Query processing - Searching from index (index lookup) - Search result ranking

24 IR System: (1) Indexing

25 IR System: (2) Retrieval

26 Text Processing Strip unwanted characters/markup e.g. HTML tags, punctuation, numbers, etc. Break into tokens (keywords) on whitespace Stem tokens to root words (Porter stemming) computational -> comput Remove common stopwords (e.g. a, the, it, etc.). Detect common phrases and named entities possibly by using a domain specific dictionary

27 Indexing Issues Index is a data structure constructed from the text to speed up the search. Indexing issues are: How to structure the index How to create the index (storage, time) How to store the index (storage, compression) How to process the index (storage, time) How to update the index (storage, time) The most critical issue is logarithmic or constant-time access to token information. Less important issue is the storage space.

28 Indexing: Inverted File Inverted file structure is composed of two elements: Vocabulary: A set of all different words or terms in the text Occurrence: A list of all documents including text positions where the word appears The word-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (such as trie, B-tree). Inverted file is the most widely adopted indexing approach for implementing IR systems and search engines.

29 Indexing: Inverted File

30 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion of relevance can be either Binary: yes or no Continuous: sorted by score

31 Classes of Retrieval Models Boolean models (set theoretic) Extended Boolean Vector space models (statistical/algebraic) Generalized Latent Semantic Indexing Probabilistic models Bayesian networks Inference network model Belief network model

32 Retrieval: Boolean Model A document is represented as a set of words. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope. ((Pattaya OR Phuket) AND Thailand AND hotel AND NOT Hilton Output: Document is relevant or not. No partial matches or ranking.

33 Retrieval: Boolean Model Popular retrieval model because: Easy to understand for simple queries. Clean formalism. Boolean models can be extended to include ranking. (Extended Boolean model) Reasonably efficient implementations possible for normal queries.

34 Retrieval: Boolean Model - Problems Very rigid: AND means all; OR means any. Difficult to express complex user requests. Difficult to control the number of documents retrieved. All matched documents will be returned. Difficult to rank output. All matched documents logically satisfy the query. Difficult to perform relevance feedback. If a document is identified by the user as relevant or irrelevant, how should the query be modified?

35 Retrieval: Vector Space Model (VSM) Statistical-based model A document is typically represented by a bag of words (unordered words with frequencies). Note: Bag = set that allows multiple occurrences of the same element. User specifies a set of desired terms with optional weights: Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > Unweighted query terms: Q = < database; text; information > No Boolean conditions specified in the query

36 Retrieval: Vector Space Model (VSM) Retrieval based on similarity between query and documents. Output documents are ranked according to similarity to query. Similarity based on occurrence frequencies of keywords in query and document. Automatic relevance feedback can be supported: Relevant documents added to query. Irrelevant documents subtracted from query.

37 Retrieval: Vector Space Model (VSM) How to determine important words in a document? Word sense? Word n-grams (and phrases, idioms, ) -> terms How to determine the degree of importance of a term within a document and within the entire collection? How to determine the degree of similarity between a document and the query? In the case of the web, what is a collection and what are the effects of links, formatting information, etc.?

38 Retrieval: Vector Space Model (VSM) Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. These orthogonal terms form a vector space. Dimension = t = vocabulary Each term, i, in a document or query, j, is given a real-valued weight, w ij Both documents and queries are expressed as t-dimensional vectors: d j = (w 1j, w 2j,, w tj )

39 Retrieval: Vector Space Model (VSM) Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T 3 5 D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T T 1 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?

40 Retrieval: VSM document collection A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the weight of a term in the document; zero means the term has no significance in the document or it simply doesn t exist in the document. T 1 T 2. D 1 w 11 w 21 T t w t1 D 2 w 12 w 22 w t2 : : : : : : : : D n w 1n w 2n w tn

41 Retrieval: VSM Term-frequency More frequent terms in a document are more important, i.e. more indicative of the topic. f ij = frequency of term i in document j May want to normalize term frequency (tf) across the entire corpus: tf ij = f ij / max{f ij }

42 Retrieval: VSM Inverse Document Frequency Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idf i = inverse document frequency of term i, = log 2 (N/ df i ) (N: total number of documents) An indication of a term s discrimination power. Log used to dampen the effect relative to tf.

43 Retrieval: VSM TF-IDF weighting A typical combined term importance indicator is tfidf weighting: w ij = tf ij x idf i = tf ij x log 2 (N/ df i ) A term occurring frequently in the document but rarely in the rest of the collection is given high weight. Many other ways of determining term weights have been proposed. Experimentally, tf-idf has been found to work well.

44 Retrieval: VSM TF-IDF weighting Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

45 Retrieval: Similarity Measure A similarity measure is a function that computes the degree of similarity between two vectors. Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in the order of presumed relevance. It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.

46 Retrieval: Cosine Similarity Measure Cosine similarity measures the cosine of the angle between two vectors. CosSim(d j, q) = dj dj q q = i= 1 t t ( w w ij ij 2 w i= 1 i= 1 t iq ) w iq 2 D 1 θ 1 θ 2 t 3 Q t 1 t 2 D 2 D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 10 / (4+9+25)(0+0+4) = 0.81 D 2 = 3T 1 + 7T 2 + 1T 3 CosSim(D 2, Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T 1 + 0T 2 + 2T 3

47 Retrieval: VSM Drawbacks Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Assumption of term independence (e.g. ignores synonomy). Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). Given a two-term query A B, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

48 Retrieval: VSM Summary Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections. (scalable)

49 IR System Evaluation Coverage of the text collection: extent to which the system includes relevant material Response time: average interval between the time request is made and the time answer is given Presentation of the output (user interface) Effort involved by user in obtaining answers to request Recall: proportion of relevant documents retrieved Precision: proportion of the retrieved documents that are actually relevant

50 Search Engine (SE)

51 Search Engine VS. Librarian

52 Search Engine : An Overview

53 Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

54 What is Web search? Access to heterogeneous, distributed information Heterogeneous in creation Heterogeneous in motives Heterogeneous in accuracy Multi-billion dollar business Source of new opportunities in marketing Strains the boundaries of trademark and intellectual property laws A source of unending technical challenges

55 Search Engine Market Share

56 Search Engine's Components Web crawler (robot /spider) builds corpus Collects web pages recursively Additional pages from direct submissions & other sources The indexer creates inverted indexes Various policies on which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. Query processor serves query results Front end query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end finds matching documents and ranks them

57 Web Crawlers Start with a comprehensive set of root URL s from which to start the search. Follow all links on these pages recursively to find additional pages. Index all new found pages in an inverted index as they are encountered. May allow users to directly submit pages to be indexed (and crawled from).

58 Web Crawlers Search Strategies Two possible crawling strategies: Breadth-First Search (BFS) Depth-First Search (DFS) BFS explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method. DFS requires memory of only depth times branching-factor (linear in depth) but gets lost pursuing a single thread.

59 Web Crawlers Summary Must avoid page duplication - Web is a graph not a tree Must obey page-owner restrictions - Robot exclusion protocol - robots.txt Crawling is a time-consuming process. - Distributed crawlers - Update strategy Many open-source crawlers are avaiable from sourceforge.net

60 Search Result Ranking Most search engines use two components in determining their ranking algorithms: (1) On-Page Factors Domain name, file name/url, image file name, title tag, meta tags, image ALT tag, header/font tags, body/visible text (2) Off-Page Factors Link popularity, link reputation

61 Search Result Ranking: On-page factor Domain name Directory and File name / URL /keyword1-keyword2/ Image file name

62 Search Result Ranking: On-page factor Title Tag The title tag is one of the most important factors in achieving high search engine rankings. <HEAD> <TITLE>keyword phrase1, keyword phrase2... </TITLE> </HEAD> Meta Description Tag <META NAME="Description" CONTENT= "> Meta description tag is a snippet of HTML code that belongs inside the <Head> </Head> section of a Web page.

63 Search Result Ranking: On-page factor Meta Keywords Tag <meta name="keywords" content="keyword phrase1, keyword phrase2, keyword phrase3"> Image Alt Tags (IMG ALT) <img src="../../images/keyword.gif" alt= sentence including keyword that define an image."> Body Copy / Body Text / Visible Text Include keyword phrase at least once in the following tags: headers (H1, H2), bold, italic, text link

64 Search Result Ranking: Off-page factor Link Popularity Number of incoming links to your Web page Link Reputation Anchor text & description Quality of the Web pages in the same category that links to your Web page Authority Sites: Links from highly trusted Web sites such as Yahoo!, Dmoz, etc.

65 Link Analysis: Overview Two well-known link analysis approach: HITS (Hypertext Induced Topic Selection) By Jon Kleinberg PageRank By Brin & Page of Google

66 Link Analysis: HITS Hubs are index pages that provide lots of useful links to relevant content pages. Web portal or directories behave like hub pages for various topics. Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority.

67 Link Analysis: PageRank Just measuring in-degree (citation count or authority of the page) R( p) = q: q p R( q) Nq Nq is the total number of out-links from page q. c A page, q, gives an equal fraction of its authority to all the pages it points to (e.g. p). c, or damping factor, is a normalizing constant set so that the rank of all pages always sums to 1.

68 Link Analysis: Summary Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google s success.

69 Open-Source IR Platform

70 Lucene: An Open Source IR Library Lucene is a high performance, scalable Information Retrieval (IR) library. Fact: created by Doug Cutting free, open-source, originally implemented in Java a member of the popular Apache Jakarta projects licensed under the Apache Software License not a complete application also ported to Perl, Python, C++,.NET active user and developer communities widely used, e.g., IBM and Microsoft.

71 Lucene: A Typical Application Integration

72 Lucene: Architecture

73 Lucene: API org.apache.lucene.document A text is organized as a document which represents a collection of fields. org.apache.lucene.analysis Analysis contains classes for performing the text processing. org.apache.lucene.index Index package contains classes responsible for indexing process. org.apache.lucene.search Search package contains classes related to search operations.

74 Lucene: Analyzer Lucene provides classes for various text analyzing approaches: WhitespaceAnalyzer: divides text at whitespace. SimpleAnalyzer: normalizes token text to lower case. StopAnalyzer: normalizes token text to lower case and removes stop words from a token stream. StandardAnalyzer: combine all above tasks, suitable for English texts These classes are designed for English language. To make the analysis applicable to particular language, a language-specific analyzer package must be developed to segment or tokenize texts into a set of words.

75 Lucene: Analyzer Analyzer Demo 1: "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer: [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer: [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

76 Lucene: Analyzer Analyzer Demo 2: "The XY&Z Corporation: - xyz@example.com" WhitespaceAnalyzer: [The] [XY&Z] [Corporation:] [-] [xyz@example.com] SimpleAnalyzer: [the] [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]

77 Lucene: ThaiAnalyzer ThaiAnalyzer(th) package contains the following classes: LexTo.java : Lexeme Tokenizer (Dictionary-based word segmentation) ParseTree.java : Construct parse tree while tokenizing texts lexitron.txt : Thai dictionary VTrie.java : Trie data structure for efficient dictionary storage and retrieval ThaiTokenizer.java : calls LexTo for word segmentation ThaiFilter.java : check for special characters ThaiAnalyzer.java : calls ThaiTokenizer, ThaiFilter, StandardFilter, LowerCaseFilter and StopFilter, respectively.

78 Sansarn Look! : A system platform for developing search engines and IR systems for Thai language

79 Sansarn Look!: Architecture

80 Sansarn-Lucene API

81 Sansarn Look! : Key Features Built on a widely used open-source IR library Lucene Support full-text and field-text search Support local files and Web pages Support multiple file formats Support English-Thai texts User-friendly via Browser Interface Configurability Additional features on Thai languages: Query suggestion Query spell-check

82 References 1) R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, Addison Wesley, ) W. B. Frakes and R. Baeza-Yates, eds., Information Retrieval: Data Structures & Algorithms, Prentice Hall, ) E. Hatcher and O. Gospo, Lucene in Action, Manning Publications, ) Jon M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM, 46(5), ) L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, 7th Int. World Wide Web Conference, pp , ) The Lucene web site: 7) Sansarn Look! :

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic