Information Retrieval and Web Search

Similar documents
Information Retrieval

Information Retrieval

More about Posting Lists

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Information Retrieval and Organisation

CS276 Information Retrieval and Web Search. Lecture 2: Dictionary and Postings

Information Retrieval CS-E credits

Natural Language Processing

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Basic Text Processing

Text Pre-processing and Faster Query Processing

DD2476: Search Engines and Information Retrieval Systems

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci

CS6322: Information Retrieval Sanda Harabagiu. Lecture 2: The term vocabulary and postings lists

Digital Libraries: Language Technologies

CS 6320 Natural Language Processing

Chapter 4. Processing Text

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval and Text Mining

Informa(on Retrieval

Information Retrieval. Lecture 2 - Building an index

Introduction to Information Retrieval

Information Retrieval. Chap 7. Text Operations

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Multimedia Information Extraction and Retrieval Indexing and Query Answering

IR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components

Informa(on Retrieval

Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation. ATIR April 28, 2016

EBSCO Searching Tips User Guide. support.ebsco.com

Challenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio

Chapter 2. Architecture of a Search Engine

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Chapter 27 Introduction to Information Retrieval and Web Search

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Influence of Word Normalization on Text Classification

Google Search Appliance

Lecture11b: NLP (Introduction)

Information Retrieval

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS105 Introduction to Information Retrieval

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Text Processing Overview. Lecture Outline

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

F INTRODUCTION TO WEB SEARCH AND MINING. Kenny Q. Zhu Dept. of Computer Science Shanghai Jiao Tong University

LAB 3: Text processing + Apache OpenNLP

Information Retrieval and Web Search

Made by: Ali Ibrahim. Supervisor: MR. Ali Jnaide. Class: 12

Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Natural Language Processing

Natural Language Processing and Information Retrieval

Interdisciplinary Journal of Best Practices in Global Development Final Manuscript Preparation Guidelines

WBJS Grammar Glossary Punctuation Section

Information Retrieval and Web Search Engines

Meridian Linguistics Testing Service 2018 English Style Guide

Chapter IR:II. II. Architecture of a Search Engine. Indexing Process Search Process

Introduction to Information Retrieval

Information Retrieval. (M&S Ch 15)

Apache UIMA ConceptMapper Annotator Documentation

Style Guide. For Athletics logos and standards, please contact Athletics or visit

Information Retrieval and Text Mining

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Corso di Biblioteche Digitali

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Chapter 3 - Text. Management and Retrieval

INFSCI 2480 Text Processing

Alphabetical Index referenced by section numbers for PUNCTUATION FOR FICTION WRITERS by Rick Taubold, PhD and Scott Gamboe

Introduction to Information Retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

ASSIGNMENT 6. COMP-202, Winter 2015, All Sections. Due: Tuesday, April 14, 2015 (23:59)

CSC401/2511 ONE OF THE FIRST TUTORIALS EVER, EVERYBODY SAYS SO. Frank Rudzicz

Outline of the course

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Language Support, Linguistics, and Text Analytics in Solr

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Correlation to Georgia Quality Core Curriculum

CS347. Lecture 2 April 9, Prabhakar Raghavan

Introduction to Information Retrieval. Lecture Outline

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Query Refinement and Search Result Presentation

Speech and Language Processing. Lecture 3 Chapter 3 of SLP

Information Retrieval

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 94-95

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

Information Retrieval

Stemming Techniques for Tamil Language

Search Engines. Information Retrieval in Practice

Transcription:

Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course taught by Prof. Ray Mooney at UT Austin)

IR System Architecture User Need User Feedback Query Operations User Interface Text Operations Logical View Indexing Text Database Manager Query Ranked Docs Searching Ranking Index Retrieved Docs Inverted file Text Database

Text Processing Pipeline Documents to be indexed OR User query Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Inverted index Indexer friend roman countryman 2 4 1 2 13 16

From Text to Tokens to Terms Tokenization = segmenting text into tokens: token = a sequence of characters, in a particular document at a particular position. type = the class of all tokens that contain the same character sequence.... to be or not to be...... so be it, he said... term = a (normalized) type that is included in the IR dictionary. Example 3 tokens, 1 type text = I slept and then I dreamed tokens = I, slept, and, then, I, dreamed types = I, slept, and, then, dreamed terms = sleep, dream (stopword removal).

Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. However, frequently they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach: Separate?! ; : [ ] ( ) < > Care with. - why? when? Care with??

Tokenization Apostrophes are ambiguous: possessive constructions: the book s cover => the book s cover contractions: he s happy => he is happy aren t => are not quotations: let it be => let it be Whitespaces in proper names or collocations: San Francisco => San_Francisco how do we determine it should be a single token?

Tokenization Hyphenations: co-education => co-education state-of-the-art => state of the art? state_of_the_art? lowercase, lower-case, lower case => lower_case Hewlett-Packard => Hewlett_Packard? Hewlett Packard? Period Abbreviations: Mr., Dr. Acronyms: U.S.A. File names: a.out

Tokenization Numbers 3/12/91 Mar. 12, 1991 55 B.C. B-52 100.2.86.144 Unusual strings that should be recognized as tokens: - C++, C#, B-52, C4.5,M*A*S*H.

Tokenization Tokenizing HTML Should text in HTML commands not typically seen by the user be included as tokens? Words appearing in URLs. Words appearing in meta text of images. Simplest approach is to exclude all HTML tag information (between < and > ) from tokenization. Note: it s important to use the same tokenization rules for the queries and the documents

Tokenization is Language Dependent Need to know the language of the document/query: Language Identification, based on classifiers trained on short character subsequences as features, is highly effective. French (reduced definite article, postposed clitic, pronouns): l ensemble, un ensemble, donne-moi. German (compund nouns), need compound splitter: Computerlinguistik Lebensversicherungsgesellschaftsangestellter (life insurance company employee) Compound Splitting for German: usually implemented by finding segments that match against dictionary entries.

Tokenization is Language Dependent East Asian languages, need word segmenter: 莎拉波娃现在居住在美国东南部的佛罗里达 - Not always guaranteed a unique tokenization Complicated in Japanese, with multiple alphabets intermingled. Word Segmentation for Chinese: ML sequence tagging models trained on manually segmented text: Logistic Regression, HMMs, Conditional Random Fields. Multiple segmentations are possible: These characters can either be treated as one word monk, or as a sequence of two words and and still

Tokenization is Language Dependent Agluttinative languages Tusaatsiarunnanngittualuujunga I can't hear very well. This long word is composed of a root word tusaa- 'to hear' followed by five suffixes: -tsiaq- well -junnaq- be able to -nngit- not -tualuu- very much -junga 1st pers. singular present indicative non-specific

Tokenization is Language Dependent Arabic and Hebrew: Written right to left, but with certain items like numbers written left to right. Words are separated, but letter forms within a word form complex ligatures start Algeria achieved its independence in 1962 after 132 years of French occupation.

Language Identification Simplest (and often most effective) approach: calculate the likelihood of each candidate language, based on training data Given a sequence of words (or characters) w 1 n For each candidate language, calculate: P(w 1n ) Π k=1 n P(w k w k-1 ); w 0 = <start> P(w n w n-1 ) = C(w n-1 w n )/C(w n-1 ) - where C(w n-1 w n ) and C(w n-1 ) are simple frequency counts collected from that language s training data Choose the language that maximizes the probability

Language Identification Avoiding Zero Counts What if a word / character (or a pair of words / characters) never occurred in the training data? Zero counts mislead your total probability Use smoothing: simple techniques to add a (very) small quantity to any count P(w n w n-1 ) = [C(w n-1 w n )+1] / [C(w n-1 )+V] - where C(w n-1 w n ) and C(w n-1 ) are simple frequency counts collected from that language s training data - V is the vocabulary, i.e., total number of unique words (or characters) in the training data A tip: use log(p) to avoid very small numbers

Exercise French: ainsi de la même manière que l'on peut dire que le Royaume-Uni est un pays, on peut dire que l'angleterre est un pays. English: the same way that we can say that United Kingdom is a country, we can also say that England is a country.? this Assume character-level bigram model. Apply add-one smoothing. Assume V for French is 27, V for English is 23

Stopwords It is typical to exclude high-frequency words (e.g., function words: a, the, in, to ; pronouns: I, he, she, it ). Stopwords are language dependent For efficiency, store strings for stopwords in a hashtable to recognize them in constant time. E.g., simple Python dictionary

Exercise How to determine a list of stopwords? For English? may use existing lists of stopwords E.g. SMART s commonword list (~ 400) WordNet stopword list For Spanish? Bulgarian?

Stopwords The trend is away from using them: From large stop lists (200-300), to small stop lists (7-12), to none. Good compression techniques (or cheap hardware) mean the cost for including stop words in a system is very small. Good query optimization techniques mean you pay little at query time for including stop words. You need them for: Phrase queries: King of Denmark Various song titles, etc.: Let it be, To be or not to be Relational queries: flights to London

Normalization Token Normalization = reducing multiple tokens to the same canonical term, such that matches occur despite superficial differences. 1. Create equivalence classes, named after one member of the class: {anti-discriminatory, antidiscriminatory} {U.S.A., USA} 2. Maintain relations between unnormalized tokens: o can be extended with lists of synonyms (car, automobile). 1. Index unnormalized tokens, a query term is expanded into a disjunction of multiple postings lists. 2. Perform expansion during index construction.

Normalization Accents and diacritics in French: résumé vs. resume. Umlauts in German: Tuebingen vs. Tübingen Most important criterion: How do users like to write their queries for these words? Even in languages that standardly have accents, users often may not type them: Often best to normalize to a de-accented term - Tuebingen, Tübingen, Tubingen => Tubingen

Normalization Case-Folding = reduce all letters to lower case: allow Automobile at beginning of sentences to match automobile. allow user-typed ferrari to match Ferrari in documents. but may lead to unintended matches: the Fed vs. fed. Bush, Black, General Motors, Associated Press,... Heuristic = lowercase only some tokens: words at beginning of sentences. all words in a title where most words are capitalized. Truecasing = use a classifier to decide when to fold: trained on many heuristic features.

Normalization British vs. American spellings: colour vs. color. Multiple formats for dates, times: 09/30/2013 vs. Sep 30, 2013. Asymmetric expansion: Enter: window Search: window, windows Enter: windows Search: Windows, windows, window Enter: Windows Search: Windows

Lemmatization Reduce inflectional/variant forms to base form Direct impact on vocabulary size E.g., am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color How to do this? Need a list of grammatical rules + a list of irregular words Children child, spoken speak Practical implementation: use WordNet s morphstr function Perl: WordNet::QueryData (first returned value from validforms function) Python: NLTK.stem

Stemming Reduce tokens to root form of words to recognize morphological variation. computer, computational, computation all reduced to same token compute Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres.

Porter Stemmer Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: computer, computational, computation all reduced to same token comput May conflate (reduce to the same token) words that are actually distinct. Not recognize all morphological derivations.

Typical rules in Porter sses ss ies i ational ate tional tion See class website for link to official Porter stemmer site Provides Python ready to use implementations

Porter Stemmer Errors Errors of comission : organization, organ organ police, policy polic arm, army arm Errors of omission : cylinder, cylindrical create, creation Europe, European

Other stemmers Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm Single-pass, longest suffix removal (about 250 rules) Stemming is language- and often application-specific: open source and commercial plug-ins. Does it improve IR performance? mixed results for English: improves recall, but hurts precision. operative (dentistry) oper definitely useful for languages with richer morphology: Spanish, German, Finish (30% gains).