Boolean Queries. Keywords combined with Boolean operators:

Similar documents
Boolean Queries. Keywords combined with Boolean operators:

Information Retrieval. Γλώσσες Επερώτησης Query Languages

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Information Retrieval. (M&S Ch 15)

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Adding Source Code Searching Capability to Yioop

Languages and Strings. Chapter 2

Indexing and Searching

Indexing and Searching

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing and Searching

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

CS 6320 Natural Language Processing

Chapter IR:IV. IV. Indexing. Abstract Model of Ranking Inverted Indexes Compression Auxiliary Structures Index Construction Query Processing

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Linear Work Suffix Array Construction

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Caveat lector: This is the first edition of this lecture note. Please send bug reports and suggestions to

Formal languages and computation models

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Proof Techniques Alphabets, Strings, and Languages. Foundations of Computer Science Theory

Parser Tools: lex and yacc-style Parsing

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Soir 1.4 Enterprise Search Server

Parser Tools: lex and yacc-style Parsing

Longest Common Subsequence

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 15 Lecturer: Michael Jordan October 26, Notes 15 for CS 170

Chapter 6: Information Retrieval and Web Search. An introduction

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Lecture 3: Phrasal queries and wildcards

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Regular Expressions. Chapter 6

Index-assisted approximate matching

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Information Retrieval

1.3 Functions and Equivalence Relations 1.4 Languages

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

CS173 Longest Increasing Substrings. Tandy Warnow

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

Efficient Implementation of Postings Lists

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

SMURF Language Reference Manual Serial MUsic Represented as Functions

CS314: FORMAL LANGUAGES AND AUTOMATA THEORY L. NADA ALZABEN. Lecture 1: Introduction

How to Search: EBSCO HOST

DATA STRUCTURE AND ALGORITHM USING PYTHON

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Search Engine Architecture II

HELP DOCUMENTATION A User s Guide

Lexical Analysis (ASU Ch 3, Fig 3.1)

Text Search and Similarity Search

Modern Information Retrieval

Lexical Analysis. Lecture 3-4

Applications of. Regular Expressions BY NIKHIL KUMAR KATTE

Non-deterministic Finite Automata (NFA)

15.4 Longest common subsequence

Information Retrieval 6. Index compression

Chapter 2. Architecture of a Search Engine

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

"GET /cgi-bin/purchase?itemid=109agfe111;ypcat%20passwd mail 200

Slides for Faculty Oxford University Press All rights reserved.

Suffix-based text indices, construction algorithms, and applications.

1. Draw the state graphs for the finite automata which accept sets of strings composed of zeros and ones which:

Ling/CSE 472: Introduction to Computational Linguistics. 4/6/15: Morphology & FST 2

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Data structures for string pattern matching: Suffix trees

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

QUERY PROCESSING AND OPTIMIZATION FOR PICTORIAL QUERY TREES HANAN SAMET

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165

Information Retrieval. Chap 7. Text Operations

Index. Bitmap Heap Scan, 156 Bitmap Index Scan, 156. Rahul Batra 2018 R. Batra, SQL Primer,

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Introduction to Spatial Database Systems

ECS 120 Lesson 7 Regular Expressions, Pt. 1

COMP4128 Programming Challenges

CSC 467 Lecture 3: Regular Expressions

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

2 Regular Languages. 2.1 Languages

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

CS54701: Information Retrieval

UNIT II LEXICAL ANALYSIS

Dr. D.M. Akbar Hussain

Index. caps method, 180 Character(s) base, 161 classes

An Evolution of Mathematical Tools

XFST2FSA. Comparing Two Finite-State Toolboxes

Transcription:

Query Languages 1

Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic. 2

Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results. 3

Natural Language Queries Full text queries as arbitrary strings. Typically just treated as a bag-of-words for a vector-space model. Typically processed using standard vectorspace retrieval methods. 4

Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) information theory May allow intervening stop words and/or stemming. buy camera matches: buy a camera buying the cameras etc. 5

Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase. 6

Phrasal Search Find set of documents D in which all keywords (k 1 k m ) in phrase occur (using AND query processing). Intitialize empty set, R, of retrieved documents. For each document, d, in D: Get array, P i,of positions of occurrences for each k i in d Find shortest array P s of the P i s For each position p of keyword k s in P s For each keyword k i except k s Use binary search to find a position (p s + i) in the array P i If correct position for every keyword found, add d to R Return R 7

Proximity Queries List of words with specific maximal distance constraints between terms. Example: dogs and race within 4 words match dogs will begin the race May also perform stemming and/or not count stop words. 8

Proximity Retrieval with Inverted Index Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints. During binary search for positions of remaining keywords, find closest position of k i to p and check that it is within maximum allowed distance. 9

Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently. 10

Simple Patterns Prefixes: Pattern that matches start of word. anti matches antiquity, antibody, etc. Suffixes: Pattern that matches end of word: ix matches fix, matrix, etc. Substrings: Pattern that matches arbitrary subsequence of characters. rapt matches enrapture, velociraptor etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. tin to tix matches tip, tire, title, etc. 11

Allowing Errors What if query or document contains typos or misspellings? Judge similarity of words (or arbitrary strings) using: Edit distance (Levenstein distance) Longest Common Subsequence (LCS) Allow proximity search with bound on string similarity. 12

Edit (Levenstein) Distance Minimum number of character deletions, additions, or replacements needed to make two strings equivalent. misspell to mispell is distance 1 misspell to mistell is distance 2 misspell to misspelling is distance 3 Can be computed efficiently using dynamic programming in O(mn) time where m and n are the lengths of the two strings being compared. 13

Longest Common Subsequence (LCS) Length of the longest subsequence of characters shared by two strings. A subsequence of a string is obtained by deleting zero or more characters. Examples: misspell to mispell is 7 misspelled to misinterpretted is 7 mis p e ed 14

Searching for Similar Words When spell-correcting a word, it is inefficient to serially search every word in the dictionary, compute the edit distance or LCS for each, and then take the most similar word. Use indexing to find most similar dictionary word without doing a linear search. 15

k-gram Index An inverted index for sequences of k characters contained in a word. 3-grams for index : $in, ind, nde, dex, ex$ (where $ is a special char denoting start or end of a word) For each k-gram encountered in the dictionary, the k-gram index has a pointer to all words that contain that k-gram. dex {index, dexterity, ambidextrous} 16

Using a k-gram Index Given a word, generate its bag of k-grams and use the k-gram index like a normal inverted index to find a word that contains many of the same k- grams. Like normal document retrieval except: words k-grams documents words Example: Query: endex {$en, end, nde, dex, ex$} Retrieval Result: 1) index, 2) ended, 3) endear. Compute detailed score just for top retrievals and take final top-scoring candidate. 17

Regular Expressions Language for composing complex patterns from simpler ones. An individual character is a regex. Union: If e 1 and e 2 are regexes, then (e 1 e 2 ) is a regex that matches whatever either e 1 or e 2 matches. Concatenation: If e 1 and e 2 are regexes, then e 1 e 2 is a regex that matches a string that consists of a substring that matches e 1 immediately followed by a substring that matches e 2 Repetition (Kleene closure): If e 1 is a regex, then e 1 * is a regex that matches a sequence of zero or more strings that match e 1 18

Regular Expression Examples (u e)nabl(e ing) matches unable unabling enable enabling (un en)*able matches able unable unenable enununenable 19

Enhanced Regex s (Perl) Special terms for common sets of characters, such as alphabetic or numeric or general wildcard. Special repetition operator (+) for 1 or more occurrences. Special optional operator (?) for 0 or 1 occurrences. Special repetition operator for specific range of number of occurrences: {min,max}. A{1,5} One to five A s. A{5,} Five or more A s A{5} Exactly five A s 20

Perl Regex s Character classes: \w (word char) Any alpha-numeric (not: \W) \d (digit char) Any digit (not: \D) \s (space char) Any whitespace (not: \S). (wildcard) Anything Anchor points: \b (boundary) Word boundary ^ Beginning of string $ End of string 21

Perl Regex Examples U.S. phone number with optional area code: /\b(\(\d{3}\)\s?)?\d{3}-\d{4}\b/ Email address: /\b\s+@\s+(\.com \.edu \.gov \.org \.net)\b/ Note: Perl regex s supported in java.util.regex package 22

Structural Queries Assumes documents have structure that can be exploited in search. Structure could be: Fixed set of fields, e.g. title, author, abstract, etc. Hierarchical (recursive) tree structure: book chapter chapter title section title section title subsection 23

Queries with Structure Allow queries for text appearing in specific fields: nuclear fusion appearing in a chapter title SFQL: Relational database query language SQL enhanced with full text search. Select abstract from journal.papers where author contains Teller and title contains nuclear fusion and date < 1/1/1950 24