Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Similar documents
Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Dictionaries and Tolerant retrieval

Information Retrieval

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Information Retrieval

Information Retrieval

Recap of last time CS276A Information Retrieval

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Natural Language Processing and Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Information Retrieval

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Multimedia Information Extraction and Retrieval Indexing and Query Answering

Search Engines. Gertjan van Noord. September 17, 2018

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Introduction to Information Retrieval

Information Retrieval

Information Retrieval

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Information Retrieval

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Recap: lecture 2 CS276A Information Retrieval

Indexing and Searching

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Indexing and Searching

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Information Retrieval 6. Index compression

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Information Retrieval

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

INDEX CONSTRUCTION 1

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Information Retrieval. Chap 7. Text Operations

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

LCP Array Construction

Information Retrieval

Direct Addressing Hash table: Collision resolution how handle collisions Hash Functions:

Information Retrieval and Organisation

Index Construction. Slides by Manning, Raghavan, Schutze

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

The questions will be short answer, similar to the problems you have done on the homework

CS 525: Advanced Database Organization 04: Indexing

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Informa(on Retrieval

CSE380 - Operating Systems

CSE 530A. B+ Trees. Washington University Fall 2013

Virtualization Technique For Replica Synchronization

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Inside the InfluxDB Storage Engine

Indexing and Searching

RAID in Practice, Overview of Indexing

Structural Text Features. Structural Features

File Structures and Indexing

Representing Data Elements

230 Million Tweets per day

Last Class: Demand Paged Virtual Memory

CSCI 5417 Information Retrieval Systems Jim Martin!

Lecture 5: Information Retrieval using the Vector Space Model

Binghamton University. CS-220 Spring Cached Memory. Computer Systems Chapter

DDS Dynamic Search Trees

memory management Vaibhav Bajpai

modern database systems lecture 4 : information retrieval

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett

Introduction to. Algorithms. Lecture 5

The Adaptive Radix Tree

Distributed computing: index building and use

CSE 562 Database Systems

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

I/O-Algorithms Lars Arge

Transcription:

Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 03-Oct-2018 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure Wild-char search and applications * You are not asked to implement any of the content in this lecture, but you might think of using some for your course project 2 1

Structured Documents Document are not always flat: Meta-data: title, author, time-stamp Structure: headline, section, body Tags: link, hashtag, mention How to deal with it? Neglect! Create separate index for each field Use extent index 3 Extent Index Special term for each element/field/tag Index all terms in a structured document as plain text Terms in a given field/tag get special additional entry Posting: spans of window related to a given field Allows multiple overlapping spans of different types he 1,1 1,5 2,1 3,3 4,3 5,1 drink 1,8 2,4 2,6 2,8 3,6 ink 3,8 4,2 5,8 pink 4,8 5,7 Link 3,1:2 4,1:4 5,7:8 4,5 5,6 D1: He likes to wink, he likes to drink D2: He likes to drink, and drink, and drink D3: The thing he likes to drink is ink D4: The ink he likes to drink is pink D5: He likes to wink, and drink pink ink 4 2

Using Extent Doc: 1 1 2 3 Headline: Information retrieval lecture Text: this is lecture 6 of the TTDS course on IR 4 5 6 7 8 Query Headline: lecture lecture 1,3 1,4 2,9 3,7 3,11 Headline 1,1:3 2,1:5 3,1:4 5 Index Compression Inverted indices are big Large disk space large I/O operations Index compression Reduce space less I/O Allow more chunks of index to be cached in memory Large size goes to: terms? document numbers? Ideas: Compress document numbers, how? 6 3

Delta Encoding Large collections large sequence of doc IDs Large ID number more bytes to store 1 byte: 0 255 2 bytes: 0 65,535 4 bytes: 0 4.3 B Idea: delta in ID instead of full ID Very useful, especially for frequent terms 3 bytes term 100002 100007 100008 100011 100019 term? 5 1 3 7 321 15 2 1 byte 2 bytes 7 v-byte Encoding Have different byte storage for each delta in index Use fewer bits to encode High bit in a byte 1/0 = terminate/continue Remaining 7 bits binary number Examples: 6 10000110 127 11111111 128 0000000110000000 00000010000000 Real example sequence: 100001010000000011000001010000111 0000101 000000010000010 0000111 5 130 7 8 4

Index Compression There are more sophisticated compression algorithms: Elias gamma code The more compression Less storage More processing In general Less I/O + more processing > more I/O + no processing > = faster With new data structures, problem is less severe 9 Dictionary Data Structures The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list For small collections, load full dictionary in memory. In real-life, cannot load all index to memory! Then what to load? How to reach quickly? What data structure to use for inverted index? 10 5

Hashes Each vocabulary term is hashed to an integer Pros Lookup is faster than for a tree: O(1) Cons No easy way to find minor variants: judgment/judgement No prefix search If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything 11 Trees: Binary Search Tree a-m Root n-z a-hu hy-m n-sh si-z 12 6

Trees: B-tree a-hu hy-m n-z Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. 13 Trees Pros? Solves the prefix problem (terms starting with ab ) Cons? Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problem 14 7

Wild-Card Queries: * mon*: find all docs containing any word beginning mon. Easy with binary tree (or B-tree) lexicon *mon: find words ending in mon : harder Maintain an additional B-tree for terms backwards. How can we enumerate all terms meeting the wildcard query pro*cent? Query processing: se*ate AND fil*er? Expensive 15 Permuterm Indexes Transform wild-card queries so that the * occurs at the end For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell, $hello where $ is a special symbol. Rotate query wild-card to the right Queries: X lookup on X$ X* lookup on $X* *X lookup on X$* X*Ylookup on Y$X* Index Size? 16 8

Character n-gram Indexes Enumerate all n-grams (sequence of n chars) occurring in any term e.g., from text April is the cruelest month we get the 2- grams (bigrams) $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$ $ is a special word boundary symbol Maintain a second inverted index from bigrams to dictionary terms that match each bigram. Character n-grams terms terms documents 17 Character n-gram Indexes The n-gram index finds terms based on a query consisting of n-grams (here n=2). $m mace madden mo on among almond amortize among Wild card query Find possible terms Filter unmatching terms Search collection for all terms Documents Index of char bigrams Collection index of terms 18 9

Character n-gram Indexes: Query time Step 1: Query mon* $m AND mo AND on It would still match moon. Step 2: Must post-filter these terms against query. Phrase match, or post-step1 match Step 3: Surviving enumerated terms are then looked up in the term-document inverted index. Montreal OR monster OR monkey Wild-cards can result in expensive query execution (very large disjunctions ) 19 Character n-gram Indexes: Applications Spelling Correction Create n-gram representation for words Build index for words: Dictionary of words documents (each word is a document) Character n-grams terms When getting a search term that is misspelled (OOV or not frequent), find possible corrections Possible corrections = most matching results Query: elepgant $e el le ep pg ga an nt t$ Results: elegant $e el le eg ga an nt t$ elephant $e el le ep ph ha an nt t$ 20 10

ا$ أ$ 10/3/2018 Character n-gram Indexes: Applications Char n-grams can be used as direct index terms for some applications: Arabic IR, when no stemmer/segmenter is available Documents with spelling mistakes: OCR documents Word char representation can by with multiple n s elephant 2/3-gram $e el le ep ph ha an nt t$ $el $ele lep eph pha han ant nt$ The children behaved well Her children are cute ال أل أب بن نا اء ء$ أب بن نا اء ءه ها ا$ األبناء تصرفوا جيدا أبناءها لطاف Document: Elepbant $e el le ep pb ba an nt t$ Query: Elephant $e el le ep ph ha an nt t$ 21 Summary Index can by multilayer Extent index (multi-terms in one position in document) Index does not have to be formed of words Character n-grams representation of words Two indexes are sometimes used Index of character n-grams to find matching words Index of terms to search for matched words 22 11

Resources Text book 1: Intro to IR, Chapter 3.1 3.4 Text book 2: IR in Practice, Chapter 5 23 12