Relational Approach. Problem Definition

Similar documents
Relational Approach. Problem Definition

On Scalable Information Retrieval Systems

CS34800, Fall 2016, Assignment 4

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Information Retrieval. (M&S Ch 15)

Relevance Feedback & Other Query Expansion Techniques

Chapter 6: Information Retrieval and Web Search. An introduction

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

CS 582 Database Management Systems II

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

ADT 2005 Lecture 7 Chapter 10: XML

Database System Concepts

CS54701: Information Retrieval

A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Chapter 13 XML: Extensible Markup Language

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Chapter 2. Architecture of a Search Engine

Lecture 5: Information Retrieval using the Vector Space Model

One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Introduction to Information Retrieval

COMP6237 Data Mining Searching and Ranking

CS3DB3/SE4DB3/SE6M03 TUTORIAL

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

EMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Q: Given a set of keywords how can we return relevant documents quickly?

Information Retrieval

Approaches. XML Storage. Storing arbitrary XML. Mapping XML to relational. Mapping the link structure. Mapping leaf values

CSCB20 Week 4. Introduction to Database and Web Application Programming. Anna Bretscher Winter 2017

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS3DB3/SE4DB3/SE6M03 TUTORIAL

Introduction to Information Retrieval

modern database systems lecture 4 : information retrieval

Full-Text Indexing For Heritrix

Database System Concepts

Chapter 3: Introduction to SQL

Introduction to Information Retrieval

Chapter 3: Introduction to SQL. Chapter 3: Introduction to SQL

CS 6320 Natural Language Processing

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Structured Retrieval Models and Query Languages. Query Language

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Query Processing & Optimization

Boolean Model. Hongning Wang

CS34800 Information Systems. Course Overview Prof. Walid Aref January 8, 2018

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Query Evaluation Strategies

SQL Data Query Language

Models for Document & Query Representation. Ziawasch Abedjan

Lecture 6 - More SQL

Semistructured Content

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Relational Algebra. ICOM 5016 Database Systems. Roadmap. R.A. Operators. Selection. Example: Selection. Relational Algebra. Fundamental Property

Information Retrieval

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

The SQL data-definition language (DDL) allows defining :

Intro to Peer-to-Peer Search

Semistructured Content

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval CSCI

Classic IR Models 5/6/2012 1

CS425 Fall 2016 Boris Glavic Chapter 1: Introduction

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Indexing and Query Processing. What will we cover?

XML and Relational Databases

MySQL Views & Comparing SQL to NoSQL

Chapter 3: Introduction to SQL

Data Formats and APIs

Chapter 2 Introduction to Relational Models

Query Processing Strategies and Optimization

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

COMP9321 Web Application Engineering. Extensible Markup Language (XML)

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

CSE 544 Data Models. Lecture #3. CSE544 - Spring,

Introduction to Information Retrieval

A DTD-Syntax-Tree Based XML file Modularization Browsing Technique

Semistructured Content

Modern information retrieval

Semistructured Data and XML

Informatics 1: Data & Analysis

Overview. Introduction. Introduction XML XML. Lecture 16 Introduction to XML. Boriana Koleva Room: C54

Overview. Structured Data. The Structure of Data. Semi-Structured Data Introduction to XML Querying XML Documents. CMPUT 391: XML and Querying XML

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

XQuery. Leonidas Fegaras University of Texas at Arlington. Web Databases and XML L7: XQuery 1

Information Retrieval: Retrieval Models

Clustering (COSC 416) Nazli Goharian. Document Clustering.

XML-QE: A Query Engine for XML Data Soures

XML: extensible Markup Language

Introduction to XML. Yanlei Diao UMass Amherst April 17, Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Transcription:

Relational Approach (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman & Frieder 1 Problem Definition Three conceptual data types: Structured: Objective (absolute) correctness; perfect accuracy; efficiency only issue (e.g.; relational database) Unstructured: Subjective (relative) correctness; accuracy & efficiency trade-off (e.g.; documents, images, video, sound). Semi-structured: Combination of structured and unstructured (e.g.; XML) Problem: Individually querying each type & merging answers. A Solution: IR as an application of Relational Database Management System (RDBMS) with typical IR Functionalities: Boolean query Relevance Ranking multiple similarity measures Proximity Search 2 1

Relational Inverted Index All inverted index entries <term> <list of documents> e.g., vehicle D1, D3, D4 results in: term vehicle vehicle vehicle docid D1 D3 D4 4 Text Retrieval Conference (TREC) Sample Document <DOC> <DOCNO> AP881214-0028 </DOCNO> <FILEID>AP-NR-12-14-88 0117EST</FILEID> <FIRST>u i BC-Japan-Stocks 12-14 0027</FIRST> <SECOND>BC-Japan-Stocks,0026</SECOND> <HEAD>Stocks Up In Tokyo</HEAD> <DATELINE>TOKYO (AP) </DATELINE> <TEXT> The Nikkei Stock Average closed at 29,754.73 points up 156.92 points on the Tokyo Stock Exchange Wednesday. </TEXT> </DOC> 5 2

Relational Document Representation DOCUMENT docid docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term idf average 1.08 closed 1.08 exchange 1.00 nikkei 2.07 points 1.23 stock 1.00 tokyo 1.58 up 0.30 wednesday 0.60 6 Simplistic Models: Keyword and Boolean Searches 7 3

Keyword search Relational Approach: Keyword Search select i.docid from INDEX i, QUERY q where i.term = q.term Keyword search with stop word list select i.docid from INDEX i, QUERY q, STOPLIST s where (i.term = q.term) and (i.term <> s.term) 8 Relational Approach: Boolean Search OR query from INDEX from INDEX where term = term1 where term = term1 OR union term = term2 OR term = term3 OR from INDEX... where term = term2 term = termn union from INDEX where term = term3... union from INDEX where term = termn 9 4

Relational Approach: Boolean Search AND query from INDEX from INDEX a, INDEX b, INDEX c,... INDEX N where term = term1 where a.term = term1 AND intersect b.term = term2 AND c.term = term3 AND from INDEX... where term = term2 n.term = termn AND intersect a.docid = b.docid AND b.docid = c.docid AND from INDEX... where term = term3 N-1.docID = N.docID... intersect from INDEX where term = termn 10 Fixed Join-Count AND Queries Find all documents that contain all of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) = select count(distinct (term)) from QUERY 11 5

TAND Queries Find all documents that contain at least X of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) >= X 12 Relevance Ranking: Vector Space and Probabilistic Models 13 6

Relational Document Representation (Single Term Processing) DOCUMENT docid docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term idf average 1.08 closed 1.08 exchange 1.00 nikkei 2.07 points 1.23 stock 1.00 tokyo 1.58 up 0.30 wednesday 0.60 14 Relational Query Representation ORIGINAL QUERY: nikkei stock exchange american stock exchange QUERY Term tf nikkei 1 Stock 2 exchange 2 american 1 15 7

Vector Space Model Term Frequency (tf ik ): number of occurrences of term t k in document i Document Frequency (df j ): number of documents which contain t j Inverse Document Frequency (idf j ): log(d/df j ) where d is the total number of documents Notes: idf is a measure of uniqueness of a term across the collection tf is the frequency of a term in a given document 16 Similarity Coefficients Several similarity coefficients based on the query vector X and the document vector Y are defined: Inner Product xi yi i1 t x i y i Cosine Coefficient i 1 t t 2 2 x i y i i 1 i 1 t 17 8

Relevance Ranking: SQL for VSM dot product List all documents in the order of their similarity coefficient where the coefficient is computed using the dot product. SQL: (Query Weight * Document Weight) SELECT d.docid, d.docname, SUM(q.termcnt * t.idf * i.termcnt * t.idf) FROM QUERY q, INDEX i, TERM t, DOCUMENT d WHERE q.term = i.term AND q.term = t.term AND i.docid = d.docid GROUP BY d.docid, docname ORDER BY 3 DESC 18 Relevance Ranking: SQL for Probabilistic Retrieval num_ terms numdocs log i dfi.5 2.2tf id df.5.3.75doclength avgdoclength i 1 / tf id qtf SELECT d.docid, d.docname, SUM( LOG(((NumDocs - t.df) + 0.5) / (t.df + 0.5)) * ((2.2*i.tf) / (.3 + ((.75 * d.doclen)/avgdoclen) + i.tf)) * q.termcnt ) FROM INDEX i, TERM t, DOCUMENT d, QUERY q WHERE i.term = t.term AND i.docid = d.docid AND t.term = q.term GROUP BY d.docid, d.docname ORDER BY 3; 19 9

Relational Document Representation (Phrase Processing) DOCUMENT docid docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt phrase 28 1 nikkei stock 28 1 stock average 28 1 average closed 28 1 points up 28 1 tokyo stock 28 1 stock exchange 28 1 exchange wednesday PHRASE phrase idf average closed 2.49 exchange Wednesday 3.33 nikkei stock 2.14 points up 2.61 stock average 2.14 stock exchange 1.34 tokyo stock 2.10 20 Simple Phrase Parsing Simple phrase parser with the following rules Phrases do not include stop terms Phrases do not span across symbols Example: The Nikkei Stock Average closed at 29,754.73 points up 156.92 points, on the Tokyo Stock Exchange Wednesday. Phrases: nikkei stock stock average average closed points up tokyo stock stock exchange exchange wednesday 21 10

Relational Document Representation (Proximity Search) DOCUMENT docid docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX docid term offset 28 nikkei 42 28 stock 43 28 average 44 28 closed 45 28 points 50 28 up 51 28 points 54 28 tokyo 57 28 stock 58 28 exchange 59 28 wednesday 60 22 Relational XML Approach: Architecture XML Collection Semistructured Query SQL Query DB Database Results 26 11

XML Search XML provides tags that allow both structured and unstructured data to be represented in the same XML document. Frequently used as a common representation for a variety of document formats. 27 Introduction XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML Documents have tags giving extra information about sections of the document E.g. <title> XML </title> <slide> Introduction </slide> Extensible, unlike HTML Users can add new tags, and separately specify how the tag should be handled for display A wide variety of tools is available for parsing, browsing and querying XML documents/data Silberschatz, Korth, Sudarshan 12

XML Introduction (Cont.) Tags make data (relatively) self-documenting E.g. <university> <department> <dept_name> Comp. Sci. </dept_name> <building> Taylor </building> <budget> 100000 </budget> </department> <course> <course_id> CS-101 </course_id> <title> Intro. to Computer Science </title> <dept_name> Comp. Sci </dept_name> <credits> 4 </credits> </course> </university> Silberschatz, Korth, Sudarshan Structure of XML Data Tag: label for a section of data Element: section of data beginning with <tagname> and ending with matching </tagname> Elements must be properly nested Proper nesting <course> <title>. </title> </course> Improper nesting <course> <title>. </course> </title> Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element. Every document must have a single top-level element Silberschatz, Korth, Sudarshan 13

Example of Nested Elements <purchase_order> <identifier> P-101 </identifier> <purchaser>. </purchaser> <itemlist> <item> <identifier> RS1 </identifier> <description> Atom powered rocket sled </description> <quantity> 2 </quantity> <price> 199.95 </price> </item> <item> <identifier> SG2 </identifier> <description> Superb glue </description> <quantity> 1 </quantity> <unit-of-measure> liter </unit-of-measure> <price> 29.95 </price> </item> </itemlist> </purchase_order> Silberschatz, Korth, Sudarshan Tree Model of XML Data An XML document is modeled as a tree, with nodes corresponding to elements and attributes Element nodes have child nodes, which can be attributes or subelements Text in an element is modeled as a text node child of the element Children of a node are ordered according to their order in the XML document Element and attribute nodes (except for the root node) have a single parent, which is an element node The root node has a single child, which is the root element of the document Silberschatz, Korth, Sudarshan 14

Relational XML Approach: Storage <book> <author> John Smith </author> <title> Colonial America </title> <publisher state= NY > NY Publishing </publisher> <book> book author title publisher (E) (E) (E) State= NY (A) 33 Relational XML Approach: Storage <book> <author> John Smith </author> <title> Colonial America </title> <publisher state= NY > NY Publishing </publisher> <book> pinndx pinndxnum parent tagtype tagpath atrname value 1 0 E 1 1 NULL 2 1 E 2 1 John Smith 3 1 E 3 1 Colonial America 4 1 E 4 1 NY Publishing 5 4 A 4 2 NY tagpath value atrname value 1 [book] 1 NULL 2 [book,author] 2 state 3 [book,title] 4 [book,publisher] atrnametbl tagpathtbl 34 15