Introduction to Information Retrieval

Similar documents
Introduction to Information Retrieval

Digital Libraries: Language Technologies

Information Retrieval. (M&S Ch 15)

CS 6320 Natural Language Processing

Chapter 27 Introduction to Information Retrieval and Web Search

Lecture 5: Information Retrieval using the Vector Space Model

Multimedia Information Systems

Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Text Analytics (Text Mining)

Chapter 6: Information Retrieval and Web Search. An introduction

Models for Document & Query Representation. Ziawasch Abedjan

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Text Analytics (Text Mining)

COMP6237 Data Mining Searching and Ranking

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Chapter 2. Architecture of a Search Engine

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Information Retrieval

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Information Retrieval

Informa(on Retrieval

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Supervised classification of law area in the legal domain

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval

Introduction to Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

68A8 Multimedia DataBases Information Retrieval - Exercises

Reading group on Ontologies and NLP:

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Tuesday 21st May 2013 Time: 09:45-11:45

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Information Retrieval and Web Search

Unstructured Data. CS102 Winter 2019

Information Retrieval. Information Retrieval and Web Search

Feature selection. LING 572 Fei Xia

Information Retrieval. hussein suleman uct cs

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Outline of the course

Part I: Data Mining Foundations

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Chapter 4. Processing Text

Information Retrieval CSCI

modern database systems lecture 4 : information retrieval

VK Multimedia Information Systems

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Birkbeck (University of London)

Tansu Alpcan C. Bauckhage S. Agarwal

Information Retrieval Tutorial 1: Boolean Retrieval

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Latent Semantic Indexing

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval: Retrieval Models

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Information Retrieval and Organisation

Information Retrieval

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Document Clustering: Comparison of Similarity Measures

Section 9: One Variable Statistics

CS646 (Fall 2016) Homework 1

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Tag-based Social Interest Discovery

Document indexing, similarities and retrieval in large scale text collections

Computer Science 572 Midterm Prof. Horowitz Tuesday, March 12, 2013, 12:30pm 1:45pm

Information Retrieval and Knowledge Organisation

Impact of Term Weighting Schemes on Document Clustering A Review

Full-Text Indexing For Heritrix

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval

Knowledge Discovery and Data Mining 1 (VO) ( )

International Journal of Advanced Research in Computer Science and Software Engineering

Information Retrieval. Session 11 LBSC 671 Creating Information Infrastructures

Midterm Exam Search Engines ( / ) October 20, 2015

Exam IST 441 Spring 2014

Introduction to Information Retrieval

... Chair of Mobile Business & Multilateral Security. Lecture 11 Business Informatics 2 (PWIN)

Chapter 3 - Text. Management and Retrieval

2018 EE448, Big Data Mining, Lecture 8. Search Engines. Weinan Zhang Shanghai Jiao Tong University

Information Retrieval. Chap 8. Inverted Files

Instructor: Stefan Savev

Key-value stores. Berkeley DB. Bigtable

Relational Approach. Problem Definition

Information Retrieval

Keyword Extraction by KNN considering Similarity among Features

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Document Clustering for Mediated Information Access The WebCluster Project

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Transcription:

Introduction to Information Retrieval Skiing Seminar Information Retrieval 2010/2011 Introduction to Information Retrieval Prof. Ulrich Müller-Funk, MScIS Andreas Baumgart and Kay Hildebrand

Agenda 1 Boolean Retrieval 2 3 Introduction to Information Retrieval 0-2

1 Boolean Retrieval 1 Boolean Retrieval 2 3 Introduction to Information Retrieval Boolean Retrieval 1-1

Approaching the Term Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Almost no data are truly unstructured language grammar music bars, chords, harmonies Use structure for classification of entities (i.e. songs or documents) Introduction to Information Retrieval Boolean Retrieval 1-2

Indexing Indexing create a binary term-document incidence matrix Doc 1: Like A Rolling Stone Doc 2: Queens of the Stone Age Doc 3: The Rolling Stones Doc 1 Doc 2 Doc 3 Like 1 0 0 Rolling 1 0 1 Stone 1 1 1 Queens 0 1 0 Age 0 1 0 Query Rolling AND Stone AND NOT Like would be 101 AND 111 AND 011 = 001 Doc 3 Introduction to Information Retrieval Boolean Retrieval 1-3

Index Building I 1. Choose document unit 2. Tokenisation There is a cloud, but the water remains calm. There is a cloud but the water remains calm 3. Remove stop words f(w) Upper Cut-Off Zipf s Law: f(w) 1 r(w) Lower Cut-Off r(w) Introduction to Information Retrieval Boolean Retrieval 1-4

Index Building II 4. Normalisation 5. Stemming Case folding Inner punctuation (e.g. U.S.A.) Porter (iteration) Rule Example SSES SS caresses caress IES I ponies poni SS SS caress caress S cats cat Lovins (longest match) 6. Invert index Introduction to Information Retrieval Boolean Retrieval 1-5

Precision and Recall Precision Retrieved documents that are relevant P (relevant retrieved) Precision = #(relevant items retrieved) #(retrieved items) Recall Relevant documents that are retrieved P (retrieved relevant) Recall = #(relevant items retrieved) #(relevant items) Introduction to Information Retrieval Boolean Retrieval 1-6

Precision-Recall-Curve 1 Precision Recall 1 Introduction to Information Retrieval Boolean Retrieval 1-7

2 1 Boolean Retrieval 2 3 Introduction to Information Retrieval 2-1

Term Frequency and Weighting Bag of words model ( Eddie loves Penny equivalent to Penny loves Eddie ) Term frequency tf t,d : all terms equally relevant Reduce tf with growing df (document frequency) Leads to inverse document frequency idf t = log N df t If df easy = 8, 000 and N = 100, 000, then idf easy = 1, 097 If df intrinsic = 2, 000 and N = 100, 000, then idf intrinsic = 1, 699 Introduction to Information Retrieval 2-2

tf-idf Scheme Combination of both concepts tf-idf t,d = tf t,d idf t = tf t,d log N df t Highest if t occurs a large number of times in few documents Lowest if t occurs in all documents Introduction to Information Retrieval 2-3

VSM Concept t 2 d r1 t 1 d r d r2 t 2 A = t 1 d 1 d 2 d M t 1 d 11 d 12 d 1M. t 2 d.. 21 d2m...... t N d N1 d N2 d NM Similarity measures n k=1 d kq k n n k=1 d2 k k=1 q2 k DICE or JACCARD coefficients for categorical data s(d, q) = cos(d, q) = d q d q = Introduction to Information Retrieval 2-4

3 1 Boolean Retrieval 2 3 Introduction to Information Retrieval 3-1

Overview 1. HOMALS for IR 2. Probabilistic Retrieval 3. XML-Retrieval 4. Ontology Retrieval 5. Music Information Retrieval 6. Empiric Search Engine Analysis 7. Multi-Language Retrieval 8. Web-Search 9. Content-based Image Retrieval Introduction to Information Retrieval 3-2

Detailed Topics I 1. HOMALS for IR Homogeneity Analysis using Alternating Least Squares Originally used for dimension reduction of categorical data Term frequencies are categorised 2. Probabilistic Retrieval Relevance as a binary notion Documents order in probability to be relevant to a query 3. XML-Retrieval Encode documents in XML; deal with nesting, specificity Use structure for performance improvement Introduction to Information Retrieval 3-3

Detailed Topics II 4. Ontology Retrieval Formal representation of knowledge through ontologies Hierarchical structures are facilitated 5. Music Information Retrieval Aspects of transcription, genre recognition Beat tracking, classification of instruments 6. Empiric Search Engine Analysis What are the big players doing? Effectiveness, efficiency, constraints Introduction to Information Retrieval 3-4

Detailed Topics III 7. Multi-Language Retrieval Transferring results to other languages Bi- / multilingual corpora 8. Web-Search Features of protocols Dealing with links, dynamic content 9. Content-based Image Retrieval No use of meta data Leveraging color, texture, shape Introduction to Information Retrieval 3-5