Introduction to Information Retrieval Skiing Seminar Information Retrieval 2010/2011 Introduction to Information Retrieval Prof. Ulrich Müller-Funk, MScIS Andreas Baumgart and Kay Hildebrand
Agenda 1 Boolean Retrieval 2 3 Introduction to Information Retrieval 0-2
1 Boolean Retrieval 1 Boolean Retrieval 2 3 Introduction to Information Retrieval Boolean Retrieval 1-1
Approaching the Term Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Almost no data are truly unstructured language grammar music bars, chords, harmonies Use structure for classification of entities (i.e. songs or documents) Introduction to Information Retrieval Boolean Retrieval 1-2
Indexing Indexing create a binary term-document incidence matrix Doc 1: Like A Rolling Stone Doc 2: Queens of the Stone Age Doc 3: The Rolling Stones Doc 1 Doc 2 Doc 3 Like 1 0 0 Rolling 1 0 1 Stone 1 1 1 Queens 0 1 0 Age 0 1 0 Query Rolling AND Stone AND NOT Like would be 101 AND 111 AND 011 = 001 Doc 3 Introduction to Information Retrieval Boolean Retrieval 1-3
Index Building I 1. Choose document unit 2. Tokenisation There is a cloud, but the water remains calm. There is a cloud but the water remains calm 3. Remove stop words f(w) Upper Cut-Off Zipf s Law: f(w) 1 r(w) Lower Cut-Off r(w) Introduction to Information Retrieval Boolean Retrieval 1-4
Index Building II 4. Normalisation 5. Stemming Case folding Inner punctuation (e.g. U.S.A.) Porter (iteration) Rule Example SSES SS caresses caress IES I ponies poni SS SS caress caress S cats cat Lovins (longest match) 6. Invert index Introduction to Information Retrieval Boolean Retrieval 1-5
Precision and Recall Precision Retrieved documents that are relevant P (relevant retrieved) Precision = #(relevant items retrieved) #(retrieved items) Recall Relevant documents that are retrieved P (retrieved relevant) Recall = #(relevant items retrieved) #(relevant items) Introduction to Information Retrieval Boolean Retrieval 1-6
Precision-Recall-Curve 1 Precision Recall 1 Introduction to Information Retrieval Boolean Retrieval 1-7
2 1 Boolean Retrieval 2 3 Introduction to Information Retrieval 2-1
Term Frequency and Weighting Bag of words model ( Eddie loves Penny equivalent to Penny loves Eddie ) Term frequency tf t,d : all terms equally relevant Reduce tf with growing df (document frequency) Leads to inverse document frequency idf t = log N df t If df easy = 8, 000 and N = 100, 000, then idf easy = 1, 097 If df intrinsic = 2, 000 and N = 100, 000, then idf intrinsic = 1, 699 Introduction to Information Retrieval 2-2
tf-idf Scheme Combination of both concepts tf-idf t,d = tf t,d idf t = tf t,d log N df t Highest if t occurs a large number of times in few documents Lowest if t occurs in all documents Introduction to Information Retrieval 2-3
VSM Concept t 2 d r1 t 1 d r d r2 t 2 A = t 1 d 1 d 2 d M t 1 d 11 d 12 d 1M. t 2 d.. 21 d2m...... t N d N1 d N2 d NM Similarity measures n k=1 d kq k n n k=1 d2 k k=1 q2 k DICE or JACCARD coefficients for categorical data s(d, q) = cos(d, q) = d q d q = Introduction to Information Retrieval 2-4
3 1 Boolean Retrieval 2 3 Introduction to Information Retrieval 3-1
Overview 1. HOMALS for IR 2. Probabilistic Retrieval 3. XML-Retrieval 4. Ontology Retrieval 5. Music Information Retrieval 6. Empiric Search Engine Analysis 7. Multi-Language Retrieval 8. Web-Search 9. Content-based Image Retrieval Introduction to Information Retrieval 3-2
Detailed Topics I 1. HOMALS for IR Homogeneity Analysis using Alternating Least Squares Originally used for dimension reduction of categorical data Term frequencies are categorised 2. Probabilistic Retrieval Relevance as a binary notion Documents order in probability to be relevant to a query 3. XML-Retrieval Encode documents in XML; deal with nesting, specificity Use structure for performance improvement Introduction to Information Retrieval 3-3
Detailed Topics II 4. Ontology Retrieval Formal representation of knowledge through ontologies Hierarchical structures are facilitated 5. Music Information Retrieval Aspects of transcription, genre recognition Beat tracking, classification of instruments 6. Empiric Search Engine Analysis What are the big players doing? Effectiveness, efficiency, constraints Introduction to Information Retrieval 3-4
Detailed Topics III 7. Multi-Language Retrieval Transferring results to other languages Bi- / multilingual corpora 8. Web-Search Features of protocols Dealing with links, dynamic content 9. Content-based Image Retrieval No use of meta data Leveraging color, texture, shape Introduction to Information Retrieval 3-5