Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Similar documents
Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Introduction to Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval. (M&S Ch 15)

Mining Web Data. Lijun Zhang

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

60-538: Information Retrieval

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Chapter 2. Architecture of a Search Engine

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler, focused crawling. 4 Web search vs Browsing. 5 Privacy, Filter bubble.

Mining Web Data. Lijun Zhang

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval and Web Search

Information Retrieval: Retrieval Models

Modern Information Retrieval

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler. 4 Focused Web crawling. 5 Web search vs Browsing. 6 Privacy, Filter bubble

Information Retrieval. hussein suleman uct cs

Information Retrieval. Information Retrieval and Web Search

Text Analytics (Text Mining)

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Information Retrieval Spring Web retrieval

Text Analytics (Text Mining)

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Introduction to Information Retrieval

Feature selection. LING 572 Fei Xia

Information Retrieval

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval May 15. Web retrieval

modern database systems lecture 4 : information retrieval

CS 6320 Natural Language Processing

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

SEARCH ENGINE INSIDE OUT

CS54701: Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

VK Multimedia Information Systems

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Reading group on Ontologies and NLP:

CS371R: Final Exam Dec. 18, 2017

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Administrative. Web crawlers. Web Crawlers and Link Analysis!

: Semantic Web (2013 Fall)

Social Media Computing

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Boolean Model. Hongning Wang

Web Information Retrieval using WordNet

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

COMP6237 Data Mining Searching and Ranking

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Information Retrieval

An Introduction to Search Engines and Web Navigation

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Chapter 3 - Text. Management and Retrieval

VK Multimedia Information Systems

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

DATA MINING II - 1DL460. Spring 2014"

Searching the Web for Information

Latent Semantic Indexing

DATA MINING - 1DL105, 1DL111

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Session 10: Information Retrieval

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Information Retrieval

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Lec 8: Adaptive Information Retrieval 2

A short introduction to the development and evaluation of Indexing systems

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Building Search Applications

Search Engines. Information Retrieval in Practice

LIST OF ACRONYMS & ABBREVIATIONS

The Information Retrieval Series. Series Editor W. Bruce Croft

Information Retrieval and Data Mining Part 1 Information Retrieval

Multimedia Information Systems

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

INTRODUCTION. Chapter GENERAL

CSE 5243 INTRO. TO DATA MINING

Tag-based Social Interest Discovery

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval on the Internet (Volume III, Part 3, 213)

MySQL Fulltext Search

Transcription:

Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 1 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 2 / 49 Outline Lecture 1 Indexing IR models Weighting Evaluation Information Retrieval - IR A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 3 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 4 / 49

Representation/Indexing (fig 12) IR models - overview (fig 21) Documents text break into words document numbers words stoplist and *field numbers non-stoplist stemming* Lexical analysis words stemmed term weighting* words * Indicates optional operation terms with weights assign document IDs Index / database 43 U s e r T a s k s Retrieval: Ad hoc Filtering Browsing Classic models Boolean Vector space Probabilistic Structured models Non-overlapping Lists Proximal nodes Browsing Flat Structure Guided Hypertext Set theoretic Fuzzy Extended Boolean Algebraic Generalized vector LSI Neural Networks Probabilistic Inference Network Belief Network A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 5 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 6 / 49 IR models - vector space Weighting TF*IDF Bag-Of-Words: syntax irrelevant document structure irrelevant meta-information irrelevant document/query = n-dimensional vector similarity = cosine between vectors Query Document Terms should be important in the document Terms present in many documents are not important (importance in document) * (how rare is the term) tf * idf tf term frequency idf inverse document frequency Various normalizations A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 7 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 8 / 49

Recall/Precision Outline A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 9 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 10 / 49 Lecture 2 Query languages - aspects Query languages/protocols Link based ranking keyword context Boolean natural language pattern matching truncation wild cards regular expressions fuzzy search structured fields A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 11 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 12 / 49

Query languages Query operations Expressiveness Standardization Examples Structured Query Language - SQL Common Query Language - CQL XQuery Proprietary/home grown Google relevance feedback Give me more like this New Relevance Feedback vector = Original query vector + mean of relevant documents (vectors) in hit set - mean of non-relevant documents (vectors) in hit set query expansion Add or remove search terms Change Boolean operators Change wild cards A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 13 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 14 / 49 Network protocols/queries PageRank Dedicated protocol Z3950 MySQL network protocol PR(u) t = d PR(v) t 1 N v v B u + (1 d)e(u) CGI-script - Web Common Gateway Interface Web services Search/Retrieval via URL - SRU OpenSearch Open Archives Initiative - OAI Random surfer model Click on a random link in the page Eventually gets bored and jumps to a random page Converges to a stable solution Problems size of the Web pages without links - dangling pages (rank sinks) converging link-spamming A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 15 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 16 / 49

PageRank + TF*IDF Outline Relevance ranking Combine PageRank with vector space model PR(D) sim(q, D) or f (PR(D)) sim(q, D) In practice proximity structure: title, link-anchor text metadata: keywords, description and A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 17 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 18 / 49 Lecture 3 Markup Languages Text Unicode character set (UTF-8) > 100000 characters Zipf s law Skewed distribution - stopwords Heaps law: Vocabulary n β ; β < 1 Metadata Author, source, length Dublin Core Metadata Element Set Similarity models: Hamming Distance, Edit (Levenshtein) Distance Markup languages Text operations SGML - Standard Generalized Markup Language HTML - HyperText Markup Language XML - extensible Markup Language T E X/ L A T E X A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 19 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 20 / 49

Text operations - preprocessing Outline 1: Lexical analysis 2: Stopwords 3: Stemming 4: Selection of indexing terms 5: Thesaurus A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 21 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 22 / 49 Lecture 4 LSI - reduced SVD LSI (Latent Semantic Indexing)- concepts The term-document matrix is decomposed into three other matrices of a special form by use of Singular Value Decomposition (SVD) The matrices show a breakdown of the original relationships into linearly independent components Many of these components are very small and can be ignored - leading to an approximate model that contains fewer dimensions SVM (Support Vector Machines) - classification Reduce dimensionality => retain only k largest singular values Saved space = Ak M x N Uk M x k Σk k x k T V k x N k M x N matrix A (term/document) reduced SVD: A A k = U k Σ k V T k A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 23 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 24 / 49

LSI - Concept extraction Text classification use rows of Σ 1 k UT k as concepts Concept 1 Concept 2 carlstrom 00354 regia 00523 rick 00354 oct 00521 amelnx 00354 chrisp 00314 advmar 00354 problems 00273 cuttings 00329 pm 00265 september 00322 ip-forum 00264 miller 00265 stratification 00261 re -00287 uk -00267 wants -00303 bladderwort -00361 aquatic -00363 cuttings -00317 rotundifolia -00371 bladderwort -00605 HARD to interpret Goal: classify documents into predefined categories Examples Subject classification: business, sports, engineering, Review classification: positive or negative Web page classification: Personal homepage or others Approach: supervised machine learning ( SVM) Each predefined category needs a set of training documents From training sets train a classifier Use classifier to classify new documents A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 25 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 26 / 49 SVM Outline Support vectors SVM maximize the margin around the separating hyper-plane Decision function specified by support vectors (from training examples) Quadratic programming problem Hot text classification method A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 27 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 28 / 49

Lecture 5 Web Search Web Search Metasearch engines Web crawling Browsing vs search Challenges Distributed, dynamic data Large volume Unstructured, heterogeneous data Size, coverage General vs focused Special functions, User interface Ranking Limited overlap between search engines A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 29 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 30 / 49 Search Engine - Basic structure Metasearch engines Web robot HTTP The Web Database Web pages Database Interface Query Answer CGI script HTTP Web browser 000000000 111111111 000000000 111111111 000000000 111111111 Size efficiency response time software crawling the web (much like a human clicking on links) collect all found web-pages into a database (IR system) offer a web-interface to that database Simultaneously search several individual search engines Query translation Result merging Simple merge Duplicate detection tf-idf/similarity ranking Position based A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 31 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 32 / 49

Web Robot - Basic architecture Focused Crawling Spider, Crawler, Robot, agent, Get URL URLs Seed URLs Get URL URLs Seed URLs Focus: Database Web pages Repository of visited pages Fetch Web page Analyze Save Links Frontier List of unvisited pages Database Web pages Repository of visited pages Within the focus Not in focus Fetch Web page Analyze Focus filter Save Links URL focus filter Frontier List of unvisited pages Domain Project Country Region Topic Subject A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 33 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 34 / 49 Browsing vs search Outline Search LOTS of data Unstructured Unrelated items clutter results Browsing Small amounts of data Hierarchically structured Quality assessed A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 35 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 36 / 49

Lecture 6 Recommender systems ITEMS Recommender systems Indexing, searching Example IR systems USERS x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Recommendation algorithm collaborative content based Predicted ratings p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p Recommendations r r r r A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 37 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 38 / 49 Content based filtering Collaborative filtering Try to predict a rating based on my own ratings Represent items as a set of features item j = (w 1j, w kj ) Users rank items user profile in feature space user c = (w c1, w ck ) Vector space! (feature/item matrix, tf idf, cosine similarity, ) User profile used as query Try to predict rating based on other users ratings Memory based Make rating based on entire collection Ex: rating c,s = k c C sim(c, c ) rating c,s User c, Item s C Set of users most similar to c 1 k Normalizing factor (usually sim(c, c ) ) c C Model based Try to learn a model to be used for predicting ratings Ex: Probabilistic model, Machine learning, A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 39 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 40 / 49

Hybrid systems Introduction: Indexing, searching Content based filtering + Collaborative filtering Combining separate recommenders Adding content based characteristics to collaborative filtering Adding collaborative characteristics to content based filtering Sequential search Small databases Volatile data Indexes Large databases Semi-static data Inverted files A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 41 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 42 / 49 Inverted files Example IR-systems Principal data structure Effective Allows fast searching Substantial storage overhead Speed more important than storage For each term List of document ID s (Term frequency in each document) (Position in document) Used for Boolean searches Vector space ranking Proximity, phrases IndexData: Zebra free, GPL license index records in XML, SGML, MARC, e-mail archives, supports SRU/CQL, Z3950, ZOOM Apache: Lucene and Solr free, open source index records via XML over HTTP query via HTTP GET, XML results A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 43 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 44 / 49

Outline Lecture 7 Digital Libraries Example systems A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 45 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 46 / 49 Digital Library Problems body of knowledge digital information large collection builds on current libraries can be accessed from anywhere extended IR system support large collections searching cataloguing(indexing) multilingual multimedia structured docs distribution federated search access interoperability! A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 47 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 48 / 49

Standards protocols metadata classification systems query languages A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 49 / 49