Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 1 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 2 / 49 Outline Lecture 1 Indexing IR models Weighting Evaluation Information Retrieval - IR A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 3 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 4 / 49
Representation/Indexing (fig 12) IR models - overview (fig 21) Documents text break into words document numbers words stoplist and *field numbers non-stoplist stemming* Lexical analysis words stemmed term weighting* words * Indicates optional operation terms with weights assign document IDs Index / database 43 U s e r T a s k s Retrieval: Ad hoc Filtering Browsing Classic models Boolean Vector space Probabilistic Structured models Non-overlapping Lists Proximal nodes Browsing Flat Structure Guided Hypertext Set theoretic Fuzzy Extended Boolean Algebraic Generalized vector LSI Neural Networks Probabilistic Inference Network Belief Network A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 5 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 6 / 49 IR models - vector space Weighting TF*IDF Bag-Of-Words: syntax irrelevant document structure irrelevant meta-information irrelevant document/query = n-dimensional vector similarity = cosine between vectors Query Document Terms should be important in the document Terms present in many documents are not important (importance in document) * (how rare is the term) tf * idf tf term frequency idf inverse document frequency Various normalizations A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 7 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 8 / 49
Recall/Precision Outline A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 9 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 10 / 49 Lecture 2 Query languages - aspects Query languages/protocols Link based ranking keyword context Boolean natural language pattern matching truncation wild cards regular expressions fuzzy search structured fields A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 11 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 12 / 49
Query languages Query operations Expressiveness Standardization Examples Structured Query Language - SQL Common Query Language - CQL XQuery Proprietary/home grown Google relevance feedback Give me more like this New Relevance Feedback vector = Original query vector + mean of relevant documents (vectors) in hit set - mean of non-relevant documents (vectors) in hit set query expansion Add or remove search terms Change Boolean operators Change wild cards A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 13 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 14 / 49 Network protocols/queries PageRank Dedicated protocol Z3950 MySQL network protocol PR(u) t = d PR(v) t 1 N v v B u + (1 d)e(u) CGI-script - Web Common Gateway Interface Web services Search/Retrieval via URL - SRU OpenSearch Open Archives Initiative - OAI Random surfer model Click on a random link in the page Eventually gets bored and jumps to a random page Converges to a stable solution Problems size of the Web pages without links - dangling pages (rank sinks) converging link-spamming A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 15 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 16 / 49
PageRank + TF*IDF Outline Relevance ranking Combine PageRank with vector space model PR(D) sim(q, D) or f (PR(D)) sim(q, D) In practice proximity structure: title, link-anchor text metadata: keywords, description and A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 17 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 18 / 49 Lecture 3 Markup Languages Text Unicode character set (UTF-8) > 100000 characters Zipf s law Skewed distribution - stopwords Heaps law: Vocabulary n β ; β < 1 Metadata Author, source, length Dublin Core Metadata Element Set Similarity models: Hamming Distance, Edit (Levenshtein) Distance Markup languages Text operations SGML - Standard Generalized Markup Language HTML - HyperText Markup Language XML - extensible Markup Language T E X/ L A T E X A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 19 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 20 / 49
Text operations - preprocessing Outline 1: Lexical analysis 2: Stopwords 3: Stemming 4: Selection of indexing terms 5: Thesaurus A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 21 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 22 / 49 Lecture 4 LSI - reduced SVD LSI (Latent Semantic Indexing)- concepts The term-document matrix is decomposed into three other matrices of a special form by use of Singular Value Decomposition (SVD) The matrices show a breakdown of the original relationships into linearly independent components Many of these components are very small and can be ignored - leading to an approximate model that contains fewer dimensions SVM (Support Vector Machines) - classification Reduce dimensionality => retain only k largest singular values Saved space = Ak M x N Uk M x k Σk k x k T V k x N k M x N matrix A (term/document) reduced SVD: A A k = U k Σ k V T k A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 23 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 24 / 49
LSI - Concept extraction Text classification use rows of Σ 1 k UT k as concepts Concept 1 Concept 2 carlstrom 00354 regia 00523 rick 00354 oct 00521 amelnx 00354 chrisp 00314 advmar 00354 problems 00273 cuttings 00329 pm 00265 september 00322 ip-forum 00264 miller 00265 stratification 00261 re -00287 uk -00267 wants -00303 bladderwort -00361 aquatic -00363 cuttings -00317 rotundifolia -00371 bladderwort -00605 HARD to interpret Goal: classify documents into predefined categories Examples Subject classification: business, sports, engineering, Review classification: positive or negative Web page classification: Personal homepage or others Approach: supervised machine learning ( SVM) Each predefined category needs a set of training documents From training sets train a classifier Use classifier to classify new documents A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 25 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 26 / 49 SVM Outline Support vectors SVM maximize the margin around the separating hyper-plane Decision function specified by support vectors (from training examples) Quadratic programming problem Hot text classification method A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 27 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 28 / 49
Lecture 5 Web Search Web Search Metasearch engines Web crawling Browsing vs search Challenges Distributed, dynamic data Large volume Unstructured, heterogeneous data Size, coverage General vs focused Special functions, User interface Ranking Limited overlap between search engines A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 29 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 30 / 49 Search Engine - Basic structure Metasearch engines Web robot HTTP The Web Database Web pages Database Interface Query Answer CGI script HTTP Web browser 000000000 111111111 000000000 111111111 000000000 111111111 Size efficiency response time software crawling the web (much like a human clicking on links) collect all found web-pages into a database (IR system) offer a web-interface to that database Simultaneously search several individual search engines Query translation Result merging Simple merge Duplicate detection tf-idf/similarity ranking Position based A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 31 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 32 / 49
Web Robot - Basic architecture Focused Crawling Spider, Crawler, Robot, agent, Get URL URLs Seed URLs Get URL URLs Seed URLs Focus: Database Web pages Repository of visited pages Fetch Web page Analyze Save Links Frontier List of unvisited pages Database Web pages Repository of visited pages Within the focus Not in focus Fetch Web page Analyze Focus filter Save Links URL focus filter Frontier List of unvisited pages Domain Project Country Region Topic Subject A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 33 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 34 / 49 Browsing vs search Outline Search LOTS of data Unstructured Unrelated items clutter results Browsing Small amounts of data Hierarchically structured Quality assessed A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 35 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 36 / 49
Lecture 6 Recommender systems ITEMS Recommender systems Indexing, searching Example IR systems USERS x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Recommendation algorithm collaborative content based Predicted ratings p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p Recommendations r r r r A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 37 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 38 / 49 Content based filtering Collaborative filtering Try to predict a rating based on my own ratings Represent items as a set of features item j = (w 1j, w kj ) Users rank items user profile in feature space user c = (w c1, w ck ) Vector space! (feature/item matrix, tf idf, cosine similarity, ) User profile used as query Try to predict rating based on other users ratings Memory based Make rating based on entire collection Ex: rating c,s = k c C sim(c, c ) rating c,s User c, Item s C Set of users most similar to c 1 k Normalizing factor (usually sim(c, c ) ) c C Model based Try to learn a model to be used for predicting ratings Ex: Probabilistic model, Machine learning, A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 39 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 40 / 49
Hybrid systems Introduction: Indexing, searching Content based filtering + Collaborative filtering Combining separate recommenders Adding content based characteristics to collaborative filtering Adding collaborative characteristics to content based filtering Sequential search Small databases Volatile data Indexes Large databases Semi-static data Inverted files A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 41 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 42 / 49 Inverted files Example IR-systems Principal data structure Effective Allows fast searching Substantial storage overhead Speed more important than storage For each term List of document ID s (Term frequency in each document) (Position in document) Used for Boolean searches Vector space ranking Proximity, phrases IndexData: Zebra free, GPL license index records in XML, SGML, MARC, e-mail archives, supports SRU/CQL, Z3950, ZOOM Apache: Lucene and Solr free, open source index records via XML over HTTP query via HTTP GET, XML results A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 43 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 44 / 49
Outline Lecture 7 Digital Libraries Example systems A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 45 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 46 / 49 Digital Library Problems body of knowledge digital information large collection builds on current libraries can be accessed from anywhere extended IR system support large collections searching cataloguing(indexing) multilingual multimedia structured docs distribution federated search access interoperability! A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 47 / 49 A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 48 / 49
Standards protocols metadata classification systems query languages A Ardö, EIT Summary: EITN01 Web Intelligence and Information Retrieval March 13, 2013 49 / 49