Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Size: px
Start display at page:

Download "Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1."

Transcription

1 Outline Lecture 2: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University January 23, 2013 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Previous lecture Representation/Indexing (fig 1.2) Indexing IR models Weighting Evaluation Information Retrieval - IR Documents assign document IDs text break into words document numbers words stoplist and *field numbers non-stoplist stemming* Lexical analysis words stemmed term weighting* words * Indicates optional operation terms with weights Index / database 43 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

2 IR models - vector space Probabilistic vs. vector space model Bag-Of-Words: syntax irrelevant document structure irrelevant position within document irrelevant meta-information irrelevant document and query are n-dimensional vectors similarity = cosine between vectors Query Document Vector space model: rank documents according to similarity to query. The notion of similarity does not translate directly into an assessment of is the document a good document to give to the user or not? The most similar document can be highly relevant or completely non-relevant. Probability theory is arguably a cleaner formalization of what we really want an IR system to do: give relevant documents to the user. Does not improve results much, experimental, not really used A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 LSI Weighting TF*IDF LSI takes documents that are semantically similar (= talk about the same topics), but are not similar in the vector space (because they use different words) and re-represents them in a reduced vector space (SVD) in which they have higher similarity. Addresses the problems of synonymy. The dimensionality reduction forces us to omit a lot of detail. Fewer dimensions, more collapsing of axes, better recall, worse precision More dimensions, less collapsing, worse recall, better precision LSI has been tested and found to be modestly effective with traditional test collections. Terms should be important in the document Terms present in many documents are not important (importance in document) * (how rare is the term) tf * idf tf term frequency idf inverse document frequency Various normalizations A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

3 Recall/Precision Lecture 2 agenda Chapters 3, 5, 7, 11.5 in Modern Information Retrieval The PageRank Citation Ranking: Bringing Order to the Web L. Page, S. Brin, R. Motwani, and T. Winograd (1999) PageRank : Authoritative Sources in a Hyperlinked Environment J. Kleinberg (1999) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Outline Query languages - aspects keyword context phrase proximity (position) Boolean natural language pattern matching truncation wild cards regular expressions fuzzy search structure fields A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

4 Databases Query languages Relational Free text Free text - fields structured Graph-based Expressiveness Standardization Examples Structured Query Language - SQL Common Query Language - CQL XQuery Proprietary/home grown Google Bing Cypher (Graph Query Language) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 SQL example SQL JOIN example SQL table country: SQL table cities: SQL table country: name region area population Afghanistan SouthAsia Albania Europe Algeria MiddleEast Andorra Europe SELECT name,population,area FROM country; SELECT name,region FROM country WHERE area > 20000; SELECT * FROM country WHERE name LIKE Al% ORDER BY region; cid name region area 1 Afghanistan SouthAsia Albania Europe Algeria MiddleEast Andorra Europe 468 cid name 1 Kabul 1 Herat 2 Tirana 2 Fier 4 Canilo 5 Rome SELECT country.name,cities.name,country.region FROM country INNER JOIN cities ON country.cid=cities.cid WHERE country.region= Europe ; Albania, Tirana, Europe Albania, Fier, Europe Andorra, Canilo, Europe A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

5 XQuery example Common Query Language - CQL XQuery is to XML what SQL is to relational database tables. <bookstore> <book category="web"> <title lang="en">learning XML</title> <author>erik T. Ray</author> <year>2003</year> <price>39.95</price> </book>... </bookstore> used in the Web standardized simple intuitive powerful for $x in /bookstore/book where $x/price>30 order by $x/title return $x/title intended to do what you mean A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Simple CQL - example 1 CQL - example 2 fish "complete dinosaur" dinosaur or bird "complete dinosaur" and fish (bird or dinosaur) and (feathers or scales) dinosaur not reptile foo and bar or baz = (foo and bar) or baz foo or bar and baz = (foo or bar) and baz foo and (bar or baz) title = dinosaur title = ((dinosaur and bird) or dinobird) bath.title="the complete dinosaur" title all "complete dinosaur" title any "dinosaur bird reptile" title exact "the complete dinosaur" publicationyear < 1980 title any "dinosaur bird reptile" and publicationyear > 1980 and author = Leakey A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

6 Query operations Outline relevance feedback Give me more like this New Relevance Feedback vector = Original query vector + mean of relevant documents (vectors) in hit set - mean of non-relevant documents (vectors) in hit set query expansion Add or remove search terms Change Boolean operators Change wild cards A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Network protocols SRU Dedicated protocol Z39.50 MySQL network protocol CGI-script - Web Common Gateway Interface Web services - REST/SOAP Search/Retrieval via URL - SRU (OpenSearch) Open Archives Initiative - OAI A SRU request is a HTTP URL consisting of a SRU base URL and a search-part base URL: search-part: version=1.1&operation=searchretrieve &query=dinosaur Query expressed in CQL URL-encode CQL query query=... in search-part title= kirkegård = title%3d%20kirkeg%c3%a5rd optional parameters: startrecord, maximumrecords, recordpacking, recordschema response is a XML record (SRW is a variation of SRU using SOAP with XML over HTTP) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

7 Outline Vector space relevance Bag-Of-Words: only words syntax irrelevant position irrelevant document structure irrelevant meta-information irrelevant document and query are n-dimensional vectors relevance = similarity = cosine between query vector and document vector Query Document A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Relevance ranking in the Web Relevance ranking Which Web page/report/article/document is most relevant? No unambiguous answer! Which Web page/report/article/document is most relevant? Is it important to know? In relation to Popularity Quality Information need Query Expectation Collection Types of Relevance Objective - System based Algorithmic Subjective - User based Topicality Pertinence Situational Find relevant information fast We are lazy LOTS of information on Internet Advertisement SEO - Search Engine Optimization Evaluation of Universities/researchers - bibliometrics YES A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

8 Outline Why can t I find X in Google? X isn t relevant enough! Results are sorted by relevance (ranking) We never look at more than 2-3 pages There is a LOT of data in Google (Make your search more specific) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Relevance ranking in the Web Vector space - cosine similarity How do you calculate relevance (importance)? In relation to what? What criteria/information should be used? Text similarity Link structure Meta-data Need numeric value (for sorting) Text only The similarity between two vectors is calculated by use of the cosine formula sim(d j, q) = d j q d = wi,j w i,q j q w 2 i,j wi,q 2 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

9 Link based relevance - PageRank Simple PageRank Web: Use linking structure Why? Motivation Creating a link is a deliberate act that requires some effort. The more links to a page - the more important (popular) it is related to citation-analysis from bibliometrics PageRank (Google): a page has high rank if the sum of the ranks of the pages linking to it is high PR i = PR i = v inlinks v inlinks Iterate PR v PR v #outlinks A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 PageRank PageRank How to calculate? PR(u) t = d PR(v) t 1 N v v B u + (1 d)e(u) PR(u) t = d PR(v) t 1 N v v B u + (1 d)e(u) PR(u) t B u N v PageRank for u at time t pages that link to u number of links from v d constant, usually 0.85 E(u) initial PageRank, for example 1 #pages P1 t: t 1: 100 t: t 1: 9 P P2 t: 53 t 1: t: 50 t 1: P4 3 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Random surfer model Click on a random link in the page Eventually gets bored and jumps to a random page Converges to a stable solution Problems size of the Web pages without links - dangling pages (rank sinks) converging link-spamming A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

10 Interpretation of PageRank PageRank variants Random surfer... gets bored after several clicks... jumps to a random page PageRank reflects the probability that a random surfer will land on the page Model: Markov chain with pages as states, transitions are links between pages PR(u) t = d PR(v) t 1 + (1 d)e(u) N v v B u Personalized Concept specific Decentralized calculation (P2P), page u from server s PR(u) = PR global (s) PR local (u) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 PageRank - Pros and Cons PageRank + TF*IDF Relevance ranking Pros Easily understandable Based on a theoretical model determined by users Works well in practice Cons Favors old pages Time consuming to compute Vulnerable to manipulation - link farms Google owns PageRank Combine PageRank with vector space model In practice PR(D) sim(q, D) or f (PR(D)) sim(q, D) proximity structure: title, link-anchor text meta-data: keywords, description and A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

11 Intro HITS HITS What - How to calculate? Another link based algorithm to find good documents a good hub is a document that points to many documents a good authority is a document that many documents point to a mutually reinforcing relationship A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Authorities Pages that many links to [A(u)] Hubs Pages with many out-links [H(u)] Hub = Authorities H(u) t = A(v) t 1 v L(u) L(u) pages that u links to Authority = Hubs A(u) t = H(v) t 1 v B(u) B(u) pages that links to u HUBS AUTHORITIES A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 HITS Other methods Not all the Web! Use a subset S S = Result set of a query S = S + L(S ) + B(S ) S S S Use behavior Clicks Co-occurrence - book orders (Amazon) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö. Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey

More information

PAGE RANK ON MAP- REDUCE PARADIGM

PAGE RANK ON MAP- REDUCE PARADIGM PAGE RANK ON MAP- REDUCE PARADIGM Group 24 Nagaraju Y Thulasi Ram Naidu P Dhanush Chalasani Agenda Page Rank - introduction An example Page Rank in Map-reduce framework Dataset Description Work flow Modules.

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on

More information

A New Technique for Ranking Web Pages and Adwords

A New Technique for Ranking Web Pages and Adwords A New Technique for Ranking Web Pages and Adwords K. P. Shyam Sharath Jagannathan Maheswari Rajavel, Ph.D ABSTRACT Web mining is an active research area which mainly deals with the application on data

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

PageRank and related algorithms

PageRank and related algorithms PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic

More information

Authoritative K-Means for Clustering of Web Search Results

Authoritative K-Means for Clustering of Web Search Results Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian

More information

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval Acknowledgement Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014 Contents of lectures, projects are extracted

More information

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books,

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

68A8 Multimedia DataBases Information Retrieval - Exercises

68A8 Multimedia DataBases Information Retrieval - Exercises 68A8 Multimedia DataBases Information Retrieval - Exercises Marco Gori May 31, 2004 Quiz examples for MidTerm (some with partial solution) 1. About inner product similarity When using the Boolean model,

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Reading Time: A Method for Improving the Ranking Scores of Web Pages Reading Time: A Method for Improving the Ranking Scores of Web Pages Shweta Agarwal Asst. Prof., CS&IT Deptt. MIT, Moradabad, U.P. India Bharat Bhushan Agarwal Asst. Prof., CS&IT Deptt. IFTM, Moradabad,

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Session 10: Information Retrieval

Session 10: Information Retrieval INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!

More information

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~ . Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst

More information

Personalized Information Retrieval

Personalized Information Retrieval Personalized Information Retrieval Shihn Yuarn Chen Traditional Information Retrieval Content based approaches Statistical and natural language techniques Results that contain a specific set of words or

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff  Dr Ahmed Rafea Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff http://www9.org/w9cdrom/68/68.html Dr Ahmed Rafea Outline Introduction Link Analysis Path Analysis Using Markov Chains Applications

More information

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages An Enhanced Page Ranking Algorithm Based on eights and Third level Ranking of the ebpages Prahlad Kumar Sharma* 1, Sanjay Tiwari #2 M.Tech Scholar, Department of C.S.E, A.I.E.T Jaipur Raj.(India) Asst.

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma Instructor: Prof. Reddy Raja Mentor: Ms M.Padmini To Implement PageRank Algorithm using Map-Reduce for Wikipedia and

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm

More information

Searching the Web for Information

Searching the Web for Information Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content

More information

Relevance of a Document to a Query

Relevance of a Document to a Query Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

CSI 445/660 Part 10 (Link Analysis and Web Search)

CSI 445/660 Part 10 (Link Analysis and Web Search) CSI 445/660 Part 10 (Link Analysis and Web Search) Ref: Chapter 14 of [EK] text. 10 1 / 27 Searching the Web Ranking Web Pages Suppose you type UAlbany to Google. The web page for UAlbany is among the

More information

Measuring Similarity to Detect

Measuring Similarity to Detect Measuring Similarity to Detect Qualified Links Xiaoguang Qi, Lan Nie, and Brian D. Davison Dept. of Computer Science & Engineering Lehigh University Introduction Approach Experiments Discussion & Conclusion

More information

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Information Retrieval. Lecture 4: Search engines and linkage algorithms Information Retrieval Lecture 4: Search engines and linkage algorithms Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2

More information

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext Generalized Social Networs Social Networs and Raning Represent relationship between entities paper cites paper html page lins to html page A supervises B directed graph A and B are friends papers share

More information

TODAY S LECTURE HYPERTEXT AND

TODAY S LECTURE HYPERTEXT AND LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority

More information