Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Similar documents
Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

COMP5331: Knowledge Discovery and Data Mining

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

DATA MINING - 1DL105, 1DL111

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Chapter 6: Information Retrieval and Web Search. An introduction

Link Analysis and Web Search

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

DATA MINING II - 1DL460. Spring 2014"

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CS6200 Information Retreival. The WebGraph. July 13, 2015

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Mining Web Data. Lijun Zhang

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Searching the Web What is this Page Known for? Luis De Alba

A Survey on Web Information Retrieval Technologies

Information Retrieval. hussein suleman uct cs

An Introduction to Search Engines and Web Navigation

Information Retrieval Spring Web retrieval

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Exam IST 441 Spring 2014

Searching the Web [Arasu 01]

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture 8: Linkage algorithms and web search

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Recent Researches on Web Page Ranking

Information Retrieval. Lecture 11 - Link analysis

How Does a Search Engine Work? Part 1

Exam IST 441 Spring 2013

Bruno Martins. 1 st Semester 2012/2013

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

DATA MINING II - 1DL460. Spring 2017

Information Retrieval May 15. Web retrieval

THE WEB SEARCH ENGINE

CS/INFO 1305 Summer 2009

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Chapter 4. Processing Text

Information Retrieval. (M&S Ch 15)

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Information Retrieval. Lecture 9 - Web search basics

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Exam IST 441 Spring 2011

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Information Networks: PageRank

CS/INFO 1305 Information Retrieval

SEARCH ENGINE OVERVIEW. Web Search Engine Architectures and their Performance Analysis

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Abstract. 1. Introduction

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Information Retrieval

A Dynamic Clustering-Based Markov Model for Web Usage Mining

Personalized Information Retrieval

Web Structure Mining using Link Analysis Algorithms

COMP 4601 Hubs and Authorities

CHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS

Authoritative Sources in a Hyperlinked Environment

Link Analysis in Web Mining

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

Information Retrieval

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Lec 8: Adaptive Information Retrieval 2

Social Network Analysis

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

: Semantic Web (2013 Fall)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Link Structure Analysis

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Mining Web Data. Lijun Zhang

Part I: Data Mining Foundations

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Keyword Search in External Memory Graph Representations of Data

Measuring Similarity to Detect

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

= a hypertext system which is accessible via internet

68A8 Multimedia DataBases Information Retrieval - Exercises

Collective Entity Resolution in Relational Data

PageRank and related algorithms

Unit VIII. Chapter 9. Link Analysis

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Natural Language Processing

Hierarchical Link Analysis for Ranking Web Data

Semantic Website Clustering

Web Applications: Internet Search and Digital Preservation

Transcription:

Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing - examining documents Retrieval - searching for documents Database System versus IR System Category SQL Search Engine Matching exact answer ranked answer Language sophisticated simple Algorithm deterministic probabilistic Database structured semistructured Query complete incomplete Error sensitive insensitive

Page 2 of 14 The Web is a Hypertext System Content - collection of pages Structure - links (directed graph) Additional user task: Navigation - traversing links and following a trail of associated links Quote from Bush 1945 As We May Think (download from my web links): the process of tying two items together is an important thing.. when numerous items have been thus joined together to form a trail they can be reviewed in turn Nelson s vision of a universal hypertext database - Xanadu (1960 s)

Page 3 of 14 The Basic Information Retrieval Algorithm 1. Remove stopwords such as: of, the, a... 2. Apply stemming to terms (or words), i.e. remove prefixes and suffixes E.g. connected, connecting, connection and connections connect 3. Weight the terms in the query and in pages 4. Rank the pages according to similarity with the query

Page 4 of 14 Term Weighting N - total no. of pages in the system n j - no. pages in which term j appears term frequency tf ij = frequency of term j in page i inverse document frequency idf j = log n j N = log N n j (self-information of term j) normalised term weight w ij = tf ij idf j max k tf ik Query Weighting w qj = ( 0.5 + 0.5 tf qj max k tf qk ) idf j

Page 5 of 14 Similarity m - no. of terms considered Represent page i as a vector i = w i1, w i2,..., w im Represent query q as a vector q = w q1, w q2,..., w qm sim(i, q) = m w k=1 ik w qk (dot product of i and q) (Other similarity measures exist)

Page 6 of 14 Measures of Information Retrieval R F - no. relevant pages returned R N - no. relevant pages not returned I F - no. of irrelevant pages returned I N - no. of irrelevant pages not returned recall = R F R F +R N Proportion of relevant pages returned. precision = R F R F +I F Proportion of returned pages which are relevant. Precision versus Recall curve

Page 7 of 14 Searching the Web over 2 billion pages (2000) growing at 1 million pages per-day. Each page has on average 7 out-links. Over 600 GB of text changes every month. Largest crawlers cover 30-40% of the indexable web during several months. 10 percent redundancy in mirrored sites. Most users type in short queries on average less than 3 terms. Most users only look at the top ten results. Most users do not modify their original query.

Page 8 of 14 Using Link Structure in Search L ij = 1 if there is a link from i to j and 0 otherwise. Structured Weighting sw qj = w qj + k j αl kj w qk α is between 0 and 1 (0.2 seems to be optimal) the sum is over all pages that have a link to page j

Page 9 of 14 HITS - Hypertext Induced Topic Search Given a query such as XML distinguish between: authorities - pages which focus on the topic of XML such as various publications on the XML standard. hubs - pages that contain many useful links to relevant pages A densely linked focused subgraph of hubs and authorities is called a community. Over 100,000 emerging web communities have been discovered from a web crawl (a process called trawling).

Page 10 of 14 The HITS Algorithm 1. Collect the top t pages (say t = 200) based on similarity, called the root set. 2. Extend the root set into a base set as follows, for all pages p in the root set: 2.1. add to the root set all pages that p points to, and 2.2. add to the root set up-to d pages that point to p (say d = 50). 3. Delete all links between the same web site in the base set resulting in a focused subgraph. 4. Assign to each page p a non-negative hub weight y p and a non-negative authority weight x p. 5. Iteratively reinforce hubs and authorities as follows, until convergence: x p := q where q p y q y p := q where p q x q

Page 11 of 14 PageRank - Google Model of a random surfer : 1. The surfer given a web page at random. 2. The surfer follows forward links without going back. 3. When the surfer gets bored a random page is chosen as the next page. The PageRank of a page is the probability that a random surfer visit a page P - a page which has incoming links from pages P 1, P 2,..., P n r - a positive number between 0 and 1 O(P i ) - the number of links going out of page P i P R(P ) = r + (1 r) n i=1 P R(P i ) O(P i )

Page 12 of 14 Metasearch Problem: Search engines have limited coverage and overlap (Nature 1999) Relative coverage of major search engines about 20%. The overall coverage is small, less than 16% are indexed by all engines, not taking into account the deep web. Solution: Select and merge results from several data sources Not easy to do well due to heterogeneity of local search engines

Page 13 of 14 The Navigation Problem in Hypertext The steps in searching for information: 1) Query - user provides the context 2) Information Retrieval - ranked list of pages returned 3) Navigation - user repeats : (a) choose a page to browse (b) follow a link 4) Query Modification - user returns to (1)

Page 14 of 14 Problem: getting lost in hyperspace - navigation (link following) leads to disorientation in terms of the goals and relevance of the currently browsed page to the query. Solution: Trails are first-class citizens We develop algorithms which maximise the expected trail relevance.