DATA MINING - 1DL460

Size: px
Start display at page:

Download "DATA MINING - 1DL460"

Transcription

1 DATA MINING - 1DL460 Spring 2014" A second course in data mining Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 07/04/14 1

2 Introduction to Data Mining: Search Engines Technology and Algorithms for Efficient Searching of Information on the Web (slides + supplemental articles)" Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 07/04/14 2

3 Lecture s Outline" The Web and its Search Engines Heuristics-based Ranking PageRank (Google) for discovering the most important web pages HITS: hubs and authorities (Clever project) more detailed evaluation of pages importance 07/04/14 3

4 The Web in 2001: Some Facts " More than 3 billion pages; several terabytes Highly dynamic More than 1 million new pages every day! Over 600 GB of pages change per month Average page changes in a few weeks Largest crawlers Refresh less than 18% in a few weeks Cover less than 50% ever (invisible Web) Average page has 7 10 links Links form content-based communities 07/04/14 4

5 Chaos on the Web" Internet lacks organization and structure: pages written in any language, dialect or style; different cultures, interests and motivation; mixes truth, falsehood, wisdom, propaganda Challenge: Quickly extract from this digital morass, high-quality, relevant, up-to-date pages in response to specific information needs No precise mathematical measure of best results 07/04/14 5

6 Search Products and Services " Verity Fulcrum PLS Oracle text extender DB2 text extender Infoseek Intranet SMART (academic) Glimpse (academic) BING (Microsoft) Baidu (China) Yandex (Russia) Inktomi (HotBot) Alta Vista Raging Search Google Dmoz.org Yahoo! Infoseek Internet Lycos Excite * heuristics-based * humanly-selected 07/04/14 6

7 Web Search Queries" Web search queries are short: ~2.4 words on average (Aug. 2000) Has increased, was 1.7 (~1997) User expectations: The first item shown should be what I want to see! This works if the user has the most popular or most common notion in mind; not otherwise 07/04/14 7

8 Standard Web Search Engine Architecture" crawl the web eliminate duplicates DocIds user! query create an inverted index show results to user Search engine servers inverted index 07/04/14 8

9 Heuristics-based ranking" Naive attempt used by many search engines Heuristics employed: number of times a page contains the query term favor instances where the term appears early give weight to word appearing in a special place or form e.g. in a title or in bold. All heuristics fail miserably due to: spamming, or synonymy and polysemy of natural language words 07/04/14 9

10 Hyperlink Graph Analysis" Hypermedia is a social network Telephoned, advised, co-authored, paid Social network theory (cf. Barnes, 1954 and Wasserman & Faust, 1994) Extensive research applying graph notions Centrality and prestige Co-citation (relevance judgement) and bibliographic coupling Applications Web search: HITS, Google Classification and topic distillation 07/04/14 10

11 Hypertext models for classification" c = class, t = text, N = neighbors Text-only model: Pr[t c] Using neighbors text to judge my topic: Pr[t, t(n) c]? Better model: Pr[t, c(n) c] Recursive relationships 07/04/14 11

12 Exploiting the Web s Hyperlink Structure" Underlying assumption: view each link as an implicit endorsement of the location it points to Assumption: If the pages pointing to this page are good, then this is also a good page. References: Kleinberg 98, Page et al. 98 Draws upon earlier research in sociology and bibliometrics. Kleinberg s model includes authorities (highly referenced pages) and hubs (pages containing good reference lists). Google model is a version with no hubs, and is closely related to work on influence weights by Pinski-Narin (1976). 07/04/14 12

13 Link analysis for ranking pages" Why does this work? The official Ferrari site will be linked to by lots of other official (or highquality) sites The best Ferrari fan-club sites probably also have many links pointing to it Less high-quality sites do not have as many high-quality sites linking to them 07/04/14 13

14 Page Rank" Intuition: recursive definition of importance. A page is important if important pages link to it. Method: create a stochastic matrix of the Web each page corresponds to a matrix s row and column if page j has L successors, (out-links) then the ij-th entry is: 1/L if page i is one of these L successors of page j 0 otherwise 07/04/14 14

15 Page Rank intuition" Initially, each page has one unit of importance. At each round, each page shares whatever importance it has with its successors, and receives new importance from its predecessors. Eventually, the importance reaches a limit, which happens to be its component of the principal eigenvector of this matrix. Importance = probability that a random Web surfer, starting from a random Web page, and following random links will be at the page in question after a long series of links. 07/04/14 15

16 Page Rank example: the web in 1689" Netscape Equation: N M A = 1/2 0 1/ /2 1/2 1 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = /2 3/2 5/4 3/4 1 9/8 1/2 11/8 5/4 11/16 17/ /5 3/5 6/5 07/04/14 16

17 Problems with real web graphs" Dead ends: a page that has no successors has nowhere to send its importance Eventually, all importance will leak out of the Web Spider traps: a group of one or more pages that have no links outside the group Eventually, these pages will accumulate all the importance of the Web. 07/04/14 17

18 The Bowtie structure of the Web" Tendrils Out Tendrils In In Component Strongly Connected Component Out Component Tubes Disconnected Components From Rajaraman and Ullman, Mining Massive Data Sets 07/04/14 18

19 Dead ends - rank leaks: Microsoft tries to duck monopoly charges..." Netscape Equation: N M A = 1/2 0 1/ /2 1/2 0 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = /2 1/2 3/4 1/4 1/2 5/8 1/4 1/2 1/2 3/16 5/ /04/14 19

20 Spider traps - rank sinks Microsoft considers itself the center of the universe..." Netscape Equation: N M A = 1/2 0 1/ /2 1/2 0 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = /2 1/2 3/4 7/4 1/2 5/8 2 3/8 1/2 35/16 5/ /04/14 20

21 Google solution to dead ends and spider traps" Instead of applying the matrix directly, tax each page with some fraction of its current importance, and distribute the taxed importance equally among all pages. N M A = 0.8 1/2 0 1/ /2 1/2 0 0 N M A The solution to this equation is now: N = 7/11; M = 21/11; A = 5/11 Formally, the system should be: stochastic (columns of non-negative numbers with the sum of 1) irreducible (strongly connected), i.e. all pairs of nodes connected and aperiodic, all states should be aperiodic, i.e. have a period of 1 07/04/14 21

22 PageRank" Where PR is the PageRank, p 1, p 2,..., p N are the pages under consideration, M(p i ) is the set of pages that link to p i, L(p j ) is the number of outbound links on page p j, N is the total number of pages, and d is the damping factor. In an iterative formulation we have at t= 0: At each time step, the PageRank can be computed as follows: 07/04/14 22

23 PageRank" Expressed as a matrix equation (using a modified adjacency matrix): where is given by:, 1 is an N-length column vector of ones and the matrix M Convergence is assumed when: 07/04/14 23

24 Google anti-spam devices" Spamming: an attempt by many web sites to appear to be about a subject that will attract surfers, without truly being about the subject Early search engines relied on the words on a page to tell what it is about. Led to tricks in which pages attracted attention by placing false words in the background color on their page. Google trusts the words in anchor text Relies on others telling the truth about your page, rather than relying on you. Makes it harder for a homepage to appear to be about something it is not. The use of page rank to measure importance, rather than the more naive number of links into a page, also protects against spamming. E.g. Page Rank recognizes as unimportant 1000 pages that mutually link to one another. 07/04/14 24

25 Google search engine evolution, (ref: wired.com)" Backrub [September 1997] This search engine, which had run on Stanford s servers for almost two years, is renamed Google. Its breakthrough innovation: ranking searches based on the number and quality of incoming links. New algorithm [August 2001] The search algorithm is completely revamped to incorporate additional ranking criteria more easily. Local connectivity analysis [February 2003] Google s first patent is granted for this feature, which gives more weight to links from authoritative sites. Fritz [Summer 2003] This initiative allows Google to update its index constantly, instead of in big batches. Personalized results [June 2005] Users can choose to let Google mine their own search behavior to provide individualized results. Bigdaddy [December 2005] Engine update allows for more-comprehensive Web crawling. Universal search [May 2007] Building on Image Search, Google News, and Book Search, the new Universal Search allows users to get links to any medium on the same results page. Google Caffeine [August 2009] A new architecture designed to return results faster and to better deal with rapidly updated information from services including Facebook and Twitter. With Caffeine, Google moved its back-end indexing system away from MapReduce and onto BigTable, the company's distributed database platform. Panda [February 2011] Change to lower the rank of "low-quality sites" or "thin sites and return higher-quality sites near the top of the search results, e.g. drop rankings for sites containing large amounts of advertising Penquin [April 2012] Change to decrease search engine rankings of websites that violate Google s Webmaster Guidelines, such as keyword stuffing, cloaking, participating in link schemes, deliberate creation of duplicate content, and others. 07/04/14 25

26 Google Facts (from end 2001)" Indexes 3 billion web pages (50 billion pages in 2012) If printed, they would result in a stack of paper 200 km high If a person reads a page per minute (and does nothing else), (s)he would need 6000 years to read them all 200 million search queries a day Approx. 80 billion searches a year! (in 2013 there were around billion searches) Most searches take less than half second Support for 35 non-english languages (now in 2014 its 151 language choices) Searchable index contained 3 trillion items Updated every 28 days 07/04/14 26

27 Google architecture (approx.)" 3. The search results are returned to the user typically in less than a second. 1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result. 07/04/14 27

28 Google Advanced Search" 07/04/14 28

29 Google strengths and weaknesses" Successful business implementation Good at fighting spam Query independent very efficient at query time Time is not considered (limited time interval for result available) Basic algorithm do not distinguish between authoritative in general and on a certain topic (no public info available on this topic) 07/04/14 29

30 Hubs and Authorities" HITS - hypertext-induced topic selection (IBM s Clever project) Defined in a mutually recursive way: a hub links to many (valuable) authorities; an authority is linked to by many (good) hubs. Authorities turn out to be pages that offer the best information about a topic. Hubs are pages that do not provide any information, but specify a collection of links on where to find the information. 07/04/14 30

31 Hubs and Authorities" Use a matrix formulation similar to that of Page Rank, but without the stochastic restriction. Rows and columns correspond to web pages; A[i,j] = 1 if page i links to page j A[i,j] = 0 otherwise The transpose of A looks like the matrix for Page Rank, but it has a 1 where the Page Rank matrix has a fraction An iterative solution where the matrix A is repeatedly applied will result in a diverging solution vector. However, by introducing scaling factors the computed values of authority and hubbiness for each page can be kept within finite bounds. 07/04/14 31

32 Computing Hubbiness and Authority of pages" Let a and h be vectors The i-th component of a and h correspond to the degrees of authority and hubbiness of the i-th page respectively. Let λ and µ be suitable scaling factors. Then: the hubbiness of each page is the sum of authorities it links to, scaled by λ h = λ A a the authority of each page is the sum of the hubbiness of all pages that link to it, scaled by µ T a = µ A h 07/04/14 32

33 Computing Hubbiness and Authority of pages" By simple substitution, two equations that relate vectors a and h only to themselves: T a = λ µ A A a T h = λ µ A A h Thus, we can compute a and h by iteratively solving the equations, giving us the principal eigenvectors of the matrices AA T and A T A, respectively. 07/04/14 33

34 Hubs and Authorities example: Netscape acknowledges Microsoft s existence" Netscape Matrices: A = A T = Microsoft AA T = A T A = Amazon 07/04/14 34

35 Hubs and Authorities example: Netscape acknowledges Microsoft s existence" Netscape Iteratively solving the equations for a and h assuming λ = µ = 1: Microsoft a(n) a(m) a(a) = = = a(a) 1.3 a(a) a(a) Amazon h(n) h(m) h(a) = = = a(a) 0,268 a(a) 0,735 a(a) 07/04/14 35

36 Authorities and Hubs pragmatics" The system is jump-started by obtaining a set of root pages from a standard text index such as Google or Bing. The iterative process settles very rapidly. A root set of 3,000 pages requires just 5 rounds of calculations! The results are independent of the initial estimates Algorithm naturally separates web sites into clusters e.g. a search for abortion partitions the Web into a pro-life and a pro-choice community 07/04/14 36

37 HITS strengths and weaknesses" Rank pages according to query topic Less good at fighting spam (later developments more robust) Topic drift (due to expanded base set can include irrelevant pages) More time consuming (compared to Page Rank) 07/04/14 37

38 Moore s Law" 07/04/14 38

39 Web Sizes over Time" ~150M in 1998 ~5B in x increase Moore would predict 25x Monthly refresh in 1998 Daily refresh in 2005 What about 2010? 40B? (actually 29B but 50B in 2012) Where is the content? Public Web? Personal Web? 07/04/14 39

40 Philosophical Remarks" The Web of today is dramatically different from what it was five years ago. Predicting the next five years seems futile. Will even the basic act of indexing soon become infeasible? If so, will our notion of searching the Web undergo fundamental changes? The Web s relentless growth will continue to generate computational challenges for wading through the ever increasing volume of on-line information. 07/04/14 40

41 World Wide Web statistics" According to a 2001 study, there were massively more than 550 billion documents on the Web, mostly in the invisible Web, or deep Web.! A 2002 survey of 2,024 million Web pages determined that by far the most Web content was in English: 56.4%; next were pages in German (7.7%), French (5.6%), and Japanese (4.9%).! A more recent study, which used Web searches in 75 different languages to sample the Web, determined that there were over 11.5 billion Web pages in the publicly indexable Web as of the end of January 2005.! As of March 2009, the indexable web contains at least billion pages.! On July 25, 2008, Google software engineers Jesse Alpert and Nissan Hajaj announced that Google Search had discovered one trillion unique URLs.! As of May 2009, over million websites operated. Of these 74% were commercial or other sites operating in the.com generic top-level domain.! 07/04/14 41

DATA MINING - 1DL460

DATA MINING - 1DL460 DATA MINING - 1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 Mutually recursive definition: A hub links to many authorities; An authority is linked to by many hubs. Authorities turn out to be places where information can

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

DATA MINING II - 1DL460. Spring 2017

DATA MINING II - 1DL460. Spring 2017 DATA MINING II - 1DL460 Spring 2017 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt17 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on, FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6... PRT 1. LRGE SLE T NLYTIS WE-SLE LINK N SOIL NETWORK NLYSIS omputer Science, olorado State University

More information

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Information Retrieval. Lecture 4: Search engines and linkage algorithms Information Retrieval Lecture 4: Search engines and linkage algorithms Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

TODAY S LECTURE HYPERTEXT AND

TODAY S LECTURE HYPERTEXT AND LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Analysis of Large Graphs: TrustRank and WebSpam

Analysis of Large Graphs: TrustRank and WebSpam Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

Information retrieval

Information retrieval Information retrieval Lecture 8 Special thanks to Andrei Broder, IBM Krishna Bharat, Google for sharing some of the slides to follow. Top Online Activities (Jupiter Communications, 2000) Email 96% Web

More information

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.

More information

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today 3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

More information

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 Wolf-Tilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine

More information

Bibliometrics: Citation Analysis

Bibliometrics: Citation Analysis Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Now, if you consider citations as links, academic

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Information Networks: Hubs and Authorities

Information Networks: Hubs and Authorities Information Networks: Hubs and Authorities Web Science (VU) (706.716) Elisabeth Lex KTI, TU Graz June 11, 2018 Elisabeth Lex (KTI, TU Graz) Links June 11, 2018 1 / 61 Repetition Opinion Dynamics Culture

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

THE HISTORY & EVOLUTION OF SEARCH

THE HISTORY & EVOLUTION OF SEARCH THE HISTORY & EVOLUTION OF SEARCH Duration : 1 Hour 30 Minutes Let s talk about The History Of Search Crawling & Indexing Crawlers / Spiders Datacenters Answer Machine Relevancy (200+ Factors)

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

CS535 Big Data Fall 2017 Colorado State University  9/5/2017. Week 3 - A. FAQs. This material is built based on, S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext Generalized Social Networs Social Networs and Raning Represent relationship between entities paper cites paper html page lins to html page A supervises B directed graph A and B are friends papers share

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity Today s topic CS347 Link-based ranking in web search engines Lecture 6 April 25, 2001 Prabhakar Raghavan Web idiosyncrasies Distributed authorship Millions of people creating pages with their own style,

More information

Social Network Analysis

Social Network Analysis Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Content Anchor text Link analysis for ranking Pagerank and variants HITS The Web as a Directed Graph Page A Anchor

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey

More information

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

PageRank and related algorithms

PageRank and related algorithms PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. Link Analysis SEEM5680 Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. 1 The Web as a Directed Graph Page A Anchor hyperlink Page

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans. 1 After WWW protocol was introduced in Internet in the early 1990s and the number of web servers started to grow, the first technology that appeared to be able to locate them were Internet listings, also

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

CS 345A Data Mining Lecture 1. Introduction to Web Mining

CS 345A Data Mining Lecture 1. Introduction to Web Mining CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

CSI 445/660 Part 10 (Link Analysis and Web Search)

CSI 445/660 Part 10 (Link Analysis and Web Search) CSI 445/660 Part 10 (Link Analysis and Web Search) Ref: Chapter 14 of [EK] text. 10 1 / 27 Searching the Web Ranking Web Pages Suppose you type UAlbany to Google. The web page for UAlbany is among the

More information

Lec 8: Adaptive Information Retrieval 2

Lec 8: Adaptive Information Retrieval 2 Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

CS 6604: Data Mining Large Networks and Time-Series

CS 6604: Data Mining Large Networks and Time-Series CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking

More information

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Algorithms. Revised based on the slides by Ruoming Kent State Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

5. search engine marketing

5. search engine marketing 5. search engine marketing What s inside: A look at the industry known as search and the different types of search results: organic results and paid results. We lay the foundation with key terms and concepts

More information

Link Analysis. Chapter PageRank

Link Analysis. Chapter PageRank Chapter 5 Link Analysis One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)! Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:

More information

Google Inc. The world s leading Internet search engine. MarketLine Case Study. Reference Code: ML Publication Date: March 2012

Google Inc. The world s leading Internet search engine. MarketLine Case Study. Reference Code: ML Publication Date: March 2012 MarketLine Case Study Google Inc. The world s leading Internet search engine Reference Code: ML00001-091 Publication Date: March 2012 WWW.MARKETLINE.COM MARKETLINE. THIS PROFILE IS A LICENSED PRODUCT AND

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information