INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)


 Elfrieda Shields
 1 years ago
 Views:
Transcription
1 INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5)
2 Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS) Analyze the structure of very large graph (Web) Link Analysis
3 PageRank
4 Early SE and Term Spam Early Search Engines invented term search Crawl the Web Extract teems (e.g. words) from each page Create an inverted index (what terms in which pages) Query processing Find all pages with query trems Rank pages according to importance/relevance E.g. term in the title of a page is more important Spammers invented term spam Add fake terms (in invisible font) Run popular query, see what page comes first, copy it
5 Google Innovation PageRank Simulate a random surfer starting from a random page following random outlinks Important pages has large chance to be on the simulated random path Page importance and terms are used for ranking Terms around the link Relevance of the page is according to terms within the page and terms around links to this page
6 Definition of PageRank A function that assigns a real number to each Page More important pages get a higher PageRank Web as a directed graph(nodespages, linkedges)
7 Transition Matrix Probability of jumping from node i to node j Assume equal probability (k out links, 1/k probability each) PageRank is a column vector Probability to be at node i
8 Stable Distribution Assume initial probability to be at each state is a vector v 0 = 1 n, 1 n,, Transition matrix M 1 n What is the probability after a single step? x = Mv 0 x i = j m ij v j After k steps x k = M k v 0 = MM Mv 0
9 Markov Process Distribution to be on a node i at step k depends only on distribution of nodes at time k1. Exists a limiting distribution v = Mv provided The graph is strongly connected (possible to get from any node to any node) There are no dead ends (nodes that have no arcs out) Limiting distribution is an eigenvector of M
10 Principle Eigenvector Transition matrix M is stochastic (each column adds up to 1) Limiting distribution is the principle eigenvector (associated with largest eigenvalue) v = λmv Computation: iterate my multiplying by matrix M till no significant change iterations for Web
11 Example Assuming transition matrix Successive multiplications
12 Structure of the Web In practice, web is not strongly connected graph
13 Structure of the Web Large strongly connected component (SCC) Incomponent Reach the SCC but could not but not reachable from the SCC Outcomponent Reachable from the SCC but unable to reach the SCC Two types of Tendrils From the incomponent To the outcomponent Tubes from the incomponent to the outcomponent Isolated component
14 Two general problems Deadends Page with no links out Spider traps Groups of pages that do not have links to any other pages Each page has outlinks within the group
15 Avoiding Dead Ends Transition matrix is not stochastic (all zero column) Substochastic matrix column sums are at most 1 Increasing power of M leads to some/all elements of v going to zero. Example
16 Dropping dead ends Drop dead ends and their incoming arcs from the graph Other nodes may become dead ends Drop recursively to obtain a strongly connected component Compute PageRank on the remaining graph Restore graph by adding nodes back in reverse order Computing PageRank for restored nodes Each parent with PageRank p and number of outlinks k contribute p/k to the restored node
17 Example Drop dead ends PageRank on reduced graph Restore C: Restore E: Single parent, same PageRank Result is not a distribution (does not sum up to 1)
18 Spider Traps and Taxation Example
19 Teleporting A random surfer has a small probability of jumping from any page to any page e is a vector of all 1 s and β is a small probability (0.15) For dead ends Always a probability to get out
20 Example Assume β = 0.8
21 Using PageRank in a SE A secret formula for ranking pages in response to a query Terms relevance PageRank Other 250 properties of pages (Google)
22 Efficient Computation of PageRank
23 PageRank for a large graph 50 iterations of matrixvector multiplication MapReduce method The transition matrix M is very sparse Represent only nonzero elements Modify MapReduce stripping approach to reduce amount of data passed from Map tasks to Reduce tasks
24 Representing Transition Matrices 10B pages, 10 links per page 1 of each 1B entries is not zero 4 bytes per coordinate index, and 8 bytes for value Total 16 bytes per nonzero entry List all nonzero entries by column Single integer for a number of nonzeroes 4 bytes for row number per each nonzero entries
25 Example Transition Matrix Representation
26 PageRank Iteration Using MapReduce Iteration For small n store vector in the main memory of each node Map i, j, m ij i, m ij v j Reduce i, m i1 v 1,, m in v n j m ij v j Large n: break M into vertical stripes, v into horizontal stripes Break M into blocks, v into stripes
27 TopicSensitive PageRank
28 Motivation Search jaguar Animal, Automobile, MAC OS, ancient game console If SE can guess the topic More relevant results Select small number of topics Create PageRank vector for each topic (eg. 16 DMOZ) Detect user interest with respect to one of these topics
29 Biased Random Walk Assume random surfers start only from a random sport page Teleport set S of sport pages Usage Decide on topics Select teleport set of each topic Find a way to decide on topic(s) relevant to query Use appropriate PageRank vector
30 Link Spam
31 Architecture of a Spam Farm Spammers constantly try to improve the PageRank of their pages Web from the point of view of a spammer Inaccessible pages (amazon) Accessible pages (blog) Own pages (spam)
32 Spam Farm Single target page and m supporting pages
33 Analysis of a Spam Farm x PageRank contributed by accessible pages β i p i k i, p i PageRank, k i number of outlinks y unknown PageRank of target page PageRank of each supporting page is
34 PageRank of Target Page Contribution x from outside Contribution of every supporting page Contribution from teleported surfers (ignore) 1 β Total Solve n
35 Example Assume β = 0.86, c = 0.46 y = 3.6 x m n Amplify x, contribution by outer page by 360% 46% of the fraction of the Web
36 Combating Link Spam Battle between SE to detect spamfarmlike structures and spammers to invent new ones Consider TrustRank a variation of topic sensitive PageRank designed to lower the score of spam pages Spam mass identify pages that are likely to be spam
37 TrustRank Let S teleport set to be a set of pages that are considered to be trustworthy Can t inject spam links into them (e.g. no talkbacks) Selecting trustworthy pages Human selected pages Pages from a specific domains (.edu.mil,.gov)
38 Spam Mass Measure fraction of page PageRank that comes from spam Compute PageRank r Compute TrustRank t The spam mass is r t r Not a spam: negative or small positive Spam: close two one (t is almost zero)
39 Example Trustworthy pages B and D No spam pages
40 Hubs and Authorities
41 HITS Probably used by Ask.com SE Hyperlink induced topic search (HITS) Originally intended to help ranking of query results Not a preprocessing step as PageRank We apply to the entire Web
42 The Intuition Behind HITS Authorities: Certain page are valuable because they provide information about a topic Hubs: Other pages are valuable as they point to good pages about that topic Example A homepage of the faculty is a HUB A homepage of each course is an Authority Recursive definition Good hub if links to good authorities Good authorities if it is linked by a good hub
43 Formalizing Hubbiness and Authority Link matrix of the Web L 1 if there is a link from i to j. Transpose L T : 1 if a link from j to I L T is similar to transition matrix M (M has probabilities)
44 Scores Let h and a be score vectors fro hubbines and authority respectively Scale each vector to sum 1 Computation h = λla, a = μl T h, with scaling constants λ and μ Substitute h = λlμl t h = λμll T h a = μl T λla = λμl T La
45 Computing L T L is much more sparse compared to L Better compute h and a by a true mutual recursion Algorithm Compute a = μl T h and scale Compute h = λla and scale Repeat until changes are small
46 Summary
47 Summary Term spam inject terms and copy pages PageRank and Transition Matrix Page importance defined by a random surfer Dead ends and Spider Traps Taxations/teleporting and removal of dead ends Combatting Spam Farms TrustRank and Spam Mass Topicsensitive PageRank Teleport sets Hubs and authorities Mutually recursive definition
Unit VIII. Chapter 9. Link Analysis
Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using MapReduce and other approaches, TopicSensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4degrees of separation, BackstromBoldiRosaUganderVigna,
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]
More information3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today
3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur
More informationWeb consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page
Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine
More informationCS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3  A. FAQs. This material is built based on,
S535 ig ata Fall 217 olorado State University 9/5/217 Week 39/5/217 S535 ig ata  Fall 217 Week 31 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration
More informationCS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When
More informationLink Analysis in Web Mining
Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #10: Link Analysis2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationJeffrey D. Ullman Stanford University/Infolab
Jeffrey D. Ullman Stanford University/Infolab Spamming = any deliberate action intended solely to boost a Web page s position in searchengine results. Web Spam = Web pages that are the result of spamming.
More informationCOMP Page Rank
COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper
More informationCOMP 4601 Hubs and Authorities
COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationHow to organize the Web?
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper
More informationAgenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page
Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationDATA MINING  1DL460
DATA MINING  1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationDATA MINING  1DL460
DATA MINING  1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 20140618 1/80 Overview
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More informationDATA MINING  1DL460
DATA MINING  1DL460 Spring 2013" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationThe PageRank Citation Ranking
October 17, 2012 Main Idea  Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR  course web page, Brian home page, Emily home page, Steven
More informationTODAY S LECTURE HYPERTEXT AND
LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority
More informationDATA MINING  1DL460
DATA MINING  1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationPage rank computation HPC course project a.y Compute efficient and scalable Pagerank
Page rank computation HPC course project a.y. 201213 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet
More information1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a
!"#$ %#& ' Introduction ' Social network analysis ' Cocitation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,/*,) Early search engines mainly compare content similarity of the query and
More informationLink Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld
Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on
More informationIntroduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1
LINK ANALYSIS 1 Today s lecture hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of
More informationLec 8: Adaptive Information Retrieval 2
Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/irbook/ Linear Algebra Revision Vectors:
More informationThe PageRank Computation in Google, Randomized Algorithms and Consensus of MultiAgent Systems
The PageRank Computation in Google, Randomized Algorithms and Consensus of MultiAgent Systems Roberto Tempo IEIITCNR Politecnico di Torino tempo@polito.it This talk The objective of this talk is to discuss
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract The World Wide Web is a huge repository of data which includes audio, text and video.
More informationLink Analysis in the Cloud
Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)
More informationLecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!
Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736bewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 WolfTilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationCS6120: Intelligent Media Systems. Web Search. Web Search 19/01/2014. Dr. Derek Bridge School of Computer Science & Information Technology UCC
CS6120: Intelligent Media Systems Dr. Derek Bridge School of Computer Science & Information Technology UCC Web Search Napoleon Waterloo Web Search 1 Web Search is Special Size of web Decentralized content
More informationMatrixVector Multiplication by MapReduce. From Rajaraman / Ullman Ch.2 Part 1
MatrixVector Multiplication by MapReduce From Rajaraman / Ullman Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrixvector multiplications When ranking of Web pages that
More informationSome Interesting Applications of Theory. PageRank Minhashing LocalitySensitive Hashing
Some Interesting Applications of Theory PageRank Minhashing LocalitySensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationGraph and Web Mining  Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BENGURION UNIVERSITY, ISRAEL
Graph and Web Mining  Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BENGURION UNIVERSITY, ISRAEL Web mining  Outline Introduction Web Content Mining Web usage
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan PhelpsGoodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ircourse/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationJordan BoydGraber University of Maryland. Thursday, March 3, 2011
DataIntensive Information Processing Applications! Session #5 Graph Algorithms Jordan BoydGraber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons AttributionNoncommercialShare
More informationLargeScale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies
LargeScale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about:  Hubs whose scores depend on the authority of the nodes they point
More informationDataIntensive Computing with MapReduce
DataIntensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons AttributionNoncommercialShare
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More information~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~
. Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst
More informationSocial Network Analysis
Chirayu Wongchokprasitti, PhD University of Pittsburgh Center for Causal Discovery Department of Biomedical Informatics chw20@pitt.edu http://www.pitt.edu/~chw20 Overview Centrality Analysis techniques
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationWeb Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology
Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17Jan2008 The Agenda
More informationBruno Martins. 1 st Semester 2012/2013
Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4
More informationAntiTrust Rank for Detection of Web Spam and Seed Set Expansion
International Journal of Information and Computation Technology. ISSN 09742239 Volume 3, Number 4 (2013), pp. 241250 International Research Publications House http://www. irphouse.com /ijict.htm AntiTrust
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 12: Crawling and Link Analysis 2 1 Ch. 1112 Last Time Chapter 11 1. ProbabilisCc Approach to Retrieval / Basic Probability Theory
More informationWeb Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search
Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public  No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationPage Rank Algorithm. May 12, Abstract
Page Rank Algorithm Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of
More informationSimilarity Ranking in Large Scale Bipartite Graphs
Similarity Ranking in Large Scale Bipartite Graphs Alessandro Epasto Brown University  20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads
More informationLecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science
Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public  No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationA brief history of Google
the math behind Sat 25 March 2006 A brief history of Google 19957 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page
More informationA Modified Algorithm to Handle Dangling Pages using Hypothetical Node
A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal
More informationApplication of PageRank Algorithm on Sorting Problem Su weijun1, a
International Conference on Mechanics, Materials and Structural Engineering (ICMMSE ) Application of PageRank Algorithm on Sorting Problem Su weijun, a Department of mathematics, Gansu normal university
More informationA Survey on Web Information Retrieval Technologies
A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information
More information.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..
.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Link Analysis in Graphs: PageRank Link Analysis Graphs Recall definitions from Discrete math and graph theory. Graph. A graph
More informationSocial and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo
Social and Technological Network Analysis Lecture 5: Web Search and Random Walks Dr. Cecilia Mascolo In This Lecture We describe the concept of search in a network. We describe powerful techniques to enhance
More informationCSE 190 Lecture 16. Data Mining and Predictive Analytics. Smallworld phenomena
CSE 190 Lecture 16 Data Mining and Predictive Analytics Smallworld phenomena Another famous study Stanley Milgram wanted to test the (already popular) hypothesis that people in social networks are separated
More informationLink Spam Detection Based on Mass Estimation
Link Spam Detection Based on Mass Estimation October 31, 2005 (Revised: June 8, 2006) Technical Report Zoltán Gyöngyi Computer Science Department Stanford University, Stanford, CA 94305 Pavel Berkhin Yahoo!
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval CS276 Informa*on Retrieval and Web Search Chris Manning and Pandu Nayak Link analysis Today s lecture hypertext and links We look beyond the content of documents We
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Link analysis 1 Today s lecture hypertext and links We look beyond the
More informationLecture MapReduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto
Lecture 04.02 MapReduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5word sequence occurs
More informationc 2006 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 27, No. 6, pp. 2112 212 c 26 Society for Industrial and Applied Mathematics A REORDERING FOR THE PAGERANK PROBLEM AMY N. LANGVILLE AND CARL D. MEYER Abstract. We describe a reordering
More informationEFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS
EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements
More informationCalcolo di PageRank in Google mediante un algoritmo Las Vegas e riflessioni sui metodi randomizzati per sistemi complessi
Calcolo di PageRank in Google mediante un algoritmo Las Vegas e riflessioni sui metodi randomizzati per sistemi complessi Roberto Tempo CNRIEIIT Consiglio Nazionale delle Ricerche Politecnico di Torino
More informationLecture 4: Information Retrieval and Web Mining.
Lecture 4: Information Retrieval and Web Mining http://www.cs.kent.edu/~jin/advdatabases.html 1 1 Outline Information Retrieval Chapter 19 (Database System Concepts) Web Mining (Mining the Web, Soumen
More informationHeat Kernels and Diffusion Processes
Heat Kernels and Diffusion Processes Definition: University of Alicante (Spain) Matrix Computing (subject 3168 Degree in Maths) 30 hours (theory)) + 15 hours (practical assignment) Contents 1. Solving
More informationBreadthFirst Search Crawling Yields HighQuality Pages
BreadthFirst Search Crawling Yields HighQuality Pages Marc Najork Compaq Systems Research Center 13 Lytton Avenue Palo Alto, CA 9431, USA marc.najork@compaq.com Janet L. Wiener Compaq Systems Research
More informationGraph Theory Review. January 30, Network Science Analytics Graph Theory Review 1
Graph Theory Review Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ January 30, 2018 Network
More informationRanking on Data Manifolds
Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname
More informationSearch Engines. Charles Severance
Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity
More informationLink Analysis Informa0on Retrieval. Evangelos Kanoulas
Link Analysis Informa0on Retrieval Evangelos Kanoulas e.kanoulas@uva.nl How Search Works Logging Clicks Context Crawling Quality Freshness Spaminess Text processing & Indexing Ranking Algorithm Content
More informationAn Adaptive Approach in Web Search Algorithm
International Journal of Information & Computation Technology. ISSN 09742239 Volume 4, Number 15 (2014), pp. 15751581 International Research Publications House http://www. irphouse.com An Adaptive Approach
More informationPagerank Computation and Keyword Search on Distributed Systems and P2P Networks
Journal of Grid Computing 1: 291 307, 2003. 2004 Kluwer Academic Publishers. Printed in the Netherlands. 291 Pagerank Computation and Keyword Search on Distributed Systems and P2P Networks Karthikeyan
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important
More informationCOMPARATIVE ANALYSIS OF POWER METHOD AND GAUSSSEIDEL METHOD IN PAGERANK COMPUTATION
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 23213469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSSSEIDEL METHOD IN PAGERANK COMPUTATION
More informationThe Anatomy of a LargeScale Hypertextual Web Search Engine
The Anatomy of a LargeScale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(17):107117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationLinkbased Object Classification (LOC) Linkbased Object Ranking (LOR)... 79
5 Link Analysis Arpan Chakraborty, Kevin Wilson, Nathan Green, Shravan Kumar Alur, Fatih Ergin, Karthik Gurumurthy, Romulo Manzano, and Deepti Chinta North Carolina State University CONTENTS 5.1 Introduction......................................................
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More information22. TwoDimensional Arrays. Topics Motivation The numpy Module Subscripting functions and 2d Arrays GooglePage Rank
22. TwoDimensional Arrays Topics Motivation The numpy Module Subscripting functions and 2d Arrays GooglePage Rank Visualizing 12 17 49 61 38 18 82 77 83 53 12 10 Can have a 2d array of strings or objects.
More informationReduce and Aggregate: Similarity Ranking in MultiCategorical Bipartite Graphs
Reduce and Aggregate: Similarity Ranking in MultiCategorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation
More informationGraphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech
CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey
More informationImproving the Ranking Capability of the Hyperlink Based Search Engines Using Heuristic Approach
Journal of Computer Science 2 (8): 638645, 2006 ISSN 15493636 2006 Science Publications Improving the Ranking Capability of the Hyperlink Based Search Engines Using Heuristic Approach 1 Haider A. Ramadhan,
More informationInformation Retrieval. Lecture 9  Web search basics
Information Retrieval Lecture 9  Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general
More information