Lec 8: Adaptive Information Retrieval 2


 Sophia Lee
 11 months ago
 Views:
Transcription
1 Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website:
2 Linear Algebra Revision Vectors: OneDimensional Matrices X= [x 0 x 1 x 2... x n ] 2 2 X = length of X = sqrt(x 0 + x 1 +x x 2 0 ) 2 = sqrt( Σ i x i ) Often used to represent coordinates in space (x,y,z), but vectors can have any dimension
3 Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [x 0 x 1 x 2... x n ] Y= [y 0 y 1 y 2... y n ] X. Y = x 0 y 0 + x 1 y 1 + x 2 y x n y n = Σ i x i y i
4 Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [ ] Y= [ ] X.Y = = 70 X Y X.Y = Length of Projection of X on Y Length of Y
5 Geometric Interpretation A.A = A 2 B.B = B 2 A.B = A B cos(θ) A cos(0)=1 A cos(90)=0 θ B B Cosine function is a similarity metric
6 Vector Product Vector Product is also called cross product A m n Β n p = C m p 2 [1 5] = C ij = Row i. Column j Rows of C are Rows of B multiplied by scalar value from A Columns of C are columns of A multiplied by scalar value from B
7 Matrix Multiplication = Row i. Column j
8 Overview 3 Lectures: Information Retrieval History and Evolution; Vector Models Link Analysis Using anchor text for indexing Using hyperlinks as recommendations PageRank Personalised PageRank Adaptive and Interactive IR
9 Properties of the internet Google indexes are big 1998: 26 Million pages 2000: 1 Billion pages 2004: 8 Billion pages 2008: 1 Trillion unique URLS These numbers are now meaningless Auto generated content Duplicates, etc, etc Probably around 20 Billion are indexed
10 Properties of the internet Dynamic Page content changes around twice a month on average Over a million pages added every day Indexing is a continuous process News sites etc have to be indexed constantly Popular sites indexed more often
11 Vector Space Model Documents and queries are vectors  Normalised term counts (tf*idf) Comparison of query Q and Document D  Cosine (Q, D)= Q.D / Q D Returns ranked documents for query  Based entirely on the textual content of the documents and query
12 Problems Not all documents on web are reliable Websites can cheat to improve rank on queries Indexing done by algorithm based on content provided on web page How do we know which websites are reliable?.
13 Problems with Term Counts For the term IBM, how do you distinguish IBM's home page (mostly graphical; IBM occurs only a few times in the html) IBM's copyright page (IBM occurs over 100 times) A Rival's spam page (Arbitrarily large term count for IBM)
14 Hyperlinks for search Web as a graph Anchor text pointing to page B provides a description of B A Hyperlink from page A to B is a recommendation or endorsement of B Ignore Internal links? IBM computers IBM Corporation International Business Machines IBM.com
15 Anchor Text <a href= > IBM computers </a> Anchor text: IBM computers computer occurs only once on ibm.com html page yahoo.com doesn't contain the word portal Apple.com doesn't contain the word apple! Gaps exist between terms present on a website and useful terms for indexing These can usually be filled by anchor text
16 Anchor Text for indexing Need tf*idf again Most common words in anchor text are:
17 Anchor Text for indexing Need tf*idf again Most common words in anchor text are: Click Here Search Engines give substantial weight to index terms obtained from anchor text satchmo > louisarmstronghouse.org
18 Extended Anchor Text Area around anchor text is useful too Click here for information about mutual funds Search engines make use of extended anchor text as well
19 Links as recommendations PageRank (Brin and Page, 1998) A link from A to B is a recommendation of B Think of science Highly cited papers are considered of higher quality Backlinks are like citations But webpages aren't reviewed, so how do we know the citer A is reliable? By counting links to A of course!
20 PageRank Consider a random surfer  Clicks on links at random A 1/3 1/3 B 1/1 E 1/3 C 1/2 1/2 D F
21 PageRank If you continue this random walk You will visit some pages more frequently than others These are pages with lots of links from other pages with lots of links PageRank: Pages visited more often in a random walk are more important (reliable)
22 Teleporting What if the random surfer reaches a page with no hyperlinks? Teleport: the surfer jumps from a page to any other page in the web graph at random If there are N pages in the web graph, teleporting takes the surfer to each node with probability 1/N Use teleport operation if No outgoing links: with probabilty α = 1 Otherwise with probability 0< α < 1
23 Need for Teleporting To avoid loops where you are forced to keep visiting the same sites in the random walk
24 Steady State Given this model of a random surfer The surfer spends a fixed fraction of the time at each page that depends on The hyperlink structure of the web Page Rank of page ν: The value of α (usually 0.1) Π (ν) = fraction of the time spent at page ν
25 Page Rank Computation Represent Web as Adjacency matrix Adj(i,j) = 1 iff there is a link from i to j Adj(i,j) = 0 iff there is no link from i to j C A B Adj = A B A B C C 1 0 0
26 Transition Probabilities If row has no 1, replace each element by 1/N (teleport if no outgoing links) Divide each 1 in A by number of 1s in Row (probability of clicking on link to that page) C A B 0 1/2 1/2 1/ /1 0 0
27 Transition Probabilities Lets consider α = 1/2, Ν=3 3) Multiply everything by 1/2=(1α) (probability of not teleporting) 4) Add 1/6 = (α/n) to every entry C A B P = 1/6 1/4+1/6=5/12 1/4+1/6=5/12 1/2+1/6=2/3 1/6 1/6 1/2+1/6=2/3 1/6 1/6
28 Starting State Imaging, surfer starts at page B At beginning, x_0 = [0, 1, 0] Vectors x_n show proportion of time spent on pages A, B, C at time n At step one, x_1=x_0 P =[0,1,0] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_1 = [2/3, 1/6, 1/6] 0*5/12 + 1*1/6 + 0*1/6 = 1/6
29 Iteration 2 At step one, x_1 = [2/3, 1/6, 1/6] At step 2, x_2 = x_1 P 1/6 5/12 5/12 = [ 2/3, 1/6, 1/6 ] 2/3 1/6 1/6 2/3 1/6 1/6 X_2 = [2/18+2/18+2/18, 10/36+1/36+1/36, 10/36+1/36+1/36 ] = [1/3, 1/3, 1/3 ] 2/3*5/12 + 1/6*1/6 + 1/6 *1/6 =1/3
30 Iterating... A B C x_ x_1 2/3 1/6 1/6 x_2 1/3 1/3 1/3 x_3 1/2 1/4 1/4 x_4 5/12 7/24 7/ X = 4/9 5/18 5/18
31 Solving by hand C A B B and C are symmetric 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 =(12p, p, p) (have to add up to 1) Solve P = to get p=5/18 1/6*(12p)+2/3*p+2/3*p=12p 1/6 1/3p +4/3p = 12p 3p=5/6; p=5/18 = [ 4/9, 5/18, 5/18 ]
32 Example Which sites have low / high pagerank? D0 D1 D2 D5 D6 D3 D4
33 Example (α = 0.14) =[ 0.05, 0.04, 0.11, 0.25, 0.21, 0.04, 0.31] D0=0.05 D1=0.04 D2=0.11 D5=0.04 D6=0.31 D3=0.25 D4=0.21
34 Web Search Ranking Documents for a Query Vector similarity: Cosine (Q, D) Terms from document and anchor text Terms normalised using tf*idf PageRank Independent of query: Property of Graph Measure of reliability: Collaborative trust Has nothing to do with how often real users click on links. The random user was only used to calculate a property of the graph
35 Properties of Page Rank New pages have to acquire Page Rank Either convince lots of sites to link to you Or convince a few highpagerank sites Page Rank can change very fast One link on Yahoo or the BBC is enough Spamming PageRank costs money Need to create huge number of sites Google never sells PageRank
36 Top PageRank sites google.com adobe.com w3.org jigsaw.w3.org/cssvalidator cnn.com usa.gov get.adobe.com/flashplayer get.adobe.com/reader india.gov.in
37 Personalised PageRank Why Personalise? Tech sites tend to have many back links and high PageRank Problem if you are not interested in IT Try searching for Apache Snow Leopard Java PageRank reflects the interests of the webcreating majority What if you are in the minority?
38 Personalised PageRank Keep track of a user's favorite websites Increase the PageRank of these sites During the iterative process, this PageRank will spread to sites that are linked PageRank will now reflect the user's interests If you give wwf.org a large PageRank, this will spread to other wildlife sites You might then see real snow leopards when you search? BUT?
39 Personalised PageRank PageRank vectors are very big and time consuming to compute, even once. You don't want to compute it for each user, or continuously update it as their browsing behaviour changes Too computationally intensive
40 Personalised PageRank Compromise Personalise by subject, not user Create a PageRank vector for each subject (Sports, Politics, etc) How?
41 Topicspecific Pagerank Random surfer Follow Link OR Teleport Teleport only to site relevant to Topic? Use directory of sports pages from yahoo or dmoz We can then build _sports, _politics, etc
42 User Modelling We can then model a user as a linear combination of Topics For example if we say a user's interests are 60% Sports 40% Politics Can we compute a PageRank for this?
43 User Modelling
44 User Modelling We don't need to recompute PageRank If each webpage has a Politics PageRank and a Sports PageRank precomputed, We can just use a linear combination of PageRanks for user with mixed interests.6 sports+.4 politics =.6 sports +.4 politics Topic PageRanks calculated offline by server User Profile maintained at clientside (.4,.6,...) Efficient method that can be used at runtime
Information Retrieval
Information Retrieval Additional Reference Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book: http://nlp.stanford.edu/irbook/ Why Study Information Retrieval? Google Searches
More informationWeb consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page
Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationCOMP Page Rank
COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 20140618 1/80 Overview
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan PhelpsGoodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ircourse/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationTODAY S LECTURE HYPERTEXT AND
LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority
More informationAgenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page
Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4degrees of separation, BackstromBoldiRosaUganderVigna,
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationCOMP 4601 Hubs and Authorities
COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationINTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)
INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)
More informationHow to organize the Web?
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper
More informationInformation Retrieval. Lecture 9  Web search basics
Information Retrieval Lecture 9  Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 WolfTilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview
More information~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~
. Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst
More informationWeb Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public  No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public  No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationUnit VIII. Chapter 9. Link Analysis
Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using MapReduce and other approaches, TopicSensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent
More informationA brief history of Google
the math behind Sat 25 March 2006 A brief history of Google 19957 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page
More informationWeb Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department t of Computer Science & Information Engineering i National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #10: Link Analysis2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs
More informationLecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!
Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736bewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:
More information2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang
1 PAGERANK ON AN EVOLVING GRAPH Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Present by Yanzhao Yang 1 Evolving Graph(Web Graph) 2 The directed links between web
More informationThe PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG  2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]
More information.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..
.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Link Analysis in Graphs: PageRank Link Analysis Graphs Recall definitions from Discrete math and graph theory. Graph. A graph
More informationCalculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky
Calculating Web Page Authority Using the PageRank Algorithm Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky Introduction In 1998 a phenomenon hit the World Wide Web: Google opened its doors. Larry
More informationA Survey on Web Information Retrieval Technologies
A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More information1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a
!"#$ %#& ' Introduction ' Social network analysis ' Cocitation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,/*,) Early search engines mainly compare content similarity of the query and
More informationIntroduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1
LINK ANALYSIS 1 Today s lecture hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of
More information3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today
3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
More informationCSE 494: Information Retrieval, Mining and Integration on the Internet
CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) Inclass Duration: Duration of the class 1hr 15min (75min) Total points:
More informationLink Analysis in Web Mining
Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained
More informationUniversity of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015
University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic
More informationRanking on Data Manifolds
Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname
More informationInformation Retrieval
Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University
More informationAN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM
AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Link analysis 1 Today s lecture hypertext and links We look beyond the
More informationToday we show how a search engine works
How Search Engines Work Today we show how a search engine works What happens when a searcher enters keywords What was performed well in advance Also explain (briefly) how paid results are chosen If we
More informationOUTLINES. Variable names in MATLAB. Matrices, Vectors and Scalar. Entering a vector Colon operator ( : ) Mathematical operations on vectors.
1 LECTURE 3 OUTLINES Variable names in MATLAB Examples Matrices, Vectors and Scalar Scalar Vectors Entering a vector Colon operator ( : ) Mathematical operations on vectors examples 2 VARIABLE NAMES IN
More informationThe PageRank Citation Ranking
October 17, 2012 Main Idea  Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR  course web page, Brian home page, Emily home page, Steven
More informationLargeScale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies
LargeScale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about:  Hubs whose scores depend on the authority of the nodes they point
More informationWeek 10: DTMC Applications Randomized Routing. Network Performance 101
Week 10: DTMC Applications Randomized Routing Network Performance 101 Random Walk: Probabilistic Routing Random neighbor selection e.g. in adhoc/sensor network due to: Scalability: no routing table (e.g.
More informationLecture MapReduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto
Lecture 04.02 MapReduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5word sequence occurs
More informationRepresentation/Indexing (fig 1.2) IR models  overview (fig 2.1) IR models  vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract The World Wide Web is a huge repository of data which includes audio, text and video.
More informationCOMPARATIVE ANALYSIS OF POWER METHOD AND GAUSSSEIDEL METHOD IN PAGERANK COMPUTATION
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 23213469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSSSEIDEL METHOD IN PAGERANK COMPUTATION
More informationMatrixVector Multiplication by MapReduce. From Rajaraman / Ullman Ch.2 Part 1
MatrixVector Multiplication by MapReduce From Rajaraman / Ullman Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrixvector multiplications When ranking of Web pages that
More informationWeb Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search
Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search
More informationCS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When
More informationPage Rank Algorithm. May 12, Abstract
Page Rank Algorithm Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of
More informationJeffrey D. Ullman Stanford University/Infolab
Jeffrey D. Ullman Stanford University/Infolab Spamming = any deliberate action intended solely to boost a Web page s position in searchengine results. Web Spam = Web pages that are the result of spamming.
More informationLecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science
Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semistructured data datagraph representation
More informationLink Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld
Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on
More informationReduce and Aggregate: Similarity Ranking in MultiCategorical Bipartite Graphs
Reduce and Aggregate: Similarity Ranking in MultiCategorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation
More informationSampling Large Graphs for Anticipatory Analysis
Sampling Large Graphs for Anticipatory Analysis Lauren Edwards*, Luke Johnson, Maja Milosavljevic, Vijay Gadepally, Benjamin A. Miller IEEE High Performance Extreme Computing Conference September 16, 2015
More informationPage rank computation HPC course project a.y Compute efficient and scalable Pagerank
Page rank computation HPC course project a.y. 201213 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet
More informationBruno Martins. 1 st Semester 2012/2013
Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval CS276 Informa*on Retrieval and Web Search Chris Manning and Pandu Nayak Link analysis Today s lecture hypertext and links We look beyond the content of documents We
More informationBelow is another example, taken from a REAL profile on one of the sites in my packet of someone abusing the sites.
Before I show you this month's sites, I need to go over a couple of things, so that we are all on the same page. You will be shown how to leave your link on each of the sites, but abusing the sites can
More informationScalable Datadriven PageRank: Algorithms, System Issues, and Lessons Learned
Scalable Datadriven PageRank: Algorithms, System Issues, and Lessons Learned Xinxuan Li 1 1 University of Maryland Baltimore County November 13, 2015 1 / 20 Outline 1 Motivation 2 Topologydriven PageRank
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationGraph and Web Mining  Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BENGURION UNIVERSITY, ISRAEL
Graph and Web Mining  Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BENGURION UNIVERSITY, ISRAEL Web mining  Outline Introduction Web Content Mining Web usage
More informationAn Overview of Search Engine. HaiYang Xu Dev Lead of Search Technology Center Microsoft Research Asia
An Overview of Search Engine HaiYang Xu Dev Lead of Search Technology Center Microsoft Research Asia haixu@microsoft.com July 24, 2007 1 Outline History of Search Engine Difference Between Software and
More information22. TwoDimensional Arrays. Topics Motivation The numpy Module Subscripting functions and 2d Arrays GooglePage Rank
22. TwoDimensional Arrays Topics Motivation The numpy Module Subscripting functions and 2d Arrays GooglePage Rank Visualizing 12 17 49 61 38 18 82 77 83 53 12 10 Can have a 2d array of strings or objects.
More informationLecture 17 November 7
CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 12: Crawling and Link Analysis 2 1 Ch. 1112 Last Time Chapter 11 1. ProbabilisCc Approach to Retrieval / Basic Probability Theory
More informationPersonalizing PageRank Based on Domain Profiles
Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu
More informationWeighted Page Rank Algorithm Based on Number of Visits of Links of Web Page
International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31307, Volume, Issue3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine
More informationWeb Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Web Search Basics The Web as a graph
More informationA Modified Algorithm to Handle Dangling Pages using Hypothetical Node
A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal
More informationGraphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech
CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey
More informationLink Analysis in the Cloud
Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)
More informationCorso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050315 3115 cell. 348397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 7075% esame orale 2530% progetto
More informationInformation Retrieval. Lecture 10  Web crawling
Information Retrieval Lecture 10  Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationHow To Gain a Competitive Advantage
How To Gain a Competitive Advantage VIDEO See this video in High Definition Download this video How To Gain a Competitive Advantage  1 Video Transcript The number one reason why people fail online is
More informationHeat Kernels and Diffusion Processes
Heat Kernels and Diffusion Processes Definition: University of Alicante (Spain) Matrix Computing (subject 3168 Degree in Maths) 30 hours (theory)) + 15 hours (practical assignment) Contents 1. Solving
More informationBeyond PageRank: Machine Learning for Static Ranking
Beyond PageRank: Machine Learning for Static Ranking Matthew Richardson 1, Amit Prakash 1 Eric Brill 2 1 Microsoft Research 2 MSN World Wide Web Conference, 2006 Outline 1 2 3 4 5 6 Types of Ranking Dynamic
More informationLecture 7: Relevance Feedback and Query Expansion
Lecture 7: Relevance Feedback and Query Expansion Information Retrieval Computer Science Tripos Part II Ronan Cummins Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk
More informationSimilarity Ranking in Large Scale Bipartite Graphs
Similarity Ranking in Large Scale Bipartite Graphs Alessandro Epasto Brown University  20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads
More informationBefore I show you this month's sites, I need to go over a couple of things, so that we are all on the same page.
Before I show you this month's sites, I need to go over a couple of things, so that we are all on the same page. You will be shown how to leave your link on each of the sites, but abusing the sites can
More informationInformation Retrieval
Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationWhat is this Page Known for? Computing Web Page Reputations. Davood Rafiei, Alberto Mendelzon University of Toronto
What is this Page Known for? Computing Web Page Reputations Davood Rafiei, Alberto Mendelzon University of Toronto 1 Introduction Ranking plays an important role in searching the Web. But the importance
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More informationDataIntensive Computing with MapReduce
DataIntensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons AttributionNoncommercialShare
More informationCONTENTS. Internet Basics. Internet Explorer. Search Engines. . Advantages and Disadvantages of the Internet. Some good websites
USING THE INTERNET CONTENTS Internet Basics Internet Explorer Search Engines EMail Advantages and Disadvantages of the Internet Some good websites 2 WHAT IS INTERNET? A computer network Two or more connected
More information