CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

Size: px
Start display at page:

Download "CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,"

Transcription

1 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6... PRT 1. LRGE SLE T NLYTIS WE-SLE LINK N SOIL NETWORK NLYSIS omputer Science, olorado State University S435 Introduction to ig ata 9/24/218 S435 Introduction to ig ata - FLL 218 W6..1 FQs uplicated printout for P1 In reducer For (String s: record.keyset()){ ontext.write (id, new Text(s record.get(s))); Record.clear() } output: and are Your combiner creates the key as and---2 and the frequency of this key is recorded as 1 P2 has been posted 9/24/218 S435 Introduction to ig ata - FLL 218 W6..2 9/24/218 S435 Introduction to ig ata - FLL 218 W6..3 Topics Large-scale nalytics 1. Web-Scale Link and Social Network nalysis Part 1. Large Scale ata nalytics 1. Web-Scale Link nalysis and Social Network nalysis Web-Scale Link nalysis 9/24/218 S435 Introduction to ig ata - FLL 218 W6..4 This material is built based on, nand Rajaraman, Jure Leskovec, and Jeffrey Ullman, Mining of Massive atasets, ambridge University Press, 212 hapter 5 9/24/218 S435 Introduction to ig ata - FLL 218 W6..5 What are these? JumpStation Go Infoseek Snap irect Hit Lycos ltavista Excite Yahoo Google 1

2 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..6 9/24/218 S435 Introduction to ig ata - FLL 218 W6..7 Early Search Engines Inverted index () They worked by crawling the Web and listing the terms Words or other strings of characters other than white space In an inverted index n inverted index is a data structure that makes it easy, given a term, to find (pointer to) all the places where that term occurs Summarization esign Pattern Inverted index For given texts, T[] = olorado State University T[1] = olorado water source T[2] = University of olorado We have the following inverted file index olorado : {,1,2} of :{2} source :{1} State : {} University : {,2} water :{1} 9/24/218 S435 Introduction to ig ata - FLL 218 W6..8 Inverted index (2/2) olorado : {,1,2} of :{2} source :{1} State : {} University : {,2} water :{1} term search for the terms, olorado, State, and University would give the set {,1, 2} {} {, 2} = {} 9/24/218 S435 Introduction to ig ata - FLL 218 W6..9 Term spam If you were selling toaster on the Web ll you care about was that people would see your page You could add a term like movie to your page dd thousands of times It does not even need to show Give the same color as background to the letters search engine would think this page is very important one about movie You could go to the search engine and search movie and see the first listed page opy that page with the same color as background 9/24/218 S435 Introduction to ig ata - FLL 218 W6..1 9/24/218 S435 Introduction to ig ata - FLL 218 W6..11 urrent total number of Web pages: More than 1.4 indexed pages Part 1. Large Scale ata nalytics 1. Web-Scale Link nalysis and Social Network nalysis Web-Scale Link nalysis: PageRank lgorithm 2

3 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..13 PageRank Goals Providing effective summaries for the search results Ordering/Ranking results Simulate random Web surfers Pages that would have a large number of surfers were considered more important than pages that would rarely be visited The content of a page was judged not only by the terms appearing on that page ut by the terms used in or near the links to that page 9/24/218 S435 Introduction to ig ata - FLL 218 W6..14 efinition of PageRank 9/24/218 S435 Introduction to ig ata - FLL 218 W function that assigns a real number to each page in the Web The higher the PageRank of a page, the more important it is There is NOT one fixed algorithm for assignment of PageRank 9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..17 Example [1/5] Page has links to, and Page has links to and Page has a link to Page has links to and 3

4 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..18 Example [2/5] Suppose that a random surfer starts at page Page, and will be the next with probability probability of being at 9/24/218 S435 Introduction to ig ata - FLL 218 W6..19 Example [3/5] Now suppose the random surfer at has probability of ½ of being at, ½ of being at and of being at or 9/24/218 S435 Introduction to ig ata - FLL 218 W6..2 Example [4/5] Transition matrix M What happens to random surfers after one step M has n rows and columns (n pages) What is the transition matrix for this example? 9/24/218 S435 Introduction to ig ata - FLL 218 W6..21 Example [5/5] 1 M= The first column a surfer at has a probability of next being at each of the other pages The second column a surfer at has a ½ probability of being next at and the same for being at 9/24/218 S435 Introduction to ig ata - FLL 218 W6..22 What does this matrix mean? [1/6] The probability distribution for the location of a random surfer column vector whose jth component is the probability that the surfer is at page j /24/218 S435 Introduction to ig ata - FLL 218 W6..23 What does this matrix mean? [2/6] If we surf at any of the n pages of the Web with equal probability The initial vector vo will have for each component If M is the transition matrix of the Web fter the first one step, the distribution of the surfer will be Mvo fter two steps, M(Mvo) =M2v and so on 1 4

5 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..24 What does this matrix mean? [3/6] Multiplying the initial vector v by M a total of i times The distribution of the surfer after i steps The probability for being in The probability for the current location the next step from the current location 1 1/4 1/4 The distribution of the surfer after the first step 1/4 = - Probability for being in the next possible location 1.2 1/4 9/24/218 S435 Introduction to ig ata - FLL 218 W6..25 What does this matrix mean? [4/6] The probability x i that a random surfer will be at node i at the next step x i = j m ij v j m ij is the probability that a surfer at node j will move to node i at the next step v j is the probability that the surfer was at node j at the previous step 9/24/218 S435 Introduction to ig ata - FLL 218 W6..26 What does this matrix mean? [5/6] The distribution of the surfer approaches a limiting distribution v that satisfies v = Mv provided two conditions are met: 1. The graph is strongly connected It is possible to get from any node to any other node 2. There are no dead ends Nodes that have no arcs out 9/24/218 S435 Introduction to ig ata - FLL 218 W6..27 What does this matrix mean? [6/6] The limit is reached when multiplying the distribution by M another time does not change the distribution The limiting v is an eigenvector of M Since M is stochastic (its columns each add up to 1), v is the principle eigenvector Its associated eigenvalue is the largest of all eigenvalues The principle eigenvector of M Where the surfer is most likely to be after a long time For the Web, 5-75 iterations are sufficient to converge to within the error limits of double-precision arithmetic 9/24/218 S435 Introduction to ig ata - FLL 218 W6..28 Example 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 9/24/218 S435 Introduction to ig ata - FLL 218 W6..29 Example 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 1 ¼ 9/24 ¼ 5/24 v1=mv= ¼ = 5/24 ¼ 5/24 5

6 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..3 What is the v 2? 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 1 ¼ 9/24 ¼ 5/24 v1=mv= ¼ = 5/24 ¼ 5/24 9/24/218 S435 Introduction to ig ata - FLL 218 W6..31 Example continued The sequence of approximations to the limit We get by multiplying at each step by M is ¼ 9/24 15/ /9 ¼ 5/24 11/48 7/32 2/9 ¼ 5/24 11/48 7/32 2/9 ¼ 5/24 11/48 7/32 2/9 This difference in probability is not noticeable In the real Web, there are billions of nodes of greatly varying importance The probability of being at a node like is orders of magnitude greater than others 9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..33 Matrix-vector Multiplication by MapReduce [] Part 1. Large Scale ata nalytics 1. Web-Scale Link nalysis and Social Network nalysis Web-Scale Link nalysis: PageRank lgorithm using MapReduce Suppose we have an n x n matrix M, whose element in row i and column j will be denoted M ij Then the matrix-vector product is the vector x of length n, whose i th element x i is given by, x i = n j=1 m ij v j 9/24/218 S435 Introduction to ig ata - FLL 218 W6..34 Matrix-vector Multiplication by MapReduce [2/3] 9/24/218 S435 Introduction to ig ata - FLL 218 W6..35 Matrix-vector Multiplication by MapReduce [3/3] If n = 1, we do NOT need FS or MapReduce However, if this calculation is a part of ranking Web pages (n is 1M) that goes on at search engine? The vector v cannot fit in main memory More than 1.4 pages The matrix M and the vector v each will be stored in a file of the FS(HFS) ssume that row-column coordinates of each matrix element will be discoverable Either from its position in the file or explicit coordinates 6

7 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..37 The Map function The Reduce function The Map function is written to apply to one element of M Each Map task will operate on a chunk of the matrix M From each matrix element m ij, it produces the key-value pair (i, m ijv j ) ll terms of the sum that make up the component x k of the matrix-vector product will get the same key, k Sums all the values associated with a given key k The result will be a pair (k, x k) x k=m k,xv + M k,1xv 1 + M k,2xv M k,(n-1)xv n-1 9/24/218 S435 Introduction to ig ata - FLL 218 W6..38 If the vector v cannot fit in main memory? It is possible that the vector v is so large that it will not fit in its main memory entirely We can divide the matrix into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height The goal is to use enough stripes so that the portion of the vector in one stripe can fit into main memory 9/24/218 S435 Introduction to ig ata - FLL 218 W6..39 ivision of a matrix and vector into five stripes The ith stripe of the matrix multiplies only components from the ith stripe of the initial vector Results:.2 x +.17 x +.3 x +.1 x = (M x v) + (M1 x v1) + (M2 x v2) Matrix M Initial Vector v 9/24/218 S435 Introduction to ig ata - FLL 218 W6..4 9/24/218 S435 Introduction to ig ata - FLL 218 W6..41 ivision of a matrix and vector into five stripes The ith stripe of the matrix multiplies only components from the ith stripe of the initial vector Mapper 1 Link nalysis PageRank lgorithm for the real Web Mapper 2 Mapper 3 Mapper 4 Matrix M Initial Vector v Mapper 5 7

8 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..43 Structure of Web () Is the Web strongly connected? Structure of Web (2/3) onsisting of pages reachable from the In omponent but not unable to reach In omponent Tendrils Out onsisting of pages reachable from the S, but unable to reach the S Tendrils In In omponent Strongly onnected omponent Out omponent Tubes onsisting of pages that could reach by following links but not reachable from the S isconnected omponents 9/24/218 S435 Introduction to ig ata - FLL 218 W6..44 Structure of Web (3/3) Tubes Pages reachable from the in-component and able to reach the output-component, but unable to reach the S or be reached from the S Isolated components Unreachable from the large components 9/24/218 S435 Introduction to ig ata - FLL 218 W6..45 nomalies from the Web structure These structures violate the assumptions needed for the Markov process iteration to converge to a limit When a random surfer enters the out-component, they can never leave Surfers starting in either the S or in-component are going to wind up in either the out-component or a tendril off the in-component No page in the S or in-component winds up with any probability of a surfer being there Nothing in the S or in-component will be of any importance 9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..47 Problems we need to avoid Link nalysis hallenges in PageRank lgorithm for the real Web ead end page that has no links out Surfers reaching such a page will disappear In the limit, no page that can reach a dead end can have any PageRank at all Spider traps Groups of pages that all have outlinks but they never link to any other pages 8

9 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..48 Questions? 9

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

CS535 Big Data Fall 2017 Colorado State University  9/5/2017. Week 3 - A. FAQs. This material is built based on, S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Link Analysis. Chapter PageRank

Link Analysis. Chapter PageRank Chapter 5 Link Analysis One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as

More information

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1 Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

Computer Science Department Picnic. FAQs. Topics. This material is developed based on, 8/27/2018 Week 2-A Sangmi Lee Pallickara

Computer Science Department Picnic. FAQs. Topics. This material is developed based on, 8/27/2018 Week 2-A Sangmi Lee Pallickara Week 2- W2..0 - FLL 2018 - FLL 2018 W2..1 omputer Science epartment Picnic No ell-phones in the class. Welcome to the 2018-2019 cademic year! If you need to use a laptop, please sit in the back row. I

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today 3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

DATA MINING - 1DL460

DATA MINING - 1DL460 DATA MINING - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Analysis of Large Graphs: TrustRank and WebSpam

Analysis of Large Graphs: TrustRank and WebSpam Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 Mutually recursive definition: A hub links to many authorities; An authority is linked to by many hubs. Authorities turn out to be places where information can

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

DATA MINING - 1DL460

DATA MINING - 1DL460 DATA MINING - 1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

Finding Top UI/UX Design Talent on Adobe Behance

Finding Top UI/UX Design Talent on Adobe Behance Procedia Computer Science Volume 51, 2015, Pages 2426 2434 ICCS 2015 International Conference On Computational Science Finding Top UI/UX Design Talent on Adobe Behance Susanne Halstead 1,H.DanielSerrano

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Algorithms. Revised based on the slides by Ruoming Kent State Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 Wolf-Tilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When

More information

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)! Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

University of Maryland. Tuesday, March 2, 2010

University of Maryland. Tuesday, March 2, 2010 Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

B490 Mining the Big Data. 5. Models for Big Data

B490 Mining the Big Data. 5. Models for Big Data B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science CS-C3160 - Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Contents of this chapter Story

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS Overview of Networks Instructor: Yizhou Sun yzsun@cs.ucla.edu January 10, 2017 Overview of Information Network Analysis Network Representation Network

More information

Jeffrey D. Ullman Stanford University/Infolab

Jeffrey D. Ullman Stanford University/Infolab Jeffrey D. Ullman Stanford University/Infolab Spamming = any deliberate action intended solely to boost a Web page s position in searchengine results. Web Spam = Web pages that are the result of spamming.

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Databases 2 (VU) ( )

Databases 2 (VU) ( ) Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment

More information

CHAPTER-2 A GRAPH BASED APPROACH TO FIND CANDIDATE KEYS IN A RELATIONAL DATABASE SCHEME **

CHAPTER-2 A GRAPH BASED APPROACH TO FIND CANDIDATE KEYS IN A RELATIONAL DATABASE SCHEME ** HPTR- GRPH S PPROH TO FIN NIT KYS IN RLTIONL TS SHM **. INTROUTION Graphs have many applications in the field of knowledge and data engineering. There are many graph based approaches used in database management

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms S 101, Winter 2018 esign and nalysis of lgorithms Lecture 3: onnected omponents lass URL: http://vlsicad.ucsd.edu/courses/cse101-w18/ nd of Lecture 2: s, Topological Order irected cyclic raph.g., Vertices

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

Analysis of Algorithms Prof. Karen Daniels

Analysis of Algorithms Prof. Karen Daniels UMass Lowell omputer Science 91.404 nalysis of lgorithms Prof. Karen aniels Spring, 2013 hapter 22: raph lgorithms & rief Introduction to Shortest Paths [Source: ormen et al. textbook except where noted]

More information

Experimental study of Web Page Ranking Algorithms

Experimental study of Web Page Ranking Algorithms IOSR IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. II (Mar-pr. 2014), PP 100-106 Experimental study of Web Page Ranking lgorithms Rachna

More information

Graphs (Part II) Shannon Quinn

Graphs (Part II) Shannon Quinn Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University S Introduction to ig Data Week - W... S Introduction to ig Data W.. FQs Term project final report Preparation guide is available at: http://www.cs.colostate.edu/~cs/tp.html Successful final project must

More information

Adaptive methods for the computation of PageRank

Adaptive methods for the computation of PageRank Linear Algebra and its Applications 386 (24) 51 65 www.elsevier.com/locate/laa Adaptive methods for the computation of PageRank Sepandar Kamvar a,, Taher Haveliwala b,genegolub a a Scientific omputing

More information

How Google Finds Your Needle in the Web's

How Google Finds Your Needle in the Web's of the content. In fact, Google feels that the value of its service is largely in its ability to provide unbiased results to search queries; Google claims, "the heart of our software is PageRank." As we'll

More information

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. III (Jan.-Feb. 2017), PP 01-07 www.iosrjournals.org PageRank Algorithm Albi Dode 1, Silvester

More information

PageRank and related algorithms

PageRank and related algorithms PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank Hans De Sterck Department of Applied Mathematics University of Waterloo, Ontario, Canada joint work with Steve McCormick,

More information

An Introduction to Double Integrals Math Insight

An Introduction to Double Integrals Math Insight An Introduction to Double Integrals Math Insight Suppose that you knew the hair density at each point on your head and you wanted to calculate the total number of hairs on your head. In other words, let

More information

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question

More information

Development of MapReduced Topic Sensitive PageRank

Development of MapReduced Topic Sensitive PageRank Development of MapReduced Topic Sensitive PageRank Swaraj Khadanga Department of Computer Science and Engineering National Institute of Technology Rourkela Rourkela-769 008, Odisha, India. Development

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Generalizing Map- Reduce

Generalizing Map- Reduce Generalizing Map- Reduce 1 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on

More information

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Introduction To Graphs and Networks. Fall 2013 Carola Wenk Introduction To Graphs and Networks Fall 203 Carola Wenk On the Internet, links are essentially weighted by factors such as transit time, or cost. The goal is to find the shortest path from one node to

More information

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1 LINK ANALYSIS 1 Today s lecture hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of

More information