CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

Size: px

Start display at page:

Download "CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,"

Raymond Singleton
5 years ago
Views:

1 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6... PRT 1. LRGE SLE T NLYTIS WE-SLE LINK N SOIL NETWORK NLYSIS omputer Science, olorado State University S435 Introduction to ig ata 9/24/218 S435 Introduction to ig ata - FLL 218 W6..1 FQs uplicated printout for P1 In reducer For (String s: record.keyset()){ ontext.write (id, new Text(s record.get(s))); Record.clear() } output: and are Your combiner creates the key as and---2 and the frequency of this key is recorded as 1 P2 has been posted 9/24/218 S435 Introduction to ig ata - FLL 218 W6..2 9/24/218 S435 Introduction to ig ata - FLL 218 W6..3 Topics Large-scale nalytics 1. Web-Scale Link and Social Network nalysis Part 1. Large Scale ata nalytics 1. Web-Scale Link nalysis and Social Network nalysis Web-Scale Link nalysis 9/24/218 S435 Introduction to ig ata - FLL 218 W6..4 This material is built based on, nand Rajaraman, Jure Leskovec, and Jeffrey Ullman, Mining of Massive atasets, ambridge University Press, 212 hapter 5 9/24/218 S435 Introduction to ig ata - FLL 218 W6..5 What are these? JumpStation Go Infoseek Snap irect Hit Lycos ltavista Excite Yahoo Google 1

2 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..6 9/24/218 S435 Introduction to ig ata - FLL 218 W6..7 Early Search Engines Inverted index () They worked by crawling the Web and listing the terms Words or other strings of characters other than white space In an inverted index n inverted index is a data structure that makes it easy, given a term, to find (pointer to) all the places where that term occurs Summarization esign Pattern Inverted index For given texts, T[] = olorado State University T[1] = olorado water source T[2] = University of olorado We have the following inverted file index olorado : {,1,2} of :{2} source :{1} State : {} University : {,2} water :{1} 9/24/218 S435 Introduction to ig ata - FLL 218 W6..8 Inverted index (2/2) olorado : {,1,2} of :{2} source :{1} State : {} University : {,2} water :{1} term search for the terms, olorado, State, and University would give the set {,1, 2} {} {, 2} = {} 9/24/218 S435 Introduction to ig ata - FLL 218 W6..9 Term spam If you were selling toaster on the Web ll you care about was that people would see your page You could add a term like movie to your page dd thousands of times It does not even need to show Give the same color as background to the letters search engine would think this page is very important one about movie You could go to the search engine and search movie and see the first listed page opy that page with the same color as background 9/24/218 S435 Introduction to ig ata - FLL 218 W6..1 9/24/218 S435 Introduction to ig ata - FLL 218 W6..11 urrent total number of Web pages: More than 1.4 indexed pages Part 1. Large Scale ata nalytics 1. Web-Scale Link nalysis and Social Network nalysis Web-Scale Link nalysis: PageRank lgorithm 2

FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..12 9/24/218 S435 Introduction to ig ata - FLL 218 W6.

important than pages that would rarely be visited The content of a page was judged not only by the terms appearing on that page ut by the terms used in or near the links to that page 9/24/218 S435

3 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..13 PageRank Goals Providing effective summaries for the search results Ordering/Ranking results Simulate random Web surfers Pages that would have a large number of surfers were considered more important than pages that would rarely be visited The content of a page was judged not only by the terms appearing on that page ut by the terms used in or near the links to that page 9/24/218 S435 Introduction to ig ata - FLL 218 W6..14 efinition of PageRank 9/24/218 S435 Introduction to ig ata - FLL 218 W function that assigns a real number to each page in the Web The higher the PageRank of a page, the more important it is There is NOT one fixed algorithm for assignment of PageRank 9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..17 Example [1/5] Page has links to, and Page has links to and Page has a link to Page has links to and 3

4 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..18 Example [2/5] Suppose that a random surfer starts at page Page, and will be the next with probability probability of being at 9/24/218 S435 Introduction to ig ata - FLL 218 W6..19 Example [3/5] Now suppose the random surfer at has probability of ½ of being at, ½ of being at and of being at or 9/24/218 S435 Introduction to ig ata - FLL 218 W6..2 Example [4/5] Transition matrix M What happens to random surfers after one step M has n rows and columns (n pages) What is the transition matrix for this example? 9/24/218 S435 Introduction to ig ata - FLL 218 W6..21 Example [5/5] 1 M= The first column a surfer at has a probability of next being at each of the other pages The second column a surfer at has a ½ probability of being next at and the same for being at 9/24/218 S435 Introduction to ig ata - FLL 218 W6..22 What does this matrix mean? [1/6] The probability distribution for the location of a random surfer column vector whose jth component is the probability that the surfer is at page j /24/218 S435 Introduction to ig ata - FLL 218 W6..23 What does this matrix mean? [2/6] If we surf at any of the n pages of the Web with equal probability The initial vector vo will have for each component If M is the transition matrix of the Web fter the first one step, the distribution of the surfer will be Mvo fter two steps, M(Mvo) =M2v and so on 1 4

5 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..24 What does this matrix mean? [3/6] Multiplying the initial vector v by M a total of i times The distribution of the surfer after i steps The probability for being in The probability for the current location the next step from the current location 1 1/4 1/4 The distribution of the surfer after the first step 1/4 = - Probability for being in the next possible location 1.2 1/4 9/24/218 S435 Introduction to ig ata - FLL 218 W6..25 What does this matrix mean? [4/6] The probability x i that a random surfer will be at node i at the next step x i = j m ij v j m ij is the probability that a surfer at node j will move to node i at the next step v j is the probability that the surfer was at node j at the previous step 9/24/218 S435 Introduction to ig ata - FLL 218 W6..26 What does this matrix mean? [5/6] The distribution of the surfer approaches a limiting distribution v that satisfies v = Mv provided two conditions are met: 1. The graph is strongly connected It is possible to get from any node to any other node 2. There are no dead ends Nodes that have no arcs out 9/24/218 S435 Introduction to ig ata - FLL 218 W6..27 What does this matrix mean? [6/6] The limit is reached when multiplying the distribution by M another time does not change the distribution The limiting v is an eigenvector of M Since M is stochastic (its columns each add up to 1), v is the principle eigenvector Its associated eigenvalue is the largest of all eigenvalues The principle eigenvector of M Where the surfer is most likely to be after a long time For the Web, 5-75 iterations are sufficient to converge to within the error limits of double-precision arithmetic 9/24/218 S435 Introduction to ig ata - FLL 218 W6..28 Example 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 9/24/218 S435 Introduction to ig ata - FLL 218 W6..29 Example 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 1 ¼ 9/24 ¼ 5/24 v1=mv= ¼ = 5/24 ¼ 5/24 5

6 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..3 What is the v 2? 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 1 ¼ 9/24 ¼ 5/24 v1=mv= ¼ = 5/24 ¼ 5/24 9/24/218 S435 Introduction to ig ata - FLL 218 W6..31 Example continued The sequence of approximations to the limit We get by multiplying at each step by M is ¼ 9/24 15/ /9 ¼ 5/24 11/48 7/32 2/9 ¼ 5/24 11/48 7/32 2/9 ¼ 5/24 11/48 7/32 2/9 This difference in probability is not noticeable In the real Web, there are billions of nodes of greatly varying importance The probability of being at a node like is orders of magnitude greater than others 9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..33 Matrix-vector Multiplication by MapReduce [] Part 1. Large Scale ata nalytics 1. Web-Scale Link nalysis and Social Network nalysis Web-Scale Link nalysis: PageRank lgorithm using MapReduce Suppose we have an n x n matrix M, whose element in row i and column j will be denoted M ij Then the matrix-vector product is the vector x of length n, whose i th element x i is given by, x i = n j=1 m ij v j 9/24/218 S435 Introduction to ig ata - FLL 218 W6..34 Matrix-vector Multiplication by MapReduce [2/3] 9/24/218 S435 Introduction to ig ata - FLL 218 W6..35 Matrix-vector Multiplication by MapReduce [3/3] If n = 1, we do NOT need FS or MapReduce However, if this calculation is a part of ranking Web pages (n is 1M) that goes on at search engine? The vector v cannot fit in main memory More than 1.4 pages The matrix M and the vector v each will be stored in a file of the FS(HFS) ssume that row-column coordinates of each matrix element will be discoverable Either from its position in the file or explicit coordinates 6

7 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..37 The Map function The Reduce function The Map function is written to apply to one element of M Each Map task will operate on a chunk of the matrix M From each matrix element m ij, it produces the key-value pair (i, m ijv j ) ll terms of the sum that make up the component x k of the matrix-vector product will get the same key, k Sums all the values associated with a given key k The result will be a pair (k, x k) x k=m k,xv + M k,1xv 1 + M k,2xv M k,(n-1)xv n-1 9/24/218 S435 Introduction to ig ata - FLL 218 W6..38 If the vector v cannot fit in main memory? It is possible that the vector v is so large that it will not fit in its main memory entirely We can divide the matrix into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height The goal is to use enough stripes so that the portion of the vector in one stripe can fit into main memory 9/24/218 S435 Introduction to ig ata - FLL 218 W6..39 ivision of a matrix and vector into five stripes The ith stripe of the matrix multiplies only components from the ith stripe of the initial vector Results:.2 x +.17 x +.3 x +.1 x = (M x v) + (M1 x v1) + (M2 x v2) Matrix M Initial Vector v 9/24/218 S435 Introduction to ig ata - FLL 218 W6..4 9/24/218 S435 Introduction to ig ata - FLL 218 W6..41 ivision of a matrix and vector into five stripes The ith stripe of the matrix multiplies only components from the ith stripe of the initial vector Mapper 1 Link nalysis PageRank lgorithm for the real Web Mapper 2 Mapper 3 Mapper 4 Matrix M Initial Vector v Mapper 5 7

8 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..43 Structure of Web () Is the Web strongly connected? Structure of Web (2/3) onsisting of pages reachable from the In omponent but not unable to reach In omponent Tendrils Out onsisting of pages reachable from the S, but unable to reach the S Tendrils In In omponent Strongly onnected omponent Out omponent Tubes onsisting of pages that could reach by following links but not reachable from the S isconnected omponents 9/24/218 S435 Introduction to ig ata - FLL 218 W6..44 Structure of Web (3/3) Tubes Pages reachable from the in-component and able to reach the output-component, but unable to reach the S or be reached from the S Isolated components Unreachable from the large components 9/24/218 S435 Introduction to ig ata - FLL 218 W6..45 nomalies from the Web structure These structures violate the assumptions needed for the Markov process iteration to converge to a limit When a random surfer enters the out-component, they can never leave Surfers starting in either the S or in-component are going to wind up in either the out-component or a tendril off the in-component No page in the S or in-component winds up with any probability of a surfer being there Nothing in the S or in-component will be of any importance 9/24/218 S435 Introduction to ig ata - FLL 218 W /24/218 S435 Introduction to ig ata - FLL 218 W6..47 Problems we need to avoid Link nalysis hallenges in PageRank lgorithm for the real Web ead end page that has no links out Surfers reaching such a page will disappear In the limit, no page that can reach a dead end can have any PageRank at all Spider traps Groups of pages that all have outlinks but they never link to any other pages 8

9 FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6..48 Questions? 9

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on, S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration