04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis
The structure of the web
Information networks Nodes: pieces of information Links: different relations between information Key example: World Wide Web
Other information networks Citation networks Difference to the web: implicit time line
Information networks and short paths
World Wide Web Key features of the early web Distributed information system: different pages stored on different computers Protocols for accessing this information using a browser Information is represented using hypertext This makes the web into a network, where nodes are pages and (directed) links are hyperlinks
The modern web Two types of links: Navigational links: traditional hyperlinks. Clicking on one shows a new page in the browser Transactional links: have side-effects. Clicking on one triggers a program that can cause other effects than showing a new page in the browser We will focus on the information network spanned out by navigational links
Using graph theory to analyse the structure of the web Representing the web as a graph, the same techniques used to analyse social networks can be employed Important difference: links are now directed The network concepts (connectivity, components, etc.) we defined for undirected graphs can be generalised to directed graphs become slightly more complicated
Concepts for directed graphs: paths and connectivity A path from A to B in a directed graph is a sequence of nodes, beginning with A and ending with B, such that every two consecutive nodes is connected with a forward edge A directed graph is strongly connected if there is a path from every node to every other node
Concepts for directed graphs: strongly connected components We say that a strongly connected component (SCC) in a directed graph is a subset of the nodes such that: (i) every node in the subset has a path to every other; and (ii) the subset is not part of some larger set with the property that every node can reach every other.
Example
Example
Example
Example
Example
Example
Example
The bow-tie structure of the web What does the web look like? What are its strongly connected components? Early influential study: Broder et al. (1999) findings later confirmed by others
The bow-tie structure of the web
The bow-tie structure of the web One giant strongly connected component
The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component
The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component OUT-nodes: can be reached from the giant SCC, but cannot reach it
The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component OUT-nodes: can be reached from the giant SCC, but cannot reach it The bow-tie structure: three very large components
Web 2.0 Three main principles: towards shared content and collective creation personal data in the cloud mechanisms for maintaining social connections between people Web 2.0 applications: Wikipedia, Facebook, Twitter, Gmail,... Many of the general concepts in this course are extremely relevant for analysing Web 2.0!
Link analysis and web search
Page ranking Web search is keyword-based Key web search problem: almost always too many results! How can the results be ranked, so that we get the most important ones first? Difficult, because: Queries have low expressivity Synonomy: several words mean the same thing Polysemy: the same word have several meanings It is difficult to say anything about the importance of a web page based only on the keywords present there Different people expect different results from the same query
Link analysis Key idea: rank pages not (only) according to their local content, but (also) use look at their links Idea 1: if a page is linked to by many of the relevant pages (pages with the keywords) then that page is important
Example
List pages Problem: ranking by in-links only can give results that have many in-links in general but are not relevant Idea 2: not all links are equally important. Links from pages that link to many of the pages with many votes are likely to be more important. We call such pages list pages. Two steps now: (1) find how good a page is as a list (2) let links from good list pages count higher
Example: (1) a page s value as a list
Example: (1) a page s value as a list 8
Example: (1) a page s value as a list 8 11
Example: (1) a page s value as a list 8 7 11
Example: (1) a page s value as a list 8 11 7
Example: (1) a page s value as a list 8 11 6 7
Example: (1) a page s value as a list 8 11 6 7
Example: (1) a page s value as a list 8 11 6 7
Example: (1) a page s value as a list 8 11 6 7 5
Example: (1) a page s value as a list 8 11 6 7 5 6
Example: (2) let links from good list pages count higher 8 11 6 7 5 6
Example: (2) let links from good list pages count higher 8 new score: 19 11 6 7 5 6
Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 6 7 5 6
Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 6 7 5 6 new score: 1
Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24
Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24 new score: 5
Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24 new score: 15 new score: 5
Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24 new score: 12 new score: 15 new score: 5
Why stop here? We can repeat this process: change the list-values using the new scores, compute new new scores Again and again and again...
Ranking algorithm: hubs and authorities 1. For the query, find all the hubs: these are pages with the keywords. These pages will be used as potential lists. The hub score for each hub is initially 1. authorities: the pages linked to by the hubs. These will be the pages we rank 2. For each authority, update its authority score to be the sum of the hub scores of all the hubs pointing to it. For each hub, update its hub score to be the sum of the authority scores of all the authorities it points to 4. Repeat 2 and a given number of times
Ranking algorithm: hubs and authorities Normalisation: divide each hub score by the sum of all hub scores, and each authority score by the sum of all authority scores The normalised values converge: for each iteration, the change is smaller and smaller And they converge to the same values, no matter which initial authority and hub scores we used! (advanced material)
PageRank In the hubs and authorities algorithm, pages have different roles. A page can cast many votes without itself being a relevant result PageRank is an alternative algorithm, where a page is considered to be important if it is linked to by other pages that also are important
Ranking algorithm: PageRank Let n be the number of nodes in the network. Let the initial PageRank value for each node be 1/n Update the PageRank value for each node as follows: each page divides its current PageRank value equally across its outgoing links. Each page updates its PageRank value to be the sum of the incoming values. Repeat the update step k times
PageRank: example (whiteboard)
PageRank: analysis The PageRank values converge towards final limiting values as the number of iterations increases We say that an assignment of PageRank values are equilibrium values if they are not changed by the update rule Limiting values are equilibrium values In strongly connected networks, equilibrium values are limiting values
PageRank: a problem Limiting values?
PageRank: a problem Limiting values? All PageRank will end up here
PageRank: a problem
PageRank: a problem In all real large networks, this is a real problem Solution: the update rule is modified by using a scaling factor This is the version of PageRank that is used in practice
PageRank with scaling factor Let n be the number of nodes in the network. Let the initial PageRank value for each node be 1/n Let s be a scaling factor between 0 and 1 (typically: 0.8-0.9) Update: 1.first apply the basic/standard PageRank update rule. 2.scale down all PageRank values by a factor of s.add the value (1-s)/n to the PageRank of each node Repeat the update step k times
PageRank with scaling factor: analysis The PageRank values still converge towards final limiting values as the number of iterations increases Limiting values depend on s In any network, limiting values are unique equilibrium values
PageRank as random walks
PageRank as random walks Consider the following definition of a random walk of a web graph: Pick a web page at random (uniform probability) Follow a random link from that web page (again, uniform probability) Repeat the previous step k times
PageRank as random walks Consider the following definition of a random walk of a web graph: Pick a web page at random (uniform probability) Follow a random link from that web page (again, uniform probability) Repeat the previous step k times Theorem: the probability of being at page X after k steps is equal to the PageRank value after k steps (unscaled)
Link analysis and modern web search PageRank was developed by Google, and was used to rank search results for many years Today, both Google and others use ranking methods that are extremely complex extremely secret always changing
Examples and analysis Blackboard