Social Networks 2015 Lecture 10: The structure of the web and link analysis

Size: px

Start display at page:

Download "Social Networks 2015 Lecture 10: The structure of the web and link analysis"

Lesley Walker
5 years ago
Views:

04198250 Social Networks 2015 Lecture 10:

1 Social Networks 2015 Lecture 10: The structure of the web and link analysis

2 The structure of the web

3 Information networks Nodes: pieces of information Links: different relations between information Key example: World Wide Web

4 Other information networks Citation networks Difference to the web: implicit time line

5 Information networks and short paths

6 World Wide Web Key features of the early web Distributed information system: different pages stored on different computers Protocols for accessing this information using a browser Information is represented using hypertext This makes the web into a network, where nodes are pages and (directed) links are hyperlinks

7 The modern web Two types of links: Navigational links: traditional hyperlinks. Clicking on one shows a new page in the browser Transactional links: have side-effects. Clicking on one triggers a program that can cause other effects than showing a new page in the browser We will focus on the information network spanned out by navigational links

8 Using graph theory to analyse the structure of the web Representing the web as a graph, the same techniques used to analyse social networks can be employed Important difference: links are now directed The network concepts (connectivity, components, etc.) we defined for undirected graphs can be generalised to directed graphs become slightly more complicated

9 Concepts for directed graphs: paths and connectivity A path from A to B in a directed graph is a sequence of nodes, beginning with A and ending with B, such that every two consecutive nodes is connected with a forward edge A directed graph is strongly connected if there is a path from every node to every other node

10 Concepts for directed graphs: strongly connected components We say that a strongly connected component (SCC) in a directed graph is a subset of the nodes such that: (i) every node in the subset has a path to every other; and (ii) the subset is not part of some larger set with the property that every node can reach every other.

11 Example

12 Example

13 Example

14 Example

15 Example

16 Example

17 Example

18 The bow-tie structure of the web What does the web look like? What are its strongly connected components? Early influential study: Broder et al. (1999) findings later confirmed by others

19 The bow-tie structure of the web

20 The bow-tie structure of the web One giant strongly connected component

21 The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component

22 The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component OUT-nodes: can be reached from the giant SCC, but cannot reach it

23 The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component OUT-nodes: can be reached from the giant SCC, but cannot reach it The bow-tie structure: three very large components

24 Web 2.0 Three main principles: towards shared content and collective creation personal data in the cloud mechanisms for maintaining social connections between people Web 2.0 applications: Wikipedia, Facebook, Twitter, Gmail,... Many of the general concepts in this course are extremely relevant for analysing Web 2.0!

25 Link analysis and web search

26 Page ranking Web search is keyword-based Key web search problem: almost always too many results! How can the results be ranked, so that we get the most important ones first? Difficult, because: Queries have low expressivity Synonomy: several words mean the same thing Polysemy: the same word have several meanings It is difficult to say anything about the importance of a web page based only on the keywords present there Different people expect different results from the same query

27 Link analysis Key idea: rank pages not (only) according to their local content, but (also) use look at their links Idea 1: if a page is linked to by many of the relevant pages (pages with the keywords) then that page is important

28 Example

29 List pages Problem: ranking by in-links only can give results that have many in-links in general but are not relevant Idea 2: not all links are equally important. Links from pages that link to many of the pages with many votes are likely to be more important. We call such pages list pages. Two steps now: (1) find how good a page is as a list (2) let links from good list pages count higher

30 Example: (1) a page s value as a list

31 Example: (1) a page s value as a list 8

32 Example: (1) a page s value as a list 8 11

33 Example: (1) a page s value as a list

34 Example: (1) a page s value as a list

35 Example: (1) a page s value as a list

36 Example: (1) a page s value as a list

37 Example: (1) a page s value as a list

38 Example: (1) a page s value as a list

39 Example: (1) a page s value as a list

40 Example: (2) let links from good list pages count higher

41 Example: (2) let links from good list pages count higher 8 new score:

42 Example: (2) let links from good list pages count higher 8 new score: new score:

43 Example: (2) let links from good list pages count higher 8 new score: new score: new score: 1

44 Example: (2) let links from good list pages count higher 8 new score: new score: 19 7 new score: new score: 24

45 Example: (2) let links from good list pages count higher 8 new score: new score: 19 7 new score: new score: 24 new score: 5

46 Example: (2) let links from good list pages count higher 8 new score: new score: 19 7 new score: new score: 24 new score: 15 new score: 5

47 Example: (2) let links from good list pages count higher 8 new score: new score: 19 7 new score: new score: 24 new score: 12 new score: 15 new score: 5

48 Why stop here? We can repeat this process: change the list-values using the new scores, compute new new scores Again and again and again...

49 Ranking algorithm: hubs and authorities 1. For the query, find all the hubs: these are pages with the keywords. These pages will be used as potential lists. The hub score for each hub is initially 1. authorities: the pages linked to by the hubs. These will be the pages we rank 2. For each authority, update its authority score to be the sum of the hub scores of all the hubs pointing to it. For each hub, update its hub score to be the sum of the authority scores of all the authorities it points to 4. Repeat 2 and a given number of times

50 Ranking algorithm: hubs and authorities Normalisation: divide each hub score by the sum of all hub scores, and each authority score by the sum of all authority scores The normalised values converge: for each iteration, the change is smaller and smaller And they converge to the same values, no matter which initial authority and hub scores we used! (advanced material)

51 PageRank In the hubs and authorities algorithm, pages have different roles. A page can cast many votes without itself being a relevant result PageRank is an alternative algorithm, where a page is considered to be important if it is linked to by other pages that also are important

52 Ranking algorithm: PageRank Let n be the number of nodes in the network. Let the initial PageRank value for each node be 1/n Update the PageRank value for each node as follows: each page divides its current PageRank value equally across its outgoing links. Each page updates its PageRank value to be the sum of the incoming values. Repeat the update step k times

53 PageRank: example (whiteboard)

54 PageRank: analysis The PageRank values converge towards final limiting values as the number of iterations increases We say that an assignment of PageRank values are equilibrium values if they are not changed by the update rule Limiting values are equilibrium values In strongly connected networks, equilibrium values are limiting values

55 PageRank: a problem Limiting values?

56 PageRank: a problem Limiting values? All PageRank will end up here

57 PageRank: a problem

58 PageRank: a problem In all real large networks, this is a real problem Solution: the update rule is modified by using a scaling factor This is the version of PageRank that is used in practice

59 PageRank with scaling factor Let n be the number of nodes in the network. Let the initial PageRank value for each node be 1/n Let s be a scaling factor between 0 and 1 (typically: ) Update: 1.first apply the basic/standard PageRank update rule. 2.scale down all PageRank values by a factor of s.add the value (1-s)/n to the PageRank of each node Repeat the update step k times

60 PageRank with scaling factor: analysis The PageRank values still converge towards final limiting values as the number of iterations increases Limiting values depend on s In any network, limiting values are unique equilibrium values

61 PageRank as random walks

62 PageRank as random walks Consider the following definition of a random walk of a web graph: Pick a web page at random (uniform probability) Follow a random link from that web page (again, uniform probability) Repeat the previous step k times

63 PageRank as random walks Consider the following definition of a random walk of a web graph: Pick a web page at random (uniform probability) Follow a random link from that web page (again, uniform probability) Repeat the previous step k times Theorem: the probability of being at page X after k steps is equal to the PageRank value after k steps (unscaled)

64 Link analysis and modern web search PageRank was developed by Google, and was used to rank search results for many years Today, both Google and others use ranking methods that are extremely complex extremely secret always changing

65 Examples and analysis Blackboard

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show