Recap! CMSC 498J: Social Media Computing. Department of Computer Science University of Maryland Spring Hadi Amiri

Recap! CMSC 498J: Social Media Computing Department of Computer Science University of Maryland Spring 2015 Hadi Amiri hadi@umd.edu

Announcement CourseEvalUM https://www.courseevalum.umd.edu/ 2

Announcement Project Presentations & Reports (40%) Each team (3 students) has strictly 14 mins Each student should present parts of the project that he / she worked on zero mark if he / she doesn t present Each team should email me one PPT file containing all slides for group presentation email me the PPT file latest by May 2, 11:59 PM Sufficiently rehearse with teammates for the presentation. 3

Announcement Project Presentations & Reports (40%) Max of 7 pages in which you explain: problem and its importance: concrete definition short description of dataset(s) used your approach to solve the problem evaluation results and insights Each team should submit the final report, codes, and datasets used by May 10, 11:00 AM Either write files to a CD or DVD, or email us a link to download your project materials There must be a readme file in the root directory explaining steps to run your code. 4

Announcement Final Exam (30%) Thursday, May 12, 8:00-10:00Am CSI 2107 Closed-book. One A4 sheet for notes is allowed! 5

Network Basics Graph Density Complete Graph Connectivity Connected graph Distance btw nodes i and j: d(i,j) Diameter of a graph Connected component: every node in the subset has a path to every other; & the subset is not part of a bigger component. BFS: distance btw root and all level-k nodes is k! 6

Triadic Closure If two nodes in a network have a neighbor in common, then there is an increased likelihood they will become connected themselves. Reasons for Triadic Closure: Opportunity, Trust, Incentives Clustering Coefficient A measure to capture the prevalence of Triadic Closure Defined for nodes CF(A) = Number of connections btw A s friends = 1/6 Possible Number of connections btw A s friends 7

Bridge An edge is bridge if deleting it would put its two ends into two different connected components. Bridges provide access to parts of the network that are unreachable by other means! 8

Local Bridge An edge such that its endpoints have no friends in common! edge not in a triangle! deleting a local bridge increases the distance btw its endpoints to a value strictly > 2. 9

Neighborhood Overlap A measure to capture bridgeness of an edge! Don t count A and B here! Nodes Neighborhood overlap ------- -------------------- A-E 2/4 A-F 1/6 A-B 0/8 (Overlap = 0 for local bridges) Edges with very small neighborhood overlap can be considered as almost local bridges 10

Strong Triadic Closure Strong Triadic Closure If A has strong links to B and C, then there must be a link, either weak or strong, btw B and C! 11

Local Bridges and Weak Ties Relationship btw local bridges and weak ties through strong triadic closure: If node A: satisfies strong triadic closure, AND is involved in at least two strong ties Then: any local bridge adjacent to A must be a weak tie. 12

The Strength of Weak Ties Weak ties (acquaintances) connect us to new sources of information. This dual role - as weak connections but also valuable links to hard-to-reach parts of the network - is the surprising strength of weak ties. 13

Questions 1. How the neighborhood overlap of an edge depends on its strength? Neighborhood overlap should grow as tie strength grows. 2. How weak ties serve to link different communities that each contain large number of stronger ties? Delete edges from the network one at a time, starting with the weakest ties first! The giant component shrank rapidly (its size decreases rapidly). 14

Structural Holes- Cnt. Structural hole: the empty space in the net btw 2 sets of nodes that don t interact closely! A node with multiple local bridges spans a structural hole in the net. B has early access to info! B is a gatekeeper and controls the ways in which groups learn about info. She has power! B may try to prevent triangles from forming around the local bridges she is part of! How long these local bridges last before triadic closure produces short-cuts around them? 15

Node Centrality Degree centrality A node is central if it has ties to many other nodes Look at node degree! Closeness centrality A node is central if it is close to other nodes Look at distance btw nodes Betweenness centrality A node is central if other nodes have to go through it to get to each other Look at shortest paths between nodes 16

Clustering Break a network into densely connected nodes with sparse connections between groups Graph Partitioning Find the edges that carry most of traffic in a network and successively remove edges of high traffic! 17

Edge Betweenness Edge Betweenness: Let s assume 1 unit of flow will pass over all shortest path btw any pair of nodes A and B. Betweenness of an edge is the total amount of flow it carries! If there are k shortest path btw A and B, then 1/k units of flow will go along each shortest path! Girvan-Newman Algorithm: Repeat until no edges are left: Calculate betweenness of edges Remove edges with highest betweenness 18

Computing Edge Betweenness A clever way to compute betweennesses efficiently Use breadth-first Search 1. For each node A{ 2. Run BFS on A 3. Count the number of shortest paths from A to any other node 4. Determine the amount of traffic from A to other nodes 5. } 6. Compute betweenness for each edge by summing all the traffic passing over the edge and divide by 2 Note that we count the flow between each pair of nodes A and B twice (once when running BFS from A and once when running BFS from B)! So, we need to divide resultant values by 2! 19

Homophily Links connect people with similar characteristics. Homophily has two mechanisms for link formation: Selection: Selecting friends with similar characteristics Individual characteristics drive the formation of links Immutable characteristics Social Influence (socialization) Modify behaviors to make them close to behaviors of friends Existing links influence the individual characteristics of the nodes Mutable characteristics 20

Homophily- Cnt. Different mechanisms for link formation as types of closure processes! Focal Closure: B and C people, A focus Selection: B links to similar C (common focus) 21

Homophily- Cnt. Different mechanisms for link formation as types of closure processes! Membership Closure: A and B people, C focus Social influence: B links to C influenced by A 22

Tracking Closure Algorithm -------------- 1) Take 2 snapshots of network at different times: S(1), S(2). 2) For each k, find all pairs of nodes in S(1) that are not directly connected but have k common friends. 3) Compute T(k) as the fraction of these pairs connected in S(2). 4) Plot T(k) as a function of k Tracking link formation in large scale datasets based on the above mechanisms estimate for the probability that a link will form btw 2 people with k common friends. T(0) is the rate of link formation when it does not close a triangle 23

Spatial Model of Segregation Effects of homophily in the formation of ethnically and racially homogeneous neighborhoods in cities. People live near others like them!! Color the map wrt to a given race : --Lighter: Lowest percentageof the race --Darker: highest percentage of the race. 24

Schelling Model The overall effect: Local Preferences of individual agents have produced a Global Pattern that none of them necessarily intended. Immutable characteristics can become highly correlated with mutable characteristics (here decision about where to live). 25

triangles with one or three +'s as balanced Structural Balance For 3 people, certain configurations are socially / psychologically more plausible. + + + + + - + - - - - - Balanced Unbalanced triangles with zero or two +'s are unbalanced 26

Structure of Balanced Nets What does a balanced network look like? If a Labeled Complete Graph is balanced, then: The network contains only positive edges, or Global division of network: Nodes can be divided into 2 groups X and Y such that: a pos link btw every pair of nodes in X, a pos link btw every pair of nodes in Y, and a neg link btw every node of X and every node of Y. + - + X Balance Theorem: These are the only ways to have a balanced network! Y 27

Weak Structural Balance Weak Structural Balance Property There is no triangle with exactly two positive edges and one negative edge. 3 Mutual enemies are allowed as there could be less of a force leading any 2 of them to become friend as compared to the first case (mutual enemies with a common friend to reconcile)! - - - 28

Structure of Weakly Balanced Nets What does a weakly balanced network look like? If a Labeled Complete Graph is weakly balanced, then its nodes can be divided into groups such that: every 2 nodes in the same group are friends, and every 2 nodes in different groups are enemies. Weakly Balance Theorem: This is the only way to have a weakly balanced network! 29

Balance in General Networks Claim: A graph is balanced if and only if it contains no cycle with an odd number of negative edges. Not balanced! 30

Web Structure Distribution of WCCs and SCCs on the web. Graph structure in the Web. Broder et. al., 2000. WWW 2000. 31

Web Structure- Cnt. IN nodes: can reach SCC but cannot be reached from it. OUT nodes: can be reached from SCC but cannot reach it. Tendrils nodes: (a) reachable from IN but cannot reach SCC, (b) can reach OUT but cannot be reached from SCC. Tendrils nodes satisfying both a & b, travel in tube from IN to OUT without touching SCC. Disconnected nodes: have no path to SCC ignoring directions Graph structure in the Web. Broder et. al., 2000. WWW 2000. 32

Power Law A function that decreases as k to some fixed power, e.g. ak -c, is called a power law! It allows to see very large values of k in data! If power-law holds, the log -log plot should be a straight line. 33

Rich Get Richer Rich-Get-Richer: A simple model for the creation of links as a basis for power laws! 1. Pages are created in order and named 1, 2,, N. 2. When page j is created, it produces a link to an earlier page i < j based on the following rules: a) With probability p, page j chooses page i uniformly at random, and creates a link to i. b) With probability (1- p), page j chooses page i uniformly at random and creates a link to the page that i points to (copies decision made by i). 34

Rich Get Richer- Cnt. We observe power law, if we run this model for many pages the fraction of pages with k in-links will be distributed according to a power law 1/k c! Value of the exponent c depends on the choice of p. Correlation between c and p? Smaller p copying becomes more frequent more likely to see extremely popular pages c gets smaller as well 35

Rich Get Richer- Cnt. Explain power laws using the Rich-Get-Richer model: Fraction of numbers receiving k calls per day: 1/k 2 Fraction of books bought by k people: 1/k 3 Fraction of papers with k citations: 1/k 3 Fraction of cities with population k: 1/k c, c constant Cities grow in proportion to their size, simply as a result of people having children! Once an item becomes popular, the rich-get-richer dynamics are likely to push it even higher! 36

HITS Algorithm 1. Set all hub scores and authority scores to 1. 2. Choose a number of steps k. 3. Perform a sequence of k hub-authority updates: 1. First apply the Authority Update Rule to the current set of scores. 2. Then apply the Hub Update Rule to the resulting set of scores. 4. Normalize authority and hub scores 37

HITS- Cnt. Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it. P auth(p) Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that it points to. hub(p) P 38

Page Rank Algorithm 1. Set initial PageRank of all nodes to 1/n. 2. Perform k updates to the PageRank values: 1. Apply PageRank Update Rule 39

Page Rank- Cnt. PageRank Update Rule: 1. Each page divides its current PageRank equally across its out-going links passes equal shares to the pages it points to. If a page has no out-going links, it passes all its current PageRank to itself. 2. Each page updates its new PageRank to be the sum of the shares it receives. p i p j 40

Page Rank- Cnt. Issue with page rank algorithm? Link Farm Wrong nodes can end up with all the PageRank in the network! 41

Page Rank- Cnt. Scaled PageRank Update Rule: Pick a scaling factor 0 < s < 1 Apply the PageRank Update Rule as before. Then scale down all PageRank values by a factor of s. The total PageRank in the network? shrinks from 1 to s. Divide residual (1 s) PageRank equally over nodes giving (1- s) /n to each. The above rule follows from the fluid intuition for PageRank Why all the water on earth doesn t inexorably run downhill and reside exclusively at the lowest points? There s a counter-balancing process at work: Water also evaporates and gets rained back down at higher elevations! 42

Random Walks Random walk on a network: Start by choosing a page at random Pick each page with equal probability. Follow links for a sequence of k steps: In each step, pick a random out-going link from the current page, and follow it to where it leads. If there is no out-going links, stay at current node. An equivalent formulation of PageRank that leads to exactly the same definition! 43

Random Walks- Cnt. Interpretation of the leakage issue in terms of Random Walks: As the walk runs for more and more steps: The probability of reaching F or G is converging to 1; Once it reaches F or G, it is stuck there forever. The probability of being at F (G) is converging to 1/2 The probabilities are converging to 0 for all other nodes. 44

Information Cascade behaviors that cascade from one node to another like an epidemic! and produce collective outcomes. occurs when people make decisions sequentially, with later people watching the actions of earlier people. 45

Information Cascade- Cnt. General Cascades Principles Cascades can easily occur, given the right structural conditions! Cascades can lead to non-optimal (wrong) outcomes! Some (but not all) cascades can be very fragile! 46

Information Cascade- Cnt. follow the crowd follow private info acc rej rej acc acc acc follow private info follow the crowd 47

Diffusion Look at cascade from network structure perspective People imitate behaviors of others. Adopt a new behavior once a sufficient proportion of their neighbors have done so. 48

Diffusion- Cnt. Question: What makes a cascade stop? Or prevents it from breaking into all parts of a network? Claim: Given initial adopters of A & threshold q: i. If remaining network contains a cluster of density greater than (1 q), then no complete cascade. ii. If there is no complete cascade, the remaining network contains a cluster of density > (1 q). Cluster Density A cluster with density p: set of nodes such that each node has at least p fraction of its neighbors in the set. 49

Six Degree of Separation Watts-Strogatz Model A simple model that makes the word small by: Many closed triads, Many short paths. Suppose nodes live on a 2-dimensional grid Link formation: Homophily each node forms a link to all other nodes that lie within a radius of up to r grid steps away, r is a constant These are the links to similar people! Weak Ties For some constant k, each node also forms a link to k other nodes selected uniformly at random from the grid Connecting nodes who lie very far apart on the grid. 50

Six Degree of Separation- Cnt. The resulting net built from local structure and random edges The network has many triangles There are many short paths connecting pairs of nodes in the net! 51

Six Degree of Separation- Cnt. It was shown that a small amount of randomness is just needed to achieve the same qualitative effect. Instead of k random friends, allow one out of every k nodes to have a single random friend! Interpretation: group k k nodes into one towns. Each town has k links to other towns Just like previous model. To find a short path btw 2 people, first find a short path btw their towns, then use the proximitybased edges to find the actual path. 52

Decentralized Search Decentralized Search People are effective at collectively finding short paths Generalizing Watts-Strogatz network model Nodes on a grid Each node has edges to nodes within r grid steps. Weak ties (k random edges) are generated in a way that decays with distance This is controlled by the clustering exponent q: Let d(v,w) be the number of grid steps between nodes v and w. distance if one had to walk along adjacent nodes on the grid v links to w with probability proportional to d(v, w) -q q controls how uniform the random links are! 53

Decentralized Search- Cnt. d(v, w) -q : Links are too random, Can't be used effectively for decentralized search Links are not random enough. Not enough for long-distance jumps to create a small world. Is there an optimal q for the network that allows rapid decentralized search? 54

Decentralized Search- Cnt. 400M nodes Average of 1k runs Delivery time Best delivery time q ~ 2 Inverse square distribution! Generalized Models Ranked-Based Friendship Social Distance 55

Microblog Search 56

Microblog Search Architecture of search / retrieval systems Techniques for improving search Challenges of real-time search Twitter s architecture for real-time search Evaluation 57

Questions? 58

Reading Ch.01 Overview [NCM] Ch.02 Graphs [NCM] Ch.03 Strong and Weak Ties [NCM] Ch.04 Homophily, and Link Prediction [NCM] Ch.05 Positive and Negative Relationships [NCM] Ch.13 The Structure of the Web [NCM] Ch.14 Link Analysis [NCM] Ch.05 [MMD] Ch.16 Information Cascades [NCM] Ch.18 Power Laws and Rich-Get-Richer [NCM] Ch.19 Cascading Behavior in Networks [NCM] Ch.20 The Small-World Phenomenon [NCM] 59