Lecture Note: Computation problems in social. network analysis

Lecture Note: Computation problems in social network analysis Bang Ye Wu CSIE, Chung Cheng University, Taiwan September 29, 2008 In this lecture note, several computational problems are listed, including exercises for computer programming and state-of-the-art research issues. Of course, the problems listed here are not exhausted. This note is not a good survey because the current statuses of most of the problems are not given. 1 Exercises of Computer programming There are many basic metrics to be computed in SNA. Many of them are easy exercises for students in computer science. But the network may be A note used in the SNA course of CS Department, Chung Cheng University http://www.cs.ccu.edu.tw/ bangye/; Email:bangye@cs.ccu.edu.tw 1

large and sparse, and data structures for such graphs should be employed in the program. In the following, a graph is denoted by G = (V, E), which may be either directed or undirected. Also the vertices and the edges/arcs may be weighted or unweighted. k-neighbors: given a node v and an integer k, find all the nodes with distance k from v. It is denoted by N k (v). Solvable by Breadth-First- Search (BFS) Components: find the connected components of a graph. Solvable by BFS or Depth-First-Search (DFS). Shortest Paths. Solvable by BFS or 2-way BFS. one shortest path: find a shortest path between two given nodes. all shortest path: find all shortest paths between two given nodes. all-to-all shortest paths: find one/all shortest paths between every pair of nodes. This problem is more general than finding the distances of all pairs. shortest-paths tree: find a shortest-paths tree for a given node shortest-path DAG*: for a node v, the shortest-path DAG is the DAG containing all shortest paths from v to all other nodes. (Is there any interesting research issue?) 2

diameter: The diameter is the maximum distance of any pair of nodes, i.e., max u,v V d(u, v), where d(u, v) is the distance from u to v. Remark 1: the distance between two nodes is the length of the shortest path between them (for undirected graph). Sometimes, we use the term eccentricity. The eccentricity of a node is the longest distance from any node to it, and the diameter is the largest eccentricity over all nodes. max-flow: In the general Maximum Flow problem, a directed graph with a capacity limit associated to each arc is given. For two given nodes (a source and a sink), we want to find the maximum capacity flow from the source to the sink. Solvable by Labeling Algorithm. In SNA, the flow is regarded as a measurement of influence. In most of the cases, the capacity limit on each arc is the same, or said unit capacity. If it is the case, could we compute the max flow more efficiently. closeness centrality (in, out, all): The closeness of a node is the total distance to all the other nodes. Sometimes called as farness (maybe more suitable). There are three variants: from the node to others, from others to the node, and distance defined by the sum (average) of 3

both directions. Solvable by BFS. How about if we only want to know the node with the maximum/minimum closeness? betweenness centrality. Node betweenness: The betweenness of a node v is the number of node pairs such that their shortest path passing through v. Every node pair whose shortest path passing through v contributes one point to the betweenness of v. In the case of multiple shortest paths, all these shortest paths share one point. Edge betweenness: Similar to node betweenness, it is the number of node pairs such that their shortest path passing through the edge. degree centrality: The in-degree and out-degree of a node. clustering coefficient (CC): The CC of a node is the number of edge between its neighbors divided by the possible maximum ( N(v) ( N(v) 1)/2 in undirected case). The CC of the whole network is the mean of the ones of nodes. However, taking the unweighted mean over all nodes is not a good measurement. It is defined by the total number of triangles divided by the total number of connected triples. Precisely, CC = {(a, b, c) : ab, bc, ac E} {(a, b, c) : ab, bc E} 4

counting 3-rings (undirected, transitive, cyclic) 2 Research Issues 2.1 Clustering problem 2.1.1 Correlation clustering: Each person has two lists of the ones he likes and dislikes respectively. Find a clustering such that the favorite score is maximum. The problem can be modeled as follows. Each edge is associated with a + or - or nothing. We want to partition the vertices into k clusters such that the total number of positive edge minus the total number of minus edge within clusters is maximum. Remarks: This problem can be defined to several variants. First the number of cluster is specified in the input or not. The definition of measure score can be defined in several similar ways. For example, we can maximize the total number of positive edges within clusters minus the one across the clusters. (See the reference papers) Remarks: Is the bottleneck version has been studied? 5

2.1.2 k-center problem The k-center problem is a classical problem in graph algorithm. Given a graph and an integer k, we look for a vertex subset S of k centers minimizing the bottleneck distance from any node to any center, i.e., we want to minimize max v V min d(v, s). s S For fixed k, the problem can be solved in polynomial time. To see this, a naive algorithm trying all possible k-node subsets takes only polynomial time for fixed k. However, for general k, it is NP-hard. The problem is thought of a clustering problem in the sense that the vertices closest to each center s S are regarded as within the same cluster. The measurement of the quality of the clustering is then defined by the maximum radius of the clusters. The k-center problem has been well studied, and many results can be found in the literature. 2.1.3 Maximal/maximum 2-clique Definition 1: For a graph G and an integer k, a k-clique in G is a subgraph induced on a vertex subset S such that d(u, v) k for all u, v S. Traditionally, the k-clique is defined on undirected graph and should be maximal (a substructure is maximal if it is not properly contained in any sub- 6

structure with the same property.) Of course we can defined it on directed graph. Please note the terminology is different from Graph Theory, in which a k-clique usually means a clique of size k. The k-clique is a relaxation of clique. The definition of clique is too restricted, and it is almost impossible to find a meaningful clique in real data. A k-clique is a clique in the k-th transitive extension of the graph. So, to find a k-clique with maximum cardinality is NP-hard since the MAX-CLIQUE is a well-known NP-hard problem. The good news is that there are many results about MAX-CLIQUE can be applied to this problem. 2.1.4 Maximal/maximum 2-club The definition of k-club is similar to the one of k-club. The only difference is that we require that d G[S] (u, v) k for all u, v S, where G[S] is the induced graph on S and d G[S] (u, v) is the distance within G[S], i.e., the minimum length of any path between u and v and passing through only vertices in S. In SNA, it seems more meaningful to use k-club than k-clique. To find maximum k-club is also NP-hard. But one thing is interesting, we still don t know the complexity to determine a maximal k-club. 7

2.1.5 Maximal/maximum directed graph 1.5-clique, 1.5-club For directed graph, we can generalize the k-clique/club. We define a middle ground between clique and 2-clique by requiring d(u, v) + d(v, u) 3. It was shown that any 1.5-clique has density at least 0.5. It may be an important feature for finding a cohesion cluster. Remark 2: There are several interesting problem about the k-clique/club. For example, exact algorithm (DP or branch-and-bound, there is a integer programming algorithm published in European J. OR 2002), heuristic algorithm, or other AI methods. 2.1.6 Maximum density subgraph with/without given size The maximum density subgraph (MDS) asking a subgraph with maximum density defined by the total number (weight) of edges divided by the number of vertices. MDS has also been studied in Graph Algorithm. If the size of the subgraph is not restricted, the problem can be solved in polynomial time. But if the size of subgraph is given in the input, the problem become NP-hard. 8

2.1.7 Overlapping and non-overlapping clustering Most of current techniques for graph clustering focus on non-overlapping clustering. Recently some research results on the overlapping case have been published. Some of them modified the previous algorithms for non-overlapping case. In social network, clusters usually overlapped with one another. 2.1.8 Bipartite clustering In a bipartite graph, the vertex set can be partitioned into two subset V = V 1 V 2 and all edges are cross the cut, i.e., for any u V 1 and v V 2, (u, v) / E. Similar to the MAX-Clique problem, we may also interested in finding a maximum size bi-clique (complete bipartite subgraph). Another problem is to partition a bipartite graph such that each of them is also complete or said dense. There is another interesting bipartite clustering problem. In the problem, V 1 is a set of members and V 2 is a set of attributes. (u, v) E iff member u has attribute v. By such information, we want to partition V 1 into several clusters, overlap or non-overlap, such that the members in one cluster are similar in their attributes. 9

2.1.9 Hierarchical partition of graph To deal with large network, hierarchical method is needed and usually considered. Most of the algorithms are time-consuming and hard to applied to a large network directly. One may first partition a network into small ones and then employ these algorithms. 2.1.10 Clustering with modularity (Newman) The Modularity is defined by Newman to measure how good a partition (clustering) is. See the reference for the formal definition. To find a clustering with maximum modularity is still an NP-hard problem. Many of the results in this area use Graph Spectrum Theory (This term?) 2.1.11 Improving GN algorithm for clustering GN algorithm was proposed by Girven(?) and Newman. It repeatedly remove the edge of maximum betweenness until the network is partitioned into desired number of clusters. It is a heuristic algorithm for non-overlapping clustering. One drawback is its high time complexity. There are some results for improving GN algorithm. There may be still improvement will be discovered in the future. 10

2.1.12 Other approaches for community detection Community detection, or graph clustering, is an important topics and has fruitful research issues. 2.2 Influence problem The Influence problem naturally arises in marketing. Suppose that we have a social network describing the social relationship among people. Now we want to promote our new product and want to select some people as the initial set (We shall also call the nodes as source nodes, or source for simplicity). Using the their influence on others, we hope there will be as many as possible people knowing our product. The exact definition depends on the way we define the influence. In general, we are given a social network G (usually directed) and an integer k. There is a influence function f S (v) of a node v on an initial set S. Besides there is a threshold δ(v) associated with each node v. We want to find a vertex subset of size k such that the cardinality of the influenced set {v f S (v) δ(v)} is maximized. Surely the problem can be defined with weights on initial nodes, influenced nodes, or both. The threshold function δ provides the flexibility of that nodes can have different hardness to be convinced. For the simplicity, in the following we assume that thresholds of all nodes are the same. 11

2.2.1 Minimum distance mode Maybe the simplest definition of influence function is to only use the distance. Intuitively, the farther a node is from the initial nodes, the weaker the influence is. Usually we use an exponential function as the influence function. The attenuation factor may be any positive number smaller than 1, but 1/2 is usually used. Another question is how to measure the total effect of all initial nodes. If we consider only the closest source, we have f S (v) = max s S 2 1 d(s,v), or equivalently f S (v) = 2 1 d(s,v), in which the distance d(s, v) from a set to a node is the minimum distance from any node in the set to the node. In such a case, the threshold can only be powers of the attenuation factor. Therefore, in this mode, a node is influenced iff it is closer to the initial set than a specified distance. So we can have the following definition. Problem: Minimum-Distance-Mode Max-Influence Input: A network G, an integer k 1, and an integer q Output: A set S of k sources such that {v d(s, v) q} is maximized. We call it the minimum distance mode. 12

2.2.2 Distance-mode An argument about the minimum distance mode is that it ignores the cooperation of sources. A people may be influenced by more than one sources. With this aspect, we define the problem as follows. Problem: Distance-Mode Max-Influence Input: A network G, an integer k 1, and a real q Output: A set S of k sources such that {v s S 2 1 d(s,v) q} is maximized. 2.2.3 Path mode An argument about the distance mode is that only shortest paths are counted. But in the real world, influence may spread along any path. So, another influence mode consider all reachable paths from the sources to the influenced node. Precisely speaking, we define f S (v) = k #(P i )2 i+1, s S i=1 where #(P i ) is the number of paths of length i (from s to v) and k is the selected limit of path length. In practice, k is small since a path of length more than 5 usually is meaningless. 13

2.2.4 Flow-mode (step-limited flow mode) Another way to take all pathways into consideration is to define the influence by the maximum flow. The point is as follows. Suppose that A has a arc pointing to B. We may have many paths passing through A to B. In the path mode, we may over estimate the importance of these paths because A is just only one person influencing B. This is the reason to use flow. But if the maximum flow is used, in a dense network, it is very possible the maximum flow is just the in-degree of the target. In another view, no consideration on the length of flow is also questionable. A modification is that we can use flow with limited length. 2.2.5 Diffusion Models There are several diffusion model used in considering the spread of an idea or innovation through a social network. In these models, a node is either active (an adopter of the innovation) or inactive. The tendency of a node to become active increases monotonically as more of its neighbors become active. Usually, these models focus on the progressive case in which nodes can switch from being inactive to being active, but do not switch in the other direction. Thus, the process will look roughly as follows from the perspective of an initially inactive node v: as time unfolds, more and more 14

of v s neighbors become active; at some point, this may cause v to become active, and v s decision may in turn trigger further decisions by nodes to which v is connected. [3] Linear threshold model In this model, a node v is influenced by each neighbor w according to a weight b v,w such that w N(v) b v,w 1, in which N(v) is the incoming neighborhood of v. The dynamics of the process then proceed as follows. Each node v chooses a threshold θ v uniformly at random from the interval [0, 1]; this represents the weighted fraction of v s neighbors that must become active in order for v to become active. Given a random choice of thresholds, and an initial set of active nodes A 0 (with all other nodes inactive), the diffusion process unfolds deterministically in discrete steps: in step t, all nodes that were active in step t 1 remain active, and we activate any node v for which the total weight of its active neighbors is at least θ v : b v,w θ v w N(v) A alternative of this model sets all theta v to 1/2. Independent Cascade model [1] We start with an initial set of active nodes A 0, and the process unfolds in discrete steps according to the following randomized rule. When node v first becomes active in step t, it is given a single chance to activate each currently 15

inactive neighbor w; it succeeds with a probability p v,w a parameter of the system independently of the history thus far. (If w has multiple newly activated neighbors, their attempts are sequenced in an arbitrary order.) If v succeeds, then w will become active in step t + 1; but whether or not v succeeds, it cannot make any further attempts to activate w in subsequent rounds. Again, the process runs until no more activations are possible. Remark 3: The maximum Influence problem under the above two diffusion models were shown to be NP-hard and can be approximated with ratio 1 1/e ɛ in [3], where e is the base of the natural logarithm and ɛ is any real number. (In the convention of approximation ratio of maximization problem, we shall say the ratio is e/(e 1) + ɛ the inverse of the above ratio. 2.2.6 Other modes 2.3 Covering (Dominating) problem The covering problem corresponds to the influence problem. Here we want to find a cheapest initial set such that all the people (or a given subset, or a desired number of unspecified persons) are influenced. 16

2.4 Social roles: Equivalence or Similarity [2] In such problems, we want to find some positions (nodes) which are equivalent, or similar, in a social network. Two nodes are structural equivalent if they have the same ties to all others. To check if two node are structural equivalent and to compute their similarity are simple. But it still worthy studying to find out equivalent or similar positions is a large social network by avoiding enumeration of all pairs. Besides structural equivalence, a more important definition about social role is the Regular equivalence. Two nodes are said to be regularly equivalent if they have the same profile of ties with members of other sets of actors that are also regularly equivalent. Regular equivalence seems more abstract than the structural equivalence. It involves the problem of graph homomorphism. 2.5 Structure (link) mining or latent friend discovery This is a research topic related to data mining. We suppose that there is a underlying social network. But for some reason, some links are lost, and we want to find the lost links. Some research find the links by analyzing the contents in web sites or blogs, and some use only structural information. Of course, one can also develop methods using both information. 17

2.6 Computing problems with huge networks When the network is huge, even a simple problem may become a trouble. This is the research issue of the field of external (memory) algorithm. See [4] for a nice survey. Final remarks The reference papers have not been enumerated in this note. Maybe they will be given somewhere later. References [1] J. Goldenberg, B. Libai, E. Muller. Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth. M arketing Letters 12:3(2001), 211-223. [2] Hanneman, A., Riddle, M., 2005. Introduction to social network methods, online at [http://www.faculty.ucr.edu/hanneman/nettext/]. [3] D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the spread of influence through a social network, K DD, (2003). 18

[4] S. Vitter, External Memory Algorithms and Data Structures: Dealing with Massive Data, ACM Computing Surveys, Vol. 33, No. 2, June 2001, February 2007 revision Available online at Internet, 19