Metropolitan Community Deliniation. Community detection algorithms. Nathaniel Pritchard

Size: px

Start display at page:

Download "Metropolitan Community Deliniation. Community detection algorithms. Nathaniel Pritchard"

Scott Ross
5 years ago
Views:

1 Metropolitan Community Deliniation Community detection algorithms Nathaniel Pritchard Advisied by: Professor Shankar Bhamidi Reveiwed by: Professor Shankar Bhamidi Professor Sayan Banjeree Statistics and Operations Research The University of North Carolina at Chapel Hill Apriil 2, 2018

2 1 Introduction There are endless ways in which we can group people together in space, the most common ways are of course cities, towns, villages, and metropolitan areas. Today we define cities, and metropolitan areas using a mixture of techniques such as historical boundaries and strict rules that deal with commuting and population density to determine the outlines of a metropolitan area[2]. The way in which these deliniations are determined is important becasue they are the basis for various statistical calculations detailing anything from community diversity to median incomes. These statistics are then used to help determine economic and infrastructure initiatives, which could be beneficial to the community, but could also be wasteful if the statistics do not represent the true community. Thus a robust method of community delineation that captures the true structure of connections would allow for more representative statistics. These more representative statistics will allow policy makers to have a better understanding of the issues. With this better understadning they should then be better able to solve these problems. This robust method can be found by turning to network theory, the study of connections between entities. This field of study is useful one because metropolitan areas are supposed to represent closely connected areas. Specificially it would be useful to look at network theory in an attempt to find an algorithm that identifies the community structure by looking data of the workers who commute to and from various counties. Using these techniques we should be able to provide delineations that are judged to be both significant and proper. 2 Network Theory Background Networks are the way that we mathematically capture the connections between entities. They can be used to describe social connections, food chains, protien structures, or in our case commuter connections. These networks are based on entities, nodes in network language, whether it be people, internet devices, or animals that are connected to other entities via edges[6]. These networks can be weighted in which case each edge takes on a certain value, or unweighted in which case the edge either exists or it does not. There is also no requirement that these edges are bi-directional, if every edge in a network is bi-directional then the network is undirected otherwise it is directed, meaning that at some connections are only one way paths[6]. These edges between nodes can be recorded in a an adjacency matrix A where the ijth value is value of the weight if there is an edge between i and j or zero if no edge exists[6]. In the case of an unweighted network it will be one if the edge exists and zero otherwise. Additionally, in the undirected case if the ijth element, then the jith elements exist and is the same value [6]. Once the network has been established there needs to be a way to pull information about the structure. There are in fact, several ways to get a rudimentary idea about network structure; however, in the context of this paper there are only two necessary to understand. One is the measure of degree, which in the context of undirected graph degree is the number of edges that each node has connected to it [6]. While in a directed sense there is an incoming and outgoing degree which are the number of edges coming 1

3 into a node, or leaving a node respectively [6].If there are n total nodes in a graph the degree of a particular node i will be denoted as d i. The total degree, d t, of the graph can be be defined through n d t = i=1 The other important measure to understand is geodesic distance, which is the shortest distance between any two nodes [6]. Distance in networks is the number of edges separating two nodes from one another. An example would be in the case that the fewest jumps made to connect Kevin Bacon to Idris Elba was six then this would be the geodesic distance. 3 Community detection: basic algorithms What is a community? In the simplest terms it is simply a grouping that has a structure more connected within it the community than with the rest of the network [6]. This simple definition has lead to numerous different ways to find a community. One way of dividing communities into groups as propsed by Mark Newman is to look at where the number of edges is greater than the number of edges expected if it were a randomly generated graph that had the same edge distribution. By then maximizing that measure one can then identify a community. The model Newman uses for this expected value of an edge is the configuration model [6]. Other techniques do things like maximize the number of triangles, when a group of three nodes are connected to each other[6]. 3.1 The configuration model The configuration model is designed for unweighted networks. In this model each node is assumed to have the degree that is the same as a corresponding node in actual graph being examined [6]. From this setup the probability of a connection between nodes i and j can be easily defined. This is done by thinking about the degree of a node being the number of half edges for a node [6]. This consideration results in an edge forming when two half edges point toward each other [6]. From this idea we define the total number of half edges of a graph with n nodes to be: d t = n i=1 Since there are d t half edges in total, the chance of an edge from node i pointing in the d direction of node j is j d since it has to point at one of the d t 1 j half edges connected to j out the total number of half edges exluding itself in the graph [6]. Then since there d i half edges leaving node i the correct probability is: p ij = d i d i d jd i d t 1 2

4 [6] If the network is large, by the law of large numbers, the effect of the -1 in the denominator is minimal and can be removed making the probability: p ij = d jd i dt In an unweighted network this probability of an edge is also the expected value for an edge since an edge can only take a value of either zero or one. Thus by calculating this probability we have a good null model for expected edges from which to try to determine community structure in a graph [6]. 3.2 Modularity With this null model described it is now possible to understand the concept of modularity. Modularity is the measure that assigns a quality value to a division of a network. This quality is determined by the nodes in a community having a higher than expected connection with each other[3]. Modularity is calculated by summing the amount each connection within a particular division exceeds the expected value from the configuration model[3]. These values are then all summed together [3]. Once this sum is calculated it is divided by the number of half edges which is d t [3]. Mathematically this can be represented as: Q = 1 (A ij d jd i d t ij In this A ij is the adjacency matrix representing each edge in a graph, δ ij is the Kroenker delta which is 1 if i and j are in the same group and zero otherwise. Modularity will take values less than one and the further away one is from one the worse the quality of the division[3]. This concept of modularity gives a good framework for the determination of the quality of a division of non-overlapping communities. d t )δ ij 3.3 Modularity based Community detection With the concept of modularity it is now possible through maximization of this measure to divide communities. Because the optimization of this measure is a NP hard problem, Newman has proposed several algorithms to perform these divisions[6]. One approach is a fast greedy algorithm in which a graph is treated as n different groupings and these groupings are joined together based on the edge that would most increase modularity. The algorithm runs until any more merging of groupings would result in a decrease in overall modularity [3]. The major problem with this algorithm as can be seen in figure 1 is that it has a tendency to favor large communities over smaller ones, which is problematic when it comes to metropolitan area detetcion which could have small communities but also large ones [3]. The walktrap algorithm, which is based on the premise that on a random walk an walker should spend more time in a community than outside of it once the walker enters the community, is a another algorithm that we intially examined. The algorithm works by defining a distance based on the transistion probabilites from i to j in a particular 3

This gives a sequence of partitions of the graph, one defined at each merge, the best partition is then chosen to be the one that maximizes modularity.

5 Figure 1: Fast Greedy Algorithm applied to 1970 Figure 2: Walktrap applied to 1970 commuter data number of steps[7]. With these transition probabilites being distance communities are merged together using a similar fast greedy approach as modularity[7]. This gives a sequence of partitions of the graph, one defined at each merge, the best partition is then chosen to be the one that maximizes modularity. The results from this algorithm can be found in figure 2. Looking at thes plots two major issues in relations metropolitan areas arise. The first is that every county in these plots has been placed inside of a community. This is not the case when looking at metropolitan areas since many counties are rural and not near any population centers. The other major issue is that the community delinations are too large to represent metropolitan areas. For instance, almost the entire state of Florida has been placed in a single metro area in the walktrap, which is not appropiate. Similary problematic is the fast greedy algorithm grouping essentially the entire west 4

6 coast togther in figrue 1. For this reason we decided to take a different approach in terms of algorithms, ones that started from a seed node and worked their way outwards to form communities rather than looking at the entire graph topolgy at one time. 4 Extractive approaches The less than satisfactory results of the previous approaches caused lead to the consideration of extractive algorithms rather than the previous more holistic ones. The difference between extractive approaches and the previous ones is that in extractive approaches, the algorithms create communities starting from specific seed nodes. This is ideal for delineating metropolitan areas because the OMB has already defined them as having counties greater than 50,000 people, so we know where to look. One algorithm used is the CCME algorithm which was created John Palowitch, Shankar Bhamidi, and Andrew Nobel. The other algorithm is tenatively named the dampened modularity algorithm and was created for this specific purpose. 4.1 CCME The main purpose of the CCME model is searching for statistically significant communities and designed with weighted communities in mind. The basis for the CCME algorithm is the continuous configuration model, which is similar to the configuration model, but is designed to account for weights of edges in its calculations of expected edge value[5]. The CCME does by using the same expression of probaility found in the configuration model and in a similar manner creating a formula for the expected strength of connection between the two nodes. Specifically, we can derive the formula for the expected strength of connection between nodes i and j by calculating the porpotion of the total strength of the graph graph that node i makes up and then multiplying that by the total strength of node j [5]. This means that if for example 5% of the strength of a graph lies in node i then after a random distristribution of strengths it would be expected that 5% of the strength of node j to be connected to node i. Mathematically this can be represented as s ij = s is j s t where s t is the sum of the total strength of the graph[5]. By dividing s ij by d ij we will thus have the expected weight for each edge which we will define as f ij (s, d)[5]. Formula wise this is f ij (s, d) = s ij d ij Where s is a vector containing the strengths of nodes i and j, while d is a vector containing the degrees of nodes i and j[5]. Along with the expected value the distribution F is characterized by a variance which is called kappa. This value is calculated using the equation: κ(d, s) = (W ij f ij (s, d)) 2 f ij (s, d) 2 ij,a ij=1 5

7 [5] Once it has these terms, the algorithm starts from a set of n seed nodes and forms n subgraphs B h where h n[5]. For each subgraph, the algorithm then checks surrounding nodes adding it to a community if the connectivity to the seed node to the community is significant[5]. This decision is based on a multiple testing threshold on z scores calculated from summing all the connections of a node to a community as well as the expected value of those same connections[5]. Then subtracting those terms from each other before dividing by a standard deviation from the formula σ(i, B κ(d, s)) 2 = j B s 2 ij d ij (1 d ij + κ(d, s)) [5] The algorithm tests all nodes until B t = B t+1 at this point the algorithm terminates and B t is added to the list of communities[5]. The algorithm finishes by filtering out all redundant communities based on the jaccard similarity and returning that filtered list of communities[5]. This works well in the case of no self loop networks. However,in our data it is also possible that a county node could be its own community based on its self commuting, in this case, the lack of self loops becomes an issue. While the intial algorithm does not utilize self loops we have modified it to account for these self loops. We did this by breaking the system into two stages one that operates as the algorithm was intended and one that identifies statistically signficant self communting communities, s 3 c 2. This specifically is done by adding a few parameters to the algorithm. One is a adding a κ for self loops and one non self loops, which function the same way as the original κ. We also add a p which represents the porportion of total commuters in the graph that are self commuters. The specifics of the kappa self loops value is as follows: κ nonselfloop = i j,j i,i j (W ij (1 ρ i ) κ selfloop = i j,j i,i j (1 ρ i) 2 i (W ii ps i ) 2 i p2 s2 i s t S i S j S t d i d j d t S i S j S t d i d j ) 2 d t 2 W ii where ρ i = and p = 1 n j,j i Wij n i=1 ρ i From here the algorithm runs twice once to detect communities without self loops and once to detect the s 3 c 2 communities. 4.2 Dampened Modularity A second approach that we have applied to detecting metropolitan areas is based on two premises. One is that there are nodes that have attributes that indicate they are central to a community. The second is there is a limit, dependent on geodesic distance, based on how far a node can be away from the seed node and still be part of the community. The algorithm works in three main phases, gathering, refining, and filtering. 6

8 4.2.1 Gathering potential nodes The algorithm works in the following way. The nodes that indicate where a community should be are taken to be seed nodes, then from there we collect all nodes that have a strong connection to that seed node. From this new set of nodes all the nodes with a strong connection to these nodes are collected and the process continues until no more nodes can be added. The algorithm also includes a dampening factor on the weight of an edge to account for the fact that at increasing geodesic distances away from the seed node it should be harder for a node to be in a community with that seed node. This makes sense to do because for each geodesic level from the seed node the connectivity of a node relative to the intial seed node is necessarily less because the connection is less direct. In a step form the specifics of the algorithm look like this: 1: identify seed nodes 2: Calculate an expected weight for each edge based on strength and degree 3: Check incoming and outgoing edges and add all nodes that have an actual edgeweight value that is greater than the expected value and add to the community. 4: Repeat step three for the nodes recently added to the community until no more nodes can be added. Mathematically this can be represented as an optimization problem in this manner: For all i in the set of seed nodes y i = j (θ x 1 ω ij E(ω ij ))δ i x- is the geodesic distance between nodes i and j ω ij - is the weight of edge between nodes i and j E(ω ij )- is the expected weight δ ij -is the kroekner delta indicating if a node j is in group i θ-is the dampening factor This function will be maximized by selecting all nodes such that: Refining Communities (θ ( x 1)ω ij E(ω ij )) 0 The algorithms that maximize modularity do more than just choose nodes with strong connections, they also choose the groupings that have the strongest connections with each other. Since the intial stage of the algorithm only requires a strong connection with one node in the community in order for a new node to be added to a community. A step needs to be taken to ensure a strong connectivity with other nodes in a community. Drawing from the filtering idea of OSLOM written by Lachietti, et.al, which ranks the nodes in a community based on the connectivity to the community then checks to see if the least connected node is strongly connected to the community and removes it if it is not, communities can be refined in a similar manner[1]. Specifically for each node in a community C i we can defined connectivity to a community C i for each node i by counting the number of nodes that node i is strongly connected to in a community and divide it by the number of nodes in a community.contingent on defining a minimum 7

9 threshold t for what a constitutes a strongly connect community. This mathematically for a community with n nodes this would be: n j=1 Connectivity ci = f(j) n { 1 ω ij E(ω ij ) f(j) = 0 otherwise Once these connectivity rankings have been calculated we can take the smallest one of them and check that it is at least as big as a threshold value t. If it is not that node is removed and the ranking process starts over and repeats until no more nodes can be removed[1]. Once this is completed this communities should resemble a result given by a modularity optimization technique Filtering communities Because of the use of seed nodes that could be close together it is possible that the dampened modularity several communities that are approximately the same. The question is how to use this information to represent the communities? A couple approaches that were attempted in the research required the calculation of similarities between communities. These similarities can be calculated based on the Jaccard Similarity measure which is the set intersection divided by the union of the two sets[5]. Which communities are similar enough can be decided by choosing a threshold value which can then be tuned in a similar way to the dampening factor. One approach would be to merge together communities that are above a threshold value into a single community. As simple as the approach is, it has the issue of fundamentally changing the composition of the community where connectiviness is no longer guarenteed. This has been shown empirically when it was applied to the communities the dampened modularity algorithm. Another approach is to filter the communities that are similar above the threshold band select which one is the better community. This has been done at this moment by looking at the connectivity of the nodes inside the community relative to the nodes outside of the community. This connectivity in this is defined relative to nodes outside the community C by: F = i C ρ i s i i C (1 ρ i) j i,j / C s j Also in the case where we are not allowing for overlapping communities, we could easily adapt a methodolgy that merges the similar communities together if they are above a threshold, and if the Jaccard value is between zero and the threshold the algorithm then assigns overlapping nodes of the communities to the community that they are most strongly connected Determining the dampening factor The validity of the dampening factor still needs to be demonstrated; however, at the current moment it seems to be reasonable addition because of the theoretical possibiltity that the addition of nodes to a community based on their actual edge strength 8

10 minus their expected edge strength could result in the community including a large node set that lacks any semblance of connectivty. Thus it would seem reasonable to impose a dampening factor that scales with levels of geodesic distance from the seed node to prevent that from happening. Assuming that the dampening factor is significant, it would also be necessary to determine what the dampening factor should actually be. A value that is too small would result in only nodes with geodesic distance of one from the seed node being collected. On the other hand a value that is too large could result in the issue mentioned above with far too many nodes being added. The easiest way to determine this would be to tune the dampening parameter to a value that results in the best community divisions. In order to do this we would need some measure that demonstrates how well a network is divided into overlapping groups. The overlapping aspect is signficant because it is reasonable for nodes to be part of more than one community in this commuter data just as a person could b ein more than one friend group. The most widely used quality measure is Newman s Modularity discussed earlier; however, it is unclear how well this measure will fair at judging the overlapping communities generated by this algorithm. Another alternative measure proposed by Shen et. al deals with determining modularity of overlapping communities[4]. The formula they derived to judge this for K communities is as follows: Q ov = 1 (A ij d id j 2M 2M ) 1 O w i O j iɛc w,jɛc w [4]. Where C w is community w, A ij is the value of edge i to j, and O i is the number of communities node i belongs to. The fact that this metric is modification of the modularity measure is that it allows for substition of a weighted null model. Thus in the case where the continous configuration null model is used, it can easily be plugged in and the metric will still work. This method has also been shown to have a similar effectiveness at evaluating the quality of a divsion to other overlapping metrics tested in a survey paper by Tanmoy Chakraborty et al. while being less computationally intensive than the other metrics tested[9]. It is also possible that upon a justification of the dampening factor that it in fact could be defined numerically based on a value like graph density. This would unambigously decrease the run time of the algorithm because there would be no need for tuning The expected value The other major question that arises would how best to calculate the expected value that is used in the optimization? Attempts of expected value for this project have included ideas like using an expected value of.01 and one determined from a Taylor approximation. The method that seems to work best; however, seems to be the continous configuration model although there are other methods such as defining degree conditional distributions that could be used as well. 9

11 4.2.6 Application in relation to CCME In its current form the dampened modularity algorithm could prove valid as an accurate and legitimate community detection algorithm. What seems as if it may be a better application for this algorithm is the ability to create subsets of nodes of which could be in a seed node s community. This is significant because in our current usage of the CCME algorithm we have been getting results that are way larger than what we would expect. In applying the gathering portion of the dampened modularity algorithm we could gather a set of feasible nodes. From this set of feasible nodes one could apply the CCME algorithm to that set for the corresponding seed and find a statistically signficicant community. This should reduce the overall speed of the CCME algorithm because it has fewer nodes to check than it would had no such restriction been applied to nodes. This technique should also result in a smaller set of nodes within a community. The key to getting this to work as an application would be demonstrating that it is in fact possible to find a factor from which the gathering nodes would include all possible community nodes for a given seed without collecting every node in the graph. 4.3 A dynamic and possibly predictive alteration Using a framework setup in the OSLOM paper it is possible to adapt both the dampened modularity and CCME algorithms to dynamic datasets due to their extractive nature. The importance of their extractive nature comes from the fact that these algorithms have the ability to start from an intial community generated at time t and detect communities at time t+ t using that intial community as a starting point [1]. Specifically this would meana taking teh intial community then adding nodes in a gathering stage based on data from t + t, before trimming the communities in a refining phase based on data from t + t [1]. It is necessary to make the addition that there should be a comparison of seed nodes at time t and at time t + t, so that if a new node becomes a seed node it has a community created around it. Once that is done, the dampened modularity algorithm can be applied using data from t + t, using the communities from time t in addition to the new seed nodes as starting points. It should be noted that for the dampened modularity algorithm because of the reliance on geodesic distance, the node from which that distance shall be measured is the one that has the largest seed criteria value. For example, in our network data we have used population as a seed determinant so if there were seven nodes in a community then the node with the largest population at time t + t shall be the one from which the geodesic distance is calculated. In terms of the CCME algorithm the approach in OSLOM could be applied, but it would be necessary to filter the communities on the data from time t + t to ensure that all nodes still have enough connectivity with community. This could be done using the approach mentioned in section Once the filtering is done the CCME can be run on the data at time t + t using those filtered communities as starting points. It will also have to perform the full algorithm, on the new seed nodes that appeared in the time t + t dataset. If these dynamic alterations are appropiate this same dynamic approach could be 10

12 used for prediction of communities if one were able to predict the weights of the edges at some future time. 5 Application to the 2010 commuter data 5.1 Application of the Dampened Modularity Algorithm In applying the Dampened Modularity Algorithm to the 2010 commuter data it was vital to tune the dampening factor, the connectivity parameter, and the similarity threshold for filtering. The manner in which these values is tuned is based on maximizing the overlapping modularity measure, using the CCM as the null model. A further modification of the overlapping modularity measure is to subtract it by the number of initial seed nodes that do not remain in communities. This is done because after running the algorithm it appears possible for the intial seed nodes to not remain in the in the finalized groups either because they are not strongly connected group or because they are not in the most connected of overlapping groups. This should be accounted for in the quality measure by subtracting the number of seed counties that do not appear in the final community. Due to the deisre to save time these values will only be tuned to the tenths place. The the dampening factor parameter is tuned by looking at the quality values for dampening factors at 1 through.1 and the other two thresholds being set to zero. From those quality values, the two highest quality scores dampening factors will be selected and tested on the connectivity measure. For the connectivity measure and the similarity threshold, the tuning will looking at.9,.5, and.1. From there the values at which the two largest overlapping modularity values occur will be the interval in which the remaining dampening factor values will be tested. So if.5 and.1 had the two largest Overlapping modularity values.4,.3,and.2 would also be tested to find connectivity value. The reason.9 instead of 1 is used as the max is because 1 means perfectly connected and exactly the same for each respective measure. These value are not going to occur often and thus not going to be very for this particular problem. The dampening factor results can be seen in figure 4. In this figure it is clear that the two highest quality values are achieved at the highest and lowest dampening factors tested. Because of the addition of the negative penalty on the overlapping modularity measure it seems reasonable to test both these values at the connectivity levels and compare those results because connectivity could affect these results in very different ways. From those results it appears that a connectivity threshold of.3 combined with a dampening factor of 1 produces the highest quality measure. With these values the final step was to look at a similarity threshold for filtering which from the graph was.7. 11

13 Figure 4:Dampening factor tuning Figure 5:Connectivity Optimization Figure 6:Filtering Optimization Plugging in these values into the algorithm results in the following set of graphs with each layer depicting disjoint communities. 12

14 13

Figure 7: Dampened Modularity Communities These graphs represent 347

06055 counties with a standard deviation of 3.044 counties.

a size of 15,620,448 people and including Los Angeles County, Orange

15 Figure 7: Dampened Modularity Communities These graphs represent 347 community divisions. On average these areas contain counties with a standard deviation of counties. The largest division by population was the Los Angeles Metro Area with a size of 15,620,448 people and including Los Angeles County, Orange County, Riverside County, and San Bernadino County. The size distribution is depicted in the figure 8. 14

5.2 Application of CCME Figure 8: Dampened Modularity size distribution The CCME algorithm was much more

The only necessary step was identifying the intializing nodes.

long as the self commuters is at least 20,000.

16 5.2 Application of CCME Figure 8: Dampened Modularity size distribution The CCME algorithm was much more straightforward to apply due to its lack of parameter tuning. The only necessary step was identifying the intializing nodes. This was done by looking for counties that have at least 1% self commuters relative to total commuters as long as the self commuters is at least 20,000. The reason this was done was to select large population centers, but in way that was more dependent on the network data. The results of the non-overlapping communities are similarly depicted in the following plots, with the final plot representing single county communities. 15

17 Figure 9: CCME communities 16

The CCME algorithm produced 214 communities with an average size of 11.01 counties and a standard deviation of 10.67 counties.

18 The CCME algorithm produced 214 communities with an average size of counties and a standard deviation of counties. The distribution of the community sizes are depicted in the following plot. Figure 10: CCME community size distribution 5.3 Comparing to MSA results Figure 11:MSA Metro Areas [8] The Metropololitan communities are outlined in red the micropolitan communities are outlined in blue. Metropolitan communities differ from micropolitan communities by having at least one area with a population of 20,000 versus 50,000 in metropolitan areas [8]. The MSA has identified 919 communities. The average size of the MSA metro communities is 1.94 with a standard deviation of The distribution of these community sizes can be found in figure

19 Figure 12:MSA Size of Communities When comparing the three result sets the first noticable difference is the fact that the MSA tended to produce the smallest communities, followed by the Dampened modularity method, and the CCME which formed much larger communities than the other two. It also unsruprisingly follows that the number of commnities identified in each method follows in the reverses order of community size with CCME finding the fewest communities, and MSA having the most. The size distributional difference between the methods is interesting in how the the MSA and CCME methods produce a farily exponential distribution of sizes of communities, while the Dampened modularity approach does not seem to produce anything really resembling a known distribution. 5.4 Finding meaning In assesing the two methods in comparison to the MSA delineations one big advantage is that because of the structure of the algorithms there is a reason for each of the communities that can be supported by data. The same cannot be said for MSA delineations that depend in part on the historical delinations [2]. The other advantage that the algorithms have over the MSA is that they identify overlapping communities which makes sense from an intuitive sense due to the fact that these counties interact with so many other counties. As far as disadvantages of the particular algorithms there are several. In terms of the CCME the major issue is that it forms communities that are almost twice as large as the ones defined by the other two methods. The reason for this may be due the fact that the way the test statistic is defined. Namely the fact that the connectivity is the sum of weights in a community compared to the expected weights in that community. The problem is that a single edgeweight that greatly exceeds the expected number could skew this statistic enough were loosely connected counties end up being added when they should have not been added solely because they have a strong connection with one county in the community. This could be corrected by adjusting the manner in which the connectivity of a node to a set is defined. It could be the case that the connectiveness is defined by the middle 50% or some other delination that would not be as affected by large outliers than the overall connectiveness. A different option would be to trim communities in a manner similar to OSLOM once they are formed ensuring 18

20 that there is some type of minimum connectivity between nodes. In terms of the Dampened Modularity algorithm a problematic occurence is that it, as noted in section 5.1 at times forms communities that do not include the seed nodes. This could be an issue either because the quality measure used does not properly represent the best delinations of communities, which could result in a bad choice in dampening factor, connectivty threshold, etc. Another possibility is that the way in which the seed nodes are chosen is not necessarily representative of where the communities should be, which could again result in those nodes falling out of communities that they seeded. This could corrected by using possibly the procedure used by the CCME or possibly even a centrality measure to determine what should be a seed node. Another interesting occurence in the Dampened modularity is the size distribution of communites being so wildly different from the other two methods. This difference in size distribution could be evidence that the method simply does not work, or could be caused by being a derivation of a method designed to identify community delinations for a whole graph. Specifically because only certain communities are being chosen from the graph based on a method designed to identify every community present in a graph, the distribution could thus differs from the one expected had every community been chosen. This thus means that there needs to be further investigation into the algorithm to make a complete determination of its validity. 6 Summary After examining the work done so far this year it seems we have the beginings of a strategy that will allow us to robustly determine these metropolitan areas. A theoretical proof of the validity of the dampening factor would be useful in the future; espicially with its debateable importance in the 2010 data. Both the CCME method and Dampened Modularity Method can and should be further modified to better capture the metropolitan community structure. However, it is the case that both algorithms result in the divisions that can be supported by data, which is better than the MSA results. There are also numerous other algorithms that are capable of making these same divisions that could also be considered. Finally, the longitudal aspect of the algorithms could be further explored to see how the communities evolve over time. This could be done using and OSLOM type edge addition and subtraction techinique as described in section 4.3. If this approach is refined it would then be a natural extension to attempt to predict what the communities would be in the future, which would hopefully give policy makers a better longterm picture of the problems that they could face. References [1] Jos J. Ramasco Santo Fortunato Andrea Lancichinetti, Filippo Radicchi. Finding statistically signficant communities in networks [2] United States Census. Metropolitan and micropolitan. Janurary Accessed on

21 [3] Santo Fortunato. Comunity detection in graphs [4] Kai Cai Huawei Shen, Xueqi Cheng and Mao-Bin Hu. Detect overlapping and hierarchical community structure in networks [5] Andrew Nobel John Palowitch, Shankar Bhamidi. Significance-based community detection in weighted networks. [6] Mark Newman. Networks: an Introduction. Oxford University Press, [7] Pascal Pons and Matthieu Latapy. Computing communities in large networks using random walks [8] Paul J. Mackun Thomas R. Fischetti Steven G. Wilson, David A. Plane and Justyna Goworowska. Patterns of metropolitan and micropolitan population change: 2000 to [9] Animesh Mukherjee Niloy Ganguly Tanmoy Chakraborty, Ayushi Dalmia. Metrics for community analysis: a survey. 20

Mining Social Network Graphs

Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be