Metropolitan Community Deliniation. Community detection algorithms. Nathaniel Pritchard

Size: px
Start display at page:

Download "Metropolitan Community Deliniation. Community detection algorithms. Nathaniel Pritchard"

Transcription

1 Metropolitan Community Deliniation Community detection algorithms Nathaniel Pritchard Advisied by: Professor Shankar Bhamidi Reveiwed by: Professor Shankar Bhamidi Professor Sayan Banjeree Statistics and Operations Research The University of North Carolina at Chapel Hill Apriil 2, 2018

2 1 Introduction There are endless ways in which we can group people together in space, the most common ways are of course cities, towns, villages, and metropolitan areas. Today we define cities, and metropolitan areas using a mixture of techniques such as historical boundaries and strict rules that deal with commuting and population density to determine the outlines of a metropolitan area[2]. The way in which these deliniations are determined is important becasue they are the basis for various statistical calculations detailing anything from community diversity to median incomes. These statistics are then used to help determine economic and infrastructure initiatives, which could be beneficial to the community, but could also be wasteful if the statistics do not represent the true community. Thus a robust method of community delineation that captures the true structure of connections would allow for more representative statistics. These more representative statistics will allow policy makers to have a better understanding of the issues. With this better understadning they should then be better able to solve these problems. This robust method can be found by turning to network theory, the study of connections between entities. This field of study is useful one because metropolitan areas are supposed to represent closely connected areas. Specificially it would be useful to look at network theory in an attempt to find an algorithm that identifies the community structure by looking data of the workers who commute to and from various counties. Using these techniques we should be able to provide delineations that are judged to be both significant and proper. 2 Network Theory Background Networks are the way that we mathematically capture the connections between entities. They can be used to describe social connections, food chains, protien structures, or in our case commuter connections. These networks are based on entities, nodes in network language, whether it be people, internet devices, or animals that are connected to other entities via edges[6]. These networks can be weighted in which case each edge takes on a certain value, or unweighted in which case the edge either exists or it does not. There is also no requirement that these edges are bi-directional, if every edge in a network is bi-directional then the network is undirected otherwise it is directed, meaning that at some connections are only one way paths[6]. These edges between nodes can be recorded in a an adjacency matrix A where the ijth value is value of the weight if there is an edge between i and j or zero if no edge exists[6]. In the case of an unweighted network it will be one if the edge exists and zero otherwise. Additionally, in the undirected case if the ijth element, then the jith elements exist and is the same value [6]. Once the network has been established there needs to be a way to pull information about the structure. There are in fact, several ways to get a rudimentary idea about network structure; however, in the context of this paper there are only two necessary to understand. One is the measure of degree, which in the context of undirected graph degree is the number of edges that each node has connected to it [6]. While in a directed sense there is an incoming and outgoing degree which are the number of edges coming 1

3 into a node, or leaving a node respectively [6].If there are n total nodes in a graph the degree of a particular node i will be denoted as d i. The total degree, d t, of the graph can be be defined through n d t = i=1 The other important measure to understand is geodesic distance, which is the shortest distance between any two nodes [6]. Distance in networks is the number of edges separating two nodes from one another. An example would be in the case that the fewest jumps made to connect Kevin Bacon to Idris Elba was six then this would be the geodesic distance. 3 Community detection: basic algorithms What is a community? In the simplest terms it is simply a grouping that has a structure more connected within it the community than with the rest of the network [6]. This simple definition has lead to numerous different ways to find a community. One way of dividing communities into groups as propsed by Mark Newman is to look at where the number of edges is greater than the number of edges expected if it were a randomly generated graph that had the same edge distribution. By then maximizing that measure one can then identify a community. The model Newman uses for this expected value of an edge is the configuration model [6]. Other techniques do things like maximize the number of triangles, when a group of three nodes are connected to each other[6]. 3.1 The configuration model The configuration model is designed for unweighted networks. In this model each node is assumed to have the degree that is the same as a corresponding node in actual graph being examined [6]. From this setup the probability of a connection between nodes i and j can be easily defined. This is done by thinking about the degree of a node being the number of half edges for a node [6]. This consideration results in an edge forming when two half edges point toward each other [6]. From this idea we define the total number of half edges of a graph with n nodes to be: d t = n i=1 Since there are d t half edges in total, the chance of an edge from node i pointing in the d direction of node j is j d since it has to point at one of the d t 1 j half edges connected to j out the total number of half edges exluding itself in the graph [6]. Then since there d i half edges leaving node i the correct probability is: p ij = d i d i d jd i d t 1 2

4 [6] If the network is large, by the law of large numbers, the effect of the -1 in the denominator is minimal and can be removed making the probability: p ij = d jd i dt In an unweighted network this probability of an edge is also the expected value for an edge since an edge can only take a value of either zero or one. Thus by calculating this probability we have a good null model for expected edges from which to try to determine community structure in a graph [6]. 3.2 Modularity With this null model described it is now possible to understand the concept of modularity. Modularity is the measure that assigns a quality value to a division of a network. This quality is determined by the nodes in a community having a higher than expected connection with each other[3]. Modularity is calculated by summing the amount each connection within a particular division exceeds the expected value from the configuration model[3]. These values are then all summed together [3]. Once this sum is calculated it is divided by the number of half edges which is d t [3]. Mathematically this can be represented as: Q = 1 (A ij d jd i d t ij In this A ij is the adjacency matrix representing each edge in a graph, δ ij is the Kroenker delta which is 1 if i and j are in the same group and zero otherwise. Modularity will take values less than one and the further away one is from one the worse the quality of the division[3]. This concept of modularity gives a good framework for the determination of the quality of a division of non-overlapping communities. d t )δ ij 3.3 Modularity based Community detection With the concept of modularity it is now possible through maximization of this measure to divide communities. Because the optimization of this measure is a NP hard problem, Newman has proposed several algorithms to perform these divisions[6]. One approach is a fast greedy algorithm in which a graph is treated as n different groupings and these groupings are joined together based on the edge that would most increase modularity. The algorithm runs until any more merging of groupings would result in a decrease in overall modularity [3]. The major problem with this algorithm as can be seen in figure 1 is that it has a tendency to favor large communities over smaller ones, which is problematic when it comes to metropolitan area detetcion which could have small communities but also large ones [3]. The walktrap algorithm, which is based on the premise that on a random walk an walker should spend more time in a community than outside of it once the walker enters the community, is a another algorithm that we intially examined. The algorithm works by defining a distance based on the transistion probabilites from i to j in a particular 3

5 Figure 1: Fast Greedy Algorithm applied to 1970 Figure 2: Walktrap applied to 1970 commuter data number of steps[7]. With these transition probabilites being distance communities are merged together using a similar fast greedy approach as modularity[7]. This gives a sequence of partitions of the graph, one defined at each merge, the best partition is then chosen to be the one that maximizes modularity. The results from this algorithm can be found in figure 2. Looking at thes plots two major issues in relations metropolitan areas arise. The first is that every county in these plots has been placed inside of a community. This is not the case when looking at metropolitan areas since many counties are rural and not near any population centers. The other major issue is that the community delinations are too large to represent metropolitan areas. For instance, almost the entire state of Florida has been placed in a single metro area in the walktrap, which is not appropiate. Similary problematic is the fast greedy algorithm grouping essentially the entire west 4

6 coast togther in figrue 1. For this reason we decided to take a different approach in terms of algorithms, ones that started from a seed node and worked their way outwards to form communities rather than looking at the entire graph topolgy at one time. 4 Extractive approaches The less than satisfactory results of the previous approaches caused lead to the consideration of extractive algorithms rather than the previous more holistic ones. The difference between extractive approaches and the previous ones is that in extractive approaches, the algorithms create communities starting from specific seed nodes. This is ideal for delineating metropolitan areas because the OMB has already defined them as having counties greater than 50,000 people, so we know where to look. One algorithm used is the CCME algorithm which was created John Palowitch, Shankar Bhamidi, and Andrew Nobel. The other algorithm is tenatively named the dampened modularity algorithm and was created for this specific purpose. 4.1 CCME The main purpose of the CCME model is searching for statistically significant communities and designed with weighted communities in mind. The basis for the CCME algorithm is the continuous configuration model, which is similar to the configuration model, but is designed to account for weights of edges in its calculations of expected edge value[5]. The CCME does by using the same expression of probaility found in the configuration model and in a similar manner creating a formula for the expected strength of connection between the two nodes. Specifically, we can derive the formula for the expected strength of connection between nodes i and j by calculating the porpotion of the total strength of the graph graph that node i makes up and then multiplying that by the total strength of node j [5]. This means that if for example 5% of the strength of a graph lies in node i then after a random distristribution of strengths it would be expected that 5% of the strength of node j to be connected to node i. Mathematically this can be represented as s ij = s is j s t where s t is the sum of the total strength of the graph[5]. By dividing s ij by d ij we will thus have the expected weight for each edge which we will define as f ij (s, d)[5]. Formula wise this is f ij (s, d) = s ij d ij Where s is a vector containing the strengths of nodes i and j, while d is a vector containing the degrees of nodes i and j[5]. Along with the expected value the distribution F is characterized by a variance which is called kappa. This value is calculated using the equation: κ(d, s) = (W ij f ij (s, d)) 2 f ij (s, d) 2 ij,a ij=1 5

7 [5] Once it has these terms, the algorithm starts from a set of n seed nodes and forms n subgraphs B h where h n[5]. For each subgraph, the algorithm then checks surrounding nodes adding it to a community if the connectivity to the seed node to the community is significant[5]. This decision is based on a multiple testing threshold on z scores calculated from summing all the connections of a node to a community as well as the expected value of those same connections[5]. Then subtracting those terms from each other before dividing by a standard deviation from the formula σ(i, B κ(d, s)) 2 = j B s 2 ij d ij (1 d ij + κ(d, s)) [5] The algorithm tests all nodes until B t = B t+1 at this point the algorithm terminates and B t is added to the list of communities[5]. The algorithm finishes by filtering out all redundant communities based on the jaccard similarity and returning that filtered list of communities[5]. This works well in the case of no self loop networks. However,in our data it is also possible that a county node could be its own community based on its self commuting, in this case, the lack of self loops becomes an issue. While the intial algorithm does not utilize self loops we have modified it to account for these self loops. We did this by breaking the system into two stages one that operates as the algorithm was intended and one that identifies statistically signficant self communting communities, s 3 c 2. This specifically is done by adding a few parameters to the algorithm. One is a adding a κ for self loops and one non self loops, which function the same way as the original κ. We also add a p which represents the porportion of total commuters in the graph that are self commuters. The specifics of the kappa self loops value is as follows: κ nonselfloop = i j,j i,i j (W ij (1 ρ i ) κ selfloop = i j,j i,i j (1 ρ i) 2 i (W ii ps i ) 2 i p2 s2 i s t S i S j S t d i d j d t S i S j S t d i d j ) 2 d t 2 W ii where ρ i = and p = 1 n j,j i Wij n i=1 ρ i From here the algorithm runs twice once to detect communities without self loops and once to detect the s 3 c 2 communities. 4.2 Dampened Modularity A second approach that we have applied to detecting metropolitan areas is based on two premises. One is that there are nodes that have attributes that indicate they are central to a community. The second is there is a limit, dependent on geodesic distance, based on how far a node can be away from the seed node and still be part of the community. The algorithm works in three main phases, gathering, refining, and filtering. 6

8 4.2.1 Gathering potential nodes The algorithm works in the following way. The nodes that indicate where a community should be are taken to be seed nodes, then from there we collect all nodes that have a strong connection to that seed node. From this new set of nodes all the nodes with a strong connection to these nodes are collected and the process continues until no more nodes can be added. The algorithm also includes a dampening factor on the weight of an edge to account for the fact that at increasing geodesic distances away from the seed node it should be harder for a node to be in a community with that seed node. This makes sense to do because for each geodesic level from the seed node the connectivity of a node relative to the intial seed node is necessarily less because the connection is less direct. In a step form the specifics of the algorithm look like this: 1: identify seed nodes 2: Calculate an expected weight for each edge based on strength and degree 3: Check incoming and outgoing edges and add all nodes that have an actual edgeweight value that is greater than the expected value and add to the community. 4: Repeat step three for the nodes recently added to the community until no more nodes can be added. Mathematically this can be represented as an optimization problem in this manner: For all i in the set of seed nodes y i = j (θ x 1 ω ij E(ω ij ))δ i x- is the geodesic distance between nodes i and j ω ij - is the weight of edge between nodes i and j E(ω ij )- is the expected weight δ ij -is the kroekner delta indicating if a node j is in group i θ-is the dampening factor This function will be maximized by selecting all nodes such that: Refining Communities (θ ( x 1)ω ij E(ω ij )) 0 The algorithms that maximize modularity do more than just choose nodes with strong connections, they also choose the groupings that have the strongest connections with each other. Since the intial stage of the algorithm only requires a strong connection with one node in the community in order for a new node to be added to a community. A step needs to be taken to ensure a strong connectivity with other nodes in a community. Drawing from the filtering idea of OSLOM written by Lachietti, et.al, which ranks the nodes in a community based on the connectivity to the community then checks to see if the least connected node is strongly connected to the community and removes it if it is not, communities can be refined in a similar manner[1]. Specifically for each node in a community C i we can defined connectivity to a community C i for each node i by counting the number of nodes that node i is strongly connected to in a community and divide it by the number of nodes in a community.contingent on defining a minimum 7

9 threshold t for what a constitutes a strongly connect community. This mathematically for a community with n nodes this would be: n j=1 Connectivity ci = f(j) n { 1 ω ij E(ω ij ) f(j) = 0 otherwise Once these connectivity rankings have been calculated we can take the smallest one of them and check that it is at least as big as a threshold value t. If it is not that node is removed and the ranking process starts over and repeats until no more nodes can be removed[1]. Once this is completed this communities should resemble a result given by a modularity optimization technique Filtering communities Because of the use of seed nodes that could be close together it is possible that the dampened modularity several communities that are approximately the same. The question is how to use this information to represent the communities? A couple approaches that were attempted in the research required the calculation of similarities between communities. These similarities can be calculated based on the Jaccard Similarity measure which is the set intersection divided by the union of the two sets[5]. Which communities are similar enough can be decided by choosing a threshold value which can then be tuned in a similar way to the dampening factor. One approach would be to merge together communities that are above a threshold value into a single community. As simple as the approach is, it has the issue of fundamentally changing the composition of the community where connectiviness is no longer guarenteed. This has been shown empirically when it was applied to the communities the dampened modularity algorithm. Another approach is to filter the communities that are similar above the threshold band select which one is the better community. This has been done at this moment by looking at the connectivity of the nodes inside the community relative to the nodes outside of the community. This connectivity in this is defined relative to nodes outside the community C by: F = i C ρ i s i i C (1 ρ i) j i,j / C s j Also in the case where we are not allowing for overlapping communities, we could easily adapt a methodolgy that merges the similar communities together if they are above a threshold, and if the Jaccard value is between zero and the threshold the algorithm then assigns overlapping nodes of the communities to the community that they are most strongly connected Determining the dampening factor The validity of the dampening factor still needs to be demonstrated; however, at the current moment it seems to be reasonable addition because of the theoretical possibiltity that the addition of nodes to a community based on their actual edge strength 8

10 minus their expected edge strength could result in the community including a large node set that lacks any semblance of connectivty. Thus it would seem reasonable to impose a dampening factor that scales with levels of geodesic distance from the seed node to prevent that from happening. Assuming that the dampening factor is significant, it would also be necessary to determine what the dampening factor should actually be. A value that is too small would result in only nodes with geodesic distance of one from the seed node being collected. On the other hand a value that is too large could result in the issue mentioned above with far too many nodes being added. The easiest way to determine this would be to tune the dampening parameter to a value that results in the best community divisions. In order to do this we would need some measure that demonstrates how well a network is divided into overlapping groups. The overlapping aspect is signficant because it is reasonable for nodes to be part of more than one community in this commuter data just as a person could b ein more than one friend group. The most widely used quality measure is Newman s Modularity discussed earlier; however, it is unclear how well this measure will fair at judging the overlapping communities generated by this algorithm. Another alternative measure proposed by Shen et. al deals with determining modularity of overlapping communities[4]. The formula they derived to judge this for K communities is as follows: Q ov = 1 (A ij d id j 2M 2M ) 1 O w i O j iɛc w,jɛc w [4]. Where C w is community w, A ij is the value of edge i to j, and O i is the number of communities node i belongs to. The fact that this metric is modification of the modularity measure is that it allows for substition of a weighted null model. Thus in the case where the continous configuration null model is used, it can easily be plugged in and the metric will still work. This method has also been shown to have a similar effectiveness at evaluating the quality of a divsion to other overlapping metrics tested in a survey paper by Tanmoy Chakraborty et al. while being less computationally intensive than the other metrics tested[9]. It is also possible that upon a justification of the dampening factor that it in fact could be defined numerically based on a value like graph density. This would unambigously decrease the run time of the algorithm because there would be no need for tuning The expected value The other major question that arises would how best to calculate the expected value that is used in the optimization? Attempts of expected value for this project have included ideas like using an expected value of.01 and one determined from a Taylor approximation. The method that seems to work best; however, seems to be the continous configuration model although there are other methods such as defining degree conditional distributions that could be used as well. 9

11 4.2.6 Application in relation to CCME In its current form the dampened modularity algorithm could prove valid as an accurate and legitimate community detection algorithm. What seems as if it may be a better application for this algorithm is the ability to create subsets of nodes of which could be in a seed node s community. This is significant because in our current usage of the CCME algorithm we have been getting results that are way larger than what we would expect. In applying the gathering portion of the dampened modularity algorithm we could gather a set of feasible nodes. From this set of feasible nodes one could apply the CCME algorithm to that set for the corresponding seed and find a statistically signficicant community. This should reduce the overall speed of the CCME algorithm because it has fewer nodes to check than it would had no such restriction been applied to nodes. This technique should also result in a smaller set of nodes within a community. The key to getting this to work as an application would be demonstrating that it is in fact possible to find a factor from which the gathering nodes would include all possible community nodes for a given seed without collecting every node in the graph. 4.3 A dynamic and possibly predictive alteration Using a framework setup in the OSLOM paper it is possible to adapt both the dampened modularity and CCME algorithms to dynamic datasets due to their extractive nature. The importance of their extractive nature comes from the fact that these algorithms have the ability to start from an intial community generated at time t and detect communities at time t+ t using that intial community as a starting point [1]. Specifically this would meana taking teh intial community then adding nodes in a gathering stage based on data from t + t, before trimming the communities in a refining phase based on data from t + t [1]. It is necessary to make the addition that there should be a comparison of seed nodes at time t and at time t + t, so that if a new node becomes a seed node it has a community created around it. Once that is done, the dampened modularity algorithm can be applied using data from t + t, using the communities from time t in addition to the new seed nodes as starting points. It should be noted that for the dampened modularity algorithm because of the reliance on geodesic distance, the node from which that distance shall be measured is the one that has the largest seed criteria value. For example, in our network data we have used population as a seed determinant so if there were seven nodes in a community then the node with the largest population at time t + t shall be the one from which the geodesic distance is calculated. In terms of the CCME algorithm the approach in OSLOM could be applied, but it would be necessary to filter the communities on the data from time t + t to ensure that all nodes still have enough connectivity with community. This could be done using the approach mentioned in section Once the filtering is done the CCME can be run on the data at time t + t using those filtered communities as starting points. It will also have to perform the full algorithm, on the new seed nodes that appeared in the time t + t dataset. If these dynamic alterations are appropiate this same dynamic approach could be 10

12 used for prediction of communities if one were able to predict the weights of the edges at some future time. 5 Application to the 2010 commuter data 5.1 Application of the Dampened Modularity Algorithm In applying the Dampened Modularity Algorithm to the 2010 commuter data it was vital to tune the dampening factor, the connectivity parameter, and the similarity threshold for filtering. The manner in which these values is tuned is based on maximizing the overlapping modularity measure, using the CCM as the null model. A further modification of the overlapping modularity measure is to subtract it by the number of initial seed nodes that do not remain in communities. This is done because after running the algorithm it appears possible for the intial seed nodes to not remain in the in the finalized groups either because they are not strongly connected group or because they are not in the most connected of overlapping groups. This should be accounted for in the quality measure by subtracting the number of seed counties that do not appear in the final community. Due to the deisre to save time these values will only be tuned to the tenths place. The the dampening factor parameter is tuned by looking at the quality values for dampening factors at 1 through.1 and the other two thresholds being set to zero. From those quality values, the two highest quality scores dampening factors will be selected and tested on the connectivity measure. For the connectivity measure and the similarity threshold, the tuning will looking at.9,.5, and.1. From there the values at which the two largest overlapping modularity values occur will be the interval in which the remaining dampening factor values will be tested. So if.5 and.1 had the two largest Overlapping modularity values.4,.3,and.2 would also be tested to find connectivity value. The reason.9 instead of 1 is used as the max is because 1 means perfectly connected and exactly the same for each respective measure. These value are not going to occur often and thus not going to be very for this particular problem. The dampening factor results can be seen in figure 4. In this figure it is clear that the two highest quality values are achieved at the highest and lowest dampening factors tested. Because of the addition of the negative penalty on the overlapping modularity measure it seems reasonable to test both these values at the connectivity levels and compare those results because connectivity could affect these results in very different ways. From those results it appears that a connectivity threshold of.3 combined with a dampening factor of 1 produces the highest quality measure. With these values the final step was to look at a similarity threshold for filtering which from the graph was.7. 11

13 Figure 4:Dampening factor tuning Figure 5:Connectivity Optimization Figure 6:Filtering Optimization Plugging in these values into the algorithm results in the following set of graphs with each layer depicting disjoint communities. 12

14 13

15 Figure 7: Dampened Modularity Communities These graphs represent 347 community divisions. On average these areas contain counties with a standard deviation of counties. The largest division by population was the Los Angeles Metro Area with a size of 15,620,448 people and including Los Angeles County, Orange County, Riverside County, and San Bernadino County. The size distribution is depicted in the figure 8. 14

16 5.2 Application of CCME Figure 8: Dampened Modularity size distribution The CCME algorithm was much more straightforward to apply due to its lack of parameter tuning. The only necessary step was identifying the intializing nodes. This was done by looking for counties that have at least 1% self commuters relative to total commuters as long as the self commuters is at least 20,000. The reason this was done was to select large population centers, but in way that was more dependent on the network data. The results of the non-overlapping communities are similarly depicted in the following plots, with the final plot representing single county communities. 15

17 Figure 9: CCME communities 16

18 The CCME algorithm produced 214 communities with an average size of counties and a standard deviation of counties. The distribution of the community sizes are depicted in the following plot. Figure 10: CCME community size distribution 5.3 Comparing to MSA results Figure 11:MSA Metro Areas [8] The Metropololitan communities are outlined in red the micropolitan communities are outlined in blue. Metropolitan communities differ from micropolitan communities by having at least one area with a population of 20,000 versus 50,000 in metropolitan areas [8]. The MSA has identified 919 communities. The average size of the MSA metro communities is 1.94 with a standard deviation of The distribution of these community sizes can be found in figure

19 Figure 12:MSA Size of Communities When comparing the three result sets the first noticable difference is the fact that the MSA tended to produce the smallest communities, followed by the Dampened modularity method, and the CCME which formed much larger communities than the other two. It also unsruprisingly follows that the number of commnities identified in each method follows in the reverses order of community size with CCME finding the fewest communities, and MSA having the most. The size distributional difference between the methods is interesting in how the the MSA and CCME methods produce a farily exponential distribution of sizes of communities, while the Dampened modularity approach does not seem to produce anything really resembling a known distribution. 5.4 Finding meaning In assesing the two methods in comparison to the MSA delineations one big advantage is that because of the structure of the algorithms there is a reason for each of the communities that can be supported by data. The same cannot be said for MSA delineations that depend in part on the historical delinations [2]. The other advantage that the algorithms have over the MSA is that they identify overlapping communities which makes sense from an intuitive sense due to the fact that these counties interact with so many other counties. As far as disadvantages of the particular algorithms there are several. In terms of the CCME the major issue is that it forms communities that are almost twice as large as the ones defined by the other two methods. The reason for this may be due the fact that the way the test statistic is defined. Namely the fact that the connectivity is the sum of weights in a community compared to the expected weights in that community. The problem is that a single edgeweight that greatly exceeds the expected number could skew this statistic enough were loosely connected counties end up being added when they should have not been added solely because they have a strong connection with one county in the community. This could be corrected by adjusting the manner in which the connectivity of a node to a set is defined. It could be the case that the connectiveness is defined by the middle 50% or some other delination that would not be as affected by large outliers than the overall connectiveness. A different option would be to trim communities in a manner similar to OSLOM once they are formed ensuring 18

20 that there is some type of minimum connectivity between nodes. In terms of the Dampened Modularity algorithm a problematic occurence is that it, as noted in section 5.1 at times forms communities that do not include the seed nodes. This could be an issue either because the quality measure used does not properly represent the best delinations of communities, which could result in a bad choice in dampening factor, connectivty threshold, etc. Another possibility is that the way in which the seed nodes are chosen is not necessarily representative of where the communities should be, which could again result in those nodes falling out of communities that they seeded. This could corrected by using possibly the procedure used by the CCME or possibly even a centrality measure to determine what should be a seed node. Another interesting occurence in the Dampened modularity is the size distribution of communites being so wildly different from the other two methods. This difference in size distribution could be evidence that the method simply does not work, or could be caused by being a derivation of a method designed to identify community delinations for a whole graph. Specifically because only certain communities are being chosen from the graph based on a method designed to identify every community present in a graph, the distribution could thus differs from the one expected had every community been chosen. This thus means that there needs to be further investigation into the algorithm to make a complete determination of its validity. 6 Summary After examining the work done so far this year it seems we have the beginings of a strategy that will allow us to robustly determine these metropolitan areas. A theoretical proof of the validity of the dampening factor would be useful in the future; espicially with its debateable importance in the 2010 data. Both the CCME method and Dampened Modularity Method can and should be further modified to better capture the metropolitan community structure. However, it is the case that both algorithms result in the divisions that can be supported by data, which is better than the MSA results. There are also numerous other algorithms that are capable of making these same divisions that could also be considered. Finally, the longitudal aspect of the algorithms could be further explored to see how the communities evolve over time. This could be done using and OSLOM type edge addition and subtraction techinique as described in section 4.3. If this approach is refined it would then be a natural extension to attempt to predict what the communities would be in the future, which would hopefully give policy makers a better longterm picture of the problems that they could face. References [1] Jos J. Ramasco Santo Fortunato Andrea Lancichinetti, Filippo Radicchi. Finding statistically signficant communities in networks [2] United States Census. Metropolitan and micropolitan. Janurary Accessed on

21 [3] Santo Fortunato. Comunity detection in graphs [4] Kai Cai Huawei Shen, Xueqi Cheng and Mao-Bin Hu. Detect overlapping and hierarchical community structure in networks [5] Andrew Nobel John Palowitch, Shankar Bhamidi. Significance-based community detection in weighted networks. [6] Mark Newman. Networks: an Introduction. Oxford University Press, [7] Pascal Pons and Matthieu Latapy. Computing communities in large networks using random walks [8] Paul J. Mackun Thomas R. Fischetti Steven G. Wilson, David A. Plane and Justyna Goworowska. Patterns of metropolitan and micropolitan population change: 2000 to [9] Animesh Mukherjee Niloy Ganguly Tanmoy Chakraborty, Ayushi Dalmia. Metrics for community analysis: a survey. 20

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

Chapter 3 Analyzing Normal Quantitative Data

Chapter 3 Analyzing Normal Quantitative Data Chapter 3 Analyzing Normal Quantitative Data Introduction: In chapters 1 and 2, we focused on analyzing categorical data and exploring relationships between categorical data sets. We will now be doing

More information

Graph Structure Over Time

Graph Structure Over Time Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

V2: Measures and Metrics (II)

V2: Measures and Metrics (II) - Betweenness Centrality V2: Measures and Metrics (II) - Groups of Vertices - Transitivity - Reciprocity - Signed Edges and Structural Balance - Similarity - Homophily and Assortative Mixing 1 Betweenness

More information

MCL. (and other clustering algorithms) 858L

MCL. (and other clustering algorithms) 858L MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted

More information

1 Homophily and assortative mixing

1 Homophily and assortative mixing 1 Homophily and assortative mixing Networks, and particularly social networks, often exhibit a property called homophily or assortative mixing, which simply means that the attributes of vertices correlate

More information

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015 MAT 142 College Mathematics Statistics Module ST Terri Miller revised July 14, 2015 2 Statistics Data Organization and Visualization Basic Terms. A population is the set of all objects under study, a sample

More information

Basic Statistical Terms and Definitions

Basic Statistical Terms and Definitions I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can

More information

V4 Matrix algorithms and graph partitioning

V4 Matrix algorithms and graph partitioning V4 Matrix algorithms and graph partitioning - Community detection - Simple modularity maximization - Spectral modularity maximization - Division into more than two groups - Other algorithms for community

More information

CHAPTER 3: Data Description

CHAPTER 3: Data Description CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a

More information

Network community detection with edge classifiers trained on LFR graphs

Network community detection with edge classifiers trained on LFR graphs Network community detection with edge classifiers trained on LFR graphs Twan van Laarhoven and Elena Marchiori Department of Computer Science, Radboud University Nijmegen, The Netherlands Abstract. Graphs

More information

Community Detection. Community

Community Detection. Community Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,

More information

Copyright 2000, Kevin Wayne 1

Copyright 2000, Kevin Wayne 1 Linear Time: O(n) CS 580: Algorithm Design and Analysis 2.4 A Survey of Common Running Times Merge. Combine two sorted lists A = a 1,a 2,,a n with B = b 1,b 2,,b n into sorted whole. Jeremiah Blocki Purdue

More information

Measures of Dispersion

Measures of Dispersion Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

Community detection. Leonid E. Zhukov

Community detection. Leonid E. Zhukov Community detection Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics Network Science Leonid E.

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 11/13/17 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2 Observations Models

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

Big Mathematical Ideas and Understandings

Big Mathematical Ideas and Understandings Big Mathematical Ideas and Understandings A Big Idea is a statement of an idea that is central to the learning of mathematics, one that links numerous mathematical understandings into a coherent whole.

More information

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati Community detection algorithms survey and overlapping communities Presented by Sai Ravi Kiran Mallampati (sairavi5@vt.edu) 1 Outline Various community detection algorithms: Intuition * Evaluation of the

More information

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

On the Permanence of Vertices in Network Communities. Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India

On the Permanence of Vertices in Network Communities. Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India On the Permanence of Vertices in Network Communities Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India 20 th ACM SIGKDD, New York City, Aug 24-27, 2014 Tanmoy Chakraborty Niloy Ganguly IIT

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

Web Structure Mining Community Detection and Evaluation

Web Structure Mining Community Detection and Evaluation Web Structure Mining Community Detection and Evaluation 1 Community Community. It is formed by individuals such that those within a group interact with each other more frequently than with those outside

More information

Alessandro Del Ponte, Weijia Ran PAD 637 Week 3 Summary January 31, Wasserman and Faust, Chapter 3: Notation for Social Network Data

Alessandro Del Ponte, Weijia Ran PAD 637 Week 3 Summary January 31, Wasserman and Faust, Chapter 3: Notation for Social Network Data Wasserman and Faust, Chapter 3: Notation for Social Network Data Three different network notational schemes Graph theoretic: the most useful for centrality and prestige methods, cohesive subgroup ideas,

More information

Exam Review: Ch. 1-3 Answer Section

Exam Review: Ch. 1-3 Answer Section Exam Review: Ch. 1-3 Answer Section MDM 4U0 MULTIPLE CHOICE 1. ANS: A Section 1.6 2. ANS: A Section 1.6 3. ANS: A Section 1.7 4. ANS: A Section 1.7 5. ANS: C Section 2.3 6. ANS: B Section 2.3 7. ANS: D

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

L E A R N I N G O B JE C T I V E S

L E A R N I N G O B JE C T I V E S 2.2 Measures of Central Location L E A R N I N G O B JE C T I V E S 1. To learn the concept of the center of a data set. 2. To learn the meaning of each of three measures of the center of a data set the

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

Combinatorics Prof. Dr. L. Sunil Chandran Department of Computer Science and Automation Indian Institute of Science, Bangalore

Combinatorics Prof. Dr. L. Sunil Chandran Department of Computer Science and Automation Indian Institute of Science, Bangalore Combinatorics Prof. Dr. L. Sunil Chandran Department of Computer Science and Automation Indian Institute of Science, Bangalore Lecture - 5 Elementary concepts and basic counting principles So, welcome

More information

Lab 9. Julia Janicki. Introduction

Lab 9. Julia Janicki. Introduction Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Chapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Chapter 3 Graphs Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 3.1 Basic Definitions and Applications Undirected Graphs Undirected graph. G = (V, E) V = nodes. E

More information

Minimum spanning trees

Minimum spanning trees Carlos Moreno cmoreno @ uwaterloo.ca EI-3 https://ece.uwaterloo.ca/~cmoreno/ece5 Standard reminder to set phones to silent/vibrate mode, please! During today's lesson: Introduce the notion of spanning

More information

Fundamental Properties of Graphs

Fundamental Properties of Graphs Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,

More information

CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS

CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS 1 UNIT I INTRODUCTION CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS 1. Define Graph. A graph G = (V, E) consists

More information

Multi-Criteria Decision Making 1-AHP

Multi-Criteria Decision Making 1-AHP Multi-Criteria Decision Making 1-AHP Introduction In our complex world system, we are forced to cope with more problems than we have the resources to handle We a framework that enable us to think of complex

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Online Social Networks and Media. Community detection

Online Social Networks and Media. Community detection Online Social Networks and Media Community detection 1 Notes on Homework 1 1. You should write your own code for generating the graphs. You may use SNAP graph primitives (e.g., add node/edge) 2. For the

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

DESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS

DESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS X MODELLING WEEK DESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS FACULTY OF MATHEMATICS PARTICIPANTS: AMANDA CABANILLAS (UCM) MIRIAM FERNÁNDEZ

More information

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 258 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Centrality Book. cohesion.

Centrality Book. cohesion. Cohesion The graph-theoretic terms discussed in the previous chapter have very specific and concrete meanings which are highly shared across the field of graph theory and other fields like social network

More information

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread MATH 112 Section 7.2: Measuring Distribution, Center, and Spread Prof. Jonathan Duncan Walla Walla College Fall Quarter, 2006 Outline 1 Measures of Center The Arithmetic Mean The Geometric Mean The Median

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

MAT 090 Brian Killough s Instructor Notes Strayer University

MAT 090 Brian Killough s Instructor Notes Strayer University MAT 090 Brian Killough s Instructor Notes Strayer University Success in online courses requires self-motivation and discipline. It is anticipated that students will read the textbook and complete sample

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications

3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications Chapter 3 31 Basic Definitions and Applications Graphs Slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley All rights reserved 1 Undirected Graphs Some Graph Applications Undirected graph G = (V,

More information

A Novel Parallel Hierarchical Community Detection Method for Large Networks

A Novel Parallel Hierarchical Community Detection Method for Large Networks A Novel Parallel Hierarchical Community Detection Method for Large Networks Ping Lu Shengmei Luo Lei Hu Yunlong Lin Junyang Zou Qiwei Zhong Kuangyan Zhu Jian Lu Qiao Wang Southeast University, School of

More information

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Copyright 2000, Kevin Wayne 1

Copyright 2000, Kevin Wayne 1 Chapter 3 - Graphs Undirected Graphs Undirected graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures pairwise relationship between objects. Graph size parameters: n = V, m = E. Directed

More information

Non Overlapping Communities

Non Overlapping Communities Non Overlapping Communities Davide Mottin, Konstantina Lazaridou HassoPlattner Institute Graph Mining course Winter Semester 2016 Acknowledgements Most of this lecture is taken from: http://web.stanford.edu/class/cs224w/slides

More information

Levels of Measurement. Data classing principles and methods. Nominal. Ordinal. Interval. Ratio. Nominal: Categorical measure [e.g.

Levels of Measurement. Data classing principles and methods. Nominal. Ordinal. Interval. Ratio. Nominal: Categorical measure [e.g. Introduction to the Mapping Sciences Map Composition & Design IV: Measurement & Class Intervaling Principles & Methods Overview: Levels of measurement Data classing principles and methods 1 2 Levels of

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Math 7 Glossary Terms

Math 7 Glossary Terms Math 7 Glossary Terms Absolute Value Absolute value is the distance, or number of units, a number is from zero. Distance is always a positive value; therefore, absolute value is always a positive value.

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

Mathematics of Networks II

Mathematics of Networks II Mathematics of Networks II 26.10.2016 1 / 30 Definition of a network Our definition (Newman): A network (graph) is a collection of vertices (nodes) joined by edges (links). More precise definition (Bollobàs):

More information

Paths, Circuits, and Connected Graphs

Paths, Circuits, and Connected Graphs Paths, Circuits, and Connected Graphs Paths and Circuits Definition: Let G = (V, E) be an undirected graph, vertices u, v V A path of length n from u to v is a sequence of edges e i = {u i 1, u i} E for

More information

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically

More information

Chapter 2: The Normal Distribution

Chapter 2: The Normal Distribution Chapter 2: The Normal Distribution 2.1 Density Curves and the Normal Distributions 2.2 Standard Normal Calculations 1 2 Histogram for Strength of Yarn Bobbins 15.60 16.10 16.60 17.10 17.60 18.10 18.60

More information

Symmetric Product Graphs

Symmetric Product Graphs Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-20-2015 Symmetric Product Graphs Evan Witz Follow this and additional works at: http://scholarworks.rit.edu/theses

More information

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use? Chapter 4 Analyzing Skewed Quantitative Data Introduction: In chapter 3, we focused on analyzing bell shaped (normal) data, but many data sets are not bell shaped. How do we analyze quantitative data when

More information

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Statistical Physics of Community Detection

Statistical Physics of Community Detection Statistical Physics of Community Detection Keegan Go (keegango), Kenji Hata (khata) December 8, 2015 1 Introduction Community detection is a key problem in network science. Identifying communities, defined

More information

5 Graphs

5 Graphs 5 Graphs jacques@ucsd.edu Some of the putnam problems are to do with graphs. They do not assume more than a basic familiarity with the definitions and terminology of graph theory. 5.1 Basic definitions

More information

Algorithms for Grid Graphs in the MapReduce Model

Algorithms for Grid Graphs in the MapReduce Model University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

Discrete Mathematics

Discrete Mathematics Discrete Mathematics Lecturer: Mgr. Tereza Kovářová, Ph.D. tereza.kovarova@vsb.cz Guarantor: doc. Mgr. Petr Kovář, Ph.D. Department of Applied Mathematics, VŠB Technical University of Ostrava About this

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Graph Matrices and Applications: Motivational Overview The Problem with Pictorial Graphs Graphs were introduced as an abstraction of software structure. There are many other kinds of graphs that are useful

More information

1. a graph G = (V (G), E(G)) consists of a set V (G) of vertices, and a set E(G) of edges (edges are pairs of elements of V (G))

1. a graph G = (V (G), E(G)) consists of a set V (G) of vertices, and a set E(G) of edges (edges are pairs of elements of V (G)) 10 Graphs 10.1 Graphs and Graph Models 1. a graph G = (V (G), E(G)) consists of a set V (G) of vertices, and a set E(G) of edges (edges are pairs of elements of V (G)) 2. an edge is present, say e = {u,

More information

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur Lecture : Graphs Rajat Mittal IIT Kanpur Combinatorial graphs provide a natural way to model connections between different objects. They are very useful in depicting communication networks, social networks

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

Homework 2: Search and Optimization

Homework 2: Search and Optimization Scott Chow ROB 537: Learning Based Control October 16, 2017 Homework 2: Search and Optimization 1 Introduction The Traveling Salesman Problem is a well-explored problem that has been shown to be NP-Complete.

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4)

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4) S-72.2420/T-79.5203 Basic Concepts 1 S-72.2420/T-79.5203 Basic Concepts 3 Characterizing Graphs (1) Characterizing Graphs (3) Characterizing a class G by a condition P means proving the equivalence G G

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

Discrete mathematics , Fall Instructor: prof. János Pach

Discrete mathematics , Fall Instructor: prof. János Pach Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 11 Coding Strategies and Introduction to Huffman Coding The Fundamental

More information

GRADE 6 PAT REVIEW. Math Vocabulary NAME:

GRADE 6 PAT REVIEW. Math Vocabulary NAME: GRADE 6 PAT REVIEW Math Vocabulary NAME: Estimate Round Number Concepts An approximate or rough calculation, often based on rounding. Change a number to a more convenient value. (0 4: place value stays

More information

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

MT5821 Advanced Combinatorics

MT5821 Advanced Combinatorics MT5821 Advanced Combinatorics 4 Graph colouring and symmetry There are two colourings of a 4-cycle with two colours (red and blue): one pair of opposite vertices should be red, the other pair blue. There

More information

3.1 Basic Definitions and Applications

3.1 Basic Definitions and Applications Chapter 3 Graphs Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 3.1 Basic Definitions and Applications Undirected Graphs Undirected graph. G = (V, E) V = nodes. E

More information

Greedy Algorithms. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms

Greedy Algorithms. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms Greedy Algorithms A greedy algorithm is one where you take the step that seems the best at the time while executing the algorithm. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms Coin

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

Greedy algorithms is another useful way for solving optimization problems.

Greedy algorithms is another useful way for solving optimization problems. Greedy Algorithms Greedy algorithms is another useful way for solving optimization problems. Optimization Problems For the given input, we are seeking solutions that must satisfy certain conditions. These

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Measures of Central Tendency

Measures of Central Tendency Measures of Central Tendency MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2017 Introduction Measures of central tendency are designed to provide one number which

More information