Metropolitan Community Deliniation. Community detection algorithms. Nathaniel Pritchard
|
|
- Scott Ross
- 5 years ago
- Views:
Transcription
1 Metropolitan Community Deliniation Community detection algorithms Nathaniel Pritchard Advisied by: Professor Shankar Bhamidi Reveiwed by: Professor Shankar Bhamidi Professor Sayan Banjeree Statistics and Operations Research The University of North Carolina at Chapel Hill Apriil 2, 2018
2 1 Introduction There are endless ways in which we can group people together in space, the most common ways are of course cities, towns, villages, and metropolitan areas. Today we define cities, and metropolitan areas using a mixture of techniques such as historical boundaries and strict rules that deal with commuting and population density to determine the outlines of a metropolitan area[2]. The way in which these deliniations are determined is important becasue they are the basis for various statistical calculations detailing anything from community diversity to median incomes. These statistics are then used to help determine economic and infrastructure initiatives, which could be beneficial to the community, but could also be wasteful if the statistics do not represent the true community. Thus a robust method of community delineation that captures the true structure of connections would allow for more representative statistics. These more representative statistics will allow policy makers to have a better understanding of the issues. With this better understadning they should then be better able to solve these problems. This robust method can be found by turning to network theory, the study of connections between entities. This field of study is useful one because metropolitan areas are supposed to represent closely connected areas. Specificially it would be useful to look at network theory in an attempt to find an algorithm that identifies the community structure by looking data of the workers who commute to and from various counties. Using these techniques we should be able to provide delineations that are judged to be both significant and proper. 2 Network Theory Background Networks are the way that we mathematically capture the connections between entities. They can be used to describe social connections, food chains, protien structures, or in our case commuter connections. These networks are based on entities, nodes in network language, whether it be people, internet devices, or animals that are connected to other entities via edges[6]. These networks can be weighted in which case each edge takes on a certain value, or unweighted in which case the edge either exists or it does not. There is also no requirement that these edges are bi-directional, if every edge in a network is bi-directional then the network is undirected otherwise it is directed, meaning that at some connections are only one way paths[6]. These edges between nodes can be recorded in a an adjacency matrix A where the ijth value is value of the weight if there is an edge between i and j or zero if no edge exists[6]. In the case of an unweighted network it will be one if the edge exists and zero otherwise. Additionally, in the undirected case if the ijth element, then the jith elements exist and is the same value [6]. Once the network has been established there needs to be a way to pull information about the structure. There are in fact, several ways to get a rudimentary idea about network structure; however, in the context of this paper there are only two necessary to understand. One is the measure of degree, which in the context of undirected graph degree is the number of edges that each node has connected to it [6]. While in a directed sense there is an incoming and outgoing degree which are the number of edges coming 1
3 into a node, or leaving a node respectively [6].If there are n total nodes in a graph the degree of a particular node i will be denoted as d i. The total degree, d t, of the graph can be be defined through n d t = i=1 The other important measure to understand is geodesic distance, which is the shortest distance between any two nodes [6]. Distance in networks is the number of edges separating two nodes from one another. An example would be in the case that the fewest jumps made to connect Kevin Bacon to Idris Elba was six then this would be the geodesic distance. 3 Community detection: basic algorithms What is a community? In the simplest terms it is simply a grouping that has a structure more connected within it the community than with the rest of the network [6]. This simple definition has lead to numerous different ways to find a community. One way of dividing communities into groups as propsed by Mark Newman is to look at where the number of edges is greater than the number of edges expected if it were a randomly generated graph that had the same edge distribution. By then maximizing that measure one can then identify a community. The model Newman uses for this expected value of an edge is the configuration model [6]. Other techniques do things like maximize the number of triangles, when a group of three nodes are connected to each other[6]. 3.1 The configuration model The configuration model is designed for unweighted networks. In this model each node is assumed to have the degree that is the same as a corresponding node in actual graph being examined [6]. From this setup the probability of a connection between nodes i and j can be easily defined. This is done by thinking about the degree of a node being the number of half edges for a node [6]. This consideration results in an edge forming when two half edges point toward each other [6]. From this idea we define the total number of half edges of a graph with n nodes to be: d t = n i=1 Since there are d t half edges in total, the chance of an edge from node i pointing in the d direction of node j is j d since it has to point at one of the d t 1 j half edges connected to j out the total number of half edges exluding itself in the graph [6]. Then since there d i half edges leaving node i the correct probability is: p ij = d i d i d jd i d t 1 2
4 [6] If the network is large, by the law of large numbers, the effect of the -1 in the denominator is minimal and can be removed making the probability: p ij = d jd i dt In an unweighted network this probability of an edge is also the expected value for an edge since an edge can only take a value of either zero or one. Thus by calculating this probability we have a good null model for expected edges from which to try to determine community structure in a graph [6]. 3.2 Modularity With this null model described it is now possible to understand the concept of modularity. Modularity is the measure that assigns a quality value to a division of a network. This quality is determined by the nodes in a community having a higher than expected connection with each other[3]. Modularity is calculated by summing the amount each connection within a particular division exceeds the expected value from the configuration model[3]. These values are then all summed together [3]. Once this sum is calculated it is divided by the number of half edges which is d t [3]. Mathematically this can be represented as: Q = 1 (A ij d jd i d t ij In this A ij is the adjacency matrix representing each edge in a graph, δ ij is the Kroenker delta which is 1 if i and j are in the same group and zero otherwise. Modularity will take values less than one and the further away one is from one the worse the quality of the division[3]. This concept of modularity gives a good framework for the determination of the quality of a division of non-overlapping communities. d t )δ ij 3.3 Modularity based Community detection With the concept of modularity it is now possible through maximization of this measure to divide communities. Because the optimization of this measure is a NP hard problem, Newman has proposed several algorithms to perform these divisions[6]. One approach is a fast greedy algorithm in which a graph is treated as n different groupings and these groupings are joined together based on the edge that would most increase modularity. The algorithm runs until any more merging of groupings would result in a decrease in overall modularity [3]. The major problem with this algorithm as can be seen in figure 1 is that it has a tendency to favor large communities over smaller ones, which is problematic when it comes to metropolitan area detetcion which could have small communities but also large ones [3]. The walktrap algorithm, which is based on the premise that on a random walk an walker should spend more time in a community than outside of it once the walker enters the community, is a another algorithm that we intially examined. The algorithm works by defining a distance based on the transistion probabilites from i to j in a particular 3
5 Figure 1: Fast Greedy Algorithm applied to 1970 Figure 2: Walktrap applied to 1970 commuter data number of steps[7]. With these transition probabilites being distance communities are merged together using a similar fast greedy approach as modularity[7]. This gives a sequence of partitions of the graph, one defined at each merge, the best partition is then chosen to be the one that maximizes modularity. The results from this algorithm can be found in figure 2. Looking at thes plots two major issues in relations metropolitan areas arise. The first is that every county in these plots has been placed inside of a community. This is not the case when looking at metropolitan areas since many counties are rural and not near any population centers. The other major issue is that the community delinations are too large to represent metropolitan areas. For instance, almost the entire state of Florida has been placed in a single metro area in the walktrap, which is not appropiate. Similary problematic is the fast greedy algorithm grouping essentially the entire west 4
6 coast togther in figrue 1. For this reason we decided to take a different approach in terms of algorithms, ones that started from a seed node and worked their way outwards to form communities rather than looking at the entire graph topolgy at one time. 4 Extractive approaches The less than satisfactory results of the previous approaches caused lead to the consideration of extractive algorithms rather than the previous more holistic ones. The difference between extractive approaches and the previous ones is that in extractive approaches, the algorithms create communities starting from specific seed nodes. This is ideal for delineating metropolitan areas because the OMB has already defined them as having counties greater than 50,000 people, so we know where to look. One algorithm used is the CCME algorithm which was created John Palowitch, Shankar Bhamidi, and Andrew Nobel. The other algorithm is tenatively named the dampened modularity algorithm and was created for this specific purpose. 4.1 CCME The main purpose of the CCME model is searching for statistically significant communities and designed with weighted communities in mind. The basis for the CCME algorithm is the continuous configuration model, which is similar to the configuration model, but is designed to account for weights of edges in its calculations of expected edge value[5]. The CCME does by using the same expression of probaility found in the configuration model and in a similar manner creating a formula for the expected strength of connection between the two nodes. Specifically, we can derive the formula for the expected strength of connection between nodes i and j by calculating the porpotion of the total strength of the graph graph that node i makes up and then multiplying that by the total strength of node j [5]. This means that if for example 5% of the strength of a graph lies in node i then after a random distristribution of strengths it would be expected that 5% of the strength of node j to be connected to node i. Mathematically this can be represented as s ij = s is j s t where s t is the sum of the total strength of the graph[5]. By dividing s ij by d ij we will thus have the expected weight for each edge which we will define as f ij (s, d)[5]. Formula wise this is f ij (s, d) = s ij d ij Where s is a vector containing the strengths of nodes i and j, while d is a vector containing the degrees of nodes i and j[5]. Along with the expected value the distribution F is characterized by a variance which is called kappa. This value is calculated using the equation: κ(d, s) = (W ij f ij (s, d)) 2 f ij (s, d) 2 ij,a ij=1 5
7 [5] Once it has these terms, the algorithm starts from a set of n seed nodes and forms n subgraphs B h where h n[5]. For each subgraph, the algorithm then checks surrounding nodes adding it to a community if the connectivity to the seed node to the community is significant[5]. This decision is based on a multiple testing threshold on z scores calculated from summing all the connections of a node to a community as well as the expected value of those same connections[5]. Then subtracting those terms from each other before dividing by a standard deviation from the formula σ(i, B κ(d, s)) 2 = j B s 2 ij d ij (1 d ij + κ(d, s)) [5] The algorithm tests all nodes until B t = B t+1 at this point the algorithm terminates and B t is added to the list of communities[5]. The algorithm finishes by filtering out all redundant communities based on the jaccard similarity and returning that filtered list of communities[5]. This works well in the case of no self loop networks. However,in our data it is also possible that a county node could be its own community based on its self commuting, in this case, the lack of self loops becomes an issue. While the intial algorithm does not utilize self loops we have modified it to account for these self loops. We did this by breaking the system into two stages one that operates as the algorithm was intended and one that identifies statistically signficant self communting communities, s 3 c 2. This specifically is done by adding a few parameters to the algorithm. One is a adding a κ for self loops and one non self loops, which function the same way as the original κ. We also add a p which represents the porportion of total commuters in the graph that are self commuters. The specifics of the kappa self loops value is as follows: κ nonselfloop = i j,j i,i j (W ij (1 ρ i ) κ selfloop = i j,j i,i j (1 ρ i) 2 i (W ii ps i ) 2 i p2 s2 i s t S i S j S t d i d j d t S i S j S t d i d j ) 2 d t 2 W ii where ρ i = and p = 1 n j,j i Wij n i=1 ρ i From here the algorithm runs twice once to detect communities without self loops and once to detect the s 3 c 2 communities. 4.2 Dampened Modularity A second approach that we have applied to detecting metropolitan areas is based on two premises. One is that there are nodes that have attributes that indicate they are central to a community. The second is there is a limit, dependent on geodesic distance, based on how far a node can be away from the seed node and still be part of the community. The algorithm works in three main phases, gathering, refining, and filtering. 6
8 4.2.1 Gathering potential nodes The algorithm works in the following way. The nodes that indicate where a community should be are taken to be seed nodes, then from there we collect all nodes that have a strong connection to that seed node. From this new set of nodes all the nodes with a strong connection to these nodes are collected and the process continues until no more nodes can be added. The algorithm also includes a dampening factor on the weight of an edge to account for the fact that at increasing geodesic distances away from the seed node it should be harder for a node to be in a community with that seed node. This makes sense to do because for each geodesic level from the seed node the connectivity of a node relative to the intial seed node is necessarily less because the connection is less direct. In a step form the specifics of the algorithm look like this: 1: identify seed nodes 2: Calculate an expected weight for each edge based on strength and degree 3: Check incoming and outgoing edges and add all nodes that have an actual edgeweight value that is greater than the expected value and add to the community. 4: Repeat step three for the nodes recently added to the community until no more nodes can be added. Mathematically this can be represented as an optimization problem in this manner: For all i in the set of seed nodes y i = j (θ x 1 ω ij E(ω ij ))δ i x- is the geodesic distance between nodes i and j ω ij - is the weight of edge between nodes i and j E(ω ij )- is the expected weight δ ij -is the kroekner delta indicating if a node j is in group i θ-is the dampening factor This function will be maximized by selecting all nodes such that: Refining Communities (θ ( x 1)ω ij E(ω ij )) 0 The algorithms that maximize modularity do more than just choose nodes with strong connections, they also choose the groupings that have the strongest connections with each other. Since the intial stage of the algorithm only requires a strong connection with one node in the community in order for a new node to be added to a community. A step needs to be taken to ensure a strong connectivity with other nodes in a community. Drawing from the filtering idea of OSLOM written by Lachietti, et.al, which ranks the nodes in a community based on the connectivity to the community then checks to see if the least connected node is strongly connected to the community and removes it if it is not, communities can be refined in a similar manner[1]. Specifically for each node in a community C i we can defined connectivity to a community C i for each node i by counting the number of nodes that node i is strongly connected to in a community and divide it by the number of nodes in a community.contingent on defining a minimum 7
9 threshold t for what a constitutes a strongly connect community. This mathematically for a community with n nodes this would be: n j=1 Connectivity ci = f(j) n { 1 ω ij E(ω ij ) f(j) = 0 otherwise Once these connectivity rankings have been calculated we can take the smallest one of them and check that it is at least as big as a threshold value t. If it is not that node is removed and the ranking process starts over and repeats until no more nodes can be removed[1]. Once this is completed this communities should resemble a result given by a modularity optimization technique Filtering communities Because of the use of seed nodes that could be close together it is possible that the dampened modularity several communities that are approximately the same. The question is how to use this information to represent the communities? A couple approaches that were attempted in the research required the calculation of similarities between communities. These similarities can be calculated based on the Jaccard Similarity measure which is the set intersection divided by the union of the two sets[5]. Which communities are similar enough can be decided by choosing a threshold value which can then be tuned in a similar way to the dampening factor. One approach would be to merge together communities that are above a threshold value into a single community. As simple as the approach is, it has the issue of fundamentally changing the composition of the community where connectiviness is no longer guarenteed. This has been shown empirically when it was applied to the communities the dampened modularity algorithm. Another approach is to filter the communities that are similar above the threshold band select which one is the better community. This has been done at this moment by looking at the connectivity of the nodes inside the community relative to the nodes outside of the community. This connectivity in this is defined relative to nodes outside the community C by: F = i C ρ i s i i C (1 ρ i) j i,j / C s j Also in the case where we are not allowing for overlapping communities, we could easily adapt a methodolgy that merges the similar communities together if they are above a threshold, and if the Jaccard value is between zero and the threshold the algorithm then assigns overlapping nodes of the communities to the community that they are most strongly connected Determining the dampening factor The validity of the dampening factor still needs to be demonstrated; however, at the current moment it seems to be reasonable addition because of the theoretical possibiltity that the addition of nodes to a community based on their actual edge strength 8
10 minus their expected edge strength could result in the community including a large node set that lacks any semblance of connectivty. Thus it would seem reasonable to impose a dampening factor that scales with levels of geodesic distance from the seed node to prevent that from happening. Assuming that the dampening factor is significant, it would also be necessary to determine what the dampening factor should actually be. A value that is too small would result in only nodes with geodesic distance of one from the seed node being collected. On the other hand a value that is too large could result in the issue mentioned above with far too many nodes being added. The easiest way to determine this would be to tune the dampening parameter to a value that results in the best community divisions. In order to do this we would need some measure that demonstrates how well a network is divided into overlapping groups. The overlapping aspect is signficant because it is reasonable for nodes to be part of more than one community in this commuter data just as a person could b ein more than one friend group. The most widely used quality measure is Newman s Modularity discussed earlier; however, it is unclear how well this measure will fair at judging the overlapping communities generated by this algorithm. Another alternative measure proposed by Shen et. al deals with determining modularity of overlapping communities[4]. The formula they derived to judge this for K communities is as follows: Q ov = 1 (A ij d id j 2M 2M ) 1 O w i O j iɛc w,jɛc w [4]. Where C w is community w, A ij is the value of edge i to j, and O i is the number of communities node i belongs to. The fact that this metric is modification of the modularity measure is that it allows for substition of a weighted null model. Thus in the case where the continous configuration null model is used, it can easily be plugged in and the metric will still work. This method has also been shown to have a similar effectiveness at evaluating the quality of a divsion to other overlapping metrics tested in a survey paper by Tanmoy Chakraborty et al. while being less computationally intensive than the other metrics tested[9]. It is also possible that upon a justification of the dampening factor that it in fact could be defined numerically based on a value like graph density. This would unambigously decrease the run time of the algorithm because there would be no need for tuning The expected value The other major question that arises would how best to calculate the expected value that is used in the optimization? Attempts of expected value for this project have included ideas like using an expected value of.01 and one determined from a Taylor approximation. The method that seems to work best; however, seems to be the continous configuration model although there are other methods such as defining degree conditional distributions that could be used as well. 9
11 4.2.6 Application in relation to CCME In its current form the dampened modularity algorithm could prove valid as an accurate and legitimate community detection algorithm. What seems as if it may be a better application for this algorithm is the ability to create subsets of nodes of which could be in a seed node s community. This is significant because in our current usage of the CCME algorithm we have been getting results that are way larger than what we would expect. In applying the gathering portion of the dampened modularity algorithm we could gather a set of feasible nodes. From this set of feasible nodes one could apply the CCME algorithm to that set for the corresponding seed and find a statistically signficicant community. This should reduce the overall speed of the CCME algorithm because it has fewer nodes to check than it would had no such restriction been applied to nodes. This technique should also result in a smaller set of nodes within a community. The key to getting this to work as an application would be demonstrating that it is in fact possible to find a factor from which the gathering nodes would include all possible community nodes for a given seed without collecting every node in the graph. 4.3 A dynamic and possibly predictive alteration Using a framework setup in the OSLOM paper it is possible to adapt both the dampened modularity and CCME algorithms to dynamic datasets due to their extractive nature. The importance of their extractive nature comes from the fact that these algorithms have the ability to start from an intial community generated at time t and detect communities at time t+ t using that intial community as a starting point [1]. Specifically this would meana taking teh intial community then adding nodes in a gathering stage based on data from t + t, before trimming the communities in a refining phase based on data from t + t [1]. It is necessary to make the addition that there should be a comparison of seed nodes at time t and at time t + t, so that if a new node becomes a seed node it has a community created around it. Once that is done, the dampened modularity algorithm can be applied using data from t + t, using the communities from time t in addition to the new seed nodes as starting points. It should be noted that for the dampened modularity algorithm because of the reliance on geodesic distance, the node from which that distance shall be measured is the one that has the largest seed criteria value. For example, in our network data we have used population as a seed determinant so if there were seven nodes in a community then the node with the largest population at time t + t shall be the one from which the geodesic distance is calculated. In terms of the CCME algorithm the approach in OSLOM could be applied, but it would be necessary to filter the communities on the data from time t + t to ensure that all nodes still have enough connectivity with community. This could be done using the approach mentioned in section Once the filtering is done the CCME can be run on the data at time t + t using those filtered communities as starting points. It will also have to perform the full algorithm, on the new seed nodes that appeared in the time t + t dataset. If these dynamic alterations are appropiate this same dynamic approach could be 10
12 used for prediction of communities if one were able to predict the weights of the edges at some future time. 5 Application to the 2010 commuter data 5.1 Application of the Dampened Modularity Algorithm In applying the Dampened Modularity Algorithm to the 2010 commuter data it was vital to tune the dampening factor, the connectivity parameter, and the similarity threshold for filtering. The manner in which these values is tuned is based on maximizing the overlapping modularity measure, using the CCM as the null model. A further modification of the overlapping modularity measure is to subtract it by the number of initial seed nodes that do not remain in communities. This is done because after running the algorithm it appears possible for the intial seed nodes to not remain in the in the finalized groups either because they are not strongly connected group or because they are not in the most connected of overlapping groups. This should be accounted for in the quality measure by subtracting the number of seed counties that do not appear in the final community. Due to the deisre to save time these values will only be tuned to the tenths place. The the dampening factor parameter is tuned by looking at the quality values for dampening factors at 1 through.1 and the other two thresholds being set to zero. From those quality values, the two highest quality scores dampening factors will be selected and tested on the connectivity measure. For the connectivity measure and the similarity threshold, the tuning will looking at.9,.5, and.1. From there the values at which the two largest overlapping modularity values occur will be the interval in which the remaining dampening factor values will be tested. So if.5 and.1 had the two largest Overlapping modularity values.4,.3,and.2 would also be tested to find connectivity value. The reason.9 instead of 1 is used as the max is because 1 means perfectly connected and exactly the same for each respective measure. These value are not going to occur often and thus not going to be very for this particular problem. The dampening factor results can be seen in figure 4. In this figure it is clear that the two highest quality values are achieved at the highest and lowest dampening factors tested. Because of the addition of the negative penalty on the overlapping modularity measure it seems reasonable to test both these values at the connectivity levels and compare those results because connectivity could affect these results in very different ways. From those results it appears that a connectivity threshold of.3 combined with a dampening factor of 1 produces the highest quality measure. With these values the final step was to look at a similarity threshold for filtering which from the graph was.7. 11
13 Figure 4:Dampening factor tuning Figure 5:Connectivity Optimization Figure 6:Filtering Optimization Plugging in these values into the algorithm results in the following set of graphs with each layer depicting disjoint communities. 12
14 13
15 Figure 7: Dampened Modularity Communities These graphs represent 347 community divisions. On average these areas contain counties with a standard deviation of counties. The largest division by population was the Los Angeles Metro Area with a size of 15,620,448 people and including Los Angeles County, Orange County, Riverside County, and San Bernadino County. The size distribution is depicted in the figure 8. 14
16 5.2 Application of CCME Figure 8: Dampened Modularity size distribution The CCME algorithm was much more straightforward to apply due to its lack of parameter tuning. The only necessary step was identifying the intializing nodes. This was done by looking for counties that have at least 1% self commuters relative to total commuters as long as the self commuters is at least 20,000. The reason this was done was to select large population centers, but in way that was more dependent on the network data. The results of the non-overlapping communities are similarly depicted in the following plots, with the final plot representing single county communities. 15
17 Figure 9: CCME communities 16
18 The CCME algorithm produced 214 communities with an average size of counties and a standard deviation of counties. The distribution of the community sizes are depicted in the following plot. Figure 10: CCME community size distribution 5.3 Comparing to MSA results Figure 11:MSA Metro Areas [8] The Metropololitan communities are outlined in red the micropolitan communities are outlined in blue. Metropolitan communities differ from micropolitan communities by having at least one area with a population of 20,000 versus 50,000 in metropolitan areas [8]. The MSA has identified 919 communities. The average size of the MSA metro communities is 1.94 with a standard deviation of The distribution of these community sizes can be found in figure
19 Figure 12:MSA Size of Communities When comparing the three result sets the first noticable difference is the fact that the MSA tended to produce the smallest communities, followed by the Dampened modularity method, and the CCME which formed much larger communities than the other two. It also unsruprisingly follows that the number of commnities identified in each method follows in the reverses order of community size with CCME finding the fewest communities, and MSA having the most. The size distributional difference between the methods is interesting in how the the MSA and CCME methods produce a farily exponential distribution of sizes of communities, while the Dampened modularity approach does not seem to produce anything really resembling a known distribution. 5.4 Finding meaning In assesing the two methods in comparison to the MSA delineations one big advantage is that because of the structure of the algorithms there is a reason for each of the communities that can be supported by data. The same cannot be said for MSA delineations that depend in part on the historical delinations [2]. The other advantage that the algorithms have over the MSA is that they identify overlapping communities which makes sense from an intuitive sense due to the fact that these counties interact with so many other counties. As far as disadvantages of the particular algorithms there are several. In terms of the CCME the major issue is that it forms communities that are almost twice as large as the ones defined by the other two methods. The reason for this may be due the fact that the way the test statistic is defined. Namely the fact that the connectivity is the sum of weights in a community compared to the expected weights in that community. The problem is that a single edgeweight that greatly exceeds the expected number could skew this statistic enough were loosely connected counties end up being added when they should have not been added solely because they have a strong connection with one county in the community. This could be corrected by adjusting the manner in which the connectivity of a node to a set is defined. It could be the case that the connectiveness is defined by the middle 50% or some other delination that would not be as affected by large outliers than the overall connectiveness. A different option would be to trim communities in a manner similar to OSLOM once they are formed ensuring 18
20 that there is some type of minimum connectivity between nodes. In terms of the Dampened Modularity algorithm a problematic occurence is that it, as noted in section 5.1 at times forms communities that do not include the seed nodes. This could be an issue either because the quality measure used does not properly represent the best delinations of communities, which could result in a bad choice in dampening factor, connectivty threshold, etc. Another possibility is that the way in which the seed nodes are chosen is not necessarily representative of where the communities should be, which could again result in those nodes falling out of communities that they seeded. This could corrected by using possibly the procedure used by the CCME or possibly even a centrality measure to determine what should be a seed node. Another interesting occurence in the Dampened modularity is the size distribution of communites being so wildly different from the other two methods. This difference in size distribution could be evidence that the method simply does not work, or could be caused by being a derivation of a method designed to identify community delinations for a whole graph. Specifically because only certain communities are being chosen from the graph based on a method designed to identify every community present in a graph, the distribution could thus differs from the one expected had every community been chosen. This thus means that there needs to be further investigation into the algorithm to make a complete determination of its validity. 6 Summary After examining the work done so far this year it seems we have the beginings of a strategy that will allow us to robustly determine these metropolitan areas. A theoretical proof of the validity of the dampening factor would be useful in the future; espicially with its debateable importance in the 2010 data. Both the CCME method and Dampened Modularity Method can and should be further modified to better capture the metropolitan community structure. However, it is the case that both algorithms result in the divisions that can be supported by data, which is better than the MSA results. There are also numerous other algorithms that are capable of making these same divisions that could also be considered. Finally, the longitudal aspect of the algorithms could be further explored to see how the communities evolve over time. This could be done using and OSLOM type edge addition and subtraction techinique as described in section 4.3. If this approach is refined it would then be a natural extension to attempt to predict what the communities would be in the future, which would hopefully give policy makers a better longterm picture of the problems that they could face. References [1] Jos J. Ramasco Santo Fortunato Andrea Lancichinetti, Filippo Radicchi. Finding statistically signficant communities in networks [2] United States Census. Metropolitan and micropolitan. Janurary Accessed on
21 [3] Santo Fortunato. Comunity detection in graphs [4] Kai Cai Huawei Shen, Xueqi Cheng and Mao-Bin Hu. Detect overlapping and hierarchical community structure in networks [5] Andrew Nobel John Palowitch, Shankar Bhamidi. Significance-based community detection in weighted networks. [6] Mark Newman. Networks: an Introduction. Oxford University Press, [7] Pascal Pons and Matthieu Latapy. Computing communities in large networks using random walks [8] Paul J. Mackun Thomas R. Fischetti Steven G. Wilson, David A. Plane and Justyna Goworowska. Patterns of metropolitan and micropolitan population change: 2000 to [9] Animesh Mukherjee Niloy Ganguly Tanmoy Chakraborty, Ayushi Dalmia. Metrics for community analysis: a survey. 20
Mining Social Network Graphs
Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be
More informationChapter 3 Analyzing Normal Quantitative Data
Chapter 3 Analyzing Normal Quantitative Data Introduction: In chapters 1 and 2, we focused on analyzing categorical data and exploring relationships between categorical data sets. We will now be doing
More informationGraph Structure Over Time
Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationV2: Measures and Metrics (II)
- Betweenness Centrality V2: Measures and Metrics (II) - Groups of Vertices - Transitivity - Reciprocity - Signed Edges and Structural Balance - Similarity - Homophily and Assortative Mixing 1 Betweenness
More informationMCL. (and other clustering algorithms) 858L
MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted
More information1 Homophily and assortative mixing
1 Homophily and assortative mixing Networks, and particularly social networks, often exhibit a property called homophily or assortative mixing, which simply means that the attributes of vertices correlate
More informationMAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015
MAT 142 College Mathematics Statistics Module ST Terri Miller revised July 14, 2015 2 Statistics Data Organization and Visualization Basic Terms. A population is the set of all objects under study, a sample
More informationBasic Statistical Terms and Definitions
I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can
More informationV4 Matrix algorithms and graph partitioning
V4 Matrix algorithms and graph partitioning - Community detection - Simple modularity maximization - Spectral modularity maximization - Division into more than two groups - Other algorithms for community
More informationCHAPTER 3: Data Description
CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a
More informationNetwork community detection with edge classifiers trained on LFR graphs
Network community detection with edge classifiers trained on LFR graphs Twan van Laarhoven and Elena Marchiori Department of Computer Science, Radboud University Nijmegen, The Netherlands Abstract. Graphs
More informationCommunity Detection. Community
Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,
More informationCopyright 2000, Kevin Wayne 1
Linear Time: O(n) CS 580: Algorithm Design and Analysis 2.4 A Survey of Common Running Times Merge. Combine two sorted lists A = a 1,a 2,,a n with B = b 1,b 2,,b n into sorted whole. Jeremiah Blocki Purdue
More informationMeasures of Dispersion
Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion
More informationMath 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency
Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,
More informationFrequency Distributions
Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,
More informationCommunity detection. Leonid E. Zhukov
Community detection Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics Network Science Leonid E.
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 11/13/17 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2 Observations Models
More informationChapter 2 Basic Structure of High-Dimensional Spaces
Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,
More informationSupplementary text S6 Comparison studies on simulated data
Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate
More informationNotes for Lecture 24
U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined
More informationBig Mathematical Ideas and Understandings
Big Mathematical Ideas and Understandings A Big Idea is a statement of an idea that is central to the learning of mathematics, one that links numerous mathematical understandings into a coherent whole.
More informationCommunity detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati
Community detection algorithms survey and overlapping communities Presented by Sai Ravi Kiran Mallampati (sairavi5@vt.edu) 1 Outline Various community detection algorithms: Intuition * Evaluation of the
More informationHomework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in
More informationTheorem 2.9: nearest addition algorithm
There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used
More informationOn the Permanence of Vertices in Network Communities. Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India
On the Permanence of Vertices in Network Communities Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India 20 th ACM SIGKDD, New York City, Aug 24-27, 2014 Tanmoy Chakraborty Niloy Ganguly IIT
More informationClustering Using Graph Connectivity
Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the
More informationWeb Structure Mining Community Detection and Evaluation
Web Structure Mining Community Detection and Evaluation 1 Community Community. It is formed by individuals such that those within a group interact with each other more frequently than with those outside
More informationAlessandro Del Ponte, Weijia Ran PAD 637 Week 3 Summary January 31, Wasserman and Faust, Chapter 3: Notation for Social Network Data
Wasserman and Faust, Chapter 3: Notation for Social Network Data Three different network notational schemes Graph theoretic: the most useful for centrality and prestige methods, cohesive subgroup ideas,
More informationExam Review: Ch. 1-3 Answer Section
Exam Review: Ch. 1-3 Answer Section MDM 4U0 MULTIPLE CHOICE 1. ANS: A Section 1.6 2. ANS: A Section 1.6 3. ANS: A Section 1.7 4. ANS: A Section 1.7 5. ANS: C Section 2.3 6. ANS: B Section 2.3 7. ANS: D
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More information6. Lecture notes on matroid intersection
Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm
More informationL E A R N I N G O B JE C T I V E S
2.2 Measures of Central Location L E A R N I N G O B JE C T I V E S 1. To learn the concept of the center of a data set. 2. To learn the meaning of each of three measures of the center of a data set the
More informationAverages and Variation
Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus
More informationCombinatorics Prof. Dr. L. Sunil Chandran Department of Computer Science and Automation Indian Institute of Science, Bangalore
Combinatorics Prof. Dr. L. Sunil Chandran Department of Computer Science and Automation Indian Institute of Science, Bangalore Lecture - 5 Elementary concepts and basic counting principles So, welcome
More informationLab 9. Julia Janicki. Introduction
Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationChapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.
Chapter 3 Graphs Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 3.1 Basic Definitions and Applications Undirected Graphs Undirected graph. G = (V, E) V = nodes. E
More informationMinimum spanning trees
Carlos Moreno cmoreno @ uwaterloo.ca EI-3 https://ece.uwaterloo.ca/~cmoreno/ece5 Standard reminder to set phones to silent/vibrate mode, please! During today's lesson: Introduce the notion of spanning
More informationFundamental Properties of Graphs
Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,
More informationCS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS
CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS 1 UNIT I INTRODUCTION CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS 1. Define Graph. A graph G = (V, E) consists
More informationMulti-Criteria Decision Making 1-AHP
Multi-Criteria Decision Making 1-AHP Introduction In our complex world system, we are forced to cope with more problems than we have the resources to handle We a framework that enable us to think of complex
More informationE-Companion: On Styles in Product Design: An Analysis of US. Design Patents
E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing
More informationOnline Social Networks and Media. Community detection
Online Social Networks and Media Community detection 1 Notes on Homework 1 1. You should write your own code for generating the graphs. You may use SNAP graph primitives (e.g., add node/edge) 2. For the
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationDESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS
X MODELLING WEEK DESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS FACULTY OF MATHEMATICS PARTICIPANTS: AMANDA CABANILLAS (UCM) MIRIAM FERNÁNDEZ
More informationCSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection
CSE 258 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationCentrality Book. cohesion.
Cohesion The graph-theoretic terms discussed in the previous chapter have very specific and concrete meanings which are highly shared across the field of graph theory and other fields like social network
More informationMATH 112 Section 7.2: Measuring Distribution, Center, and Spread
MATH 112 Section 7.2: Measuring Distribution, Center, and Spread Prof. Jonathan Duncan Walla Walla College Fall Quarter, 2006 Outline 1 Measures of Center The Arithmetic Mean The Geometric Mean The Median
More informationCentralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge
Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum
More informationMAT 090 Brian Killough s Instructor Notes Strayer University
MAT 090 Brian Killough s Instructor Notes Strayer University Success in online courses requires self-motivation and discipline. It is anticipated that students will read the textbook and complete sample
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More information3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications
Chapter 3 31 Basic Definitions and Applications Graphs Slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley All rights reserved 1 Undirected Graphs Some Graph Applications Undirected graph G = (V,
More informationA Novel Parallel Hierarchical Community Detection Method for Large Networks
A Novel Parallel Hierarchical Community Detection Method for Large Networks Ping Lu Shengmei Luo Lei Hu Yunlong Lin Junyang Zou Qiwei Zhong Kuangyan Zhu Jian Lu Qiao Wang Southeast University, School of
More informationCSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection
CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationTELCOM2125: Network Science and Analysis
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning
More informationCopyright 2000, Kevin Wayne 1
Chapter 3 - Graphs Undirected Graphs Undirected graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures pairwise relationship between objects. Graph size parameters: n = V, m = E. Directed
More informationNon Overlapping Communities
Non Overlapping Communities Davide Mottin, Konstantina Lazaridou HassoPlattner Institute Graph Mining course Winter Semester 2016 Acknowledgements Most of this lecture is taken from: http://web.stanford.edu/class/cs224w/slides
More informationLevels of Measurement. Data classing principles and methods. Nominal. Ordinal. Interval. Ratio. Nominal: Categorical measure [e.g.
Introduction to the Mapping Sciences Map Composition & Design IV: Measurement & Class Intervaling Principles & Methods Overview: Levels of measurement Data classing principles and methods 1 2 Levels of
More informationGetting to Know Your Data
Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss
More informationMath 7 Glossary Terms
Math 7 Glossary Terms Absolute Value Absolute value is the distance, or number of units, a number is from zero. Distance is always a positive value; therefore, absolute value is always a positive value.
More informationDownloaded from
UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making
More informationMathematics of Networks II
Mathematics of Networks II 26.10.2016 1 / 30 Definition of a network Our definition (Newman): A network (graph) is a collection of vertices (nodes) joined by edges (links). More precise definition (Bollobàs):
More informationPaths, Circuits, and Connected Graphs
Paths, Circuits, and Connected Graphs Paths and Circuits Definition: Let G = (V, E) be an undirected graph, vertices u, v V A path of length n from u to v is a sequence of edges e i = {u i 1, u i} E for
More informationChapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data
Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically
More informationChapter 2: The Normal Distribution
Chapter 2: The Normal Distribution 2.1 Density Curves and the Normal Distributions 2.2 Standard Normal Calculations 1 2 Histogram for Strength of Yarn Bobbins 15.60 16.10 16.60 17.10 17.60 18.10 18.60
More informationSymmetric Product Graphs
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-20-2015 Symmetric Product Graphs Evan Witz Follow this and additional works at: http://scholarworks.rit.edu/theses
More informationThe main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?
Chapter 4 Analyzing Skewed Quantitative Data Introduction: In chapter 3, we focused on analyzing bell shaped (normal) data, but many data sets are not bell shaped. How do we analyze quantitative data when
More informationAn Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization
An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationStatistical Physics of Community Detection
Statistical Physics of Community Detection Keegan Go (keegango), Kenji Hata (khata) December 8, 2015 1 Introduction Community detection is a key problem in network science. Identifying communities, defined
More information5 Graphs
5 Graphs jacques@ucsd.edu Some of the putnam problems are to do with graphs. They do not assume more than a basic familiarity with the definitions and terminology of graph theory. 5.1 Basic definitions
More informationAlgorithms for Grid Graphs in the MapReduce Model
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
More informationDiscrete Mathematics
Discrete Mathematics Lecturer: Mgr. Tereza Kovářová, Ph.D. tereza.kovarova@vsb.cz Guarantor: doc. Mgr. Petr Kovář, Ph.D. Department of Applied Mathematics, VŠB Technical University of Ostrava About this
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationGraph Matrices and Applications: Motivational Overview The Problem with Pictorial Graphs Graphs were introduced as an abstraction of software structure. There are many other kinds of graphs that are useful
More information1. a graph G = (V (G), E(G)) consists of a set V (G) of vertices, and a set E(G) of edges (edges are pairs of elements of V (G))
10 Graphs 10.1 Graphs and Graph Models 1. a graph G = (V (G), E(G)) consists of a set V (G) of vertices, and a set E(G) of edges (edges are pairs of elements of V (G)) 2. an edge is present, say e = {u,
More informationLecture 5: Graphs. Rajat Mittal. IIT Kanpur
Lecture : Graphs Rajat Mittal IIT Kanpur Combinatorial graphs provide a natural way to model connections between different objects. They are very useful in depicting communication networks, social networks
More informationGraphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs
Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are
More informationHomework 2: Search and Optimization
Scott Chow ROB 537: Learning Based Control October 16, 2017 Homework 2: Search and Optimization 1 Introduction The Traveling Salesman Problem is a well-explored problem that has been shown to be NP-Complete.
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationAdvanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs
Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationCharacterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4)
S-72.2420/T-79.5203 Basic Concepts 1 S-72.2420/T-79.5203 Basic Concepts 3 Characterizing Graphs (1) Characterizing Graphs (3) Characterizing a class G by a condition P means proving the equivalence G G
More informationOn Covering a Graph Optimally with Induced Subgraphs
On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number
More informationDiscrete mathematics , Fall Instructor: prof. János Pach
Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 11 Coding Strategies and Introduction to Huffman Coding The Fundamental
More informationGRADE 6 PAT REVIEW. Math Vocabulary NAME:
GRADE 6 PAT REVIEW Math Vocabulary NAME: Estimate Round Number Concepts An approximate or rough calculation, often based on rounding. Change a number to a more convenient value. (0 4: place value stays
More informationSTA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures
STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and
More informationMT5821 Advanced Combinatorics
MT5821 Advanced Combinatorics 4 Graph colouring and symmetry There are two colourings of a 4-cycle with two colours (red and blue): one pair of opposite vertices should be red, the other pair blue. There
More information3.1 Basic Definitions and Applications
Chapter 3 Graphs Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 3.1 Basic Definitions and Applications Undirected Graphs Undirected graph. G = (V, E) V = nodes. E
More informationGreedy Algorithms. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms
Greedy Algorithms A greedy algorithm is one where you take the step that seems the best at the time while executing the algorithm. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms Coin
More informationHeteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors
Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms
More informationGreedy algorithms is another useful way for solving optimization problems.
Greedy Algorithms Greedy algorithms is another useful way for solving optimization problems. Optimization Problems For the given input, we are seeking solutions that must satisfy certain conditions. These
More informationPrepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.
Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good
More informationSpatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data
Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated
More informationA Formal Approach to Score Normalization for Meta-search
A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003
More informationMeasures of Central Tendency
Measures of Central Tendency MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2017 Introduction Measures of central tendency are designed to provide one number which
More information