Graph Structure Over Time - PDF Free Download

Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines the IEEE data set and looks at how different time stamps of the graph differ in terms of structure. It also looks at how subsets of the graph differ from each other. I. INTRODUCTION A social network is a very special type of graph that represents information of how people interact and relate to one another. Because social networks are based on people, the structural properties of such a graph differ greatly from the anticipated structure of a graph that is generated randomly. Human actors in this graph have more logical decisions on who will be connected to them. This paper explores the properties of a graph based on a data set of papers in the IEEE database from the years of 1990 to 2010. The paper also examines how different subdivisions of the graph relate to each other. Since the properties are not what would be expected of a random graph, looking at portions of the graph allows us to delve deeper into the reasons for why this might happen and what leads to it. The graph is looked at as it evolves by adding new papers, and by looking at only the new papers added. The consistency of certain unusual properties is observed, in relation to the graph of the entire data set and graphs of other divisions of the data set. These questions of the structure of the graph and how constant they are allow for a deeper understanding of what a social network is. The IEEE data set depicts an even more specialized graph than a general social network. It shows a network specifically based on work done and papers published. The way in which authors connect is significantly different from how nodes of a random graph would connect, and from how people connect in other areas of their lives. This data set provides a specific case that makes these questions even more interesting. II. ORGANIZING THE DATA SET The first step to answering these questions was to put all the information in the data set in a graph. This would provide a better method of analyzing the data. The given data was set up by looking at the paper and when it was written. The given information was the identification number of the paper, the year in which it was written, and the authors who collaborated on it. The information given covered papers written between the years 1990 and 2010. To organize the data, all information was represented in a graph. Each author represented a node on the graph. Two nodes are connected if the authors ever worked on a paper together. That means that this is an unweighted graph. The reasoning for why authors are only required to write a single paper together will be discussed in later sections. The division of the graph into subsets is a crucial part of this paper. The overall data set is divided into years. This information is then looked at in two ways. The first of these is a cumulative graph. This is where each year s data includes the data from that year and all previous years. So, when discussing the 1997 cumulative graph, the graph contains all papers from 1990 up to and including 1997. This data set is used to look at how additional information changes the structure of the graph. The other method of analysis is on a year-by-year basis. This is where each year s data only contains papers from that year. When referring to the 1997 separated graph, this means that the graph has only the information from the year 1997. These two methods are necessary to be able to observe the trends as they relate to the growing graph, and the incoming information being added. To properly understand the developments in the graph, information on what new data is being added is necessary. This

organizational structure for the graph provides us with that. III. INCOMING INFORMATION When observing the effects of time on the graph, it is crucial to look at how the information that is being added at each time step relates to the graph and the information from other time steps. Changes in the inputs can help explain how and why certain properties of the overall graph hold true. Some trends in graph of the entire data set will be directly influenced by the incoming information, such as the overall size of the graph. The data set is organized by paper, not by author as the graph is. The first aspect to analyze, then, becomes the number of papers written by year. Looking at the two graph sets, it is clear that the cumulative set will continually increase. Each year adds more papers, and all these new papers will be added to the graph that already contains all previous papers. The separated data set, however, has no specific reasoning to actually show an increase. However, as shown in Chart 1 below, there is a steady rise in the number of papers added each year. The number of papers written in the year 1990 is 1,367. By the year 2000, there are 4,534 papers written, showing a 330% increase in the number of papers. This trend continues, with 8,435 papers from the year 2010. This shows a 617% rise from the year 1990, and a 186% increase from the year 2000. The number of papers added does appear to reach a small peak at year 2008, where the number of papers is 9,477, but there is not enough information to confirm whether or not the slight decrease that follows is a continuing trend or an anomaly. The year 2007 has 8,789 papers and the year 2009 has 8,810 papers, which shows that 2008 is a peak for that local area. The general trend, however, does indicate a rising number of papers per year. Chart 1: Number of papers by year Chart 2: Number of authors per year In addition to the number of papers written per year, the number of authors added to the data set per year has to be looked at. If the number of papers increases, but the number of total authors remains relatively constant, that would mean that the same authors are continually writing papers. In Chart 2, we see that the number of authors in a specific year shows a trend similar to that of the number of papers per year that are added. For the separated data set, the number of authors added correlates directly to the number of papers added. As with the number of papers added, the number of authors increases significantly until 2008. In 1990, the graph starts with 3,331 authors represented in the 1,367 papers. In the year 2000, 11,325 authors wrote papers that were added to the data set. This is an increase of around 340% from the year 1990. By 2010, the data set holds 23,827 authors that wrote papers to be added, a 210% rise from 2000 and 715% increase from 1990. As shown with the number of papers, the numbers rise until 2008. After this there is a slight decrease in the number of authors added. Again, we see the possibility of a new trend, but have no data to properly analyze this effect. As to the cumulative graph, the number of authors increases, but is not just the sum of the authors for each year. Authors that have papers in more than one year are only represented in the cumulative graph once. Still, the rate of increase is similar to the rate of increase of the number of papers. Using this, the average number of authors per paper can be examined. A higher number of authors per paper implies that there are either papers with large numbers of authors, authors that write fewer papers per year, or a combination of the two. A lower number implies that there authors are publishing papers alone or that authors are writing more papers per year. An interesting distinction occurs with this property of the graph in the two representations of the data. The data

set that shows an increase in the average for the separated years. This change is not very large, going from 2.437 in 1990 to 2.825 in 2010. However, the cumulative graph shows a decrease in the average. The final value for the average is 1.556 in 2010, down from 2.437 in 1990. This information means that new authors are being added to the graph and that authors do get added to the data set at a significant rate. IV. PERCENT OF CONNECTIONS To look at the IEEE data set fully, the amount of connections compared to the number of possible connections should be examined. This gives us a general idea of how hard it is to actually connect to another node in the graph. The calculation for this was done by taking the total number of edges formed in the graph, and dividing that by the number of possible edges. The number of possible edges is the number of nodes multiplied by the number of nodes minus one. This allows for every node to connect to every other node. The quotient of the division is the percent of connections that are actually formed, and for this purpose, also the probability of connections forming. When examining the data set, it is clear that the percent of connections actually formed is fairly low. The value starts at 0.00089 for the separated data set of 1990. This value increases for the following few years, but the peak is still low at 0.00106 in 1991. The final value of this for the year 2010 is 0.000159. A similar trend to the number of papers and the number of authors per year is shown. Again it is seen that the year 2008 causes a difference in the data. This time, however, the year brings a low to the graph, with a percentage of 0.000145. The value of the previous year is 0.000154, and the value for the next year is 0.000160. Because the value for 2009 is greater than the value for 2007, this shows more evidence that is possible for 2008 to be an anomaly, but that there is still a slight increase caused. Still, it could easily by a small anomaly, but more data would be needed to confirm anything. The percent of connections decreases so significantly over the years shows that many more authors were either working alone or with the same group of people instead of new people. The cumulative graph gives a slightly different story than the separated graphs. The start value is the same of 0.00089 for the year 1990. However, since new nodes are being added to the graph at the same time of old nodes being present, the overall probability decreases significantly compared to the decreased n the separate data set graph. There is no peak in the year 1991, as there was in the separated graph. The percentage of connections just continually declines. The rate of decrease also shows a great deal, as it starts off as being fairly high. The change rate then decreases, as the graph appears to basically level off. Again, there is a lack of data to fully support this idea. The cumulative graph shows much lower values than the separated years graph does. The start value for both is the same, as they both include only the 1990 papers of the data set. The rate of decrease for the cumulative graph is much greater, however. This is intuitively on par, though. Since more nodes are being added, but not all the new nodes are connecting to any old nodes, and not all the old nodes are connecting to any new nodes. While papers are added, the increase in potential connections is too great to overcome the gap and maintain a higher percent of connections formed. This shows that many of the newer nodes are not making as Chart 3: Percent of Connections formed in separate years data set Chart 4: Percent of Connections formed in cumulative years data set

many connections to older nodes, as well as not having as many potential connections realized in their own years. By the year 2010, the percent of connections in the cumulative graph is 0.0000367, much lower than the separate year s value of 0.000159. V. DEGREE DISTRIBUTION Another crucial part of the data is the degree distribution of the data. The degree of a node is the values of edges that go from a node to a different node in the graph. Because this is an unweighted, undirected graph, the degree is simply the number of edges that the node has. Another option for this graph would have been to use each paper written by an author as an edge. If an author wrote three papers with the same co-author, the weight of that edge would be three. However, that information did not seem valuable for the question discussed in this paper. The degree distribution shows how authors collaborate with their peers. With this, it can be seen whether people to tend with larger groups of people or smaller. Degree distributions that show higher degrees being more represented than lower degrees would show that most authors prefer to work with a large variety of people. The shape of the curve for a degree distribution graph would also tell us a significant amount. This would tell us what general trend this follows and how rapidly extreme differences are. Part of the question for this is comparing the actual structure of the graph to what should be expected of a random graph with some similar attributes. To examine this, an Erdos-Renyi graph was constructed to match each time stamp of the graph. The random graph was created by entering the number of nodes, n, and the probability of connecting, p, to a simulator. This p value is calculated by using the probability of connecting for the given time step as discussed earlier. The simulator created n nodes. Each node attempted to connect to all other nodes. Based on probability p, the node would connect. This means that there was p chance of connecting, and (1-p) chance of not connecting. A random number between zero and one was generated to try this connection. The random graph gave a baseline expectation of what the degree distribution should look like. We see that the expected graphs show more of a bell curve that has a peak towards the middle. Graphs were generated for each time step. Each of the generated graphs showed a similar bell shape curve for the expected degree distribution. This says that most nodes will have a degree somewhere around the middle of the distribution. For a randomly generated graph, this is the instinctive trend that the graph should follow. There is no specific reason to not connect to other nodes. If the probability matches up, connect. All the generated graphs show the trend, so it can be safely assumed that this is the expected trend for any graph of this type to follow. Chart 5 shows an example of what the graph looks like based on the data of 1993. This means the number of nodes for 1993 as a singleton year was used (4,264 authors with a 0.000730356 probability of connecting). Chart 6 shows an example of the graph based on 1993 as a cumulative data set (11,262 authors and 0.00032135 chance of connecting). While the expected graph shows that nodes connect with a decent number of other nodes in the graph, the actual data set does not show this same idea at all. A significant number of nodes have a low degree of only one or two. This means that all papers they have worked on are with the same group of people. Comparing this to the expected graph, the results are extremely confusing. Why should the difference be so great? However, thinking of this as a social network answers that to some degree. Remembering that this data set is based on papers written together answers another portion of that question. Because this is a social network, it can t be expected for decisions to be made randomly. The random graph allows any node to connect to any other node. A graph based on real people under real circumstance would not create connections like that. Writing a paper is a commitment of time and work. In order for two authors to actually agree to work on a paper together, many preconditions must be met. For two people to collaborate on a paper there is usually some form of existing connection between the two. Very often, it will be people working at the same university or research institution. This limits the possible connections a great deal. Two authors having no previous connection is rare in this case. Another limitation could be physical distance. Depending on the authors, they might not want to work with a collaborator they could not meet in person. A further limiter of connections is the field in which authors are working. While some papers will span multiple fields, many papers require special knowledge of an area. All these explain why the results shown in Charts 7 and 8 are very different from the expected results shown in Charts 5 and 6. Chart 7 shows the singleton year of 1993 s degree distribution. Chart 8 shows the degree distribution of the cumulative data set from 1990-1993.

Number of nodes Number of nodes Number of nodes Number of nodes 700 600 500 400 300 200 100 0 Expected Distribution - 1993 Separated 0 0 3 6 9 12 15 18 21 24 27 30 33 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 Degree Degree Chart 5 Chart 6 2000 1500 1000 500 Expected Distribution - 1993 Cumulative 1400 1200 1000 800 600 400 200 Actual Distribution - 1993 Separated 0 1 4 7 10 13 16 19 24 43 Degree 3500 3000 2500 2000 1500 1000 500 0 Actual Distribution - 1993 Cumulative 1 4 7 10 13 16 19 22 26 29 32 36 40 Degree Chart 7 Chart 8

VI. CLUSTREING COEFFICIENT Along with degree distribution, the clustering coefficient is an important statistic about the data set. The clustering coefficient is a method of analyzing how the graph is structured in terms of what nodes are connected to what other nodes. The coefficient looks at a node s neighbors, and sees if the neighbors are connected. If two neighbors are connected, a triangle is formed with the original node and the two connected neighbors. This value is calculated for all the nodes in the graph. The total is then divided by the number of nodes in the graph. This produces a number between zero and one. A complete graph would have a coefficient of one. This statistic shows us whether authors work within groups or with people who do not know each other. Random graphs were used to determine the anticipated clustering coefficient for a graph with this probability of connecting and number of nodes. The Eros-Renyi model for a random graph was used once again, and the expected clustering coefficient was determined by looking at the clustering coefficient for that graph. The expected clustering coefficient correlates directly to the probability of connecting discussed earlier. If the probability entered is 0.000892877, the expected clustering coefficient is the same. This means that with a random graph, the change of forming a triangle of the sort mentioned in the previous paragraph is related to the probability of connecting two nodes in the first place. As with several other properties examined, the value of the clustering coefficient increases with time in this data set. This is related to the idea that the more authors are being added in relation to the number of papers being added. If there is a higher number of authors per paper, the clustering coefficient is going to rise. The charts below (Charts 9 and 10) give clear Chart 9: Clustering Coefficient by Separate Years Chart 10: Clustering Coefficient by Cumulative Years indicators that the clustering coefficient is almost constantly rising (there is a slight anomaly, but is corrected quickly). Chart 9 shows the clustering coefficient for each year. Chart 10 shows the same information for the cumulative data set. While the clustering coefficient of the separated graph rises by 0.19, the cumulative graph rises by only 0.07. This is a factor of new nodes being added, and old nodes no longer being active. Because of this effect, the clustering coefficient will not rise at the same rate as the separated years data set. In the random graph, the nodes are always active, with no consideration of the time involved. The separated data set also has this, since each year is stand-alone. The anomaly on this graph is the year 1992, which shows a decrease. However, this proves to be only an anomaly, not a trend, because the coefficient increases again after that year. The actual values of the clustering coefficient are wildly different from the expected values. As with the degree distribution, the expected values set up a scenario where random connection greatly changes the situation. The observed values are much, much higher than the anticipated values. The table below (Table 1) shows how these numbers vary. The first column simply lists the years represented. The second column gives the expected clustering coefficient. This is calculated by using the probability of connecting as discussed earlier. For the purpose of this comparison, the actual probability is used. The last column shows the actual clustering coefficient for the graph that represents the years discussed. It is shown that the actual values are higher than the expected values. The actual coefficient for the year 1990 is more than six hundred times the expected value. This gap rises as the probability of connecting decreases, but the clustering coefficient increases.

Years Expected Actual Coefficient Included Separated 1990 0.000892877 0.595256 1991 0.001058157 0.601461 1992 0.001002386 0.592469 1993 0.000730356 0.651579 1994 0.000649383 0.641051 1995 0.000503312 0.641131 1996 0.000426278 0.656631 1997 0.000449407 0.680394 1998 0.000362182 0.666448 1999 0.000308224 0.69693 2000 0.000289385 0.69492 2001 0.00029336 0.69369 2002 0.000238 0.701429 2003 0.000230582 0.719133 2004 0.000216318 0.722704 2005 0.000215321 0.744165 2006 0.000166057 0.760883 2007 0.000154151 0.757536 2008 0.000144627 0.765875 2009 0.000159618 0.774594 2010 0.000159129 0.786621 Cumulative 1990 0.000892877 0.595256 1990-1991 0.000556907 0.59419 1990-1992 0.000425347 0.590745 1990-1993 0.00032135 0.604458 1990-1994 0.000254234 0.608008 1990-1995 0.000204267 0.611121 1990-1996 0.00016769 0.614859 1990-1997 0.000146121 0.620332 1990-1998 0.00012633 0.621839 1990-1999 0.000107856 0.628934 1990-2000 9.52645E-05 0.632215 1990-2001 8.63471E-05 0.634434 1990-2002 7.64192E-05 0.63577 1990-2003 6.92452E-05 0.6401 1990-2004 6.33298E-05 0.643752 1990-2005 5.86011E-05 0.648977 1990-2006 5.20065E-05 0.657558 1990-2007 4.67062E-05 0.662459 1990-2008 4.24328E-05 0.666659 1990-2009 3.94624E-05 0.669954 1990-2010 3.67434E-05 0.674314 Table 1 The differences in this information versus the randomized graphs go back to what was discussed with the degree distribution. There are many more limitations in this data set than are represented here and can be shown by a randomized graph. If authors actually are working with others in their work place, it can be assumed that two collaborators of a single author are much more likely to connect than in a randomized graph. Field of work also plays a big role, as authors will look for collaborators in a smaller set than the entire graph. That set will most likely contain previous collaborators, and their collaborators. Because of this, more triangles will be formed, increasing the clustering coefficient. Another aspect of this idea is going through a collaborator to find a new partner. An author looking to write another paper might ask a connection for a recommendation on whom to work with. It is likely that your co-author on one paper will be able to mention another author or co-worker that is participating in a similar field. This effect produces more triangles, and is based on the idea of trust. Having a good experience with a co-author will increase the likelihood of working with them again, but also the likelihood of trusting their opinions on others. When looking for a partner, factors come into play that negates the idea of randomization. Aside from non-random factors that play a big role, there is also the issue of authors who only collaborate on one paper. These authors will increase the clustering coefficient of the entire graph. Each of these authors will have a clustering coefficient of one. This is because all their co-authors have also worked with each other. For example, suppose author A worked on only one paper in the data set, and worked with authors B, C, and D on that paper. There are three possible triangles for A to be included in: A-B-C, A-B-D, and A-C-D. All of these triangles have been formed, because all of these authors are connected through the same paper that let A create connections with all of them. Table 2 shows the percentage of authors with only one paper for the data set. This is shown for the singleton data sets as well as the combined data sets. In addition, the percentage of authors with two papers is shown for comparison s sake. As seen in the table, the percentage of authors with only one paper in their name is incredibly high. The lowest value for this is 83% for a singleton year and 68% for the cumulative data set. Most of these add to the clustering coefficient, although a few did not work with enough collaborators to contribute to the coefficient. The percent of authors with only two papers is also fairly high, and their clustering coefficients are also likely to be high, adding to the overall clustering coefficient.

Years Included Number of Authors Number of Authors with One Paper Percent of Authors with One Paper Number of Authors with Two Papers Percent of Authors with Two Papers Separated 1990 3331 2991 89.792855 262 7.865505854 1991 2567 2362 92.01402415 169 6.583560577 1992 3253 2931 90.10144482 258 7.931140486 1993 4264 3819 89.56378987 355 8.325515947 1994 5208 4626 88.82488479 449 8.621351767 1995 6223 5473 87.94793508 571 9.175638759 1996 7407 6500 87.75482652 684 9.234507898 1997 7117 6255 87.88815512 658 9.245468596 1998 8521 7352 86.28095294 905 10.62081915 1999 11325 9596 84.73289183 1253 11.06401766 2000 11343 9707 85.57700785 1218 10.73790003 2001 11350 9645 84.97797357 1287 11.33920705 2002 13451 11405 84.789235 1457 10.83190841 2003 14505 12061 83.15063771 1741 12.00275767 2004 15783 13179 83.50123551 1825 11.56307419 2005 18260 15211 83.30230011 2097 11.48411829 2006 21991 18307 83.24769224 2586 11.7593561 2007 22918 19198 83.76821712 2607 11.37533816 2008 24754 20618 83.29158924 2862 11.5617678 2009 23796 20112 84.51840645 2638 11.08589679 2010 23827 20414 85.67591388 2398 10.06421287 Cumulative 1990 3331 2991 89.792855 262 7.865505854 1990-1991 5505 4751 86.30336058 558 10.13623978 1990-1992 8038 6662 82.88131376 925 11.50783777 1990-1993 11262 9081 80.63399041 1372 12.18256082 1990-1994 15152 11983 79.08526927 1918 12.65839493 1990-1995 19523 15078 77.23198279 2619 13.41494647 1990-1996 24571 18629 75.81702006 3404 13.85373001 1990-1997 29131 21712 74.53228519 4136 14.19793347 1990-1998 34492 25348 73.48950481 4959 14.3772469 1990-1999 41952 30464 72.61632342 6125 14.60001907 1990-2000 48820 34933 71.5546907 7319 14.99180664 1990-2001 55382 39227 70.82987252 8308 15.00126395 1990-2002 63475 44702 70.4245766 9551 15.04686885 1990-2003 71743 49927 69.5914584 10910 15.20705853 1990-2004 80490 55508 68.96260405 12279 15.25531122 1990-2005 90700 62079 68.44432194 13872 15.29437707 1990-2006 103989 71076 68.34953697 15813 15.20641606 1990-2007 117364 80054 68.21001329 17830 15.19205208 1990-2008 131339 89323 68.00950213 19857 15.11889081 1990-2009 144356 97875 67.80113054 21937 15.19645875 1990-2010 157605 106993 67.88680562 23762 15.07693284 Table 2

VII. CONCLUSIONS The IEEE data set shows clear distinctions from a random data set of this size. The degree distributions shown in this graph resemble the projected degree distributions in almost no way. There is a much higher number of nodes with small degrees than the randomized graph would predict. The randomized graph depicts a situation in which most nodes have a higher number of connections. The reality is that situations arise to limit the number of authors an author can work with. The clustering coefficient also demonstrates the effects that non-randomization has on the graph. The significantly higher values of the coefficient for the actual graph than the anticipated values for the randomized graph show that there is clearly social networking occurring in this data set. It is shown that there must be some reasoning for choosing co-authors beyond just randomization. A large part of this is that most authors added to the data set have only a single paper. This unique aspect does not occur in the randomized graph and changes the results by extreme amounts. There are several reasons for the data to be skewed from the estimated graphs, most of which lie in the fact that choosing a co-author is not random, while the connections of nodes in the random graph have no constraints on the connections besides chance. Coauthors are chosen based on a variety of reasons. Location is crucial to meeting potential co-authors, and can cause more triangles between authors. Also, papers actually take time to write. This seems obvious, but also provides insight as to why the degree distribution shows lower numbers of connections. An author will most likely not be able to write 20 papers in a year, which limits how many authors they can actually work with within a given year. The amount of effort it takes to actually form a connection in the IEEE data set is much greater than the effort required for the random data set. In addition, choosing a random co-author can lead to very negative results (such as wasted time, a frustrating experience, etc.). Adding another edge in the random graph has no negative repercussions. The more surprising result was the change in statistics that were represented in the singleton years data set. Most of the changes are presented because of the change in the number of papers added. That statistic, however, seems to increase on a year by year basis with no reasoning. This might be due to things that can t be represented on a graph. Things like availability of resources on a general scale, increase in the popularity of IEEE, increase in the fields that IEEE caters to, and general trends that are not represented would help cause this. The increase in the number of authors added can be seen as a direct function of the increase in the number of papers added. It is more likely that new authors will be added, as there is only a certain amount of work any author can do in a year. The IEEE data set shows clear trends as a factor of time. The increase in the number of papers added does have some effect on the other trends, such as the increase in the number of authors as a factor of time. This is also a reason for the percent of connections formed to be limited. It is much more likely that one author will work with a specific group of nodes and work within that than to explore the graph and work with as many people as possible. That idea leads to the high clustering coefficient and the high degree distribution. All these factors are inter-related, and can stem from the nature of writing a paper as a timeconsuming task that takes much more consideration than a random connection. VIII. FUTURE WORK The work done in this paper is a stepping stone to many other possibilities. Analyzing the graph in terms of structure and as a relation to time provides a higher platform for discovering other questions and some of the data needed to explore it. The next step would be to cluster all the authors and determine how the graph is actually clustered. The high clustering coefficient implies that there will be many very small clusters. This is instead of the possibility of few, but large clusters. The clustering found for this would be based on what algorithms and conditions are used. A seemingly meaningless clustering of only a few people might be very meaningful in this data set, though. The clusters would be based on strong bonds, or common neighbors. The data and algorithms would have to be implemented and tinkered with to determine the specifics. After determining the clusters, it would be interesting to look at physical location and affiliations of the authors in a cluster. It is possible that most clusters form from within a single affiliated university or other form of research establishment. This would require a great deal more of information, though. The results might show what factors authors actually take into account when writing a paper and choosing a coauthor. This data set has many more possibilities, but these two would provide very interesting information especially if structured in the correct manor.