Content-based Modeling and Prediction of Information Dissemination

Size: px

Start display at page:

Download "Content-based Modeling and Prediction of Information Dissemination"

Karin Griffith
5 years ago
Views:

1 Content-based Modeling and Prediction of Information Dissemination Kathy Macropol Department of Computer Science University of California Santa Barbara, CA 9316 USA Ambuj Singh Department of Computer Science University of California Santa Barbara, CA 9316 USA Abstract Social and communication networks across the world generate vast amounts of graph-like data each day. The modeling and prediction of how these communication structures evolve can be highly useful for many applications. Previous research in this area has focused largely on using past graph structure to predict future links. However, a useful observation is that many graph datasets have additional information associated with them beyond just their graph structure. In particular, communication graphs (such as , twitter, blog graphs, etc.) have information content associated with their graph edges. In this paper we examine the link between information content and graph structure, proposing a new graph modeling approach, GC-Model, which combines both. We then apply this model to multiple real world communication graphs, demonstrating that the built models can be used effectively to predict future graph structure and information flow. On average, GC-Model s top predictions covered 19% more of the actual future graph communication structure when compared to other previously introduced algorithms, far outperforming multiple link prediction methods and several naive approaches. I. INTRODUCTION The mining and analysis of graph datasets has proven to be both useful and important for a large variety of applications, from social community detection to biological function prediction [1]. Graph representations can naturally capture the structure and interactions present in many types of data, and are therefore commonly used to represent a wide variety of datasets. As a large portion of graph datasets change over time, one important area of research is the modeling and prediction of how these networks will evolve and how information will flow over them. By building and analyzing models of these changes, many important applications arise, including prediction of future links, anomaly detection, and expert finding [2], [3]. Previous research in the area of link prediction has focused largely on using past graph structure to predict future links. Models looking at graph measures such as node neighborhoods or path information such as random walk and flow-based distances, in and out degrees, etc. have been used successfully to predict the formation of links in the future. However, a useful observation is that many graph datasets have additional information associated with them beyond just their graph structure. In particular, communication graphs (such as , twitter, blog graphs, etc.) are a rich and important source of network data, and have a large amount of information content associated with them. In a communication graph, each node is associated with a unique user, and edges are made through messages sent between users. In this way, a communication graph is a directed multigraph, with each edge having an associated timestamp and message text. Another important aspect to communication graphs is the fact that graph edges may be clustered into message threads. Each thread consists of an originating message, along with all ensuing replies and forwards initiated by the originating message. In this way, a thread s graph structure describes the structure of information flow its users followed during communication. In addition, by collecting graph edges into threads, we obtain a dataset which consists of numerous smaller graphs, and their associated text. While previous research has mainly focused on analyzing and modeling time evolving graphs as one large graph, the additional information associated with communication graphs allows us to approach the problem instead as a modeling problem of multiple smaller graphs. In this paper, we introduce a new method, GC-Model, which takes advantage of additional properties and attributes contained within communication graphs. This model can be applied to a dataset consisting of numerous graph threads, and combines both graph and content information to predict future graph communication structure. In addition, starting information on graph structure (for example, an originating user or message link) can be input, allowing for more specific targeted models with increased accuracy to be built. We implement GC-Model and test it on multiple real word communication graphs, finding that its performance dominates both naive methods of link prediction which rely only on graph structure or content information alone, as well as other existing link prediction algorithms. On average, we find that GC-Model is able to cover over 8% of a test thread s actual communication structure using only its top 2 predictions. In addition, GC-Model performs especially well on its highest predicted links, with its top 5 predictions covering 19% more graph structure on average than that covered by the other compared methods. The rest of this paper is divided as follows. In Section II, previous research and several methods related to our work are overviewed. Next, in Section III we describe and introduce

2 our new method, GC-Model. In Section IV we implement GC-Model on multiple communication graphs, analyzing its predictive results and comparing it with multiple previously introduced link prediction methods, as well as two naive approaches. Finally, Section V overviews our results and contributions. II. RELATED WORK There has been much focus recently on the link prediction problem in graphs. In general, link prediction concentrates on the analysis of a single large graph which contains links appearing / disappearing across time. The goal is to predict what the overall graph structure will be (what links will be added or deleted) at a later time step. Multiple papers have shown that even simple analysis of basic statistical graph properties can be successful for accurate link predictions in a future network. For the majority of these approaches, a single score is calculated between every pair of nodes, and the resulting node pairs with the highest scores are used as predictions for future links. Examples of these scoring methods include counting the common neighbors between two nodes ( N(a) N(b), where N(a) is the neighborhood set of node a) and calculating the s coefficient between the two nodes neighborhood sets (( N(a) N(b))/(N(a) N(b) ). Other related methods include modeling the graph using preferential attachment, which finds a score ( N(a) N(b) ) favoring nodes of high degree [4], [5]. In the Adamic/Adar measure ( i N(a) N(b) a/log N(i) ), a score is calculated between two nodes based on their number of common neighbors, with each neighbor weighted by its importance (the rarity of links in its neighbor set) [6]. Another recently introduced method of unsupervised link prediction includes PropFlow, which looks at random walk probabilities to make predictions [7]. Besides these simple unsupervised methods of link prediction based on graph structure, a supervised learning method was recently introduced, which uses an ensemble of the above unsupervised methods, combined with supervised learning, to make predictions on link formation in graphs [7]. All the previously mentioned methods of link prediction focus heavily on the prediction of new link formations between nodes. By calculating a score between every pair of nodes in the graph, previously unseen links are likely to be predicted. However, these methods of prediction ignore both the timing or the amount of communication previously seen between two nodes, and are not as appropriate in communication graphs where the vast majority of links are repeats. In addition, these previously introduced link prediction methods look only at graph structure while making link predictions. Beyond graph structure, however, many large evolving graph datasets contain other features associated with their nodes and links. Current work has shown that augmenting these simple graph models with the information contained within the additional features allows for increased predictive power, as well. The inclusion of family circle information has been shown to increase social network link prediction accuracy B A Graph Thread 1 Thread 2 Thread 3 1. A B, A B 2. A B, A C 3. A B, A C C Word 1. No 2. Meeting 3. This 4. Tuesday 5. Canceled 6. Meeting 7. Tomorrow Graph 1. A C, A C A C Word 1. Need 2. This 3. Tax 4. Info 5. Tomorrow Graph 1. A C, C A 2. A C, A C 3. A C, C A A C Word 1. Tax 2. Info 3. Sent 4. Already Fig. 1. Example Graph and Text word documents created from three threads, using a graph word length of l = 2 by as much as 3% [8], and aggregating features such as paper categories and neighbor counts have been used for link prediction in publication networks as well [9]. Recent work has also looked at learning random walk edge weights based on social network attributes [1]. Besides these categorical features, a rich source of feature information is contained within document and communication networks. In these cases, each node or link in the graph has associated with it a text document. One recent work focuses on document link prediction by building a relational topic model (RTM) between document texts [11]. In this case, each node is associated with a text document, and LDA is used to discover the document topic probabilities, with the likelihood of two text documents containing a link between them a function of their topic similarities. This RTM approach, however, focuses on graphs containing one single static document per node, as well as allowing only for the prediction of new links between a given new node and the rest of the graph, rather than predicting a larger, flowing, thread structure. In this work, we introduce GC-Model, a new method which extends on these previous ideas by building a model for communication threads using both graph and text content, the amount of communication between nodes, as well as message timing information. The built models can be customized by inputing starting information on the graph structure, and allows for the prediction of future thread structure. III. METHOD In GC-Model, a single, large, communication graph is split into numerous smaller communication threads. Each thread is a non-overlapping set of messages sent between users from a single message flow. By partitioning the larger graph into these separate thread structures, we model separate communication flows based on subject and time. This partitioning is closer to real life, where communication often happens in multiple separate flows, rather than being part of a single giant structure. Three example threads are shown at the top of Figure 1, with Thread 1 consisting of three messages, two sent from A to B and one sent from A to C, Thread 2 consisting of two

3 Fig. 2. Algorithm: Build GC-Model Algorithm : GC-Model Require: Training threads dataset, Num topics t, input graph structure I, Num sampling iterations sample, subgraph length l. Ensure: Topic-word probabilities P [word][topic]. S[ ] threads in dataset matching I gw ords[ ] [ ] for thread in S[ ] do create graph words of length l for thread add graph words to gw ords Sort S[ ] by first link s start time tp rob[ ][ ] [ ][ ] for thread in S[ ] do Stem and remove stopwords in Text for thread Use LDA to find topic distribution on thread store tp rob[thread][topic] curve findcurve(s[ ]) prt opic[ ] findprtopic(t, S[ ], tp rob[ ][ ], curve) prw ord[ ] findprword(gw ords, S[ ], curve) prt opicw ord[ ][ ] findprtopicword(gw ords, S[ ], t, sample) P [ ][ ] [ ][ ] for word in gw ords[ ] do P [word][topic] prt opicw ord[word][topic] prw ord[word] / prt opic[topic] Return P [ ][ ] messages, both from A to C, etc. Once these threads are created, the main idea behind GC- Model is that each communication thread can be thought of as two different types of documents: a text document and a graph document. Both document types consist of a grouping of words. The text document consists of all text words contained within the thread messages. The graph document consists of all possible connected subgraphs, of length l, that may be created from the thread s graph structure. Each of the possible connected subgraphs composes one word of the graph document. The bottom half of Figure 1 shows examples of these text and graph documents (where l = 2) for three different sample threads. From Figure 1, the text document for Thread 1 consists of all the text words contained in the three messages sent in that thread. The graph document consists of all connected subgraphs of size 2 created from Thread 1 s graph structure. Similar text and graph documents are created Fig. 3. Algorithm: Find Pr[Topic] Algorithm : findprtopic Require: Num topics t, Filtered dataset S, Thread LDA Topic Probs tp rob[ ][ ], Curve fit for Pr[Thread] curve. Ensure: Array of prior probability of topics, prt opic prt opic[ ] [ ] for i = 1 to S.size do prt opic[topic] += tp rob[i][topic] curve(testt hr.time, S[i].time) Return prt opic for Threads 2 and 3. For both types of documents, GC-Model makes a mixtureof-topics assumption, where every document is assumed to be associated with a distribution over a set of topics. These topics, in turn, are a distribution over a vocabulary of words. Each topic, therefore, is associated with a distribution over two separate vocabularies: one for text words and one for graph words. In both cases, the probability of a word appearing in its respective document type for that thread is: P robability of word = t P r[word T i ] Q[T i ] (1) i=1 where T i represents the i-th topic, and Q[T i ] represents the distribution probability of T i in this document. For a thread, the graph structure is therefore contained in its graph document words. If we are able to build a model which predicts, given its topic distribution, what graph words will likely appear in a thread, then we can predict which links will appear in this thread over the future. This idea is the basis behind our model. Various methods of modeling documents composed of a series of words have been introduced in past research. One method that has gained widespread use is Latent Dirichlet Allocation (LDA) [12]. Previous studies have shown LDA modeling and its variants to be effective at discovering topics and clustering documents contained in communication messages [13], [14]. LDA is a generative probabilistic model where every document is assumed to be associated with a mixture of topics, where a topic is a distribution over a vocabulary of words. During the generative process, each document draws its topic proportions from a Dirichlet distribution. Next, words in a document are generated by first randomly choosing a topic, weighted by the topic proportions, and then choosing a word from the corresponding topic distribution. Multiple different techniques have been proposed to infer these topic and word distributions from a collection of documents. In this paper we use Gibbs sampling to infer the various LDA distributions, using the publicly available JGibbLDA 1 1 Available online at

4 Fig. 4. Algorithm: Find Pr[Word] Algorithm : findprword Require: List of graph words words, Filtered dataset S, Curve fit for Pr[Thread] curve. Ensure: Array of prior probabilities of words prw ord. prw ord[ ] [ ] for word = 1 to words.size do for i = 1 to S.size do wordcount = for currentw ord in S[i].graphWords do if currentw ord = word then wordcount + + end if wordp rob wordcount / S[i].graphWords prw ord[word] += wordp rob curve(testt hr.time, S[i].time) Return prw ord implementation. To obtain the probability of graph words given topic, P r[word T i ], and therefore find P r[word] from Equation 1, GC-Model follows several steps (pseudocode for the overall GC-Model algorithm is shown in Figure 2). First, given a starting portion of the graph, all threads matching the starting condition are collected for use in training, resulting in a training set of size n. Applying LDA inference to the training threads text word documents will give the associated topic probabilities for each thread. In addition, from Bayes Theorem, we obtain Eq 2: P r[word w T i ] = P r[t i word w ] P r[word w ] P r[t i ] where word w represents the w-th word in the vocabulary. This means that to calculate P r[word w T i ], we must find three values: P r[t i word w ], P r[word w ], and P r[t i ]. The last value, P r[t i ], represents the prior probability of a topic appearing for any thread. This may be found, as shown in Equation 3, by summing the product of each thread s topic probability, obtained from the LDA inference, with the prior probability of that thread appearing. P r[t i ] = (2) n P r[t i T hread j ] P r[t hread j ] (3) j=1 The prior probability of a thread s likelihood, P r[t hread j ] from Equation 3, is estimated using a simple exponential curve fitted to the thread s start time. Details on how this value is found are given in subsection III-A. Pseudocode for finding P r[t opic] is shown in Figure 3. The second value from Equation 2, P r[word w ], may be found similarly to P r[t hread j ], as shown in Equation 4: Fig. 5. Algorithm: Find Pr[Topic Word] Algorithm : findprtopicword Require: List of graph words words, Filtered dataset S, Num topics t. Curve fit for Pr[Thread] curve. Sampling iterations sample, LDA topic probs tp rob[ ][ ]. Ensure: Array of prior probability of Topics given a Word, prt opicw ord prt opicw ord[ ][ ] [ ][ ] for word = 1 to words.size do U { } for i = 1 to S.size do if S[i].graphWords contains word then Add S[i] to U end if for i = 1 to sample do topiccount[ ] [ ] for thread in U do Randomly pick a topic, m, weighted according to tprob[thread][topic] topiccount[m]++ for j = 1 to t do prt opicw ord[word][j] += topiccount[j] / U.size for i = 1 to t do prt opicw ord[word][i] /= sample ReturnprT opicw ord P r[word w ] = n P r[word w T hread j ] P r[t hread j ] (4) j=1 In this case, the prior probability of finding a document word in a thread, P r[word w T hread j ], can be calculated by finding its probability of occurrence in the graph document for that thread, as shown in Equation 5: P r[word w T hread j ] = word w in T hread j words(t hread j ) where word w in T hread j represents the number of times word w occurs in T hread j s graph document, and words(t hread j ) is the set of graph words contained in T hread j s graph document. As an example, Thread 1 from Figure 1 contains the graph word A B, A B once out a total of three graph words, and therefore has an associated probability P r[a B, A B T hread1] = 1 3. The next graph word in Thread 1, A B, A C, occurs twice out (5)

5 Fig. 6. Algorithm: Fit Curve for Pr[Thread] Algorithm : findcurve Require: Filtered dataset S Ensure: Curve used to find Pr[Thread] given time, curve sumoff itcurves total for j = 1 to S.size do currentt hread S[j] timeprob { } for k = j 1 to 1 do T ime currentthread.time - S[k].time Train { } for m = k to j 1 do Add S[m] to Train P rob 1 set false for Link, l, in currentt hread do if Train contains l then P rob = (Number of l contained in Train) / (Number of links in Train) set true end if if set is true then Add (T ime, P rob) to timeprob total total + 1 end if Fit exponential curve c to timeprob sumoff itcurves += c curves = sumoffitcurves / total Return curves of three graph words, and therefore has a probability of 2 3. The pseudocode for finding P r[w ord] is shown in Figure 4. The third and final value necessary to calculate the probability in Equation 2, P r[t i word w ], is estimated using sampling. By collecting and analyzing the set of threads containing word w, the associated topic probabilities may be found. To do this, the set S = {T hread j word w v(t hread j )} is first created, containing every thread in the training set whose graph document contains word w. Next, a sampled value for the probability of every topic is found. This is done by repeatedly choosing a random topic, weighted by the inferred topic distribution, for each thread in S. The final sampled probability for each topic is calculated to be its average probability of being chosen during this sampling process. Pseudocode for finding this prior is shown in Figure 5. By combining the three previously found values according to Equation 2, we can calculate the probability, given the thread s topic distribution, of any graph word appearing. These probabilities therefore allow us to model the connection Fig. 7. Algorithm: GeneratePredictions Algorithm : GeneratePredictions Require: Text for thread test, Training threads dataset, Num topics t, input graph structure I, Num sampling iterations sample, subgraph length l. Ensure: Sorted array of link predictions, predict[ ]. tp rob[ ] [ ] Stem and remove stopwords in test Use LDA to find topic distribution on test store tp rob[topic] P [word][topic] GC-Model(dataset, t, I, sample, l) predict[ ] [ ] for w = 1 to p[word].length do predict[w] += P [w][topic] * tp rob[topic] Return predict between a thread s graph structure and its communication text content, and can be used on new test threads to predict which graph words and communication structures are most likely to appear in that graph. Given a new thread s text content, by applying LDA inference on its text document, the list of topic probabilities for this test thread may be found. Using Equation 1 to combine this topic distribution with the word probabilities calculated in Equation 2, the probability of any graph word appearing in this test thread may be found. The graph words with the highest predicted probabilities are GC-Model s top predictions for the future graph structure of this test thread. The pseudocode for this overall prediction algorithm is shown in Figure 7 A. Thread Probability It is possible to model the likelihood of a thread, P r[t hread j ], from Equations 3 and 4 using the assumption that every training thread is equally likely, obtaining a constant value of P r[t hread j ] = 1/n. However, in actuality, the probability of a past thread s structure appearing again is often related to how recently it occurred. In this way, for example, people messaged recently by a user are more likely to be contacted again, in the near future, than those not contacted for several years. To account for this, GC-Model fits a simple relation between probability and time. The probability of each thread is approximated by the likelihood of its graph links reoccurring in another thread. Between every pair of training threads, their distance in time (the difference between the timestamps of their first messages) and their percentage of link overlap is calculated. These probability / time pairs are then sorted by their timestamp distances and binned into equi-depth bins, averaging the values in each bin. Finally, a simple exponential curve is fit to this series, allowing for the

6 5 5 5 l = 1 l = 2 l = Topic 2 Topics 4 Topics 6 Topics 8 Topics 1 Topics 12 Topics Fig. 8. Varying graph word length. The parameter affects scores only slightly. Fig. 9. Varying number of LDA topics. Increasing topics raises F1 score until a steady state is reached, where further adjustment has little affect. connection between thread time and probability. Pseudocode for this algorithm is shown in Figure 6. IV. EXPERIMENTS In this section, we report our results from implementing and testing GC-Model introduced above. To confirm the effectiveness of our modeling and prediction strategy, GC- Model was implemented in Java and tested on multiple realworld communication networks. We compare our performance to recently introduced link prediction method PropFlow [7], as well as Common Neighbors, Attachment, s coefficient, Adamic/Adar, and both a graph based and a text based naive method. For all approaches, the training graph dataset was created by collecting only those threads which matched the given starting input. The graph based naive method used here predicts links based on frequency, where the highest scoring predictions are the links which appear most frequently. The text based naive method predicts links based on text distances, where the top scoring predictions are the links which contain communication text (stemmed, with stopwords removed) with closest TF-IDF cosine similarity with the test threads. A. Networks Three separate communication graphs are used in this paper. The first is a crawled Twitter dataset, consisting of ~1M nodes (unique Twitter users) and ~2M edges (tweets). The total graph dataset is grouped into threads using each tweet s exact reply to and retweet information. Keeping only those threads containing 3 or more links, a total of ~6k threads were obtained from this network. The second dataset is the Enron dataset, consisting of over 5k links ( s) between 37k nodes (unique addresses). To separate the Enron graph into approximate threads, the subject lines and timestamps were used. s sent close in time (within one month of each other) were grouped according to exact subject line matches, ignoring prepended Re and Fwd text. The ten largest threads were removed, due to the general nature of their subject lines, as well as broadcast s sent to greater than 2 people. Only threads containing at least 3 links and unique users were kept, leaving a total of ~5k threads. The last dataset is the SNAP Twitter dataset [15], consisting of ~17M nodes and ~476M edges. Approximate threads were formed through the retweet text information contained in the links. Threads were creating by doing exact text matches between the tweet text, the RT retweet information, and the user ids contained in the messages, resulting in ~4k threads. B. Parameters and Setup In our implementation and experiments on GC-Model, the graph word length parameter l was found to influence results only slightly and set to 2 for all tests reported here, unless otherwise noted. An example of the results obtained varying l between 1 to 3 in the Enron dataset is shown in Figure 8, which plots the F1 score, defined as: F 1 Score = 2 recall precision recall + precision As can be seen, the graphs of the resulting scores are mostly clustered and have similar scores. Comparable results were found in the other networks. In addition, a sampling size of 1 to find P r[t i word w ], as discussed in Section III, was discovered to give consistent estimates, and is the value used in this paper. Finally, a size of 8 LDA topics was found to result in high quality results for all tested communication graphs, and is the value used for the displayed results here. Figure 9 shows the results of varying the number of topics between 1 to 12 for the Enron dataset. As can be seen from the figure, result scores increase steadily as the number of topics are increased, until reaching a stable state where further increases in topic numbers affect results only slightly. Similar results were found for the other datasets. To validate the effectiveness of our technique at building models of communication graphs, for each network a series of models were built and evaluated. This was done twice, and the results reported in Figures 9 and 1. For the results shown in Figure 9, the starting information given on graph structure was the beginning user for the thread. In Figure 1, the starting information was the beginning graph link. For both these types, the starting graph information was obtained by pulling the top 2% of starting users or graph links with the highest frequencies in the dataset. This was done to allow for

7 5 5 5 PropFlow.6.5 Prop Flow Prop Flow (a) Enron Dataset (b) Crawled Twitter Dataset (c) SNAP Twitter Dataset Fig. 1. Results for predicting directed links given starting user. GC-Model consistantly returns higher scores than the other previously introduced algorithms PropFlow Common Neighbors.6.5 Prop Flow Prop Flow (a) Enron Dataset (b) Crawled Twitter Dataset (c) SNAP Twitter Dataset Fig. 11. Results for predicting directed links given starting link. Again, GC-Model s results dominate the other methods, especially among the top predictions. sufficient dataset sizes in training and testing. For each model, the threads matching the given starting information were found and sorted according to the timestamp of the first message sent, and the ten latest threads were separated to form the testing set. The remaining graph threads were used for training and building the model. After building the models, the most likely predicted graph links were found for each test thread using the method shown in Section III. To evaluate these predictions, the F1 score was again used, as it allows for the combination of both recall and precision, favoring results with a good balance of both. The F1 score was calculated repeatedly for every test thread and model, using a series of top-k predictions, with K ranging from 1 to 2. For each value of K, the recall and precision of the top-k prediction set was found against the actual set of directed links, and used to find its F1 score. To allow for fair prediction and scoring, only those links in the testing set occurring at least once in the training set were used in these result calculations. Finally, the series of F1 scores found across the test threads for each model were averaged to obtain the final results. C. Prediction Results The result of applying GC-Mode, as well as the other link prediction and naive graph and text methods, can be seen in Figures 9 and 1. Figure 9 shows the plot of the F1 score versus top-k value, on each network, for the various methods. As can be seen, GC- Model consistently outperforms the other methods, reporting higher results consistently across all the graphs. GC-Model also performs especially well on those earliest predictions assigned the highest probability, obtaining F1 scores of up to 45% higher than the next closest method. The Graph and Text Naive methods, as well as the recently introduced PropFlow link prediction method, score consistently lower on the highest predictions, but report similar scores eventually as the K value is increased. This helps to confirm the effectiveness of GC- Model s method in the modeling and use of both text and graph structure information for prediction, as GC-Model needs fewer predictions to cover more of the test threads. The much lower prediction results of the remaining previously introduced link prediction methods emphasize their use for the prediction of new, rather than repeating, links. These trends are echoed in Figure 1, where the starting link, rather than the starting user, is given. GC-Model again returns much stronger F1 scores, dominating the results across the graphs. The addition of the extra starting link information, used to prune the training sets for all the reported methods, can be seen to increase the strength of the models and predictions, as well. In addition, Tables I and II report the percentage of directed links in the test graphs covered by the top-k predictions. Only the graph and text naive methods and PropFlow method are used for comparison here, as the other link prediction methods generally had lower values. As can be seen from Table I, taking just the top 1 GC-Model link predictions covers over 7% of the actual thread structure in the Enron dataset, and the top 2 cover over 8%. GC-Model s strong results again emphasizes its ability to build communication thread models that work both effectively and are useful for information dissemination prediction, and confirms the usefulness of combining content and graph structure information.

8 TABLE I GIVEN STARTING USER, PERCENT OF GRAPH COVERED BY PREDICTIONS Enron Crawled Twitter SNAP Twitter Predictions % Graph Covered % Graph Covered % Graph Covered GC- Graph Text Prop GC- Graph Text Prop GC- Graph Text Prop Model Naive Naive Flow Model Naive Naive Flow Model Naive Naive Flow Top 5 43% 38% 37% 23% 54% 38% 35% 33% 74% 41% 18% 18% Top 1 62% 54% 53% 35% 63% 46% 43% 38% 81% 45% 2% 2% Top 2 73% 66% 67% 47% 74% 55% 53% 41% 86% 48% 22% 22% Top 5 86% 85% 85% 55% 89% 65% 68% 45% 93% 53% 25% 24% TABLE II GIVEN STARTING LINK, PERCENT OF GRAPH COVERED BY PREDICTIONS Enron Crawled Twitter SNAP Twitter Predictions % Graph Covered % Graph Covered % Graph Covered GC- Graph Text Prop GC- Graph Text Prop GC- Graph Text Prop Model Naive Naive Flow Model Naive Naive Flow Model Naive Naive Flow Top 5 62% 53% 45% 49% 57% 4% 37% 34% 77% 45% 18% 19% Top 1 78% 65% 56% 51% 66% 47% 45% 39% 83% 49% 21% 21% Top 2 87% 71% 71% 59% 75% 55$ 54% 43% 89% 53% 23% 22% Top 5 91% 83% 8% 64% 89% 66% 68% 47% 96% 6% 27% 26% V. CONCLUSION The modeling and prediction of graph structure in graph datasets has been both an interesting and important area of research in recent years. In particular, with the growth of social networks and services such as Twitter, communication graphs have been an especially fast growing source of data and information. While numerous link prediction methods have been introduced, most focusing on predicting new link formation using only graph structure, an important insight is that communication graphs have additional text and information content associated with them. In this paper, we introduced a new method, GC-Model, which utilizes the extra content information included in communication graphs to build graph models that may be used to predict future network structure and information flow. When implemented and tested on three real world communication networks, GC-Model predicted with far increased accuracy and coverage, when compared to multiple previously introduced link prediction and naive methods. GC-Model s returned scores dominate the comparison sets, with its top 1 predictions covering on average up to 22% more of the actual communication graph structure than the compared methods. These results emphasize the importance and strength of including both information content, as well as graph structure, in the modeling of communication networks, as well as the effectiveness of GC-Model s modeling scheme for the prediction of future information flow structure. ACKNOWLEDGMENT Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. REFERENCES [1] K. Macropol and A. Singh, Scalable discovery of best clusters on large graphs, Proc. VLDB Endow., vol. 3, pp , September 21. [2] D. Liben-Nowell and J. Kleinberg, The link-prediction problem for social networks, J. of ASIST, vol. 58, pp , 27. [3] Z. Huang and D. Zeng, A link prediction approach to anomalous detection, in SMC, 26. [4] M. E. J. Newman, Clustering and preferential attachment in growing networks. Phys. Rev. E, vol. 64, 21. [5] A. L. Barabasi and et. al., Evolution of the social network of scientific collaborations, Physica A: Stat. Mech. and its Applications, vol [6] L. Adamic, Friends and neighbors on the Web, Social Networks, vol. 25, no. 3, pp , Jul. 23. [7] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla, New perspectives and methods in link prediction, in KDD, 21. [8] E. Zheleva, L. Getoor, J. Golbeck, and U. Kuter, Using friendship ties and family circles for link prediction, in SNAKDD, 21, pp [9] M. A. Hasan, V. Chaoji, S. Salem, and M. Zaki, Link prediction using supervised learning, in SDM 6 workshop on LACS, 26. [1] L. Backstrom and J. Leskovec, Supervised random walks: predicting and recommending links in social networks, in Proc. of the WSDM. ACM, 211, pp [11] J. Chang and D. M. Blei, Hierarchical relational models for document networks, Annals of Applied Statistics, Oct. 21. [12] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp , March 23. [13] C. Ozcaglar, Classification of Messages Into Topics Using Latent Dirichlet Allocation, Master s thesis, Rensselaer Polytechnic Institute, Troy, New York, 28. [14] L. Hong and B. D. Davison, Empirical study of topic modeling in twitter, in SOMA, 21. [15] J. Yang and J. Leskovec, Patterns of temporal variation in online media, in Proc. of the WSDM. ACM, 211, pp

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,