Content-based Modeling and Prediction of Information Dissemination

Size: px
Start display at page:

Download "Content-based Modeling and Prediction of Information Dissemination"

Transcription

1 Content-based Modeling and Prediction of Information Dissemination Kathy Macropol Department of Computer Science University of California Santa Barbara, CA 9316 USA Ambuj Singh Department of Computer Science University of California Santa Barbara, CA 9316 USA Abstract Social and communication networks across the world generate vast amounts of graph-like data each day. The modeling and prediction of how these communication structures evolve can be highly useful for many applications. Previous research in this area has focused largely on using past graph structure to predict future links. However, a useful observation is that many graph datasets have additional information associated with them beyond just their graph structure. In particular, communication graphs (such as , twitter, blog graphs, etc.) have information content associated with their graph edges. In this paper we examine the link between information content and graph structure, proposing a new graph modeling approach, GC-Model, which combines both. We then apply this model to multiple real world communication graphs, demonstrating that the built models can be used effectively to predict future graph structure and information flow. On average, GC-Model s top predictions covered 19% more of the actual future graph communication structure when compared to other previously introduced algorithms, far outperforming multiple link prediction methods and several naive approaches. I. INTRODUCTION The mining and analysis of graph datasets has proven to be both useful and important for a large variety of applications, from social community detection to biological function prediction [1]. Graph representations can naturally capture the structure and interactions present in many types of data, and are therefore commonly used to represent a wide variety of datasets. As a large portion of graph datasets change over time, one important area of research is the modeling and prediction of how these networks will evolve and how information will flow over them. By building and analyzing models of these changes, many important applications arise, including prediction of future links, anomaly detection, and expert finding [2], [3]. Previous research in the area of link prediction has focused largely on using past graph structure to predict future links. Models looking at graph measures such as node neighborhoods or path information such as random walk and flow-based distances, in and out degrees, etc. have been used successfully to predict the formation of links in the future. However, a useful observation is that many graph datasets have additional information associated with them beyond just their graph structure. In particular, communication graphs (such as , twitter, blog graphs, etc.) are a rich and important source of network data, and have a large amount of information content associated with them. In a communication graph, each node is associated with a unique user, and edges are made through messages sent between users. In this way, a communication graph is a directed multigraph, with each edge having an associated timestamp and message text. Another important aspect to communication graphs is the fact that graph edges may be clustered into message threads. Each thread consists of an originating message, along with all ensuing replies and forwards initiated by the originating message. In this way, a thread s graph structure describes the structure of information flow its users followed during communication. In addition, by collecting graph edges into threads, we obtain a dataset which consists of numerous smaller graphs, and their associated text. While previous research has mainly focused on analyzing and modeling time evolving graphs as one large graph, the additional information associated with communication graphs allows us to approach the problem instead as a modeling problem of multiple smaller graphs. In this paper, we introduce a new method, GC-Model, which takes advantage of additional properties and attributes contained within communication graphs. This model can be applied to a dataset consisting of numerous graph threads, and combines both graph and content information to predict future graph communication structure. In addition, starting information on graph structure (for example, an originating user or message link) can be input, allowing for more specific targeted models with increased accuracy to be built. We implement GC-Model and test it on multiple real word communication graphs, finding that its performance dominates both naive methods of link prediction which rely only on graph structure or content information alone, as well as other existing link prediction algorithms. On average, we find that GC-Model is able to cover over 8% of a test thread s actual communication structure using only its top 2 predictions. In addition, GC-Model performs especially well on its highest predicted links, with its top 5 predictions covering 19% more graph structure on average than that covered by the other compared methods. The rest of this paper is divided as follows. In Section II, previous research and several methods related to our work are overviewed. Next, in Section III we describe and introduce

2 our new method, GC-Model. In Section IV we implement GC-Model on multiple communication graphs, analyzing its predictive results and comparing it with multiple previously introduced link prediction methods, as well as two naive approaches. Finally, Section V overviews our results and contributions. II. RELATED WORK There has been much focus recently on the link prediction problem in graphs. In general, link prediction concentrates on the analysis of a single large graph which contains links appearing / disappearing across time. The goal is to predict what the overall graph structure will be (what links will be added or deleted) at a later time step. Multiple papers have shown that even simple analysis of basic statistical graph properties can be successful for accurate link predictions in a future network. For the majority of these approaches, a single score is calculated between every pair of nodes, and the resulting node pairs with the highest scores are used as predictions for future links. Examples of these scoring methods include counting the common neighbors between two nodes ( N(a) N(b), where N(a) is the neighborhood set of node a) and calculating the s coefficient between the two nodes neighborhood sets (( N(a) N(b))/(N(a) N(b) ). Other related methods include modeling the graph using preferential attachment, which finds a score ( N(a) N(b) ) favoring nodes of high degree [4], [5]. In the Adamic/Adar measure ( i N(a) N(b) a/log N(i) ), a score is calculated between two nodes based on their number of common neighbors, with each neighbor weighted by its importance (the rarity of links in its neighbor set) [6]. Another recently introduced method of unsupervised link prediction includes PropFlow, which looks at random walk probabilities to make predictions [7]. Besides these simple unsupervised methods of link prediction based on graph structure, a supervised learning method was recently introduced, which uses an ensemble of the above unsupervised methods, combined with supervised learning, to make predictions on link formation in graphs [7]. All the previously mentioned methods of link prediction focus heavily on the prediction of new link formations between nodes. By calculating a score between every pair of nodes in the graph, previously unseen links are likely to be predicted. However, these methods of prediction ignore both the timing or the amount of communication previously seen between two nodes, and are not as appropriate in communication graphs where the vast majority of links are repeats. In addition, these previously introduced link prediction methods look only at graph structure while making link predictions. Beyond graph structure, however, many large evolving graph datasets contain other features associated with their nodes and links. Current work has shown that augmenting these simple graph models with the information contained within the additional features allows for increased predictive power, as well. The inclusion of family circle information has been shown to increase social network link prediction accuracy B A Graph Thread 1 Thread 2 Thread 3 1. A B, A B 2. A B, A C 3. A B, A C C Word 1. No 2. Meeting 3. This 4. Tuesday 5. Canceled 6. Meeting 7. Tomorrow Graph 1. A C, A C A C Word 1. Need 2. This 3. Tax 4. Info 5. Tomorrow Graph 1. A C, C A 2. A C, A C 3. A C, C A A C Word 1. Tax 2. Info 3. Sent 4. Already Fig. 1. Example Graph and Text word documents created from three threads, using a graph word length of l = 2 by as much as 3% [8], and aggregating features such as paper categories and neighbor counts have been used for link prediction in publication networks as well [9]. Recent work has also looked at learning random walk edge weights based on social network attributes [1]. Besides these categorical features, a rich source of feature information is contained within document and communication networks. In these cases, each node or link in the graph has associated with it a text document. One recent work focuses on document link prediction by building a relational topic model (RTM) between document texts [11]. In this case, each node is associated with a text document, and LDA is used to discover the document topic probabilities, with the likelihood of two text documents containing a link between them a function of their topic similarities. This RTM approach, however, focuses on graphs containing one single static document per node, as well as allowing only for the prediction of new links between a given new node and the rest of the graph, rather than predicting a larger, flowing, thread structure. In this work, we introduce GC-Model, a new method which extends on these previous ideas by building a model for communication threads using both graph and text content, the amount of communication between nodes, as well as message timing information. The built models can be customized by inputing starting information on the graph structure, and allows for the prediction of future thread structure. III. METHOD In GC-Model, a single, large, communication graph is split into numerous smaller communication threads. Each thread is a non-overlapping set of messages sent between users from a single message flow. By partitioning the larger graph into these separate thread structures, we model separate communication flows based on subject and time. This partitioning is closer to real life, where communication often happens in multiple separate flows, rather than being part of a single giant structure. Three example threads are shown at the top of Figure 1, with Thread 1 consisting of three messages, two sent from A to B and one sent from A to C, Thread 2 consisting of two

3 Fig. 2. Algorithm: Build GC-Model Algorithm : GC-Model Require: Training threads dataset, Num topics t, input graph structure I, Num sampling iterations sample, subgraph length l. Ensure: Topic-word probabilities P [word][topic]. S[ ] threads in dataset matching I gw ords[ ] [ ] for thread in S[ ] do create graph words of length l for thread add graph words to gw ords Sort S[ ] by first link s start time tp rob[ ][ ] [ ][ ] for thread in S[ ] do Stem and remove stopwords in Text for thread Use LDA to find topic distribution on thread store tp rob[thread][topic] curve findcurve(s[ ]) prt opic[ ] findprtopic(t, S[ ], tp rob[ ][ ], curve) prw ord[ ] findprword(gw ords, S[ ], curve) prt opicw ord[ ][ ] findprtopicword(gw ords, S[ ], t, sample) P [ ][ ] [ ][ ] for word in gw ords[ ] do P [word][topic] prt opicw ord[word][topic] prw ord[word] / prt opic[topic] Return P [ ][ ] messages, both from A to C, etc. Once these threads are created, the main idea behind GC- Model is that each communication thread can be thought of as two different types of documents: a text document and a graph document. Both document types consist of a grouping of words. The text document consists of all text words contained within the thread messages. The graph document consists of all possible connected subgraphs, of length l, that may be created from the thread s graph structure. Each of the possible connected subgraphs composes one word of the graph document. The bottom half of Figure 1 shows examples of these text and graph documents (where l = 2) for three different sample threads. From Figure 1, the text document for Thread 1 consists of all the text words contained in the three messages sent in that thread. The graph document consists of all connected subgraphs of size 2 created from Thread 1 s graph structure. Similar text and graph documents are created Fig. 3. Algorithm: Find Pr[Topic] Algorithm : findprtopic Require: Num topics t, Filtered dataset S, Thread LDA Topic Probs tp rob[ ][ ], Curve fit for Pr[Thread] curve. Ensure: Array of prior probability of topics, prt opic prt opic[ ] [ ] for i = 1 to S.size do prt opic[topic] += tp rob[i][topic] curve(testt hr.time, S[i].time) Return prt opic for Threads 2 and 3. For both types of documents, GC-Model makes a mixtureof-topics assumption, where every document is assumed to be associated with a distribution over a set of topics. These topics, in turn, are a distribution over a vocabulary of words. Each topic, therefore, is associated with a distribution over two separate vocabularies: one for text words and one for graph words. In both cases, the probability of a word appearing in its respective document type for that thread is: P robability of word = t P r[word T i ] Q[T i ] (1) i=1 where T i represents the i-th topic, and Q[T i ] represents the distribution probability of T i in this document. For a thread, the graph structure is therefore contained in its graph document words. If we are able to build a model which predicts, given its topic distribution, what graph words will likely appear in a thread, then we can predict which links will appear in this thread over the future. This idea is the basis behind our model. Various methods of modeling documents composed of a series of words have been introduced in past research. One method that has gained widespread use is Latent Dirichlet Allocation (LDA) [12]. Previous studies have shown LDA modeling and its variants to be effective at discovering topics and clustering documents contained in communication messages [13], [14]. LDA is a generative probabilistic model where every document is assumed to be associated with a mixture of topics, where a topic is a distribution over a vocabulary of words. During the generative process, each document draws its topic proportions from a Dirichlet distribution. Next, words in a document are generated by first randomly choosing a topic, weighted by the topic proportions, and then choosing a word from the corresponding topic distribution. Multiple different techniques have been proposed to infer these topic and word distributions from a collection of documents. In this paper we use Gibbs sampling to infer the various LDA distributions, using the publicly available JGibbLDA 1 1 Available online at

4 Fig. 4. Algorithm: Find Pr[Word] Algorithm : findprword Require: List of graph words words, Filtered dataset S, Curve fit for Pr[Thread] curve. Ensure: Array of prior probabilities of words prw ord. prw ord[ ] [ ] for word = 1 to words.size do for i = 1 to S.size do wordcount = for currentw ord in S[i].graphWords do if currentw ord = word then wordcount + + end if wordp rob wordcount / S[i].graphWords prw ord[word] += wordp rob curve(testt hr.time, S[i].time) Return prw ord implementation. To obtain the probability of graph words given topic, P r[word T i ], and therefore find P r[word] from Equation 1, GC-Model follows several steps (pseudocode for the overall GC-Model algorithm is shown in Figure 2). First, given a starting portion of the graph, all threads matching the starting condition are collected for use in training, resulting in a training set of size n. Applying LDA inference to the training threads text word documents will give the associated topic probabilities for each thread. In addition, from Bayes Theorem, we obtain Eq 2: P r[word w T i ] = P r[t i word w ] P r[word w ] P r[t i ] where word w represents the w-th word in the vocabulary. This means that to calculate P r[word w T i ], we must find three values: P r[t i word w ], P r[word w ], and P r[t i ]. The last value, P r[t i ], represents the prior probability of a topic appearing for any thread. This may be found, as shown in Equation 3, by summing the product of each thread s topic probability, obtained from the LDA inference, with the prior probability of that thread appearing. P r[t i ] = (2) n P r[t i T hread j ] P r[t hread j ] (3) j=1 The prior probability of a thread s likelihood, P r[t hread j ] from Equation 3, is estimated using a simple exponential curve fitted to the thread s start time. Details on how this value is found are given in subsection III-A. Pseudocode for finding P r[t opic] is shown in Figure 3. The second value from Equation 2, P r[word w ], may be found similarly to P r[t hread j ], as shown in Equation 4: Fig. 5. Algorithm: Find Pr[Topic Word] Algorithm : findprtopicword Require: List of graph words words, Filtered dataset S, Num topics t. Curve fit for Pr[Thread] curve. Sampling iterations sample, LDA topic probs tp rob[ ][ ]. Ensure: Array of prior probability of Topics given a Word, prt opicw ord prt opicw ord[ ][ ] [ ][ ] for word = 1 to words.size do U { } for i = 1 to S.size do if S[i].graphWords contains word then Add S[i] to U end if for i = 1 to sample do topiccount[ ] [ ] for thread in U do Randomly pick a topic, m, weighted according to tprob[thread][topic] topiccount[m]++ for j = 1 to t do prt opicw ord[word][j] += topiccount[j] / U.size for i = 1 to t do prt opicw ord[word][i] /= sample ReturnprT opicw ord P r[word w ] = n P r[word w T hread j ] P r[t hread j ] (4) j=1 In this case, the prior probability of finding a document word in a thread, P r[word w T hread j ], can be calculated by finding its probability of occurrence in the graph document for that thread, as shown in Equation 5: P r[word w T hread j ] = word w in T hread j words(t hread j ) where word w in T hread j represents the number of times word w occurs in T hread j s graph document, and words(t hread j ) is the set of graph words contained in T hread j s graph document. As an example, Thread 1 from Figure 1 contains the graph word A B, A B once out a total of three graph words, and therefore has an associated probability P r[a B, A B T hread1] = 1 3. The next graph word in Thread 1, A B, A C, occurs twice out (5)

5 Fig. 6. Algorithm: Fit Curve for Pr[Thread] Algorithm : findcurve Require: Filtered dataset S Ensure: Curve used to find Pr[Thread] given time, curve sumoff itcurves total for j = 1 to S.size do currentt hread S[j] timeprob { } for k = j 1 to 1 do T ime currentthread.time - S[k].time Train { } for m = k to j 1 do Add S[m] to Train P rob 1 set false for Link, l, in currentt hread do if Train contains l then P rob = (Number of l contained in Train) / (Number of links in Train) set true end if if set is true then Add (T ime, P rob) to timeprob total total + 1 end if Fit exponential curve c to timeprob sumoff itcurves += c curves = sumoffitcurves / total Return curves of three graph words, and therefore has a probability of 2 3. The pseudocode for finding P r[w ord] is shown in Figure 4. The third and final value necessary to calculate the probability in Equation 2, P r[t i word w ], is estimated using sampling. By collecting and analyzing the set of threads containing word w, the associated topic probabilities may be found. To do this, the set S = {T hread j word w v(t hread j )} is first created, containing every thread in the training set whose graph document contains word w. Next, a sampled value for the probability of every topic is found. This is done by repeatedly choosing a random topic, weighted by the inferred topic distribution, for each thread in S. The final sampled probability for each topic is calculated to be its average probability of being chosen during this sampling process. Pseudocode for finding this prior is shown in Figure 5. By combining the three previously found values according to Equation 2, we can calculate the probability, given the thread s topic distribution, of any graph word appearing. These probabilities therefore allow us to model the connection Fig. 7. Algorithm: GeneratePredictions Algorithm : GeneratePredictions Require: Text for thread test, Training threads dataset, Num topics t, input graph structure I, Num sampling iterations sample, subgraph length l. Ensure: Sorted array of link predictions, predict[ ]. tp rob[ ] [ ] Stem and remove stopwords in test Use LDA to find topic distribution on test store tp rob[topic] P [word][topic] GC-Model(dataset, t, I, sample, l) predict[ ] [ ] for w = 1 to p[word].length do predict[w] += P [w][topic] * tp rob[topic] Return predict between a thread s graph structure and its communication text content, and can be used on new test threads to predict which graph words and communication structures are most likely to appear in that graph. Given a new thread s text content, by applying LDA inference on its text document, the list of topic probabilities for this test thread may be found. Using Equation 1 to combine this topic distribution with the word probabilities calculated in Equation 2, the probability of any graph word appearing in this test thread may be found. The graph words with the highest predicted probabilities are GC-Model s top predictions for the future graph structure of this test thread. The pseudocode for this overall prediction algorithm is shown in Figure 7 A. Thread Probability It is possible to model the likelihood of a thread, P r[t hread j ], from Equations 3 and 4 using the assumption that every training thread is equally likely, obtaining a constant value of P r[t hread j ] = 1/n. However, in actuality, the probability of a past thread s structure appearing again is often related to how recently it occurred. In this way, for example, people messaged recently by a user are more likely to be contacted again, in the near future, than those not contacted for several years. To account for this, GC-Model fits a simple relation between probability and time. The probability of each thread is approximated by the likelihood of its graph links reoccurring in another thread. Between every pair of training threads, their distance in time (the difference between the timestamps of their first messages) and their percentage of link overlap is calculated. These probability / time pairs are then sorted by their timestamp distances and binned into equi-depth bins, averaging the values in each bin. Finally, a simple exponential curve is fit to this series, allowing for the

6 5 5 5 l = 1 l = 2 l = Topic 2 Topics 4 Topics 6 Topics 8 Topics 1 Topics 12 Topics Fig. 8. Varying graph word length. The parameter affects scores only slightly. Fig. 9. Varying number of LDA topics. Increasing topics raises F1 score until a steady state is reached, where further adjustment has little affect. connection between thread time and probability. Pseudocode for this algorithm is shown in Figure 6. IV. EXPERIMENTS In this section, we report our results from implementing and testing GC-Model introduced above. To confirm the effectiveness of our modeling and prediction strategy, GC- Model was implemented in Java and tested on multiple realworld communication networks. We compare our performance to recently introduced link prediction method PropFlow [7], as well as Common Neighbors, Attachment, s coefficient, Adamic/Adar, and both a graph based and a text based naive method. For all approaches, the training graph dataset was created by collecting only those threads which matched the given starting input. The graph based naive method used here predicts links based on frequency, where the highest scoring predictions are the links which appear most frequently. The text based naive method predicts links based on text distances, where the top scoring predictions are the links which contain communication text (stemmed, with stopwords removed) with closest TF-IDF cosine similarity with the test threads. A. Networks Three separate communication graphs are used in this paper. The first is a crawled Twitter dataset, consisting of ~1M nodes (unique Twitter users) and ~2M edges (tweets). The total graph dataset is grouped into threads using each tweet s exact reply to and retweet information. Keeping only those threads containing 3 or more links, a total of ~6k threads were obtained from this network. The second dataset is the Enron dataset, consisting of over 5k links ( s) between 37k nodes (unique addresses). To separate the Enron graph into approximate threads, the subject lines and timestamps were used. s sent close in time (within one month of each other) were grouped according to exact subject line matches, ignoring prepended Re and Fwd text. The ten largest threads were removed, due to the general nature of their subject lines, as well as broadcast s sent to greater than 2 people. Only threads containing at least 3 links and unique users were kept, leaving a total of ~5k threads. The last dataset is the SNAP Twitter dataset [15], consisting of ~17M nodes and ~476M edges. Approximate threads were formed through the retweet text information contained in the links. Threads were creating by doing exact text matches between the tweet text, the RT retweet information, and the user ids contained in the messages, resulting in ~4k threads. B. Parameters and Setup In our implementation and experiments on GC-Model, the graph word length parameter l was found to influence results only slightly and set to 2 for all tests reported here, unless otherwise noted. An example of the results obtained varying l between 1 to 3 in the Enron dataset is shown in Figure 8, which plots the F1 score, defined as: F 1 Score = 2 recall precision recall + precision As can be seen, the graphs of the resulting scores are mostly clustered and have similar scores. Comparable results were found in the other networks. In addition, a sampling size of 1 to find P r[t i word w ], as discussed in Section III, was discovered to give consistent estimates, and is the value used in this paper. Finally, a size of 8 LDA topics was found to result in high quality results for all tested communication graphs, and is the value used for the displayed results here. Figure 9 shows the results of varying the number of topics between 1 to 12 for the Enron dataset. As can be seen from the figure, result scores increase steadily as the number of topics are increased, until reaching a stable state where further increases in topic numbers affect results only slightly. Similar results were found for the other datasets. To validate the effectiveness of our technique at building models of communication graphs, for each network a series of models were built and evaluated. This was done twice, and the results reported in Figures 9 and 1. For the results shown in Figure 9, the starting information given on graph structure was the beginning user for the thread. In Figure 1, the starting information was the beginning graph link. For both these types, the starting graph information was obtained by pulling the top 2% of starting users or graph links with the highest frequencies in the dataset. This was done to allow for

7 5 5 5 PropFlow.6.5 Prop Flow Prop Flow (a) Enron Dataset (b) Crawled Twitter Dataset (c) SNAP Twitter Dataset Fig. 1. Results for predicting directed links given starting user. GC-Model consistantly returns higher scores than the other previously introduced algorithms PropFlow Common Neighbors.6.5 Prop Flow Prop Flow (a) Enron Dataset (b) Crawled Twitter Dataset (c) SNAP Twitter Dataset Fig. 11. Results for predicting directed links given starting link. Again, GC-Model s results dominate the other methods, especially among the top predictions. sufficient dataset sizes in training and testing. For each model, the threads matching the given starting information were found and sorted according to the timestamp of the first message sent, and the ten latest threads were separated to form the testing set. The remaining graph threads were used for training and building the model. After building the models, the most likely predicted graph links were found for each test thread using the method shown in Section III. To evaluate these predictions, the F1 score was again used, as it allows for the combination of both recall and precision, favoring results with a good balance of both. The F1 score was calculated repeatedly for every test thread and model, using a series of top-k predictions, with K ranging from 1 to 2. For each value of K, the recall and precision of the top-k prediction set was found against the actual set of directed links, and used to find its F1 score. To allow for fair prediction and scoring, only those links in the testing set occurring at least once in the training set were used in these result calculations. Finally, the series of F1 scores found across the test threads for each model were averaged to obtain the final results. C. Prediction Results The result of applying GC-Mode, as well as the other link prediction and naive graph and text methods, can be seen in Figures 9 and 1. Figure 9 shows the plot of the F1 score versus top-k value, on each network, for the various methods. As can be seen, GC- Model consistently outperforms the other methods, reporting higher results consistently across all the graphs. GC-Model also performs especially well on those earliest predictions assigned the highest probability, obtaining F1 scores of up to 45% higher than the next closest method. The Graph and Text Naive methods, as well as the recently introduced PropFlow link prediction method, score consistently lower on the highest predictions, but report similar scores eventually as the K value is increased. This helps to confirm the effectiveness of GC- Model s method in the modeling and use of both text and graph structure information for prediction, as GC-Model needs fewer predictions to cover more of the test threads. The much lower prediction results of the remaining previously introduced link prediction methods emphasize their use for the prediction of new, rather than repeating, links. These trends are echoed in Figure 1, where the starting link, rather than the starting user, is given. GC-Model again returns much stronger F1 scores, dominating the results across the graphs. The addition of the extra starting link information, used to prune the training sets for all the reported methods, can be seen to increase the strength of the models and predictions, as well. In addition, Tables I and II report the percentage of directed links in the test graphs covered by the top-k predictions. Only the graph and text naive methods and PropFlow method are used for comparison here, as the other link prediction methods generally had lower values. As can be seen from Table I, taking just the top 1 GC-Model link predictions covers over 7% of the actual thread structure in the Enron dataset, and the top 2 cover over 8%. GC-Model s strong results again emphasizes its ability to build communication thread models that work both effectively and are useful for information dissemination prediction, and confirms the usefulness of combining content and graph structure information.

8 TABLE I GIVEN STARTING USER, PERCENT OF GRAPH COVERED BY PREDICTIONS Enron Crawled Twitter SNAP Twitter Predictions % Graph Covered % Graph Covered % Graph Covered GC- Graph Text Prop GC- Graph Text Prop GC- Graph Text Prop Model Naive Naive Flow Model Naive Naive Flow Model Naive Naive Flow Top 5 43% 38% 37% 23% 54% 38% 35% 33% 74% 41% 18% 18% Top 1 62% 54% 53% 35% 63% 46% 43% 38% 81% 45% 2% 2% Top 2 73% 66% 67% 47% 74% 55% 53% 41% 86% 48% 22% 22% Top 5 86% 85% 85% 55% 89% 65% 68% 45% 93% 53% 25% 24% TABLE II GIVEN STARTING LINK, PERCENT OF GRAPH COVERED BY PREDICTIONS Enron Crawled Twitter SNAP Twitter Predictions % Graph Covered % Graph Covered % Graph Covered GC- Graph Text Prop GC- Graph Text Prop GC- Graph Text Prop Model Naive Naive Flow Model Naive Naive Flow Model Naive Naive Flow Top 5 62% 53% 45% 49% 57% 4% 37% 34% 77% 45% 18% 19% Top 1 78% 65% 56% 51% 66% 47% 45% 39% 83% 49% 21% 21% Top 2 87% 71% 71% 59% 75% 55$ 54% 43% 89% 53% 23% 22% Top 5 91% 83% 8% 64% 89% 66% 68% 47% 96% 6% 27% 26% V. CONCLUSION The modeling and prediction of graph structure in graph datasets has been both an interesting and important area of research in recent years. In particular, with the growth of social networks and services such as Twitter, communication graphs have been an especially fast growing source of data and information. While numerous link prediction methods have been introduced, most focusing on predicting new link formation using only graph structure, an important insight is that communication graphs have additional text and information content associated with them. In this paper, we introduced a new method, GC-Model, which utilizes the extra content information included in communication graphs to build graph models that may be used to predict future network structure and information flow. When implemented and tested on three real world communication networks, GC-Model predicted with far increased accuracy and coverage, when compared to multiple previously introduced link prediction and naive methods. GC-Model s returned scores dominate the comparison sets, with its top 1 predictions covering on average up to 22% more of the actual communication graph structure than the compared methods. These results emphasize the importance and strength of including both information content, as well as graph structure, in the modeling of communication networks, as well as the effectiveness of GC-Model s modeling scheme for the prediction of future information flow structure. ACKNOWLEDGMENT Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. REFERENCES [1] K. Macropol and A. Singh, Scalable discovery of best clusters on large graphs, Proc. VLDB Endow., vol. 3, pp , September 21. [2] D. Liben-Nowell and J. Kleinberg, The link-prediction problem for social networks, J. of ASIST, vol. 58, pp , 27. [3] Z. Huang and D. Zeng, A link prediction approach to anomalous detection, in SMC, 26. [4] M. E. J. Newman, Clustering and preferential attachment in growing networks. Phys. Rev. E, vol. 64, 21. [5] A. L. Barabasi and et. al., Evolution of the social network of scientific collaborations, Physica A: Stat. Mech. and its Applications, vol [6] L. Adamic, Friends and neighbors on the Web, Social Networks, vol. 25, no. 3, pp , Jul. 23. [7] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla, New perspectives and methods in link prediction, in KDD, 21. [8] E. Zheleva, L. Getoor, J. Golbeck, and U. Kuter, Using friendship ties and family circles for link prediction, in SNAKDD, 21, pp [9] M. A. Hasan, V. Chaoji, S. Salem, and M. Zaki, Link prediction using supervised learning, in SDM 6 workshop on LACS, 26. [1] L. Backstrom and J. Leskovec, Supervised random walks: predicting and recommending links in social networks, in Proc. of the WSDM. ACM, 211, pp [11] J. Chang and D. M. Blei, Hierarchical relational models for document networks, Annals of Applied Statistics, Oct. 21. [12] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp , March 23. [13] C. Ozcaglar, Classification of Messages Into Topics Using Latent Dirichlet Allocation, Master s thesis, Rensselaer Polytechnic Institute, Troy, New York, 28. [14] L. Hong and B. D. Davison, Empirical study of topic modeling in twitter, in SOMA, 21. [15] J. Yang and J. Leskovec, Patterns of temporal variation in online media, in Proc. of the WSDM. ACM, 211, pp

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Link Prediction and Anomoly Detection

Link Prediction and Anomoly Detection Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in

More information

Using Friendship Ties and Family Circles for Link Prediction

Using Friendship Ties and Family Circles for Link Prediction Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva and Jennifer Golbeck Lise Getoor College of Information Studies Computer Science University of Maryland University of Maryland

More information

Link Prediction in Microblog Network Using Supervised Learning with Multiple Features

Link Prediction in Microblog Network Using Supervised Learning with Multiple Features Link Prediction in Microblog Network Using Supervised Learning with Multiple Features Siyao Han*, Yan Xu The Information Science Department, Beijing Language and Culture University, Beijing, China. * Corresponding

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Stable Statistics of the Blogograph

Stable Statistics of the Blogograph Stable Statistics of the Blogograph Mark Goldberg, Malik Magdon-Ismail, Stephen Kelley, Konstantin Mertsalov Rensselaer Polytechnic Institute Department of Computer Science Abstract. The primary focus

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

Social & Information Network Analysis CS 224W

Social & Information Network Analysis CS 224W Social & Information Network Analysis CS 224W Final Report Alexandre Becker Jordane Giuly Sébastien Robaszkiewicz Stanford University December 2011 1 Introduction The microblogging service Twitter today

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Graph Structure Over Time

Graph Structure Over Time Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines

More information

Online Social Networks and Media

Online Social Networks and Media Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

arxiv: v1 [cs.si] 8 Apr 2016

arxiv: v1 [cs.si] 8 Apr 2016 Leveraging Network Dynamics for Improved Link Prediction Alireza Hajibagheri 1, Gita Sukthankar 1, and Kiran Lakkaraju 2 1 University of Central Florida, Orlando, Florida 2 Sandia National Labs, Albuquerque,

More information

CS 224W Final Report Group 37

CS 224W Final Report Group 37 1 Introduction CS 224W Final Report Group 37 Aaron B. Adcock Milinda Lakkam Justin Meyer Much of the current research is being done on social networks, where the cost of an edge is almost nothing; the

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

A Naïve Bayes model based on overlapping groups for link prediction in online social networks

A Naïve Bayes model based on overlapping groups for link prediction in online social networks Universidade de São Paulo Biblioteca Digital da Produção Intelectual - BDPI Departamento de Ciências de Computação - ICMC/SCC Comunicações em Eventos - ICMC/SCC 25-4 A Naïve Bayes model based on overlapping

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Hadoop Based Link Prediction Performance Analysis

Hadoop Based Link Prediction Performance Analysis Hadoop Based Link Prediction Performance Analysis Yuxiao Dong, Casey Robinson, Jian Xu Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556, USA Email: ydong1@nd.edu,

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Extracting Information from Complex Networks

Extracting Information from Complex Networks Extracting Information from Complex Networks 1 Complex Networks Networks that arise from modeling complex systems: relationships Social networks Biological networks Distinguish from random networks uniform

More information

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Abstract The goal of influence maximization has led to research into different

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu November 7, 2017 Learnt Clustering Methods Vector Data Set Data Sequence Data Text

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 11/13/17 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2 Observations Models

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science

More information

Scalable Object Classification using Range Images

Scalable Object Classification using Range Images Scalable Object Classification using Range Images Eunyoung Kim and Gerard Medioni Institute for Robotics and Intelligent Systems University of Southern California 1 What is a Range Image? Depth measurement

More information

The Effect of Neighbor Graph Connectivity on Coverage Redundancy in Wireless Sensor Networks

The Effect of Neighbor Graph Connectivity on Coverage Redundancy in Wireless Sensor Networks The Effect of Neighbor Graph Connectivity on Coverage Redundancy in Wireless Sensor Networks Eyuphan Bulut, Zijian Wang and Boleslaw K. Szymanski Department of Computer Science and Center for Pervasive

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

The missing links in the BGP-based AS connectivity maps

The missing links in the BGP-based AS connectivity maps The missing links in the BGP-based AS connectivity maps Zhou, S; Mondragon, RJ http://arxiv.org/abs/cs/0303028 For additional information about this publication click this link. http://qmro.qmul.ac.uk/xmlui/handle/123456789/13070

More information

Exploiting Social and Mobility Patterns for Friendship Prediction in Location-Based Social Networks

Exploiting Social and Mobility Patterns for Friendship Prediction in Location-Based Social Networks Exploiting Social and Mobility Patterns for Friendship Prediction in Location-Based Social Networks Jorge Valverde-Rebaza ICMC Univ. of São Paulo, Brazil jvalverr@icmc.usp.br Mathieu Roche TETIS & LIRMM

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

Topic mash II: assortativity, resilience, link prediction CS224W

Topic mash II: assortativity, resilience, link prediction CS224W Topic mash II: assortativity, resilience, link prediction CS224W Outline Node vs. edge percolation Resilience of randomly vs. preferentially grown networks Resilience in real-world networks network resilience

More information

Social Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users

Social Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users Social Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users Bin Nie, Honggang Zhang, Yong Liu Fordham University, Bronx, NY. Email: {bnie, hzhang44}@fordham.edu NYU Poly,

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

D 380 Disciplinary. Disease Surveillance, Case Study. Definition. Introduction. Synonyms. Glossary

D 380 Disciplinary. Disease Surveillance, Case Study. Definition. Introduction. Synonyms. Glossary D 380 Disciplinary Scoring Function An objective function that measures the anomalousness of a subset of data LTSS Linear-time subset scanning Time to Detect Evaluation metric; time delay before detecting

More information

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Chenghua Lin, Yulan He, Carlos Pedrinaci, and John Domingue Knowledge Media Institute, The Open University

More information

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University

More information

Appendix A Additional Information

Appendix A Additional Information Appendix A Additional Information In this appendix, we provide more information on building practical applications using the techniques discussed in the chapters of this book. In Sect. A.1, we discuss

More information

Supplementary material to Epidemic spreading on complex networks with community structures

Supplementary material to Epidemic spreading on complex networks with community structures Supplementary material to Epidemic spreading on complex networks with community structures Clara Stegehuis, Remco van der Hofstad, Johan S. H. van Leeuwaarden Supplementary otes Supplementary ote etwork

More information

An Efficient Link Prediction Technique in Social Networks based on Node Neighborhoods

An Efficient Link Prediction Technique in Social Networks based on Node Neighborhoods An Efficient Link Prediction Technique in Social Networks based on Node Neighborhoods Gypsy Nandi Assam Don Bosco University Guwahati, Assam, India Anjan Das St. Anthony s College Shillong, Meghalaya,

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine Data Mining SPSS 12.0 6. k-means Algorithm Spring 2010 Instructor: Dr. Masoud Yaghini Outline K-Means Algorithm in K-Means Node References K-Means Algorithm in Overview The k-means method is a clustering

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Favorites-Based Search Result Ordering

Favorites-Based Search Result Ordering Favorites-Based Search Result Ordering Ben Flamm and Georey Schiebinger CS 229 Fall 2009 1 Introduction Search engine rankings can often benet from knowledge of users' interests. The query jaguar, for

More information

Recommendation of APIs for Mashup creation

Recommendation of APIs for Mashup creation Recommendation of APIs for Mashup creation Priyanka Samanta (ps7723@cs.rit.edu) Advisor: Dr. Xumin Liu (xl@cs.rit.edu) B. Thomas Golisano College of Computing and Information Sciences Rochester Institute

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Supervised Link Prediction with Path Scores

Supervised Link Prediction with Path Scores Supervised Link Prediction with Path Scores Wanzi Zhou Stanford University wanziz@stanford.edu Yangxin Zhong Stanford University yangxin@stanford.edu Yang Yuan Stanford University yyuan16@stanford.edu

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

Where Should the Bugs Be Fixed?

Where Should the Bugs Be Fixed? Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports Presented by: Chandani Shrestha For CS 6704 class About the Paper and the Authors Publication

More information

Parallelism for LDA Yang Ruan, Changsi An

Parallelism for LDA Yang Ruan, Changsi An Parallelism for LDA Yang Ruan, Changsi An (yangruan@indiana.edu, anch@indiana.edu) 1. Overview As parallelism is very important for large scale of data, we want to use different technology to parallelize

More information

Toward Efficient Search for a Fragment Network in a Large Semantic Database

Toward Efficient Search for a Fragment Network in a Large Semantic Database Toward Efficient Search for a Fragment Network in a Large Semantic Database M. Goldberg, J. Greenman, B. Gutting, M. Magdon-Ismail, J. Schwartz, W. Wallace Computer Science Department, Rensselaer Polytechnic

More information

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Tracking Trends: Incorporating Term Volume into Temporal Topic Models Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA,

More information

Mining Social Media Users Interest

Mining Social Media Users Interest Mining Social Media Users Interest Presenters: Heng Wang,Man Yuan April, 4 th, 2016 Agenda Introduction to Text Mining Tool & Dataset Data Pre-processing Text Mining on Twitter Summary & Future Improvement

More information

An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data

An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data Nicola Barbieri, Massimo Guarascio, Ettore Ritacco ICAR-CNR Via Pietro Bucci 41/c, Rende, Italy {barbieri,guarascio,ritacco}@icar.cnr.it

More information

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling 2014/04/09 @ WWW 14 Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling Takuya Akiba (U Tokyo) Yoichi Iwata (U Tokyo) Yuichi Yoshida (NII & PFI)

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Annales Univ. Sci. Budapest., Sect. Comp. 43 (2014) 57 68 COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Imre Szücs (Budapest, Hungary) Attila Kiss (Budapest, Hungary) Dedicated to András

More information

K- Nearest Neighbors(KNN) And Predictive Accuracy

K- Nearest Neighbors(KNN) And Predictive Accuracy Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.

More information

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples

More information

RSDC 09: Tag Recommendation Using Keywords and Association Rules

RSDC 09: Tag Recommendation Using Keywords and Association Rules RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

The link prediction problem for social networks

The link prediction problem for social networks The link prediction problem for social networks Alexandra Chouldechova STATS 319, February 1, 2011 Motivation Recommending new friends in in online social networks. Suggesting interactions between the

More information

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian Demystifying movie ratings 224W Project Report Amritha Raghunath (amrithar@stanford.edu) Vignesh Ganapathi Subramanian (vigansub@stanford.edu) 9 December, 2014 Introduction The past decade or so has seen

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points 1 Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points Zhe Zhao Abstract In this project, I choose the paper, Clustering by Passing Messages Between Data Points [1],

More information

Characteristics of Preferentially Attached Network Grown from. Small World

Characteristics of Preferentially Attached Network Grown from. Small World Characteristics of Preferentially Attached Network Grown from Small World Seungyoung Lee Graduate School of Innovation and Technology Management, Korea Advanced Institute of Science and Technology, Daejeon

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Hung Nghiep Tran University of Information Technology VNU-HCMC Vietnam Email: nghiepth@uit.edu.vn Atsuhiro Takasu National

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

The Data Mining Application Based on WEKA: Geographical Original of Music

The Data Mining Application Based on WEKA: Geographical Original of Music Management Science and Engineering Vol. 10, No. 4, 2016, pp. 36-46 DOI:10.3968/8997 ISSN 1913-0341 [Print] ISSN 1913-035X [Online] www.cscanada.net www.cscanada.org The Data Mining Application Based on

More information

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm Majid Hatami Faculty of Electrical and Computer Engineering University of Tabriz,

More information

More Efficient Classification of Web Content Using Graph Sampling

More Efficient Classification of Web Content Using Graph Sampling More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Study of Data Mining Algorithm in Social Network Analysis

Study of Data Mining Algorithm in Social Network Analysis 3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Study of Data Mining Algorithm in Social Network Analysis Chang Zhang 1,a, Yanfeng Jin 1,b, Wei Jin 1,c, Yu Liu 1,d 1

More information

Failure in Complex Social Networks

Failure in Complex Social Networks Journal of Mathematical Sociology, 33:64 68, 2009 Copyright # Taylor & Francis Group, LLC ISSN: 0022-250X print/1545-5874 online DOI: 10.1080/00222500802536988 Failure in Complex Social Networks Damon

More information

Classication of Corporate and Public Text

Classication of Corporate and Public Text Classication of Corporate and Public Text Kevin Nguyen December 16, 2011 1 Introduction In this project we try to tackle the problem of classifying a body of text as a corporate message (text usually sent

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range

More information