Content-based Modeling and Prediction of Information Dissemination
|
|
- Karin Griffith
- 5 years ago
- Views:
Transcription
1 Content-based Modeling and Prediction of Information Dissemination Kathy Macropol Department of Computer Science University of California Santa Barbara, CA 9316 USA Ambuj Singh Department of Computer Science University of California Santa Barbara, CA 9316 USA Abstract Social and communication networks across the world generate vast amounts of graph-like data each day. The modeling and prediction of how these communication structures evolve can be highly useful for many applications. Previous research in this area has focused largely on using past graph structure to predict future links. However, a useful observation is that many graph datasets have additional information associated with them beyond just their graph structure. In particular, communication graphs (such as , twitter, blog graphs, etc.) have information content associated with their graph edges. In this paper we examine the link between information content and graph structure, proposing a new graph modeling approach, GC-Model, which combines both. We then apply this model to multiple real world communication graphs, demonstrating that the built models can be used effectively to predict future graph structure and information flow. On average, GC-Model s top predictions covered 19% more of the actual future graph communication structure when compared to other previously introduced algorithms, far outperforming multiple link prediction methods and several naive approaches. I. INTRODUCTION The mining and analysis of graph datasets has proven to be both useful and important for a large variety of applications, from social community detection to biological function prediction [1]. Graph representations can naturally capture the structure and interactions present in many types of data, and are therefore commonly used to represent a wide variety of datasets. As a large portion of graph datasets change over time, one important area of research is the modeling and prediction of how these networks will evolve and how information will flow over them. By building and analyzing models of these changes, many important applications arise, including prediction of future links, anomaly detection, and expert finding [2], [3]. Previous research in the area of link prediction has focused largely on using past graph structure to predict future links. Models looking at graph measures such as node neighborhoods or path information such as random walk and flow-based distances, in and out degrees, etc. have been used successfully to predict the formation of links in the future. However, a useful observation is that many graph datasets have additional information associated with them beyond just their graph structure. In particular, communication graphs (such as , twitter, blog graphs, etc.) are a rich and important source of network data, and have a large amount of information content associated with them. In a communication graph, each node is associated with a unique user, and edges are made through messages sent between users. In this way, a communication graph is a directed multigraph, with each edge having an associated timestamp and message text. Another important aspect to communication graphs is the fact that graph edges may be clustered into message threads. Each thread consists of an originating message, along with all ensuing replies and forwards initiated by the originating message. In this way, a thread s graph structure describes the structure of information flow its users followed during communication. In addition, by collecting graph edges into threads, we obtain a dataset which consists of numerous smaller graphs, and their associated text. While previous research has mainly focused on analyzing and modeling time evolving graphs as one large graph, the additional information associated with communication graphs allows us to approach the problem instead as a modeling problem of multiple smaller graphs. In this paper, we introduce a new method, GC-Model, which takes advantage of additional properties and attributes contained within communication graphs. This model can be applied to a dataset consisting of numerous graph threads, and combines both graph and content information to predict future graph communication structure. In addition, starting information on graph structure (for example, an originating user or message link) can be input, allowing for more specific targeted models with increased accuracy to be built. We implement GC-Model and test it on multiple real word communication graphs, finding that its performance dominates both naive methods of link prediction which rely only on graph structure or content information alone, as well as other existing link prediction algorithms. On average, we find that GC-Model is able to cover over 8% of a test thread s actual communication structure using only its top 2 predictions. In addition, GC-Model performs especially well on its highest predicted links, with its top 5 predictions covering 19% more graph structure on average than that covered by the other compared methods. The rest of this paper is divided as follows. In Section II, previous research and several methods related to our work are overviewed. Next, in Section III we describe and introduce
2 our new method, GC-Model. In Section IV we implement GC-Model on multiple communication graphs, analyzing its predictive results and comparing it with multiple previously introduced link prediction methods, as well as two naive approaches. Finally, Section V overviews our results and contributions. II. RELATED WORK There has been much focus recently on the link prediction problem in graphs. In general, link prediction concentrates on the analysis of a single large graph which contains links appearing / disappearing across time. The goal is to predict what the overall graph structure will be (what links will be added or deleted) at a later time step. Multiple papers have shown that even simple analysis of basic statistical graph properties can be successful for accurate link predictions in a future network. For the majority of these approaches, a single score is calculated between every pair of nodes, and the resulting node pairs with the highest scores are used as predictions for future links. Examples of these scoring methods include counting the common neighbors between two nodes ( N(a) N(b), where N(a) is the neighborhood set of node a) and calculating the s coefficient between the two nodes neighborhood sets (( N(a) N(b))/(N(a) N(b) ). Other related methods include modeling the graph using preferential attachment, which finds a score ( N(a) N(b) ) favoring nodes of high degree [4], [5]. In the Adamic/Adar measure ( i N(a) N(b) a/log N(i) ), a score is calculated between two nodes based on their number of common neighbors, with each neighbor weighted by its importance (the rarity of links in its neighbor set) [6]. Another recently introduced method of unsupervised link prediction includes PropFlow, which looks at random walk probabilities to make predictions [7]. Besides these simple unsupervised methods of link prediction based on graph structure, a supervised learning method was recently introduced, which uses an ensemble of the above unsupervised methods, combined with supervised learning, to make predictions on link formation in graphs [7]. All the previously mentioned methods of link prediction focus heavily on the prediction of new link formations between nodes. By calculating a score between every pair of nodes in the graph, previously unseen links are likely to be predicted. However, these methods of prediction ignore both the timing or the amount of communication previously seen between two nodes, and are not as appropriate in communication graphs where the vast majority of links are repeats. In addition, these previously introduced link prediction methods look only at graph structure while making link predictions. Beyond graph structure, however, many large evolving graph datasets contain other features associated with their nodes and links. Current work has shown that augmenting these simple graph models with the information contained within the additional features allows for increased predictive power, as well. The inclusion of family circle information has been shown to increase social network link prediction accuracy B A Graph Thread 1 Thread 2 Thread 3 1. A B, A B 2. A B, A C 3. A B, A C C Word 1. No 2. Meeting 3. This 4. Tuesday 5. Canceled 6. Meeting 7. Tomorrow Graph 1. A C, A C A C Word 1. Need 2. This 3. Tax 4. Info 5. Tomorrow Graph 1. A C, C A 2. A C, A C 3. A C, C A A C Word 1. Tax 2. Info 3. Sent 4. Already Fig. 1. Example Graph and Text word documents created from three threads, using a graph word length of l = 2 by as much as 3% [8], and aggregating features such as paper categories and neighbor counts have been used for link prediction in publication networks as well [9]. Recent work has also looked at learning random walk edge weights based on social network attributes [1]. Besides these categorical features, a rich source of feature information is contained within document and communication networks. In these cases, each node or link in the graph has associated with it a text document. One recent work focuses on document link prediction by building a relational topic model (RTM) between document texts [11]. In this case, each node is associated with a text document, and LDA is used to discover the document topic probabilities, with the likelihood of two text documents containing a link between them a function of their topic similarities. This RTM approach, however, focuses on graphs containing one single static document per node, as well as allowing only for the prediction of new links between a given new node and the rest of the graph, rather than predicting a larger, flowing, thread structure. In this work, we introduce GC-Model, a new method which extends on these previous ideas by building a model for communication threads using both graph and text content, the amount of communication between nodes, as well as message timing information. The built models can be customized by inputing starting information on the graph structure, and allows for the prediction of future thread structure. III. METHOD In GC-Model, a single, large, communication graph is split into numerous smaller communication threads. Each thread is a non-overlapping set of messages sent between users from a single message flow. By partitioning the larger graph into these separate thread structures, we model separate communication flows based on subject and time. This partitioning is closer to real life, where communication often happens in multiple separate flows, rather than being part of a single giant structure. Three example threads are shown at the top of Figure 1, with Thread 1 consisting of three messages, two sent from A to B and one sent from A to C, Thread 2 consisting of two
3 Fig. 2. Algorithm: Build GC-Model Algorithm : GC-Model Require: Training threads dataset, Num topics t, input graph structure I, Num sampling iterations sample, subgraph length l. Ensure: Topic-word probabilities P [word][topic]. S[ ] threads in dataset matching I gw ords[ ] [ ] for thread in S[ ] do create graph words of length l for thread add graph words to gw ords Sort S[ ] by first link s start time tp rob[ ][ ] [ ][ ] for thread in S[ ] do Stem and remove stopwords in Text for thread Use LDA to find topic distribution on thread store tp rob[thread][topic] curve findcurve(s[ ]) prt opic[ ] findprtopic(t, S[ ], tp rob[ ][ ], curve) prw ord[ ] findprword(gw ords, S[ ], curve) prt opicw ord[ ][ ] findprtopicword(gw ords, S[ ], t, sample) P [ ][ ] [ ][ ] for word in gw ords[ ] do P [word][topic] prt opicw ord[word][topic] prw ord[word] / prt opic[topic] Return P [ ][ ] messages, both from A to C, etc. Once these threads are created, the main idea behind GC- Model is that each communication thread can be thought of as two different types of documents: a text document and a graph document. Both document types consist of a grouping of words. The text document consists of all text words contained within the thread messages. The graph document consists of all possible connected subgraphs, of length l, that may be created from the thread s graph structure. Each of the possible connected subgraphs composes one word of the graph document. The bottom half of Figure 1 shows examples of these text and graph documents (where l = 2) for three different sample threads. From Figure 1, the text document for Thread 1 consists of all the text words contained in the three messages sent in that thread. The graph document consists of all connected subgraphs of size 2 created from Thread 1 s graph structure. Similar text and graph documents are created Fig. 3. Algorithm: Find Pr[Topic] Algorithm : findprtopic Require: Num topics t, Filtered dataset S, Thread LDA Topic Probs tp rob[ ][ ], Curve fit for Pr[Thread] curve. Ensure: Array of prior probability of topics, prt opic prt opic[ ] [ ] for i = 1 to S.size do prt opic[topic] += tp rob[i][topic] curve(testt hr.time, S[i].time) Return prt opic for Threads 2 and 3. For both types of documents, GC-Model makes a mixtureof-topics assumption, where every document is assumed to be associated with a distribution over a set of topics. These topics, in turn, are a distribution over a vocabulary of words. Each topic, therefore, is associated with a distribution over two separate vocabularies: one for text words and one for graph words. In both cases, the probability of a word appearing in its respective document type for that thread is: P robability of word = t P r[word T i ] Q[T i ] (1) i=1 where T i represents the i-th topic, and Q[T i ] represents the distribution probability of T i in this document. For a thread, the graph structure is therefore contained in its graph document words. If we are able to build a model which predicts, given its topic distribution, what graph words will likely appear in a thread, then we can predict which links will appear in this thread over the future. This idea is the basis behind our model. Various methods of modeling documents composed of a series of words have been introduced in past research. One method that has gained widespread use is Latent Dirichlet Allocation (LDA) [12]. Previous studies have shown LDA modeling and its variants to be effective at discovering topics and clustering documents contained in communication messages [13], [14]. LDA is a generative probabilistic model where every document is assumed to be associated with a mixture of topics, where a topic is a distribution over a vocabulary of words. During the generative process, each document draws its topic proportions from a Dirichlet distribution. Next, words in a document are generated by first randomly choosing a topic, weighted by the topic proportions, and then choosing a word from the corresponding topic distribution. Multiple different techniques have been proposed to infer these topic and word distributions from a collection of documents. In this paper we use Gibbs sampling to infer the various LDA distributions, using the publicly available JGibbLDA 1 1 Available online at
4 Fig. 4. Algorithm: Find Pr[Word] Algorithm : findprword Require: List of graph words words, Filtered dataset S, Curve fit for Pr[Thread] curve. Ensure: Array of prior probabilities of words prw ord. prw ord[ ] [ ] for word = 1 to words.size do for i = 1 to S.size do wordcount = for currentw ord in S[i].graphWords do if currentw ord = word then wordcount + + end if wordp rob wordcount / S[i].graphWords prw ord[word] += wordp rob curve(testt hr.time, S[i].time) Return prw ord implementation. To obtain the probability of graph words given topic, P r[word T i ], and therefore find P r[word] from Equation 1, GC-Model follows several steps (pseudocode for the overall GC-Model algorithm is shown in Figure 2). First, given a starting portion of the graph, all threads matching the starting condition are collected for use in training, resulting in a training set of size n. Applying LDA inference to the training threads text word documents will give the associated topic probabilities for each thread. In addition, from Bayes Theorem, we obtain Eq 2: P r[word w T i ] = P r[t i word w ] P r[word w ] P r[t i ] where word w represents the w-th word in the vocabulary. This means that to calculate P r[word w T i ], we must find three values: P r[t i word w ], P r[word w ], and P r[t i ]. The last value, P r[t i ], represents the prior probability of a topic appearing for any thread. This may be found, as shown in Equation 3, by summing the product of each thread s topic probability, obtained from the LDA inference, with the prior probability of that thread appearing. P r[t i ] = (2) n P r[t i T hread j ] P r[t hread j ] (3) j=1 The prior probability of a thread s likelihood, P r[t hread j ] from Equation 3, is estimated using a simple exponential curve fitted to the thread s start time. Details on how this value is found are given in subsection III-A. Pseudocode for finding P r[t opic] is shown in Figure 3. The second value from Equation 2, P r[word w ], may be found similarly to P r[t hread j ], as shown in Equation 4: Fig. 5. Algorithm: Find Pr[Topic Word] Algorithm : findprtopicword Require: List of graph words words, Filtered dataset S, Num topics t. Curve fit for Pr[Thread] curve. Sampling iterations sample, LDA topic probs tp rob[ ][ ]. Ensure: Array of prior probability of Topics given a Word, prt opicw ord prt opicw ord[ ][ ] [ ][ ] for word = 1 to words.size do U { } for i = 1 to S.size do if S[i].graphWords contains word then Add S[i] to U end if for i = 1 to sample do topiccount[ ] [ ] for thread in U do Randomly pick a topic, m, weighted according to tprob[thread][topic] topiccount[m]++ for j = 1 to t do prt opicw ord[word][j] += topiccount[j] / U.size for i = 1 to t do prt opicw ord[word][i] /= sample ReturnprT opicw ord P r[word w ] = n P r[word w T hread j ] P r[t hread j ] (4) j=1 In this case, the prior probability of finding a document word in a thread, P r[word w T hread j ], can be calculated by finding its probability of occurrence in the graph document for that thread, as shown in Equation 5: P r[word w T hread j ] = word w in T hread j words(t hread j ) where word w in T hread j represents the number of times word w occurs in T hread j s graph document, and words(t hread j ) is the set of graph words contained in T hread j s graph document. As an example, Thread 1 from Figure 1 contains the graph word A B, A B once out a total of three graph words, and therefore has an associated probability P r[a B, A B T hread1] = 1 3. The next graph word in Thread 1, A B, A C, occurs twice out (5)
5 Fig. 6. Algorithm: Fit Curve for Pr[Thread] Algorithm : findcurve Require: Filtered dataset S Ensure: Curve used to find Pr[Thread] given time, curve sumoff itcurves total for j = 1 to S.size do currentt hread S[j] timeprob { } for k = j 1 to 1 do T ime currentthread.time - S[k].time Train { } for m = k to j 1 do Add S[m] to Train P rob 1 set false for Link, l, in currentt hread do if Train contains l then P rob = (Number of l contained in Train) / (Number of links in Train) set true end if if set is true then Add (T ime, P rob) to timeprob total total + 1 end if Fit exponential curve c to timeprob sumoff itcurves += c curves = sumoffitcurves / total Return curves of three graph words, and therefore has a probability of 2 3. The pseudocode for finding P r[w ord] is shown in Figure 4. The third and final value necessary to calculate the probability in Equation 2, P r[t i word w ], is estimated using sampling. By collecting and analyzing the set of threads containing word w, the associated topic probabilities may be found. To do this, the set S = {T hread j word w v(t hread j )} is first created, containing every thread in the training set whose graph document contains word w. Next, a sampled value for the probability of every topic is found. This is done by repeatedly choosing a random topic, weighted by the inferred topic distribution, for each thread in S. The final sampled probability for each topic is calculated to be its average probability of being chosen during this sampling process. Pseudocode for finding this prior is shown in Figure 5. By combining the three previously found values according to Equation 2, we can calculate the probability, given the thread s topic distribution, of any graph word appearing. These probabilities therefore allow us to model the connection Fig. 7. Algorithm: GeneratePredictions Algorithm : GeneratePredictions Require: Text for thread test, Training threads dataset, Num topics t, input graph structure I, Num sampling iterations sample, subgraph length l. Ensure: Sorted array of link predictions, predict[ ]. tp rob[ ] [ ] Stem and remove stopwords in test Use LDA to find topic distribution on test store tp rob[topic] P [word][topic] GC-Model(dataset, t, I, sample, l) predict[ ] [ ] for w = 1 to p[word].length do predict[w] += P [w][topic] * tp rob[topic] Return predict between a thread s graph structure and its communication text content, and can be used on new test threads to predict which graph words and communication structures are most likely to appear in that graph. Given a new thread s text content, by applying LDA inference on its text document, the list of topic probabilities for this test thread may be found. Using Equation 1 to combine this topic distribution with the word probabilities calculated in Equation 2, the probability of any graph word appearing in this test thread may be found. The graph words with the highest predicted probabilities are GC-Model s top predictions for the future graph structure of this test thread. The pseudocode for this overall prediction algorithm is shown in Figure 7 A. Thread Probability It is possible to model the likelihood of a thread, P r[t hread j ], from Equations 3 and 4 using the assumption that every training thread is equally likely, obtaining a constant value of P r[t hread j ] = 1/n. However, in actuality, the probability of a past thread s structure appearing again is often related to how recently it occurred. In this way, for example, people messaged recently by a user are more likely to be contacted again, in the near future, than those not contacted for several years. To account for this, GC-Model fits a simple relation between probability and time. The probability of each thread is approximated by the likelihood of its graph links reoccurring in another thread. Between every pair of training threads, their distance in time (the difference between the timestamps of their first messages) and their percentage of link overlap is calculated. These probability / time pairs are then sorted by their timestamp distances and binned into equi-depth bins, averaging the values in each bin. Finally, a simple exponential curve is fit to this series, allowing for the
6 5 5 5 l = 1 l = 2 l = Topic 2 Topics 4 Topics 6 Topics 8 Topics 1 Topics 12 Topics Fig. 8. Varying graph word length. The parameter affects scores only slightly. Fig. 9. Varying number of LDA topics. Increasing topics raises F1 score until a steady state is reached, where further adjustment has little affect. connection between thread time and probability. Pseudocode for this algorithm is shown in Figure 6. IV. EXPERIMENTS In this section, we report our results from implementing and testing GC-Model introduced above. To confirm the effectiveness of our modeling and prediction strategy, GC- Model was implemented in Java and tested on multiple realworld communication networks. We compare our performance to recently introduced link prediction method PropFlow [7], as well as Common Neighbors, Attachment, s coefficient, Adamic/Adar, and both a graph based and a text based naive method. For all approaches, the training graph dataset was created by collecting only those threads which matched the given starting input. The graph based naive method used here predicts links based on frequency, where the highest scoring predictions are the links which appear most frequently. The text based naive method predicts links based on text distances, where the top scoring predictions are the links which contain communication text (stemmed, with stopwords removed) with closest TF-IDF cosine similarity with the test threads. A. Networks Three separate communication graphs are used in this paper. The first is a crawled Twitter dataset, consisting of ~1M nodes (unique Twitter users) and ~2M edges (tweets). The total graph dataset is grouped into threads using each tweet s exact reply to and retweet information. Keeping only those threads containing 3 or more links, a total of ~6k threads were obtained from this network. The second dataset is the Enron dataset, consisting of over 5k links ( s) between 37k nodes (unique addresses). To separate the Enron graph into approximate threads, the subject lines and timestamps were used. s sent close in time (within one month of each other) were grouped according to exact subject line matches, ignoring prepended Re and Fwd text. The ten largest threads were removed, due to the general nature of their subject lines, as well as broadcast s sent to greater than 2 people. Only threads containing at least 3 links and unique users were kept, leaving a total of ~5k threads. The last dataset is the SNAP Twitter dataset [15], consisting of ~17M nodes and ~476M edges. Approximate threads were formed through the retweet text information contained in the links. Threads were creating by doing exact text matches between the tweet text, the RT retweet information, and the user ids contained in the messages, resulting in ~4k threads. B. Parameters and Setup In our implementation and experiments on GC-Model, the graph word length parameter l was found to influence results only slightly and set to 2 for all tests reported here, unless otherwise noted. An example of the results obtained varying l between 1 to 3 in the Enron dataset is shown in Figure 8, which plots the F1 score, defined as: F 1 Score = 2 recall precision recall + precision As can be seen, the graphs of the resulting scores are mostly clustered and have similar scores. Comparable results were found in the other networks. In addition, a sampling size of 1 to find P r[t i word w ], as discussed in Section III, was discovered to give consistent estimates, and is the value used in this paper. Finally, a size of 8 LDA topics was found to result in high quality results for all tested communication graphs, and is the value used for the displayed results here. Figure 9 shows the results of varying the number of topics between 1 to 12 for the Enron dataset. As can be seen from the figure, result scores increase steadily as the number of topics are increased, until reaching a stable state where further increases in topic numbers affect results only slightly. Similar results were found for the other datasets. To validate the effectiveness of our technique at building models of communication graphs, for each network a series of models were built and evaluated. This was done twice, and the results reported in Figures 9 and 1. For the results shown in Figure 9, the starting information given on graph structure was the beginning user for the thread. In Figure 1, the starting information was the beginning graph link. For both these types, the starting graph information was obtained by pulling the top 2% of starting users or graph links with the highest frequencies in the dataset. This was done to allow for
7 5 5 5 PropFlow.6.5 Prop Flow Prop Flow (a) Enron Dataset (b) Crawled Twitter Dataset (c) SNAP Twitter Dataset Fig. 1. Results for predicting directed links given starting user. GC-Model consistantly returns higher scores than the other previously introduced algorithms PropFlow Common Neighbors.6.5 Prop Flow Prop Flow (a) Enron Dataset (b) Crawled Twitter Dataset (c) SNAP Twitter Dataset Fig. 11. Results for predicting directed links given starting link. Again, GC-Model s results dominate the other methods, especially among the top predictions. sufficient dataset sizes in training and testing. For each model, the threads matching the given starting information were found and sorted according to the timestamp of the first message sent, and the ten latest threads were separated to form the testing set. The remaining graph threads were used for training and building the model. After building the models, the most likely predicted graph links were found for each test thread using the method shown in Section III. To evaluate these predictions, the F1 score was again used, as it allows for the combination of both recall and precision, favoring results with a good balance of both. The F1 score was calculated repeatedly for every test thread and model, using a series of top-k predictions, with K ranging from 1 to 2. For each value of K, the recall and precision of the top-k prediction set was found against the actual set of directed links, and used to find its F1 score. To allow for fair prediction and scoring, only those links in the testing set occurring at least once in the training set were used in these result calculations. Finally, the series of F1 scores found across the test threads for each model were averaged to obtain the final results. C. Prediction Results The result of applying GC-Mode, as well as the other link prediction and naive graph and text methods, can be seen in Figures 9 and 1. Figure 9 shows the plot of the F1 score versus top-k value, on each network, for the various methods. As can be seen, GC- Model consistently outperforms the other methods, reporting higher results consistently across all the graphs. GC-Model also performs especially well on those earliest predictions assigned the highest probability, obtaining F1 scores of up to 45% higher than the next closest method. The Graph and Text Naive methods, as well as the recently introduced PropFlow link prediction method, score consistently lower on the highest predictions, but report similar scores eventually as the K value is increased. This helps to confirm the effectiveness of GC- Model s method in the modeling and use of both text and graph structure information for prediction, as GC-Model needs fewer predictions to cover more of the test threads. The much lower prediction results of the remaining previously introduced link prediction methods emphasize their use for the prediction of new, rather than repeating, links. These trends are echoed in Figure 1, where the starting link, rather than the starting user, is given. GC-Model again returns much stronger F1 scores, dominating the results across the graphs. The addition of the extra starting link information, used to prune the training sets for all the reported methods, can be seen to increase the strength of the models and predictions, as well. In addition, Tables I and II report the percentage of directed links in the test graphs covered by the top-k predictions. Only the graph and text naive methods and PropFlow method are used for comparison here, as the other link prediction methods generally had lower values. As can be seen from Table I, taking just the top 1 GC-Model link predictions covers over 7% of the actual thread structure in the Enron dataset, and the top 2 cover over 8%. GC-Model s strong results again emphasizes its ability to build communication thread models that work both effectively and are useful for information dissemination prediction, and confirms the usefulness of combining content and graph structure information.
8 TABLE I GIVEN STARTING USER, PERCENT OF GRAPH COVERED BY PREDICTIONS Enron Crawled Twitter SNAP Twitter Predictions % Graph Covered % Graph Covered % Graph Covered GC- Graph Text Prop GC- Graph Text Prop GC- Graph Text Prop Model Naive Naive Flow Model Naive Naive Flow Model Naive Naive Flow Top 5 43% 38% 37% 23% 54% 38% 35% 33% 74% 41% 18% 18% Top 1 62% 54% 53% 35% 63% 46% 43% 38% 81% 45% 2% 2% Top 2 73% 66% 67% 47% 74% 55% 53% 41% 86% 48% 22% 22% Top 5 86% 85% 85% 55% 89% 65% 68% 45% 93% 53% 25% 24% TABLE II GIVEN STARTING LINK, PERCENT OF GRAPH COVERED BY PREDICTIONS Enron Crawled Twitter SNAP Twitter Predictions % Graph Covered % Graph Covered % Graph Covered GC- Graph Text Prop GC- Graph Text Prop GC- Graph Text Prop Model Naive Naive Flow Model Naive Naive Flow Model Naive Naive Flow Top 5 62% 53% 45% 49% 57% 4% 37% 34% 77% 45% 18% 19% Top 1 78% 65% 56% 51% 66% 47% 45% 39% 83% 49% 21% 21% Top 2 87% 71% 71% 59% 75% 55$ 54% 43% 89% 53% 23% 22% Top 5 91% 83% 8% 64% 89% 66% 68% 47% 96% 6% 27% 26% V. CONCLUSION The modeling and prediction of graph structure in graph datasets has been both an interesting and important area of research in recent years. In particular, with the growth of social networks and services such as Twitter, communication graphs have been an especially fast growing source of data and information. While numerous link prediction methods have been introduced, most focusing on predicting new link formation using only graph structure, an important insight is that communication graphs have additional text and information content associated with them. In this paper, we introduced a new method, GC-Model, which utilizes the extra content information included in communication graphs to build graph models that may be used to predict future network structure and information flow. When implemented and tested on three real world communication networks, GC-Model predicted with far increased accuracy and coverage, when compared to multiple previously introduced link prediction and naive methods. GC-Model s returned scores dominate the comparison sets, with its top 1 predictions covering on average up to 22% more of the actual communication graph structure than the compared methods. These results emphasize the importance and strength of including both information content, as well as graph structure, in the modeling of communication networks, as well as the effectiveness of GC-Model s modeling scheme for the prediction of future information flow structure. ACKNOWLEDGMENT Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. REFERENCES [1] K. Macropol and A. Singh, Scalable discovery of best clusters on large graphs, Proc. VLDB Endow., vol. 3, pp , September 21. [2] D. Liben-Nowell and J. Kleinberg, The link-prediction problem for social networks, J. of ASIST, vol. 58, pp , 27. [3] Z. Huang and D. Zeng, A link prediction approach to anomalous detection, in SMC, 26. [4] M. E. J. Newman, Clustering and preferential attachment in growing networks. Phys. Rev. E, vol. 64, 21. [5] A. L. Barabasi and et. al., Evolution of the social network of scientific collaborations, Physica A: Stat. Mech. and its Applications, vol [6] L. Adamic, Friends and neighbors on the Web, Social Networks, vol. 25, no. 3, pp , Jul. 23. [7] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla, New perspectives and methods in link prediction, in KDD, 21. [8] E. Zheleva, L. Getoor, J. Golbeck, and U. Kuter, Using friendship ties and family circles for link prediction, in SNAKDD, 21, pp [9] M. A. Hasan, V. Chaoji, S. Salem, and M. Zaki, Link prediction using supervised learning, in SDM 6 workshop on LACS, 26. [1] L. Backstrom and J. Leskovec, Supervised random walks: predicting and recommending links in social networks, in Proc. of the WSDM. ACM, 211, pp [11] J. Chang and D. M. Blei, Hierarchical relational models for document networks, Annals of Applied Statistics, Oct. 21. [12] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp , March 23. [13] C. Ozcaglar, Classification of Messages Into Topics Using Latent Dirichlet Allocation, Master s thesis, Rensselaer Polytechnic Institute, Troy, New York, 28. [14] L. Hong and B. D. Davison, Empirical study of topic modeling in twitter, in SOMA, 21. [15] J. Yang and J. Leskovec, Patterns of temporal variation in online media, in Proc. of the WSDM. ACM, 211, pp
CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks
CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationLink Prediction and Anomoly Detection
Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in
More informationUsing Friendship Ties and Family Circles for Link Prediction
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva and Jennifer Golbeck Lise Getoor College of Information Studies Computer Science University of Maryland University of Maryland
More informationLink Prediction in Microblog Network Using Supervised Learning with Multiple Features
Link Prediction in Microblog Network Using Supervised Learning with Multiple Features Siyao Han*, Yan Xu The Information Science Department, Beijing Language and Culture University, Beijing, China. * Corresponding
More informationAn Empirical Analysis of Communities in Real-World Networks
An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationCS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul
1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given
More informationStable Statistics of the Blogograph
Stable Statistics of the Blogograph Mark Goldberg, Malik Magdon-Ismail, Stephen Kelley, Konstantin Mertsalov Rensselaer Polytechnic Institute Department of Computer Science Abstract. The primary focus
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationRecommendation System for Location-based Social Network CS224W Project Report
Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless
More informationSocial & Information Network Analysis CS 224W
Social & Information Network Analysis CS 224W Final Report Alexandre Becker Jordane Giuly Sébastien Robaszkiewicz Stanford University December 2011 1 Introduction The microblogging service Twitter today
More informationLink Mining & Entity Resolution. Lise Getoor University of Maryland, College Park
Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous
More informationText Document Clustering Using DPM with Concept and Feature Analysis
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationImplementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky
Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding
More informationGraph Structure Over Time
Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines
More informationOnline Social Networks and Media
Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationarxiv: v1 [cs.si] 8 Apr 2016
Leveraging Network Dynamics for Improved Link Prediction Alireza Hajibagheri 1, Gita Sukthankar 1, and Kiran Lakkaraju 2 1 University of Central Florida, Orlando, Florida 2 Sandia National Labs, Albuquerque,
More informationCS 224W Final Report Group 37
1 Introduction CS 224W Final Report Group 37 Aaron B. Adcock Milinda Lakkam Justin Meyer Much of the current research is being done on social networks, where the cost of an edge is almost nothing; the
More informationSupervised classification of law area in the legal domain
AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms
More informationA Naïve Bayes model based on overlapping groups for link prediction in online social networks
Universidade de São Paulo Biblioteca Digital da Produção Intelectual - BDPI Departamento de Ciências de Computação - ICMC/SCC Comunicações em Eventos - ICMC/SCC 25-4 A Naïve Bayes model based on overlapping
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationjldadmm: A Java package for the LDA and DMM topic models
jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical
More informationIn this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.
December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationHadoop Based Link Prediction Performance Analysis
Hadoop Based Link Prediction Performance Analysis Yuxiao Dong, Casey Robinson, Jian Xu Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556, USA Email: ydong1@nd.edu,
More informationChapter 3: Supervised Learning
Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example
More informationExtracting Information from Complex Networks
Extracting Information from Complex Networks 1 Complex Networks Networks that arise from modeling complex systems: relationships Social networks Biological networks Distinguish from random networks uniform
More informationInfluence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report
Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Abstract The goal of influence maximization has led to research into different
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu November 7, 2017 Learnt Clustering Methods Vector Data Set Data Sequence Data Text
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 11/13/17 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2 Observations Models
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationVisoLink: A User-Centric Social Relationship Mining
VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.
More informationAn Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization
An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science
More informationScalable Object Classification using Range Images
Scalable Object Classification using Range Images Eunyoung Kim and Gerard Medioni Institute for Robotics and Intelligent Systems University of Southern California 1 What is a Range Image? Depth measurement
More informationThe Effect of Neighbor Graph Connectivity on Coverage Redundancy in Wireless Sensor Networks
The Effect of Neighbor Graph Connectivity on Coverage Redundancy in Wireless Sensor Networks Eyuphan Bulut, Zijian Wang and Boleslaw K. Szymanski Department of Computer Science and Center for Pervasive
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationThe missing links in the BGP-based AS connectivity maps
The missing links in the BGP-based AS connectivity maps Zhou, S; Mondragon, RJ http://arxiv.org/abs/cs/0303028 For additional information about this publication click this link. http://qmro.qmul.ac.uk/xmlui/handle/123456789/13070
More informationExploiting Social and Mobility Patterns for Friendship Prediction in Location-Based Social Networks
Exploiting Social and Mobility Patterns for Friendship Prediction in Location-Based Social Networks Jorge Valverde-Rebaza ICMC Univ. of São Paulo, Brazil jvalverr@icmc.usp.br Mathieu Roche TETIS & LIRMM
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationTopic mash II: assortativity, resilience, link prediction CS224W
Topic mash II: assortativity, resilience, link prediction CS224W Outline Node vs. edge percolation Resilience of randomly vs. preferentially grown networks Resilience in real-world networks network resilience
More informationSocial Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users
Social Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users Bin Nie, Honggang Zhang, Yong Liu Fordham University, Bronx, NY. Email: {bnie, hzhang44}@fordham.edu NYU Poly,
More informationBehavioral Data Mining. Lecture 18 Clustering
Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i
More informationD 380 Disciplinary. Disease Surveillance, Case Study. Definition. Introduction. Synonyms. Glossary
D 380 Disciplinary Scoring Function An objective function that measures the anomalousness of a subset of data LTSS Linear-time subset scanning Time to Detect Evaluation metric; time delay before detecting
More informationFeature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web
Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Chenghua Lin, Yulan He, Carlos Pedrinaci, and John Domingue Knowledge Media Institute, The Open University
More informationSupplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion
Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University
More informationAppendix A Additional Information
Appendix A Additional Information In this appendix, we provide more information on building practical applications using the techniques discussed in the chapters of this book. In Sect. A.1, we discuss
More informationSupplementary material to Epidemic spreading on complex networks with community structures
Supplementary material to Epidemic spreading on complex networks with community structures Clara Stegehuis, Remco van der Hofstad, Johan S. H. van Leeuwaarden Supplementary otes Supplementary ote etwork
More informationAn Efficient Link Prediction Technique in Social Networks based on Node Neighborhoods
An Efficient Link Prediction Technique in Social Networks based on Node Neighborhoods Gypsy Nandi Assam Don Bosco University Guwahati, Assam, India Anjan Das St. Anthony s College Shillong, Meghalaya,
More informationProbabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation
Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationData Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 6. k-means Algorithm Spring 2010 Instructor: Dr. Masoud Yaghini Outline K-Means Algorithm in K-Means Node References K-Means Algorithm in Overview The k-means method is a clustering
More informationOutlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationFavorites-Based Search Result Ordering
Favorites-Based Search Result Ordering Ben Flamm and Georey Schiebinger CS 229 Fall 2009 1 Introduction Search engine rankings can often benet from knowledge of users' interests. The query jaguar, for
More informationRecommendation of APIs for Mashup creation
Recommendation of APIs for Mashup creation Priyanka Samanta (ps7723@cs.rit.edu) Advisor: Dr. Xumin Liu (xl@cs.rit.edu) B. Thomas Golisano College of Computing and Information Sciences Rochester Institute
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationSupervised Link Prediction with Path Scores
Supervised Link Prediction with Path Scores Wanzi Zhou Stanford University wanziz@stanford.edu Yangxin Zhong Stanford University yangxin@stanford.edu Yang Yuan Stanford University yyuan16@stanford.edu
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationClassifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped
More informationWhere Should the Bugs Be Fixed?
Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports Presented by: Chandani Shrestha For CS 6704 class About the Paper and the Authors Publication
More informationParallelism for LDA Yang Ruan, Changsi An
Parallelism for LDA Yang Ruan, Changsi An (yangruan@indiana.edu, anch@indiana.edu) 1. Overview As parallelism is very important for large scale of data, we want to use different technology to parallelize
More informationToward Efficient Search for a Fragment Network in a Large Semantic Database
Toward Efficient Search for a Fragment Network in a Large Semantic Database M. Goldberg, J. Greenman, B. Gutting, M. Magdon-Ismail, J. Schwartz, W. Wallace Computer Science Department, Rensselaer Polytechnic
More informationLiangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*
Tracking Trends: Incorporating Term Volume into Temporal Topic Models Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA,
More informationMining Social Media Users Interest
Mining Social Media Users Interest Presenters: Heng Wang,Man Yuan April, 4 th, 2016 Agenda Introduction to Text Mining Tool & Dataset Data Pre-processing Text Mining on Twitter Summary & Future Improvement
More informationAn Empirical Comparison of Collaborative Filtering Approaches on Netflix Data
An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data Nicola Barbieri, Massimo Guarascio, Ettore Ritacco ICAR-CNR Via Pietro Bucci 41/c, Rende, Italy {barbieri,guarascio,ritacco}@icar.cnr.it
More informationDynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling
2014/04/09 @ WWW 14 Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling Takuya Akiba (U Tokyo) Yoichi Iwata (U Tokyo) Yuichi Yoshida (NII & PFI)
More informationA Parallel Community Detection Algorithm for Big Social Networks
A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic
More informationMODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS
MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500
More informationCOMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS
Annales Univ. Sci. Budapest., Sect. Comp. 43 (2014) 57 68 COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Imre Szücs (Budapest, Hungary) Attila Kiss (Budapest, Hungary) Dedicated to András
More informationK- Nearest Neighbors(KNN) And Predictive Accuracy
Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.
More informationLecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy
Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples
More informationRSDC 09: Tag Recommendation Using Keywords and Association Rules
RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationThe link prediction problem for social networks
The link prediction problem for social networks Alexandra Chouldechova STATS 319, February 1, 2011 Motivation Recommending new friends in in online social networks. Suggesting interactions between the
More informationDemystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian
Demystifying movie ratings 224W Project Report Amritha Raghunath (amrithar@stanford.edu) Vignesh Ganapathi Subramanian (vigansub@stanford.edu) 9 December, 2014 Introduction The past decade or so has seen
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationSCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR
SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG
More informationClassification Algorithms in Data Mining
August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms
More informationReplication on Affinity Propagation: Clustering by Passing Messages Between Data Points
1 Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points Zhe Zhao Abstract In this project, I choose the paper, Clustering by Passing Messages Between Data Points [1],
More informationCharacteristics of Preferentially Attached Network Grown from. Small World
Characteristics of Preferentially Attached Network Grown from Small World Seungyoung Lee Graduate School of Innovation and Technology Management, Korea Advanced Institute of Science and Technology, Daejeon
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationPartitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization
Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Hung Nghiep Tran University of Information Technology VNU-HCMC Vietnam Email: nghiepth@uit.edu.vn Atsuhiro Takasu National
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationThe Data Mining Application Based on WEKA: Geographical Original of Music
Management Science and Engineering Vol. 10, No. 4, 2016, pp. 36-46 DOI:10.3968/8997 ISSN 1913-0341 [Print] ISSN 1913-035X [Online] www.cscanada.net www.cscanada.org The Data Mining Application Based on
More informationImproving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm
Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm Majid Hatami Faculty of Electrical and Computer Engineering University of Tabriz,
More informationMore Efficient Classification of Web Content Using Graph Sampling
More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important
More informationStudy of Data Mining Algorithm in Social Network Analysis
3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Study of Data Mining Algorithm in Social Network Analysis Chang Zhang 1,a, Yanfeng Jin 1,b, Wei Jin 1,c, Yu Liu 1,d 1
More informationFailure in Complex Social Networks
Journal of Mathematical Sociology, 33:64 68, 2009 Copyright # Taylor & Francis Group, LLC ISSN: 0022-250X print/1545-5874 online DOI: 10.1080/00222500802536988 Failure in Complex Social Networks Damon
More informationClassication of Corporate and Public Text
Classication of Corporate and Public Text Kevin Nguyen December 16, 2011 1 Introduction In this project we try to tackle the problem of classifying a body of text as a corporate message (text usually sent
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationData Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005
Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate
More informationCS224W Final Report Emergence of Global Status Hierarchy in Social Networks
CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range
More information