Link Prediction for Social Network

Size: px
Start display at page:

Download "Link Prediction for Social Network"

Transcription

1 Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Abstract Friendship recommendation has become an important issue for social network in the digital era. In this paper, I model the friendship recommendation problem to the link prediction problem of a graph. Based on supervised learning methods, I select several useful features and classification models to achieve accurate friendship recommendation. Experimental results show that I achieve high accuracy, precision, recall and F-measure rate. Keywords link prediction, social network, Support Vector Machine, Logistic Regression. I. INTRODUCTION AND LITERATURE Social Network has gained tremendous popularity in recent years. People communicate with each other through social networks more and more frequently. Therefore, for anyone who manages a social network, it is an important task to correctly predict the relationship or proximity of arbitrary pairs of users through a program; thus can provide a better using experience for customers. In this paper, I build a reliable friendship recommendation system. There are two types of datasets in friendship prediction: homogeneous and heterogeneous network. For homogeneous networks, there are only topological information. Whereas for heterogeneous networks, some extra side information are given. For example, we might have some features describing the friendship connection, such as how many posts the two users are together tagged in and the frequency the two users chat with each other, etc. According to [1], there are three challenges in friendship prediction: (1) in heterogeneous network, it is not obvious how best to combine the topology and side information, (2) the extreme imbalance of link prediction datasets, that is, the number of edges known to be present if often far less than the number of edges known to be absent, and (3) the large scale of social network leads to computational inefficiency. There are many methods for link predictions: (1) methods based on node neighbors, (2) methods based on the ensembles of all paths, and (3) some higher level approaches such as lowrank approximation, unseen bigrams, and clustering. In this paper I aim to solve link prediction using [2] as dataset, which is a homogeneous network collected from Facebook with incomplete temporal information. The dataset is often used to do link prediction research. The state-of-the-art methods to solve this problem include matrix factorization [1] and tensor factorization [3]. However, scalability is a big concern of both feature-based (discussed in this assignment) and kernel-based method for most of the real-life social networks. Cold start is also problem. Most of the research move on to heterogeneous network, which has additional information related to users that will help address the above problems. The rest of this paper is organized as follows. Section II describes the dataset. Section III selection of features and models in details. I compare the performance between different models and provide discussion in Section IV. Finally, Section V concludes the paper. II. DATASET The dataset is a temporal undirected social network which contains friendship data of Facebook users. The network has nodes (i.e., users) and edges (i.e., links). In average, every user has friendship connections. The user with maximum friends has 1098 connections. Figure 2 shows the degree distributions of the graph. Fig. 1. Degree Distributions The temporal information is the UNIX timestamp with the time of link establishment. However, the temporal information is incomplete. Over half of the edges (481327) are lack of temporal information and thus their timestamp are treated as zero. The rest of the edges were connected started from to Figure 3 shows the temporal distribution of the graph. In order to conduct experiments, I sort the edges in terms of temporal information and remove the last 1% edges from the graphs as testing set, and the remaining edges are the training set. The following table shows the overview of the dataset.

2 common friends the two nodes share, the higher probability the two nodes being connected in the future. I count the number of neighbors that two nodes, x and y, have in common and denote it to be Γ(x) Γ(y) (1) where Γ(x) is the set of neighbors of x. 2) Jaccards Coefficient: Based on the idea of common neighbor, I further deliberate that it is more likely that there exist a link between the two nodes x and y if they share more common neighbors and have fewer neighbors respectively. Here, I select the Jaccards coefficient, also known as the Jaccard index, as an indicator to predict a link. Fig. 2. Temporal Distribution Number of Edges Number of Nodes whole data set training data testing data Since the cold start problem is an important issue for recommendation system, I calculate how many users in the testing data that is new and unseen to training data. This is a measurement of how serious the cold start problem is. The result shows that 481 users are new to the training data. III. PREDICTIVE TASK AND MODEL SELECTION As shown in figure 1, given a partial undirected graph Gp = (V, E) and two nodes u,v Gp which do not have a link in Gp. The task is to predict whether there exists a link in the original graph G. Originally, Jaccard index is used for comparing the similarity and diversity of two sets A and B, and is defined as the size of the intersection divided by the size of the union of the two sets: A B J(A, B) = A B. (2) In [4], it is modified as: J(x, y) = Γ(x) Γ(y) Γ(x) Γ(y) 3) Adamic and Adar: Considering only common neighbor is not enough since we shouldn t equally view every existing link between node x or y and the common neighbor z. It is clear that a common neighbor is more representative and significant if the neighbor has less number of friends. Based on this idea, I choose Adamic/Adar as a mean to weight the common neighbors who have fewer friends more heavily and meanwhile to belittle the importance of the common neighbors whose friendship style is gregarious. Adamic and Adar considered whether the link can be connected or not by defining the similarity between two nodes to be 1 (4) log Γ(z) z Γ(x) Γ(y) (3) where z is the common neighbor of x and y. Fig. 3. Model of Link Prediction This problem can be formulated as a binary classification problem. For any given pair of nodes, we have to classify whether there should be an edge or not, using the attributes obtained from the graph. The goal is to find the optimal method and features for training classifier. Besides, limited memory is also a significant issue. I apply edge-sampling to solve this problem. A. Feature Extraction 1) Common Neighbors: Common Neighbors may be the most direct and intuitive feature for link prediction. The more 4) Preferential Attachment: One is more likely to gain new friend if he/she has more friends, that is, The rich get richer. Similarly, we can assume that a new link is more likely to be set up if the two end points have more neighbors. Therefore, I choose preferential attachment to be one of the features and denote it to be Γ(x) Γ(y) (5) The basic premise of preferential attachment is that the probability that a new link has node x as an endpoint is proportional to the current number of neighbor of x. 5) Simplified Katz: In addition to common neighbor, the most direct method of link prediction, counting the amount and length of paths between two nodes may also be helpful. Consequently, I choose Katz as a basis to describe the feature. Katz defined a measure that directly sums over this collection

3 of paths, exponentially damped by length to count short paths more heavily. This notion leads to the measure l=0 β l paths <l> x,y, (6) where paths are the set of all length-l paths from x to y, and β 0 is a parameter of the predictor. However, since the data set is too large, it is impossible and inefficient to sum path length l from 1 to infinity. I simplify Katz by merely summing l from 1 to 3. That is, in addition to common neighbor that the path length equals to 1, I further consider the path length equaling up to 3. The experimental results shows that this is enough for predicting new links. Note that a very small β yields predictions much like common neighbors since paths of length 3 or more contribute very little to the summation. I empirically set β to B. Training Set 1) Positive Training Data Sampling: In order to handle the large scale of the social network and avoid over-fitting while training, I sample part of the graph by edges as positive training data. I empirically set the number of sampling edges to There is no need to increase the number of training data since the improvement of testing accuracy does not indeed benefit from it. Also, growth of training data results in cost of training time. 2) Negative Training Data Sampling: This social graph is relatively sparse. To prevent the imbalance of the dataset, I randomly select two different nodes, and check whether these two nodes are adjacent to each other. If the two nodes are already neighbors, the edge is labeled as a positive edge. If not, the edge is then labeled as a negative one and be added to the negative training data set. C. Testing Set In this part, I remove the last 1% edges in terms of time from the original graphs to model the condition that people are friends in real world but not in the social network. In addition, I sample negative testing data with the same number of positive testing data to emulate real prediction condition. D. Baseline Model The baseline model in this prediction task is Naive Bayesian, a simple probabilistic model. Naive Bayesian applies Bayes theorem with the assumption that each attribute is independent to others. Two major advantages of Naive Bayesian are that (1) it requires little time to build the model, and (2) it requires small amount of training data to estimate the necessary variables such as mean and variances of the model. E. Model 1) Support Vector Machine: Support vector machine is one of the most commonly used classifiers today. It can be applied to a wide variety of data sets and is very reliable in the meanwhile. Given a set of n-dimensional training data, each is marked as one of the two classes. Support vector machine finds a (n- 1)-dimensional hyperplane, which can separate two classes of data with the largest margin. However, sometimes data sets are not linearly separable in the original feature space. We need to map the data from the original feature space into a higherdimensional space with the help of some transform functions. A set of functions called kernel functions are usually selected for this purpose. These functions have a common property that dot products can be computed in the original space, which ease the computational burden. I choose radial basis function as the kernel function, which shows a better result in empirical knowledge. 2) Logistic Regression: Logistic regression is a classification model that is derived from probabilistic perspective. It can perform both binary and multi-class classification by modeling the probability of each class. The probability of each class is modeled as a logit function, where the input for the logit function is a linear combination of features. Thus, from bayesian perspective, logistic regression is a generalize linear model. Generally, logistic regression is trained with gradient decent in a supervised fashion. In my link prediction experiment, binary outcome (e.g. exist link or not) is being classified. Therefore, a binary logistic regression model is trained. Because the features I use here are highly correlated, I adopt L 2 regularizer to avoid overfitting. A. SVM IV. EXPERIMENTAL RESULTS Below is the result of SVM with different penalty parameter C of the error term. Accuracy C= C= C= C= C= Finally, I select C=1.7 for model. B. Logistic Regression Below is the table of different inverse of regularization strength C for logistic regression model. Accuracy C= C= C= C= C= Finally, I select C=0.05 for model. C. Comparison between Nave Bayesian, SVM and Logistic Regression The following table shows the results among the three model using the best parameters.

4 Nave Bayesian SVM Logistic Regression Accuracy Precision Recall F1 Score As expected, the two proposed model outperform the baseline model. This may due to the dependency between features, which conflicts with the assumption of the Nave Bayesian model. With non-linear kernel, SVM can do a great job by separating two classes. Logistic Regression performs better than two other models. D. Effects of Unseen Nodes To see to what extent the new nodes affects the result, I remove the unseen data from the testing set. The following table shows the results among the three model using the best parameters. Nave Bayesian SVM Logistic Regression Accuracy Precision Recall F1 Score We can observe that both Nave Bayesian and SVM model benifit a lot by removing new nodes in terms of every metric, while Logistic Regression model loses points. We can further see that there is no significant difference between the performance of SVM and Logistic Regression model. From the facts, we can infer that (1) the Logistic Regression model is good at predict links that new users are involved, and (2) both SVM and Logistic Regression model are roughly good at predict links between two known users. E. Importance of Each Feature 1) Whole Testing Set: To see the importance of every feature, I first remove each feature. The results of removing each feature using Nave Bayesian model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using SVM model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using Logistic Regression model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features In general, for Nave Bayesian and SVM models, Preferential Attachment is the most important feature, while Simplified Katz is the least important feature that does not help a lot. Without the Preferential Attachment, the baseline model can t even perform better result than random guess. As for Logistic Regression Model, Adamic and Adar is the most important feature, while Preferential Attachment seems to have negative effect on prediction. 2) Remove New Nodes from Testing Set: The results of removing each feature using Nave Bayesian model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using SVM model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features The results of removing each feature using Logistic Regression model are shown in the following table. CommonNbor Jaccard AdamAdar preferattach Shortest Path Simplified Katz All features We can observe that by viewing only the seen users, Simplified Katz becomes a useful feature for SVM model, while Preferential Attachment still does not help Logistic Regression model to predict. We can conclude this situation by saying that one feature important feature to a model may not also be important to other model, and can even be helpfulless. V. CONCLUSION In this paper, I model the friendship recommendation problem to the link prediction problem. On the basis of supervised learning methods, I select useful features and models to attain

5 accurate prediction. I choose common neighbor, Jaccards Coefficient, Adamic and Adar, Preferential Attachment and Simplified Katz as features and choose Support Vector Machine and Logistic Regression as models. The performance of the model is evaluated by accuracy, precision, recall, and F-measure rate. Since I concern about the correctness of predicting links that truly exist. Generally speaking, Logistic Regression achieves the highest accuracy and F-1 score. Therefore, I believe that model built by Logistic Regression is the best one. REFERENCES [1] Menon, Aditya Krishna, and Charles Elkan. Link prediction via matrix factorization. Joint european conference on machine learning and knowledge discovery in databases. Springer Berlin Heidelberg, [2] Viswanath, B., Mislove, A., Cha, M., and Gummadi, K. P. (2009, August). On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on Online social networks (pp ). ACM. [3] Dunlavy, Daniel M., Tamara G. Kolda, and Evrim Acar. Temporal link prediction using matrix and tensor factorizations. ACM Transactions on Knowledge Discovery from Data (TKDD) 5.2 (2011): 10. [4] LibenNowell, D., Kleinberg, J. (2007). The linkprediction problem for social networks. journal of the Association for Information Science and Technology, 58(7),

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Online Social Networks and Media

Online Social Networks and Media Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ

More information

The link prediction problem for social networks

The link prediction problem for social networks The link prediction problem for social networks Alexandra Chouldechova STATS 319, February 1, 2011 Motivation Recommending new friends in in online social networks. Suggesting interactions between the

More information

Topic mash II: assortativity, resilience, link prediction CS224W

Topic mash II: assortativity, resilience, link prediction CS224W Topic mash II: assortativity, resilience, link prediction CS224W Outline Node vs. edge percolation Resilience of randomly vs. preferentially grown networks Resilience in real-world networks network resilience

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

Introduction to Automated Text Analysis. bit.ly/poir599

Introduction to Automated Text Analysis. bit.ly/poir599 Introduction to Automated Text Analysis Pablo Barberá School of International Relations University of Southern California pablobarbera.com Lecture materials: bit.ly/poir599 Today 1. Solutions for last

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Link Prediction and Anomoly Detection

Link Prediction and Anomoly Detection Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in

More information

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models DB Tsai Steven Hillion Outline Introduction Linear / Nonlinear Classification Feature Engineering - Polynomial Expansion Big-data

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Link Sign Prediction and Ranking in Signed Directed Social Networks

Link Sign Prediction and Ranking in Signed Directed Social Networks Noname manuscript No. (will be inserted by the editor) Link Sign Prediction and Ranking in Signed Directed Social Networks Dongjin Song David A. Meyer Received: date / Accepted: date Abstract Signed directed

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty) Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

Using network evolution theory and singular value decomposition method to improve accuracy of link prediction in social networks

Using network evolution theory and singular value decomposition method to improve accuracy of link prediction in social networks Proceedings of the Tenth Australasian Data Mining Conference (AusDM 2012), Sydney, Australia Using network evolution theory and singular value decomposition method to improve accuracy of link prediction

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

Supervised Link Prediction with Path Scores

Supervised Link Prediction with Path Scores Supervised Link Prediction with Path Scores Wanzi Zhou Stanford University wanziz@stanford.edu Yangxin Zhong Stanford University yangxin@stanford.edu Yang Yuan Stanford University yyuan16@stanford.edu

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Feature Selection for fmri Classification

Feature Selection for fmri Classification Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 chuangw@andrew.cmu.edu Abstract The functional Magnetic Resonance Imaging

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Generating Useful Network-based Features for Analyzing Social Networks

Generating Useful Network-based Features for Analyzing Social Networks Generating Useful Network-based Features for Analyzing Social Networks Abstract Recently, many Web services such as social networking services, blogs, and collaborative tagging have become widely popular.

More information

Automated Tagging for Online Q&A Forums

Automated Tagging for Online Q&A Forums 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Predicting Investments in Startups using Network Features and Supervised Random Walks

Predicting Investments in Startups using Network Features and Supervised Random Walks Predicting Investments in Startups using Network Features and Supervised Random Walks Arushi Raghuvanshi arushi@stanford.edu Tara Balakrishnan taragb@stanford.edu Maya Balakrishnan mayanb@stanford.edu

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Chakra Chennubhotla and David Koes

Chakra Chennubhotla and David Koes MSCBIO/CMPBIO 2065: Support Vector Machines Chakra Chennubhotla and David Koes Nov 15, 2017 Sources mmds.org chapter 12 Bishop s book Ch. 7 Notes from Toronto, Mark Schmidt (UBC) 2 SVM SVMs and Logistic

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks 1 BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks Pan Hui, Jon Crowcroft, Eiko Yoneki Presented By: Shaymaa Khater 2 Outline Introduction. Goals. Data Sets. Community Detection Algorithms

More information

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Tomohiro Tanno, Kazumasa Horie, Jun Izawa, and Masahiko Morita University

More information

Fraud Detection using Machine Learning

Fraud Detection using Machine Learning Fraud Detection using Machine Learning Aditya Oza - aditya19@stanford.edu Abstract Recent research has shown that machine learning techniques have been applied very effectively to the problem of payments

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

Class 6 Large-Scale Image Classification

Class 6 Large-Scale Image Classification Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual

More information

Generating Useful Network-based Features for Analyzing Social Networks

Generating Useful Network-based Features for Analyzing Social Networks Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Generating Useful Network-based Features for Analyzing Social Networks Jun Karamon and Yutaka Matsuo and Mitsuru Ishizuka

More information

Link Prediction in Graph Streams

Link Prediction in Graph Streams Peixiang Zhao, Charu C. Aggarwal, and Gewen He Florida State University IBM T J Watson Research Center Link Prediction in Graph Streams ICDE Conference, 2016 Graph Streams Graph Streams arise in a wide

More information

Linear methods for supervised learning

Linear methods for supervised learning Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines SemDeep-4, Oct. 2018 Gengchen Mai Krzysztof Janowicz Bo Yan STKO Lab, University of California, Santa Barbara

More information

Internal Link Prediction in Early Stage using External Network

Internal Link Prediction in Early Stage using External Network Internal Link Prediction in Early Stage using External Network Honghao Wei Stanford University weihh16 weihh16@stanford.edu Yiwei Zhao Stanford University ywzhao ywzhao@stanford.edu Junjie Ke Stanford

More information

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Abstract The goal of influence maximization has led to research into different

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Hotel Recommendation Based on Hybrid Model

Hotel Recommendation Based on Hybrid Model Hotel Recommendation Based on Hybrid Model Jing WANG, Jiajun SUN, Zhendong LIN Abstract: This project develops a hybrid model that combines content-based with collaborative filtering (CF) for hotel recommendation.

More information

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Jaweria Kanwal Quaid-i-Azam University, Islamabad kjaweria09@yahoo.com Onaiza Maqbool Quaid-i-Azam University, Islamabad onaiza@qau.edu.pk

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Sumedh Sawant sumedh@stanford.edu Team 38 December 10, 2013 Abstract We implement a personal recommendation

More information

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu (fcdh@stanford.edu), CS 229 Fall 2014-15 1. Introduction and Motivation High- resolution Positron Emission Tomography

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

Efficient Case Based Feature Construction

Efficient Case Based Feature Construction Efficient Case Based Feature Construction Ingo Mierswa and Michael Wurst Artificial Intelligence Unit,Department of Computer Science, University of Dortmund, Germany {mierswa, wurst}@ls8.cs.uni-dortmund.de

More information

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio

Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio Adela Ioana Tudor, Adela Bâra, Simona Vasilica Oprea Department of Economic Informatics

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Bayesian model ensembling using meta-trained recurrent neural networks

Bayesian model ensembling using meta-trained recurrent neural networks Bayesian model ensembling using meta-trained recurrent neural networks Luca Ambrogioni l.ambrogioni@donders.ru.nl Umut Güçlü u.guclu@donders.ru.nl Yağmur Güçlütürk y.gucluturk@donders.ru.nl Julia Berezutskaya

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Supervised Random Walks

Supervised Random Walks Supervised Random Walks Pawan Goyal CSE, IITKGP September 8, 2014 Pawan Goyal (IIT Kharagpur) Supervised Random Walks September 8, 2014 1 / 17 Correlation Discovery by random walk Problem definition Estimate

More information

CS 8520: Artificial Intelligence

CS 8520: Artificial Intelligence CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013 1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

Object and Action Detection from a Single Example

Object and Action Detection from a Single Example Object and Action Detection from a Single Example Peyman Milanfar* EE Department University of California, Santa Cruz *Joint work with Hae Jong Seo AFOSR Program Review, June 4-5, 29 Take a look at this:

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul Mathematics of Data INFO-4604, Applied Machine Learning University of Colorado Boulder September 5, 2017 Prof. Michael Paul Goals In the intro lecture, every visualization was in 2D What happens when we

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Machine Learning (CSE 446): Practical Issues

Machine Learning (CSE 446): Practical Issues Machine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1 / 39 scary words 2 / 39 Outline of CSE 446 We ve already covered stuff

More information

Study of Data Mining Algorithm in Social Network Analysis

Study of Data Mining Algorithm in Social Network Analysis 3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Study of Data Mining Algorithm in Social Network Analysis Chang Zhang 1,a, Yanfeng Jin 1,b, Wei Jin 1,c, Yu Liu 1,d 1

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows)

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows) Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows) Average clustering coefficient of a graph Overall measure

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information