Link Analysis in Weibo

Size: px
Start display at page:

Download "Link Analysis in Weibo"


1 Link Analysis in Weibo Liwen Sun AMPLab, EECS Di Wang Theory Group, EECS Abstract With the widespread use of social network applications, online user behaviors, ranging from information search to marketing, have been greatly changed, due to the connected nature of these platforms. In this project, we study the link analysis problem, which analyze these links and predict new links. With these techniques, online platforms can make effective recommendations to users so that the world is more connected, which opens the door to business opportunities and information sharing. We apply machine learning techniques and consider both global methods such as SVM and logistic regression and local methods such as k-nearest neighbor. We compare the performance of these tools on both effectiveness and efficiency. 1 an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather ob- vious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of informa- tion sources that might consume it. Herbert Simon, Designing Organizations for an Information-Rich World, Online social networking platforms have become tremendously popular in recent years and keep pushing the boundary of online business. The largest sites like Facebook and Twitter constantly add thousands of enthusiastic new users daily to their existing hundreds of millions of actively engaged users. Micro-blogging services, in particular, have rapidly gained its popularity due to the Figure 1: Growth of Weibo Users rise of Twitter. Through limited-length messages (one hundred and fourty characters at most), micro-blogging sites have demonstrated its power of information spreading as a crowd-powered social media. In China, Weibo, which literally stands for microblog in Chinese, is a micro-blogging service after Twitter. Two major weibo sites, Tecent and Sina, have grown their user base into over three hundred millions each in two years of time, as shown in Figure 1. Viewed as a similar social platform as Twitter, Weibo has its unique feature. One primary distinction is that a message of one hundred and fourty characters in Chinese can be much more informative than in English. It also has unique social effect in China s society, e.g., the ability to spread forbidden messages like wildfire before those messages get censored. In this project, we study the link analysis problems on Weibo. Social platforms bring online users together and form a gigantic social graph, where users and other entities serve as nodes, and their interactions, such as friend and follow relationships serve as edges. Such link information becomes an important source to analyze and predict user behaviors. In particular, the link recommendation problem aims at helping users connect with others and make this social graph connected more densely.

2 data file details type size user profile age, gender, num of tweets list of records 2.3M users user keywords extract keywords in tweets list of records 2.3M users item attributes category, tags list of records 6.5K items follows follow history graph 2M nodes, 50M edges actions mention, retweet, comment graph 2M nodes, 10M edges Table 1: Dataset Description Motivation. The service host is incentivized to identify such pairs and send a recommendation to them, as they might be interested to connect with each other. From the users perspective, they would like to seek the right information source of interest and connect with the people they know. For business users and celebrities, they are eager to be seen and followed by more people so that they can gain expanded coverage and increased popularity. By making the social network more connected, the service host can maintain user stickiness and generate traffic. However, with inappropriate recommendations, the system would flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature in Weibo and other social platforms. To generate effective link recommendatioins, we analyze the proximity, or similarity, of nodes (users) in the social graph. If a pair of nodes exhibit high network proximity and yet no direct links exist between them, e.g., they are not friends yet, then a recommendation for the pair may be accepted with a high chance. Also, we also consider the attributes of nodes, such as age and gender of users and the keywords from their online posts, as features for similarity analysis. Overall, our focus is on exploit the hybrid of link- and content-based information and identify useful features. Recent advances in data mining and machine learning provide us abuandant tools for our problem of link analysis. Among others, logistic regression, SVM, and k- nearest neighbor are among the most important ones and have been successfully applied to a wide range of problems [2, 3]. We explore these state-of-the-art machine learning techniques and compare models on our problem. We observed that in a connected social graph, local methods, such as k-nearest neighbor, can significantly outperform the global methods, such as SVM and logistic regression. Moreover, k-nearest neighbor has the best performance when we set k = 20, which means we should look at small local neighborhood. Computationally, as expected, k-nearest neighbor algorithms are much more expensive to compute than those linear methods, such as linear SVM and logistic regression. The remainder of this report is structured as follows. Section 2 reports the Weibo dataset we use. Section 3 presents problem definition. We discuss the approaches we use in Section 4. Some computational issues are addressed in Section 5. We show the experimental results in Section 6. Section 7 concludes the paper and outline some future directions. 2 Dataset Tencent Weibo is one of the largest micro-blogging websites in China. Since its launch in April 2010, it has become a major platform for building friendship and sharing interests online. Currently, there are more than two hundred million registered users on Tencent Weibo, generating over fourty million messages each day. Recently, Tencent Weibo released a sample of its data, which is the first known Weibo data at this scale. This dataset is summarized in Table. Note that each item is also a user, and items and users share the same domain of ids. Thus we are able to look up the corresponding information in user profile data for a given item id. Items are different from other users in that they are to be recommended to users, and thus addtional information for them are provided, such as category. A large proportion of items are celebrities and special groups. Also note that item set (6K) come from only a small fraction of the user set (2M). This also gave us opportunity to optimize our learning algorithms, as will shown in later sections. Training Data. Apart from the dataset itself, we have training data avaiable as well. The training set contains 70M records. A record in a training set is in the following format: (i,u,c) where i is an item, being any entity in Weibo such as a user, a celebrity and a group, u represents a user, and c { 1,1} is the class label, where 1 indicates u accepted the recommendation to follow i and 1 stands for otherwise. When the trained classifier is applied to a test pair (u,i), it outputs 1 or 1 as the classification result. 2

3 4.1 Similarity Analysis To predict links, our approach is to identify those pairs of nodes that are not currently connected. Thus, we need some metrics, or features, to quantify the proximity, or similarity, of two nodes in a graph. we have the following two kinds of similarity: Figure 2: Recommender System in Weibo Snapshot 3 Problem Definition Given the various kinds of information of the data and an adequate number of training set, our task is to select three items from the item set and recommend them to users. This three-item recomendation scheme is actually used in the real Weibo system, as shown in Figure 3. Objective Function. Given a recommendation in the format of a (user, item)-pair, we compute a real-value score indicating how likely the user will accept the recommendation and go ahead to follow the item. For each user, we have a list of items recommended to the user and their score computed. We then rank these items based on this score, and pick the top-3 items. To measure the effectiveness of these 3-item ranked results, we use average precision: Suppose m items in an ordered list are recommended to a user, who may click zero, one, or more of them to follow, the average precision at n for this user is defined as: p n = k=1,...,n P(k)Rel(k) m where if the denominator is zero, the result is set zero; P(k) means the precision at cut-off k in the item list, i.e., the ratio of number of clicked items up to the position k over the number k, and P(k) equals 0 when k-th item is not followed upon recommendation; n = 3 as this is the default number of items recommended to each user in Weibo. where p(k) is precision at k; Rel(k) is the indicator of whether the user follows the k-th recommended item; and n is set as 3, which is the default number of items recommended to a user. The score of the algorithm will be the p n for all users. 4 Approaches In this section, we explore different machine algorithms and study how they can applied to our link prediction problem. Attribute-based Similarity. We can extract some attribute-based information from the data, such as user profile and category of items. To quantify the similarity of two nodes, we can estimate the similarity of such information. For example, we can derive the cosine similarity of users keywords, or age difference. These values will be fed into our machine learning algorithms as features. Link-based Similarity. Another important source of information is how to estimate the proximity of two nodes solely based on graph structure. For example, we can estimate the number of common neighbors of two nodes, or their cosine similarity. Our focus in this project is to measure the effectiveness of the two category of information and study how to effectively combine the two. 4.2 Global methods We first consider the most straightforward global classification methods SVM and logistic regression. For each (i,u,c) observations where u is in our training set (recall u is the user, i the item recommended to u, and c indicates whether u accept the recommendation.), we compute a feature vector for the user-item pair containing information gathered from our data related to u and i. The features we used includes the gender and age of u, keywords similarity between u and i, popularity of i, number of common neighbors between u and i in the following graph, and the number of items u followed that are in the same category of i. The feature vectors are the points, we then run the linear SVM and logistic regression on these set of points with the corresponding c s as the labels of the points. Both algorithms will produce a vector after going over the training data. For each pair testing user and an item, we can compute the same set of features, and use the model we trained to compute a score of how likelly the user will follow the item. We recommend the items which a testing user is most likely to follow. Another method we implemented is based on SVD. We consider the user-item following matrix A, and computed the k-dimensional subspace that best approximates A. Formally, we find orthonormal matrices U,V and diagonal matrix D such that the Frobenius norm of the matrix A UDV is minimized. For each testing user, we 3

4 frequency count degree Figure 3: Degree distribution of our social graph Figure 4: Node Partition of the Social Graph project his/her vector to this k-dimensional space (with basis V ), and use the projected vector as the scores for each item. We consider this as a global approach since we use the projection directly, and we can also use the low dimensional projections as inputs to local methods. 4.3 Local methods Contrary to the global methods which aims to find the pattern in the entire dataset, we also consider local methods that try to pick out signals from the local neighborhoods of users. We experimented with both user-based and item-based local methods. In this setting, we focus on the follow history of a user, and don t include other features in our consideration. For the user-based K-nearest-neighbor algorithm, we view each user as a vector in the item space. For each testing user, we find the k most similar users of him/her in the training set in terms of the cosine similarity between their vectors, and predict the most popular items among his nearest neighbors that the testing user hasn t followed yet. For item-based algorithm, we computed the correlation between each pair of items (x,y). Let X i,y i be the indicator variables of wether user i followed x, y respectively. We consider them as two random variables, and for each user we obtain a paired observation. We compute the Pearson s correlation between the two items. For a testing user, the score for an item is the sum of the correlation of it and all the items we observed that this user followed. We then recommend the items with the highest scores that the testing user hasn t followed yet. 5 Computational Efficiency The problem we deal with has some nontrivial computational issues. The training set has millions of records. For each record, we have to compute a feature vector. The features based on graph structure information can be expensive to derive, since the graph has millions of nodes. Even if we only consider the local neighbor of nodes, some nodes have very large degrees and thus big neighborhood, due to the power-law degree distribution. In this section, we discuss two techniques that we can use to boost the speed of learning algorithms. 5.1 Materialized View We observe some primitive values that are frequently queried when building feature matrices. Processing these queries from scratch for each training example involves unnecessarily redundant computation. Therefore, we identify such primitive queries, precompute them and store them as materialized view so that queries can just look up the values when needed. We store the these information for user item entries. Although we have millions of users, the number of items is as small as six thousands, as given in Table. Therefore, the storage cost is still in a managable range. In detail, we have the following information stored in materialized view. number of common neighbors attribute value (e.g., age, gender) distribution of an item s fans category distribution for users following items 5.2 Parallelization Although we focus on link-based computations, which are connected and dependent in nature, we still can exploit some embarrassingly parallel tasks to make best use of our computer cluster of multi-core nodes. Precomputation. In the initial phase of materialized view construction, where we compute some values on top of the graph, we can parallelize the task of graph based computations. The tasks are essentially based on the set of nodes. We separate these tasks (or nodes) into several partitions, as depicted in Figure 5.2. For each partition, we replicate the entire graph structure, so that the cut does not really affect the link-based computation. Constructing Feature Matrix. As mentioned, one of the biggest bottlenecks in our computation is deriving feature matrix. We can actually partition the training set and derive feature matrix for each partition in parallel. Since each row in the matrix corresponds to a training 4

5 example, we can combine these horizonally partitioned sub-matrix finally, to form a overall feature matrix. 6 Performance Analaysis In this section, we report the results from our experiemental study. We conduct our experiments on a big machine in AMPLab, which features 2 Intel E5645 at 2.40GHz, 288GB RAM and 16 1TB disk. Note that it has 24 cores so that we are able to implement parallelized, as discussed in Section 5.2. For linear classification methods, such as SVM and logistic regression, we use the liblinear library [1]. 6.1 Comparing Different Approaches Due to the time constraint, we conducted our experiment on the subset of users who followed more that 50 items. This yields a set of users. We split the users into training and testing users, and for each testing user, we hold out 30% of the items he/she follows, and try to predict them. We computed the average prediction rate of the algorithms discussed earlier, and the result is as follows. For baseline, we use the algorithm that always recommend the most popular items (in terms of number of followers) that a testing user hasn t followed so far. For KNN and LSI we only plot the result of the optimal parameters (i.e. num of neighbors, num of dimensions). Among the various algorithms, KNN really stands out, which suggests that the signals in local neighborhood is much stronger than the global pattern. Among the global methods, SVM and logistic regression performs on par with each other, and SVD predicts better than both of them. We do want to point out that SVD also takes much longer to compute than SVM and logistic regression. Also when the number of items is much smaller than number of users, the item-based correlation algorithm will be much more efficient than the other algorithms. It also worth noticing that SVM and logistic regression also consider information beyond the follow history, while KNN, LSI and item correlation use only the user-item following matrix. 6.2 Study on k-nn Performance To substantialize our intuition that the local neighborhood is more informative when it comes to recommendation, we run KNN with various k values. We show plot the results in Figure 7. Notice that when we take k as the total number of users, the algorithms is essentially equivalent to the baseline algorithm success rate (%) Baseline SVM Log. Reg. LSI- 25 KNN- 10 ItemCo Figure 5: Comparison of Approaches k Figure 6: k-nn Performance vs. k 7 Conclusions and Future Work We study the problem of link analysis on a Weibo dataset, which is first ever released Weibo data in this scale. We apply different machine learning tools to this problem and compare the results. Since the dataset is in non-trivial scale, we try to address some computational issues we encounter and improve the efficiency. We observe that, in order to recommend items for users to follow, algorithms that utilize information from local neighborhood performs much better than algorithms trying to find global patterns. On the other hand, the global methods are more efficient to compute than algorithms like KNN, and a naive implementation of KNN would be more infeasible for large scale datasets. Thus it will be interesting to speed up KNN to make it feasible for big data, probably via clustering and dimension reduction. References [1] Liblinear: cjlin/liblinear/. [2] BISHOP, C. M. Pattern Recognition and Machine Learning [3] MANNING, C. D., RAGHAVAN, P., AND SCHTZE, H. Introduction to Information Retrieval

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: Abstract Friendship recommendation has become an important issue

More information

Automated Tagging for Online Q&A Forums

Automated Tagging for Online Q&A Forums 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal} Abstract Hashtags created

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

More information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data

More information

Large-Scale Face Manifold Learning

Large-Scale Face Manifold Learning Large-Scale Face Manifold Learning Sanjiv Kumar Google Research New York, NY * Joint work with A. Talwalkar, H. Rowley and M. Mohri 1 Face Manifold Learning 50 x 50 pixel faces R 2500 50 x 50 pixel random

More information

Extracting Information from Complex Networks

Extracting Information from Complex Networks Extracting Information from Complex Networks 1 Complex Networks Networks that arise from modeling complex systems: relationships Social networks Biological networks Distinguish from random networks uniform

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013 Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information


CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Content-based Dimensionality Reduction for Recommender Systems

Content-based Dimensionality Reduction for Recommender Systems Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece Abstract. Recommender

More information


NON-CENTRALIZED DISTINCT L-DIVERSITY NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh I. Introduction The key to any successful relationship is communication, especially during times when

More information

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. SPM Users Guide Model Compression via ISLE and RuleLearner This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. Title: Model Compression

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Predicting User Ratings Using Status Models on

Predicting User Ratings Using Status Models on Predicting User Ratings Using Status Models on Borui Wang Stanford University Guan (Bell) Wang Stanford University Group 19 Zhemin Li Stanford University

More information

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis} Abstract This paper analyzes the performance of various KNNs techniques as applied to the

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

Extracting Information from Social Networks

Extracting Information from Social Networks Extracting Information from Social Networks Reminder: Social networks Catch-all term for social networking sites Facebook microblogging sites Twitter blog sites (for some purposes) 1 2 Ways we can use

More information

Applications of Machine Learning on Keyword Extraction of Large Datasets

Applications of Machine Learning on Keyword Extraction of Large Datasets Applications of Machine Learning on Keyword Extraction of Large Datasets 1 2 Meng Yan 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country

More information

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Sumedh Sawant Team 38 December 10, 2013 Abstract We implement a personal recommendation

More information


VECTOR SPACE CLASSIFICATION VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei Lecture

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus

More information

Web Personalization & Recommender Systems

Web Personalization & Recommender Systems Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 2 High dim. data

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action PETER HARRINGTON Ill MANNING Shelter Island brief contents PART l (~tj\ssification...,... 1 1 Machine learning basics 3 2 Classifying with k-nearest Neighbors 18 3 Splitting

More information

Feature Selection for fmri Classification

Feature Selection for fmri Classification Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 Abstract The functional Magnetic Resonance Imaging

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Data mining: concepts and algorithms

Data mining: concepts and algorithms Data mining: concepts and algorithms Practice Data mining Objective Exploit data mining algorithms to analyze a real dataset using the RapidMiner machine learning tool. The practice session is organized

More information

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano Learning Similarity Metrics for Event Identification in Social Media Hila Becker, Luis Gravano Columbia University Mor Naaman Rutgers University Event Content in Social Media Sites Event Content in Social

More information

Design and Implementation of Music Recommendation System Based on Hadoop

Design and Implementation of Music Recommendation System Based on Hadoop Design and Implementation of Music Recommendation System Based on Hadoop Zhao Yufeng School of Computer Science and Engineering Xi'an University of Technology Shaanxi, Xi an, China e-mail:

More information

Face Recognition using Eigenfaces SMAI Course Project

Face Recognition using Eigenfaces SMAI Course Project Face Recognition using Eigenfaces SMAI Course Project Satarupa Guha IIIT Hyderabad 201307566 Ayushi Dalmia IIIT Hyderabad 201307565 Abstract

More information

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data.

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. 1 Divide & Recombine with Tessera: Analyzing Larger and More Complex Data The D&R Framework Computationally, this is a very simple. 2 Division a division method specified by the analyst divides

More information

Distribution-free Predictive Approaches

Distribution-free Predictive Approaches Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

More information

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits Carl Sabottke Octavian Suciu Tudor Dumitraș University of Maryland 2 Problem Increasing number

More information

Part 11: Collaborative Filtering. Francesco Ricci

Part 11: Collaborative Filtering. Francesco Ricci Part : Collaborative Filtering Francesco Ricci Content An example of a Collaborative Filtering system: MovieLens The collaborative filtering method n Similarity of users n Methods for building the rating

More information

Copyright 2011 please consult the authors

Copyright 2011 please consult the authors Alsaleh, Slah, Nayak, Richi, Xu, Yue, & Chen, Lin (2011) Improving matching process in social network using implicit and explicit user information. In: Proceedings of the Asia-Pacific Web Conference (APWeb

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

Clustering in Networks

Clustering in Networks Clustering in Networks (Spectral Clustering with the Graph Laplacian... a brief introduction) Tom Carter Computer Science CSU Stanislaus tom/clustering April 1, 2012 1 Our general

More information

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing Gautam Bhat, Rajeev Kumar Singh Department of Computer Science and Engineering Shiv Nadar University Gautam Buddh Nagar,

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

arxiv: v1 [cs.lg] 5 Mar 2013

arxiv: v1 [cs.lg] 5 Mar 2013 GURLS: a Least Squares Library for Supervised Learning Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco Center for Biological and Computational Learning, Massachusetts Institute

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Web Personalization & Recommender Systems

Web Personalization & Recommender Systems Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

More information

Towards a hybrid approach to Netflix Challenge

Towards a hybrid approach to Netflix Challenge Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the

More information


CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information



More information

Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware

Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware Technical Overview Abstract This technical overview describes the considerations, recommended configurations, and host server requirements when

More information

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech Foundations of Machine Learning CentraleSupélec Paris Fall 2016 7. Nearest neighbors Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech Learning

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Fundamentals of learning (continued) and the k-nearest neighbours classifier Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart.

More information

k-nearest Neighbor (knn) Sept Youn-Hee Han

k-nearest Neighbor (knn) Sept Youn-Hee Han k-nearest Neighbor (knn) Sept. 2015 Youn-Hee Han ²Eager Learners Eager vs. Lazy Learning when given a set of training data, it will construct a generalization model before receiving

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le ( 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

A Novel deep learning models for Cold Start Product Recommendation using Micro blogging Information

A Novel deep learning models for Cold Start Product Recommendation using Micro blogging Information A Novel deep learning models for Cold Start Product Recommendation using Micro blogging Information Chunchu.Harika, PG Scholar, Department of CSE, QIS College of Engineering and Technology, Ongole, Andhra

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Using Social Networks to Improve Movie Rating Predictions

Using Social Networks to Improve Movie Rating Predictions Introduction Using Social Networks to Improve Movie Rating Predictions Suhaas Prasad Recommender systems based on collaborative filtering techniques have become a large area of interest ever since the

More information

Using Data Mining to Determine User-Specific Movie Ratings

Using Data Mining to Determine User-Specific Movie Ratings Available Online at International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

arxiv: v3 [] 3 May 2017

arxiv: v3 [] 3 May 2017 Modeling Request Patterns in VoD Services with Recommendation Systems Samarth Gupta and Sharayu Moharir arxiv:1609.02391v3 [] 3 May 2017 Department of Electrical Engineering, Indian Institute of Technology

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang Chenyan Xiong William Yang Wang Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016 CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information


CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM 82 CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM 5.1 INTRODUCTION In this phase, the prime attribute that is taken into consideration is the high dimensionality of the document space.

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods + CS78: Machine Learning and Data Mining Complexity & Nearest Neighbor Methods Prof. Erik Sudderth Some materials courtesy Alex Ihler & Sameer Singh Machine Learning Complexity and Overfitting Nearest

More information

Part 11: Collaborative Filtering. Francesco Ricci

Part 11: Collaborative Filtering. Francesco Ricci Part : Collaborative Filtering Francesco Ricci Content An example of a Collaborative Filtering system: MovieLens The collaborative filtering method n Similarity of users n Methods for building the rating

More information

Object and Action Detection from a Single Example

Object and Action Detection from a Single Example Object and Action Detection from a Single Example Peyman Milanfar* EE Department University of California, Santa Cruz *Joint work with Hae Jong Seo AFOSR Program Review, June 4-5, 29 Take a look at this:

More information

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University  Infinite data. Filtering data streams /9/7 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them

More information

6.034 Quiz 2, Spring 2005

6.034 Quiz 2, Spring 2005 6.034 Quiz 2, Spring 2005 Open Book, Open Notes Name: Problem 1 (13 pts) 2 (8 pts) 3 (7 pts) 4 (9 pts) 5 (8 pts) 6 (16 pts) 7 (15 pts) 8 (12 pts) 9 (12 pts) Total (100 pts) Score 1 1 Decision Trees (13

More information

Support vector machines

Support vector machines Support vector machines Cavan Reilly October 24, 2018 Table of contents K-nearest neighbor classification Support vector machines K-nearest neighbor classification Suppose we have a collection of measurements

More information