E-commercial Recommendation Algorithms Based on Link Analysis

Size: px
Start display at page:

Download "E-commercial Recommendation Algorithms Based on Link Analysis"

Transcription

1 E-commercial Recommendation Algorithms Based on Link Analysis Guanlin Li Le Lu Shulin Cao Junjie Zhu Instructor: Fragkiskos Malliaros University of California, San Diego June 11, 2017

2 E-commercial Recommendation Algorithms Based on Link Analysis ABSTRACT Guanlin Li Shulin Cao This paper focuses on similar product recommendation using graph mining methodology on the dataset of Amazon product copurchasing network metadata[1]. In detail, we want to build a similarity graph in which nodes are products (or customers) on Amazon and edges are the degrees of similarity between nodes. We firstly demonstrate a baseline algorithm called "Common Neighbors Algorithm"[2]. Then, we implement a machine learning algorithm which will calculate similarity based attributes of nodes. Finally, we carry out a graphing mining algorithm "Supervised Random Walk Algorithm"[3] which derives similarities among nodes based on network properties. Our measurement of prediction accuracy will be 1) accuracy of classification on labels and 2) the cardinality of intersection of top prediction set and ground truth set. We would like to compare the performance of product and customer similarity graphs and the accuracy of different algorithms. With the above work, we explore in detail the attributes and network properties of Amazon product co-purchasing network dataset. KEYWORDS Amazon Metadata, E-commercial recommendation, graph mining, link analysis, similarity graph 1 INTRODUCTION 1.1 Project Motivation E-commerce has spread widely since its first appearance by ARPANET in 1970s; in 2003, Amazon first ever posted its yearly profit. Following the rapid economic development across the world, E-commerce also makes it very convenient for us to shop remotely and efficiently. The emergence of online shopping websites such as Yahoo, Ebay, Amazon and Bestbuy has brought us great convenience. At the same time, the need for customers to make best choices and for merchants to efficiently promote products has attracted increasing attention. Therefore, a recommendation system is necessary. Many online shopping websites now provide recommendation tools. They present product bundles and similar products. However, they may not bring up most-wanted and best-fit products for customers. Under this circumstance, we hope to explore multiple (currently we have three in mind) of many algorithms being studied, evaluate their performance and hopefully put forward a solution that could optimize the recommendation system. Le Lu l2lu@ucsd.edu Junjie Zhu juz091@eng.ucsd.edu 1.2 Problem Definition Recommendation systems usually provide a list of recommendations in a variety of ways. We aim to construct a recommendation system that provides the top few similar commercial products for every product, such that customers could easily catch a glance of other similar options while shopping online. For the purpose of constructing an effective recommendation system, we propose the idea of constructing a similarity graph, in which nodes represent either commercial products or customers and links denote the similarity between two nodes. This similarity value would be weighted to a float number ranging from 0 to 1, with larger value indicating closer relationship. Our main technique for this task would be conducting link analysis (machine learning or graph mining algorithms) based on features of dataset to generate a similarity graph. With link analysis, we are able to evaluate the existing relations among nodes and derive similarities. Also, we could predict the connections that are not currently in the graph but may be formed in the future. Several challenges and limitations have been foreseen, which include: (1) large dataset may lead to difficulties in scalability. Categories in the Amazon Meta Dataset vary widely from art & literature to technology, from home supplies to specialized objectives. Larger dataset would have more complex network and properties. Thus the choice of category and respective analysis would be our prior difficulties; (2) designing appropriate features for machine learning algorithms: attributes of items/network that may contribute more to proficiency of data training and final precision. The main goal is to derive top K recommendations from similarity graphs and we want to optimize the accuracy of our recommendations comparing with ground truth similarity (provided in Amazon metadata). 1.3 Dataset Information Here we choose Amazon product co-purchasing network metadata[1] as our dataset. This dataset contains 548,552 products and 1,788,725 product-project edges, which provides a fair amount of useful information for analysis. Each record has eight fields[6]: Id: Product id (number 0,..., ) ASIN: Amazon Standard Identification Number title: Name/title of the product group: Product group (Book, DVD, Video or Music) salesrank: Amazon Salesrank similar: ASINs of co-purchased products (people who buy X also buy Y)

3 categories: Location in product category hierarchy to which the product belongs reviews: Product review information: time, user id, rating, total number of votes on the review, total number of helpfulness votes (how many people found the review to be helpful) The "similar" field of the records is obtained based on customer purchasing history, which can be used as ground truth for our research. Based on the attributes of products, we have plenty of features for both machine learning and graphing mining algorithms. Thus, we can use these techniques to effectively predict and evaluate similarity between two products. Figure 1. category count distribution for all items 1.4 Data Preprocessing 1.5 For the sake of graph construction and machine-learning data preparation, preprocessing the raw data is necessary. We first parsed the raw dataset and constructed an <item, customers> table that stores a list of customers for each item and a <customer, items> table that store a list of items purchased by each customer. Afterwards, we generated (1) an item-item graph by associating all pairs of items that have been purchased by the same customer and (2) a customer-customer graph by connecting all pairs of customers who have purchased the same item. Besides these two graphs, we also (1) constructed a ground truth graph of which items represent nodes and are linked to their "similar items". A fraction of this ground truth graph is served as the training data for supervised machine learning; (2) reformatted information of each item into a.csv file, including item s ASIN, title, group, sales-rank, customer ratings and categories. These attributes, especially categories, will be fairly useful in predicting similarities between items. Challenges came when the dataset in hand was too large to process. We first tried to construct an item-item graph based on all 548,522 items, yet the scope of the graph expanded to which a personal computer cannot handle (more than than 300GB of space under our estimation). Therefore, we decided to narrow down the dataset to select and analyze items from specific categories. At this stage, we propose two different ways of filtering items: (1) filtering items that fall into with any of specified categories and (2) filtering items that fall into all specified categories. Figure 1 below shows the category count distribution such that x axis represents the occurrence of a category among all items and y axis represents the frequency of that occurrence count. In addition, we processed ground truth data accordingly by removing each item s "similar items" that are not in the graph. This way, we ensure that all ground truth items can be found in the constructed graph. Although this approach may affect the global accuracy of recommendation prediction, it makes analyzing data and conducting our methodology possible on a personal computer. At this stage, it is more important to test algorithms and analyze meaningful results than to fully consider scalability. Data Analysis To better understand the structure and property of the network generated from the Amazon Meta dataset, we randomly selected one of the original categories for data analysis. In this part, the category Publishing & Books was chosen and its properties are analyzed as followed. We generated a item-item graph based on the item information of the original dataset with weights representing the user numbers between items (Fig. 2). Figure 2. GCC graph visualization The New Weighted Network (NWN) is composed of 555 nodes and edges together with assigned weights based on the links among the nodes. A wealth of data (shown as Table. 1) demonstrates that the NWN is a highly centered and highly connected network. Giant Connected Component (GCC) of the NWN shows that 91.17% nodes and 99.89% edges of the NWN belong to GCC. That marginal node and edge data both hold no more than 10% of the entire network tells the centrality. With 6 in diameter of GCC, our deduction is greatly validated. Besides, there are more than 1 million triangles in the GCC, which we could make an assumption that the network of the item-item graph are highly connected. To verify our hypothesis, we analyzed the clustering coefficients, path lengths and degree distribution of the GCC. Results show that average clustering coefficient is 0.81 and average shortest path length is 2.04, which exactly match our previous assumptions and 2

4 our expectations. And degree distribution plot provides more visualized information about the centrality and connectivity of the network (Fig. 3). In addition to the properties verification, we could discover more details about the network from the plot. Nearly 30% nodes have higher degrees while there are also around 20% nodes own lower degrees. This difference could result from our choice of separating data. Since we choose only one category from the network, some links connected by nodes in our graph and original graph are neglected. The vanishing of marginal links between our NWN and entire network could also represent the emergence of those marginal nodes with lower degrees. This loss actually helps us understand the potential properties of the entire network. Table 1. Basic Statistics of GCC and NWN 2 RELATED WORK Figure 3. Degree distribution plot of GCC Ideas of recommendation systems have emerged long ago and there are many algorithms being studied; most of them target generic recommendation systems or new recommendation algorithm and very few conduct thorough experiments on e-commerce recommendation system via link prediction and compare multiple algorithms performance. There are algorithms that are designed carefully but not practically for academic study: In Linden et al s study on e-commerce recommendation system, they proposed Item-to-Item Collaborative Filtering[4], which effectively personalizes recommendations for each customer. Item-to-item collaborative filtering analyzes each customer s shopping cart, purchased items and rated items to similar items before finalizing a customized recommendation list. However, such algorithm can be inefficient in terms of processing time and memory usage[4] when many product pairs have no common customers. In addition, this algorithm can only be practical exclusively since customers credentials are not accessible to outsides of the company. Therefore, although this algorithm may achieve high recommendation accuracy, it could be impractical for generic application purpose. Many researchers are also interested in link prediction and the broad application of the theory: Liben-Nowell and Kleinberg [2] proposed multiple link prediction methods to solve problems for social networks, such as Common-Nighbor algorithm, graph distance algorithm, rooted PageRank, SimRank, etc. They proposed the challenge to infer which new interactions among members in a social network are likely to occur in the near future, and developed approaches based on link prediction to analyze the proximity of nodes in a network. However, social network and product network have many different properties and Liben-Nowell and Kleinberg did not reveal potentials of their proposed algorithms on e-commerce recommendation systems. Despite that, We employed their Common-neighbor algorithm as our baseline algorithm, considering its practicability and operability. Besides that, many researchers devised innovative approaches regarding recommendation algorithms. Li et al proposed a new product recommendation algorithm in bipartite network via link predictions [7]: instead of solely relying on the network structure topological features of the graph, like most of network-based recommendation algorithms do, they combined link prediction methods with domain knowledge to study the evolution of interactions among consumers and products, with specific weight assigned to products based on domain similarities. Although their algorithm can be expected to achieve high prediction accuracy, they only presented theoretical hypothesis and limited evaluations without comparing to other algorithms. Meanwhile, our work targets multiple approaches to attempt e-commerce product recommendation via link prediction and compares their performance. As large e-commerce business such as Amazon.com gets expanded, choosing an effective and efficient recommendation algorithm is necessary for improving user experience and maximizing the revenue. 3 METHODOLOGY 3.1 Baseline Algorithm One baseline algorithm we use will be "Common-Neighbor Algorithm" described in The link-prediction problem for social networks[2]. For a node x, let τ (x) denote the set of neighbors of x, in our case, the set of customers who bought the same item x (or, if in a customer graph, the set of items purchased by customer x). The similarity score between node x and y is defined as: ( τ (x) τ (y) ) The intuition behind is that, in product similarity graph, if two items are bought by two group of customers, whose intersection is relatively large, then these two items might be similar, following the definition of similarity as "people who buy X also buy Y". In the case of customer similarity graph, two people might have similar interests if they buy almost the same group of items. 3

5 In order to predict the similarity of two items, we apply "Common- Neighbor Algorithm" on the item-item graph described in 1.3. We use item-item graph as training set and ground truth as testing set. Ground truth contains the actual similarity between two items obtained from amazon meta data. Based on the item-item graph, we compute the similarity score between two nodes as the number of common neighbors of two nodes. Then, we sort the list of pairs of two nodes by the decreasing similarity score. Finally, we pick the top n pairs of links in the list, where n is the number of edges in the ground truth. Compared to the actual similarity between two nodes, the accuracy of similarity predictions based on common neighbor algorithm is 5.99%. The accuracy is the proportion of edges predicted correctly in the ground truth. After discussion, the two main reasons why the accuracy is low are listed as follows. First, the size of item-item graph is not large enough (it involves only 555 nodes now). The more information the item-item graph contains, the more accurate the prediction would be. Second, we find that in the amazon dataset the maximum number of similar items of one specific item is 5. Obviously, the number of similar items for a specific item is not merely five. For example, the first pair of item in the list we predict are the same books of two different versions. However, they are not considered as two similar items due to the limitation that maximum number of similar items is merely five. Furthermore, we compare the performance of our algorithm with the performance described in The link-prediction problem for social networks[2]. In this report, they apply common neighbor algorithm to five co-authorship networks and the accuracy is between 5% and 10%. The performance in this report is a little higher than our performance. The possible reason we think is that the algorithm in this report is used to predict the future new links between two authors. However, our goal is to predict the similarity of two items based on item-item graph, which excavates new things. 3.2 Machine Learning Approach The machine learning algorithm we use takes advantage of the features of product nodes. Although we may not be able to implement machine learning algorithm with customer similarity graph, since customers information is too credential to obtain. However, with product similarity graph, we can still do much meaningful work. The intuition behind is that, with the provided product features, e.g. title, group, hierarchical category and ratings, the algorithm will be able to capture the key properties of similar items. One obvious example is when two items are both under the same hierarchical category and have nearly the same ratings, then they are quite likely to be similar because they may serve the same function and have same qualities. To make it more concrete, "Logitech G303 Gaming Mouse" and "Razor Chroma Gaming Mouse", both are under the category "Computers & Accessories > Computer Accessories > Game Hardware > PC Game Hardware > Gaming Mice" and have similar ratings. We are confident to say that these two products are similar. The features we extract from Amazon meta data focus on five similarities and three graph properties: (1) Title similarity: we obtained TF-IDF vectors for each title and calculate cosine similarity between two products titles as a feature. The intuition behind is the title of a product matters. (2) Group similarity: the feature is 1 if two products are in the same general group, 0 if two products are in different general group. If two products are not even in general group, it is really hard for them to be similar items (consider the above example about gaming mouse). (3) Number of reviews similarity: the feature is the absolute value of difference of number of reviews between two products. The idea behind is that similar items might draw similar attentions from customers. (4) Category similarity: the feature is the cardinality of intersection of the sets of two products detailed category. From the above example, both gaming mice have the same set of detailed category: "Computers & Accessories", "Computer Accessories", "Game Hardware", "PC Game Hardware", "Gaming Mice". Thus, the cardinality of the intersection is 5, which is used as a feature. (5) Rating similarity: the feature is the difference of average ratings between two items. The thoughts behind is that two similar items should have similar ratings (or qualities). (6) Node popularity: the feature is the sum of degrees of two nodes in the similarity graph. Higher degrees means higher popularity within Amazon dataset, which might indicate some core products that are similar to many other products. (7) Distance in graph: the feature is the distance between two nodes in the similarity graph. Greater distance means two products are less similar to each other. Disconnected nodes highly indicates they are not related at all. (8) Common neighbors: the feature is the number of common neighbors between two nodes (they already have a distance 1 or 2 between each other). Two nodes that have distance greater than 2 have no common neighbors. This is a detailed observation of how similar two products are. The supervised labels are obtained from partial ground truth. The ground truth is read in as a format resembled that of an adjacency list, where a row indicates a group of nodes similar to each other. Using NetworkX, we build a ground truth graph based on the list of nodes and an edge in this graph means two nodes are closely similar to each other (labeled as 1 for classification). Two nodes that have a long path connecting them may indicate they are roughly similar (labeled as 0). Two nodes that are disconnected probably indicate they are not similar items (labeled as -1). We output the graph in the format of node-node pair and their labels. (for example, item1, item2, -1 ). We further divide the output file into two parts: training set and testing set. With the above features on similarity extracted from training set, we used SVM as our classification algorithm and used partial ground truth as labels. Since the labels are heavily unbalanced ( 1 has much lower frequency than those of 0 and 1 ), we used balanced keyword in SVM class_weight property. We then use the SVM classifier to label items pairs in testing set. 4

6 3.3 4 Graph Mining: Supervised Random Walks Algorithm The graph mining algorithm we use is "Supervised Random Walks Algorithm" described in Supervised Random Walks: Predicting and Recommending Links in Social Networks[3]. This algorithm considers both network structure and the features of nodes and edges. The intuition behind the algorithm is that, the provided features bias a PageRank-like random walk on the constructed network graph. For a basic random walk on the graph, it start from a node chosen uniformly at random,pick one of outgoing edges uniformly at random, and then move to the destination of the edge. It repeats the step described above by computing the PageRank weights until convergence. The transition probability of each edge in basic random walk is equally distributed. In our algorithm, however, we use attribute data of nodes and edges to bias the random walk by assigning each edge (u, v) a transition probability based on features of nodes u and v. To predict new edges of node s, we follow the steps described in Supervised Random Walks: Predicting and Recommending Links in Social Networks[3]. (1) Calculate the edge strength auv = fw (ψuv ) of all edges where ψuv is the edge feature vector between node u and node v and w is the paramter vector. (2) Create a weighted graph by assigning each edge with strength auv. (3) Do random walk with restarts from node s, then each node w will have a PageRank score pw. (4) Sort the nodes other than s by the decresing PageRank score. (5) The top k nodes with the highest score are the predicted destination nodes of node s For the step (3), what we want to mention is the optimization problem to find the optimal set of parameters w of the edge strength function fw (ψuv ). Note that, we label nodes that node s will create edges in the future as destination nodes set D, while we label other nodes that s will not create new edges as no-link nodes set L. The optimization problem is: Í minw F (w) = w 2 + λ d D,l L h(pl pd ) where λ is the regularization parameter that is used to prevent the over-fit of the model and h( ) is a loss function that takes the difference between pl and pd and calculates the penalty. In addition, p is the vector of PageRank scores. The process of sovling the optimization problem is so troublesome that we don t disscuss this part in our paper. If you have interests in how to solve the problem, you can refer to Supervised Random Walks: Predicting and Recommending Links in Social Networks[3]. Next, we need to list the several choices in our algorithm. Based on Backstrom s paper, we choose the functions and parameters whose performance are the best. 1 where b=0.01 (1) Loss function: h(x) = 1+exp( x /b) (2) (3) (4) (5) EVALUATION In this part, we include another category of original set Finance & Investing in our training and testing. Fig. 4-6 have shown the basic properties of the new category. Figure 4. Degree rank plot of GCC-Finance & Investing Figure 5. Degree distribution plot of GCC-Finance & Investing Figure 6. Network-Finance & Investing Category Diameter Triangle No. ClusterCoefficients ShortestPathLength 1 Edge strength function: auv = 1+exp( ψ uv w ) Random walk restart parameter: α = 0.5 Regularization parameter: λ = 1 Feature vector : ψuv = [Number of reviews similarity, Category similarity] 5 Publishing&Books Finance&Investing

7 Unlike previous analysis on Publishing & Books sub-dataset, the new dataset has more nodes with lower degrees, which means this network is not as centered as the former one. Also, comparing to the former network, this network is closer to our real network with assumptions that it has a power-law degree distribution. 4.1 Performance Determination Criteria To evaluate the performance of our algorithm, we compare our prediction results to the ground truth (similar items) obtained from the amazon dataset. Here we present two kinds of metrics as follows: The first evaluation metric that is easy to achieve is to measure classification accuracy. From the ground truth, we built a ground truth graph using python package NetworkX. Then, for each pair of nodes in the graph we assign labels -1,0,1 based on their relative position in graph as in Fig. 6 : Label 1: Two connected nodes that have distance 1 are closely similar to each other Label 0: Two connected nodes that have distance greater than 1 may indicate that they are roughly similar Label -1: Two disconnected nodes indicate that they are not similar For testing, our algorithm predicts a label for each pair of nodes and then we calculate the accuracy of classification (Measurement #1) respectively. We define accuracy formula as: accuracy = number of correct classifications total classification Note that the graph is sparse (a prediction of all zeros will achieve 99% accuracy); therefore we also measure the accuracy of correct labeling on 1 (Measurement #2). accuracy = number of correct classifications on label 1 total number of label 1 Another metric focuses on the accuracy of top K recommendations for each item. Firstly, we compute similarity scores between one node and all of its neighbors using algorithms mentioned previously. Based on similarity scores, we present top K similar items as our results. We are then able to calculate the cardinality of intersection of set of top K results and the set of ground truth items, and we use the ratio of size of intersection and size of ground truth items as our prediction accuracy. We compare accuracies of top 10 recommendations (Measurement #3) and top 5 recommendations (Measurement #4). For each label, correct number of top K recommendation ratio = number of ground truth recommendations accuracy = average of ratios for all labels 4.2 Performance of baseline algorithm "Common-Neighbor Algorithm" will serve as our baseline algorithm. To see how it performs, we firstly perform measurement on a small subcategory Publishing & Books, which contains 1099 nodes (670 of them are isolated) and 863 edges. See detailed analysis of this small network in part 1.5. Accuracy for four measurements are listed as the table below. Common Neighbors Then in comparison, we perform tests on a slightly bigger subcategory Finance & Investing, which contains 1545 nodes (820 of 6 them are isolated) and 1233 edges (see analysis in Section. 4). Here is the result: Common Neighbors From the two tables above, we can see that our baseline algorithm suffers from the unbalanced dataset (comparing measurement #1 and #2). Also, it performs poorly on recommendation accuracy with sharply decrease between the first two measurements. This may result from the fact that it is a naive way of predicting similarity among items. It purely exploits the property of nodes in a graph without taking in considerations of factors like the titles of items and the ratings of items. Another limitation of Common Neightbors algorithm comes from our technique of measurement. For Measurement #3 and #4, we form a similarity graph with the number of edges matches that of ground truth graph, which means that common neighbors only has 863 predictions. We sort the all node pairs based on the number of common neighbors between them in descending order and take top 863 pairs and exclude rest of pairs. This method makes our predictions stay on the same scale as ground truth, however, will degrade our prediction accuracy. Taken into account the more features of the network, new approaches would be proposed with the adjustments accordingly. And in our following analysis with different algorithms, there would also be four measurements for comparison. 4.3 Performance of machine learning approach To measure the performance of Machine Learning predictions, we firstly perform measurement on a small subcategory Publishing & Books. In addition, we make some adjustments for the Linear- SVC classifier due to the randomly normalized distribution property of the dataset: most labels in our training set is -1, thus we assign less weight to label -1 but more weight to label 0 and 1. In our machine learning approach, we have chosen four different algorithms namely linear regression, Linear-SVC with balanced class weight, Linear-SVC, and SVC. Linear regression Linear-SVC balanced Linear-SVC SVC Similar as what we have done with our baseline algorithm, we also perform measurements on a medium sized subcategory Finance & Investing, which contains 1545 nodes (820 of them are isolated) and 1233 edges. The results involve the same algorithms as the previous sub-dataset. Linear regression Linear-SVC balanced Linear-SVC SVC From the results above, we could notice that Linear-SVC with balanced class weight achieves the best results in both sub-datasets. It has better performance than the other three algorithms since they are training with an unbalanced dataset (majority with label 0). In other words, our graphs are sparse. Assigning weights according to the inverse of frequency would benefit in reducing testing errors. In

8 addition, Linear-SVC-balanced has the least observation difference between the first measurements, which means that it makes most reliable predictions with unbalanced labels. We also notice that linear regression performs poorly on both training sets, which could only achieve about 20% recommendations accuracy on average in measurement #4. More specifically, only 1 in 5 recommendations would succeed. The reasons behind might be that the features and labels are not in a linear relationship. Also, with an unbalanced training set, linear regression tends to make trivial predictions and it cannot distinguish between the slight differences among similar items. Comparing results between two datasets, we can see that Linear- SVC balanced performs better on the second dataset while all other algorithms performs worse on the second dataset. One possible explanation is that the second dataset is more sparse than the first dataset, making it a more unbalanced dataset. Other than this reason, all algorithms performs roughly on the same scale on both datasets. Increasing the size of dataset will not significantly reduce or improve the classification and recommendation accuracy. Despite those drawbacks, the best algorithm we have utilized could achieve 80% accuracy on top5 recommendations (4 out of 5 recommendations are correct), which already has significant improvement against our baseline algorithm. Although it observes partial ground truth and use features from ground truth graph, it can still be used to generate recommendations if provided with some behaviors of users. 4.4 Performance of Supervised Random Walk Algorithm In addition to previous experiments on both datasets, we also introduces another graph-mining approach which is Supervised Random Walk algorithm. The performances of Supervised Random Walk on subcategories are presented as follows: Publishing&Books Finance&Investing From the table above, we notice that the performances among all measurements are different from what they have shown in other algorithms. An interesting finding is that for Measure 1 & 2, precision in Finance&Investing dataset is higher while in other measurements they are the other way around. Besides, a sharp drop in measure 2 also attracts our attention on the reasons behind. The diameters for both network are 6 and 7 respectively for Publishing&Books and Finance&Investing, which are very close and not large. The reason why we get a poor performance for measurement 2 may be that the short walks could sometimes get stuck in local network neighborhood. Comparing to other machine learning approaches and graph mining approaches, performances of supervised random walks algorithms are not so good. The diameters for these two subcategories and other properties of network may help explain the low precision accuracy of application of supervised random walks algorithm. Since both networks are highly connected, high clustering coefficients and degree distribution would thus show the poor performance of supervised random walks algorithm. Under these circumstances, we could take some steps for partial remedy. Zhue et al use Metropolis-Hastings biasing where they have tried to tweak the input graph. This trial does have improve the performances of random walk algorithm but sometimes it is not easy to find the tricks for all cases. Further study on this topic would involve a better supervised random walks algorithms. 4.5 Avoid Over-fitting When conducting experiments on Supervised Machine Learning, we employed two methods to avoid overfitting: (1) feature choice: we chose a limited number of features that have significant contribution to classification. Too many features or features of trivial significance may result in overfitting; (2) validation set: we ensured to prepare a validation set that is different from training set or test set and run classifier on validation set. Prediction on validation set should have similar performance as prediction on test set, if data is not overfitted. 5 CONCLUSIONS In this report, we conducted three different approaches to implement a recommendation system based on amazon dataset: (1) Common-Neighbor Algorithm (baseline); (2) Supervised Machine learning; (3) Supervised Random Walks. We thereafter propose four measurement metrics to evaluate performance of different approaches: (1) accuracy of classification; (2) accuracy of classification on label 1; (3) accuracy of top 10 recommendations; (4) accuracy of top 5 recommendations. After experiments, we discovered that Supervised Machine Learning, specifically with Linear-SVC balanced model has the best performance among all algorithms via all measurement metrics. Common-neighbor algorithm (baseline), although with good prediction via measurement metric #1, performs very badly via other three metrics. Supervised Random Walks Algorithm has better performance than Common-Neighbor algorithm (baseline) via measurement metric #2, #3, #4, yet it does not achieve as high accuracy as the Supervised machine learning approach. However, with some data modification (eg. input graph tweak), it is possible for Supervised Random Walks Algorithm to achieve better performance, which would be involved in future study on this topic. Also, due to limitation of experimental condition, we are yet to try our approaches on larger datasets. The research on scalability of these algorithms would also be involved in further study. REFERENCES [1] J. Leskovec, L. Adamic and B. Adamic. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB). 1(1),2017. [2] D. Liben-Nowell, J. Kleinberg. The link-prediction problem for social networks. In CIKM, [3] L. Backstrom, J. Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks. WSDM, [4] Linden, Greg, Brent Smith, and Jeremy York. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7.1 (2003): [5] LuÌĹ, Linyuan, and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications (2011): [6] Jure Leskovec. SNAP: Amazon product co-purchasing network metadata [7] Jing Li, Lingling Zhang, Fan Meng, Fenhua Li Recommendation Algorithm based on Link Prediction and Domain Knowledge in Retail Transactions Procedia Computer Science (2014): Volume 31, Pages

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Predicting User Ratings Using Status Models on Amazon.com

Predicting User Ratings Using Status Models on Amazon.com Predicting User Ratings Using Status Models on Amazon.com Borui Wang Stanford University borui@stanford.edu Guan (Bell) Wang Stanford University guanw@stanford.edu Group 19 Zhemin Li Stanford University

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Supervised Random Walks

Supervised Random Walks Supervised Random Walks Pawan Goyal CSE, IITKGP September 8, 2014 Pawan Goyal (IIT Kharagpur) Supervised Random Walks September 8, 2014 1 / 17 Correlation Discovery by random walk Problem definition Estimate

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Supervised Link Prediction with Path Scores

Supervised Link Prediction with Path Scores Supervised Link Prediction with Path Scores Wanzi Zhou Stanford University wanziz@stanford.edu Yangxin Zhong Stanford University yangxin@stanford.edu Yang Yuan Stanford University yyuan16@stanford.edu

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK PRESENTED BY: DEEVEN PAUL ADITHELA 2708705 OUTLINE INTRODUCTION DIFFERENT TYPES OF FILTERING

More information

CS224W Project: Recommendation System Models in Product Rating Predictions

CS224W Project: Recommendation System Models in Product Rating Predictions CS224W Project: Recommendation System Models in Product Rating Predictions Xiaoye Liu xiaoye@stanford.edu Abstract A product recommender system based on product-review information and metadata history

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

Topic mash II: assortativity, resilience, link prediction CS224W

Topic mash II: assortativity, resilience, link prediction CS224W Topic mash II: assortativity, resilience, link prediction CS224W Outline Node vs. edge percolation Resilience of randomly vs. preferentially grown networks Resilience in real-world networks network resilience

More information

CS 224W Final Report Group 37

CS 224W Final Report Group 37 1 Introduction CS 224W Final Report Group 37 Aaron B. Adcock Milinda Lakkam Justin Meyer Much of the current research is being done on social networks, where the cost of an edge is almost nothing; the

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Network-based recommendation: Using graph structure in user-product rating networks to generate product recommendations

Network-based recommendation: Using graph structure in user-product rating networks to generate product recommendations Introduction. Abstract Network-based recommendation: Using graph structure in user-product rating networks to generate product recommendations David Cummings Ningxuan (Jason) Wang Given

More information

Link Prediction and Anomoly Detection

Link Prediction and Anomoly Detection Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in

More information

Social & Information Network Analysis CS 224W

Social & Information Network Analysis CS 224W Social & Information Network Analysis CS 224W Final Report Alexandre Becker Jordane Giuly Sébastien Robaszkiewicz Stanford University December 2011 1 Introduction The microblogging service Twitter today

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University  Infinite data. Filtering data streams /9/7 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman http://www.mmds.org Overview of Recommender Systems Content-based Systems Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive

More information

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research Jure Leskovec, Cornell/Stanford University Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research Network: an interaction graph: Nodes represent entities Edges represent interaction

More information

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network Journal of Innovative Technology and Education, Vol. 3, 216, no. 1, 131-137 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.12988/jite.216.6828 Graph Sampling Approach for Reducing Computational Complexity

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering W10.B.0.0 CS435 Introduction to Big Data W10.B.1 FAQs Term project 5:00PM March 29, 2018 PA2 Recitation: Friday PART 1. LARGE SCALE DATA AALYTICS 4. RECOMMEDATIO SYSTEMS 5. EVALUATIO AD VALIDATIO TECHIQUES

More information

Predict Topic Trend in Blogosphere

Predict Topic Trend in Blogosphere Predict Topic Trend in Blogosphere Jack Guo 05596882 jackguo@stanford.edu Abstract Graphical relationship among web pages has been used to rank their relative importance. In this paper, we introduce a

More information

Internal Link Prediction in Early Stage using External Network

Internal Link Prediction in Early Stage using External Network Internal Link Prediction in Early Stage using External Network Honghao Wei Stanford University weihh16 weihh16@stanford.edu Yiwei Zhao Stanford University ywzhao ywzhao@stanford.edu Junjie Ke Stanford

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Online Social Networks and Media

Online Social Networks and Media Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Sumedh Sawant sumedh@stanford.edu Team 38 December 10, 2013 Abstract We implement a personal recommendation

More information

Hotel Recommendation Based on Hybrid Model

Hotel Recommendation Based on Hybrid Model Hotel Recommendation Based on Hybrid Model Jing WANG, Jiajun SUN, Zhendong LIN Abstract: This project develops a hybrid model that combines content-based with collaborative filtering (CF) for hotel recommendation.

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Welfare Navigation Using Genetic Algorithm

Welfare Navigation Using Genetic Algorithm Welfare Navigation Using Genetic Algorithm David Erukhimovich and Yoel Zeldes Hebrew University of Jerusalem AI course final project Abstract Using standard navigation algorithms and applications (such

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

CSE255 Assignment 1 Improved image-based recommendations for what not to wear dataset

CSE255 Assignment 1 Improved image-based recommendations for what not to wear dataset CSE255 Assignment 1 Improved image-based recommendations for what not to wear dataset Prabhav Agrawal and Soham Shah 23 February 2015 1 Introduction We are interested in modeling the human perception of

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

PROBLEM 4

PROBLEM 4 PROBLEM 2 PROBLEM 4 PROBLEM 5 PROBLEM 6 PROBLEM 7 PROBLEM 8 PROBLEM 9 PROBLEM 10 PROBLEM 11 PROBLEM 12 PROBLEM 13 PROBLEM 14 PROBLEM 16 PROBLEM 17 PROBLEM 22 PROBLEM 23 PROBLEM 24 PROBLEM 25

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

10.4 Linear interpolation method Newton s method

10.4 Linear interpolation method Newton s method 10.4 Linear interpolation method The next best thing one can do is the linear interpolation method, also known as the double false position method. This method works similarly to the bisection method by

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

More information

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian Demystifying movie ratings 224W Project Report Amritha Raghunath (amrithar@stanford.edu) Vignesh Ganapathi Subramanian (vigansub@stanford.edu) 9 December, 2014 Introduction The past decade or so has seen

More information

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Hybrid Recommendation System Using Clustering and Collaborative Filtering Hybrid Recommendation System Using Clustering and Collaborative Filtering Roshni Padate Assistant Professor roshni@frcrce.ac.in Priyanka Bane B.E. Student priyankabane56@gmail.com Jayesh Kudase B.E. Student

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Annales Univ. Sci. Budapest., Sect. Comp. 43 (2014) 57 68 COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Imre Szücs (Budapest, Hungary) Attila Kiss (Budapest, Hungary) Dedicated to András

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Automated Tagging for Online Q&A Forums

Automated Tagging for Online Q&A Forums 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created

More information

Collaborative filtering models for recommendations systems

Collaborative filtering models for recommendations systems Collaborative filtering models for recommendations systems Nikhil Johri, Zahan Malkani, and Ying Wang Abstract Modern retailers frequently use recommendation systems to suggest products of interest to

More information

Absorbing Random walks Coverage

Absorbing Random walks Coverage DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random

More information

Absorbing Random walks Coverage

Absorbing Random walks Coverage DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random

More information

Study on A Recommendation Algorithm of Crossing Ranking in E- commerce

Study on A Recommendation Algorithm of Crossing Ranking in E- commerce International Journal of u-and e-service, Science and Technology, pp.53-62 http://dx.doi.org/10.14257/ijunnesst2014.7.4.6 Study on A Recommendation Algorithm of Crossing Ranking in E- commerce Duan Xueying

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Collaborative Filtering using Euclidean Distance in Recommendation Engine Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance

More information

Statistical Physics of Community Detection

Statistical Physics of Community Detection Statistical Physics of Community Detection Keegan Go (keegango), Kenji Hata (khata) December 8, 2015 1 Introduction Community detection is a key problem in network science. Identifying communities, defined

More information

Extracting Information from Complex Networks

Extracting Information from Complex Networks Extracting Information from Complex Networks 1 Complex Networks Networks that arise from modeling complex systems: relationships Social networks Biological networks Distinguish from random networks uniform

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

The link prediction problem for social networks

The link prediction problem for social networks The link prediction problem for social networks Alexandra Chouldechova STATS 319, February 1, 2011 Motivation Recommending new friends in in online social networks. Suggesting interactions between the

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Abstract The goal of influence maximization has led to research into different

More information

Scalable Influence Maximization in Social Networks under the Linear Threshold Model

Scalable Influence Maximization in Social Networks under the Linear Threshold Model Scalable Influence Maximization in Social Networks under the Linear Threshold Model Wei Chen Microsoft Research Asia Yifei Yuan Li Zhang In collaboration with University of Pennsylvania Microsoft Research

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

Applications of Machine Learning on Keyword Extraction of Large Datasets

Applications of Machine Learning on Keyword Extraction of Large Datasets Applications of Machine Learning on Keyword Extraction of Large Datasets 1 2 Meng Yan my259@stanford.edu 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

More information

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1 Problems and were graded by Amin Sorkhei, Problems and 3 by Johannes Verwijnen and Problem by Jyrki Kivinen.. [ points] (a) Gini index and Entropy are impurity measures which can be used in order to measure

More information

Evolutionary Linkage Creation between Information Sources in P2P Networks

Evolutionary Linkage Creation between Information Sources in P2P Networks Noname manuscript No. (will be inserted by the editor) Evolutionary Linkage Creation between Information Sources in P2P Networks Kei Ohnishi Mario Köppen Kaori Yoshida Received: date / Accepted: date Abstract

More information

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014 Classify My Social Contacts into Circles Stanford University CS224W Fall 2014 Amer Hammudi (SUNet ID: ahammudi) ahammudi@stanford.edu Darren Koh (SUNet: dtkoh) dtkoh@stanford.edu Jia Li (SUNet: jli14)

More information

Predicting Investments in Startups using Network Features and Supervised Random Walks

Predicting Investments in Startups using Network Features and Supervised Random Walks Predicting Investments in Startups using Network Features and Supervised Random Walks Arushi Raghuvanshi arushi@stanford.edu Tara Balakrishnan taragb@stanford.edu Maya Balakrishnan mayanb@stanford.edu

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

Indoor Object Recognition of 3D Kinect Dataset with RNNs

Indoor Object Recognition of 3D Kinect Dataset with RNNs Indoor Object Recognition of 3D Kinect Dataset with RNNs Thiraphat Charoensripongsa, Yue Chen, Brian Cheng 1. Introduction Recent work at Stanford in the area of scene understanding has involved using

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Nested Sampling: Introduction and Implementation

Nested Sampling: Introduction and Implementation UNIVERSITY OF TEXAS AT SAN ANTONIO Nested Sampling: Introduction and Implementation Liang Jing May 2009 1 1 ABSTRACT Nested Sampling is a new technique to calculate the evidence, Z = P(D M) = p(d θ, M)p(θ

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm Majid Hatami Faculty of Electrical and Computer Engineering University of Tabriz,

More information

The Complex Network Phenomena. and Their Origin

The Complex Network Phenomena. and Their Origin The Complex Network Phenomena and Their Origin An Annotated Bibliography ESL 33C 003180159 Instructor: Gerriet Janssen Match 18, 2004 Introduction A coupled system can be described as a complex network,

More information

CSC 2515 Introduction to Machine Learning Assignment 2

CSC 2515 Introduction to Machine Learning Assignment 2 CSC 2515 Introduction to Machine Learning Assignment 2 Zhongtian Qiu(1002274530) Problem 1 See attached scan files for question 1. 2. Neural Network 2.1 Examine the statistics and plots of training error

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

DIAL: A Distributed Adaptive-Learning Routing Method in VDTNs

DIAL: A Distributed Adaptive-Learning Routing Method in VDTNs : A Distributed Adaptive-Learning Routing Method in VDTNs Bo Wu, Haiying Shen and Kang Chen Department of Electrical and Computer Engineering Clemson University, Clemson, South Carolina 29634 {bwu2, shenh,

More information

Link prediction in graph construction for supervised and semi-supervised learning

Link prediction in graph construction for supervised and semi-supervised learning Link prediction in graph construction for supervised and semi-supervised learning Lilian Berton, Jorge Valverde-Rebaza and Alneu de Andrade Lopes Laboratory of Computational Intelligence (LABIC) University

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information