E-commercial Recommendation Algorithms Based on Link Analysis

Size: px

Start display at page:

Download "E-commercial Recommendation Algorithms Based on Link Analysis"

Maurice Freeman
5 years ago
Views:

1 E-commercial Recommendation Algorithms Based on Link Analysis Guanlin Li Le Lu Shulin Cao Junjie Zhu Instructor: Fragkiskos Malliaros University of California, San Diego June 11, 2017

2 E-commercial Recommendation Algorithms Based on Link Analysis ABSTRACT Guanlin Li Shulin Cao This paper focuses on similar product recommendation using graph mining methodology on the dataset of Amazon product copurchasing network metadata[1]. In detail, we want to build a similarity graph in which nodes are products (or customers) on Amazon and edges are the degrees of similarity between nodes. We firstly demonstrate a baseline algorithm called "Common Neighbors Algorithm"[2]. Then, we implement a machine learning algorithm which will calculate similarity based attributes of nodes. Finally, we carry out a graphing mining algorithm "Supervised Random Walk Algorithm"[3] which derives similarities among nodes based on network properties. Our measurement of prediction accuracy will be 1) accuracy of classification on labels and 2) the cardinality of intersection of top prediction set and ground truth set. We would like to compare the performance of product and customer similarity graphs and the accuracy of different algorithms. With the above work, we explore in detail the attributes and network properties of Amazon product co-purchasing network dataset. KEYWORDS Amazon Metadata, E-commercial recommendation, graph mining, link analysis, similarity graph 1 INTRODUCTION 1.1 Project Motivation E-commerce has spread widely since its first appearance by ARPANET in 1970s; in 2003, Amazon first ever posted its yearly profit. Following the rapid economic development across the world, E-commerce also makes it very convenient for us to shop remotely and efficiently. The emergence of online shopping websites such as Yahoo, Ebay, Amazon and Bestbuy has brought us great convenience. At the same time, the need for customers to make best choices and for merchants to efficiently promote products has attracted increasing attention. Therefore, a recommendation system is necessary. Many online shopping websites now provide recommendation tools. They present product bundles and similar products. However, they may not bring up most-wanted and best-fit products for customers. Under this circumstance, we hope to explore multiple (currently we have three in mind) of many algorithms being studied, evaluate their performance and hopefully put forward a solution that could optimize the recommendation system. Le Lu l2lu@ucsd.edu Junjie Zhu juz091@eng.ucsd.edu 1.2 Problem Definition Recommendation systems usually provide a list of recommendations in a variety of ways. We aim to construct a recommendation system that provides the top few similar commercial products for every product, such that customers could easily catch a glance of other similar options while shopping online. For the purpose of constructing an effective recommendation system, we propose the idea of constructing a similarity graph, in which nodes represent either commercial products or customers and links denote the similarity between two nodes. This similarity value would be weighted to a float number ranging from 0 to 1, with larger value indicating closer relationship. Our main technique for this task would be conducting link analysis (machine learning or graph mining algorithms) based on features of dataset to generate a similarity graph. With link analysis, we are able to evaluate the existing relations among nodes and derive similarities. Also, we could predict the connections that are not currently in the graph but may be formed in the future. Several challenges and limitations have been foreseen, which include: (1) large dataset may lead to difficulties in scalability. Categories in the Amazon Meta Dataset vary widely from art & literature to technology, from home supplies to specialized objectives. Larger dataset would have more complex network and properties. Thus the choice of category and respective analysis would be our prior difficulties; (2) designing appropriate features for machine learning algorithms: attributes of items/network that may contribute more to proficiency of data training and final precision. The main goal is to derive top K recommendations from similarity graphs and we want to optimize the accuracy of our recommendations comparing with ground truth similarity (provided in Amazon metadata). 1.3 Dataset Information Here we choose Amazon product co-purchasing network metadata[1] as our dataset. This dataset contains 548,552 products and 1,788,725 product-project edges, which provides a fair amount of useful information for analysis. Each record has eight fields[6]: Id: Product id (number 0,..., ) ASIN: Amazon Standard Identification Number title: Name/title of the product group: Product group (Book, DVD, Video or Music) salesrank: Amazon Salesrank similar: ASINs of co-purchased products (people who buy X also buy Y)

3 categories: Location in product category hierarchy to which the product belongs reviews: Product review information: time, user id, rating, total number of votes on the review, total number of helpfulness votes (how many people found the review to be helpful) The "similar" field of the records is obtained based on customer purchasing history, which can be used as ground truth for our research. Based on the attributes of products, we have plenty of features for both machine learning and graphing mining algorithms. Thus, we can use these techniques to effectively predict and evaluate similarity between two products. Figure 1. category count distribution for all items 1.4 Data Preprocessing 1.5 For the sake of graph construction and machine-learning data preparation, preprocessing the raw data is necessary. We first parsed the raw dataset and constructed an <item, customers> table that stores a list of customers for each item and a <customer, items> table that store a list of items purchased by each customer. Afterwards, we generated (1) an item-item graph by associating all pairs of items that have been purchased by the same customer and (2) a customer-customer graph by connecting all pairs of customers who have purchased the same item. Besides these two graphs, we also (1) constructed a ground truth graph of which items represent nodes and are linked to their "similar items". A fraction of this ground truth graph is served as the training data for supervised machine learning; (2) reformatted information of each item into a.csv file, including item s ASIN, title, group, sales-rank, customer ratings and categories. These attributes, especially categories, will be fairly useful in predicting similarities between items. Challenges came when the dataset in hand was too large to process. We first tried to construct an item-item graph based on all 548,522 items, yet the scope of the graph expanded to which a personal computer cannot handle (more than than 300GB of space under our estimation). Therefore, we decided to narrow down the dataset to select and analyze items from specific categories. At this stage, we propose two different ways of filtering items: (1) filtering items that fall into with any of specified categories and (2) filtering items that fall into all specified categories. Figure 1 below shows the category count distribution such that x axis represents the occurrence of a category among all items and y axis represents the frequency of that occurrence count. In addition, we processed ground truth data accordingly by removing each item s "similar items" that are not in the graph. This way, we ensure that all ground truth items can be found in the constructed graph. Although this approach may affect the global accuracy of recommendation prediction, it makes analyzing data and conducting our methodology possible on a personal computer. At this stage, it is more important to test algorithms and analyze meaningful results than to fully consider scalability. Data Analysis To better understand the structure and property of the network generated from the Amazon Meta dataset, we randomly selected one of the original categories for data analysis. In this part, the category Publishing & Books was chosen and its properties are analyzed as followed. We generated a item-item graph based on the item information of the original dataset with weights representing the user numbers between items (Fig. 2). Figure 2. GCC graph visualization The New Weighted Network (NWN) is composed of 555 nodes and edges together with assigned weights based on the links among the nodes. A wealth of data (shown as Table. 1) demonstrates that the NWN is a highly centered and highly connected network. Giant Connected Component (GCC) of the NWN shows that 91.17% nodes and 99.89% edges of the NWN belong to GCC. That marginal node and edge data both hold no more than 10% of the entire network tells the centrality. With 6 in diameter of GCC, our deduction is greatly validated. Besides, there are more than 1 million triangles in the GCC, which we could make an assumption that the network of the item-item graph are highly connected. To verify our hypothesis, we analyzed the clustering coefficients, path lengths and degree distribution of the GCC. Results show that average clustering coefficient is 0.81 and average shortest path length is 2.04, which exactly match our previous assumptions and 2

our expectations. And degree distribution plot provides more visualized information about the centrality and connectivity of the network (Fig. 3).

4 our expectations. And degree distribution plot provides more visualized information about the centrality and connectivity of the network (Fig. 3). In addition to the properties verification, we could discover more details about the network from the plot. Nearly 30% nodes have higher degrees while there are also around 20% nodes own lower degrees. This difference could result from our choice of separating data. Since we choose only one category from the network, some links connected by nodes in our graph and original graph are neglected. The vanishing of marginal links between our NWN and entire network could also represent the emergence of those marginal nodes with lower degrees. This loss actually helps us understand the potential properties of the entire network. Table 1. Basic Statistics of GCC and NWN 2 RELATED WORK Figure 3. Degree distribution plot of GCC Ideas of recommendation systems have emerged long ago and there are many algorithms being studied; most of them target generic recommendation systems or new recommendation algorithm and very few conduct thorough experiments on e-commerce recommendation system via link prediction and compare multiple algorithms performance. There are algorithms that are designed carefully but not practically for academic study: In Linden et al s study on e-commerce recommendation system, they proposed Item-to-Item Collaborative Filtering[4], which effectively personalizes recommendations for each customer. Item-to-item collaborative filtering analyzes each customer s shopping cart, purchased items and rated items to similar items before finalizing a customized recommendation list. However, such algorithm can be inefficient in terms of processing time and memory usage[4] when many product pairs have no common customers. In addition, this algorithm can only be practical exclusively since customers credentials are not accessible to outsides of the company. Therefore, although this algorithm may achieve high recommendation accuracy, it could be impractical for generic application purpose. Many researchers are also interested in link prediction and the broad application of the theory: Liben-Nowell and Kleinberg [2] proposed multiple link prediction methods to solve problems for social networks, such as Common-Nighbor algorithm, graph distance algorithm, rooted PageRank, SimRank, etc. They proposed the challenge to infer which new interactions among members in a social network are likely to occur in the near future, and developed approaches based on link prediction to analyze the proximity of nodes in a network. However, social network and product network have many different properties and Liben-Nowell and Kleinberg did not reveal potentials of their proposed algorithms on e-commerce recommendation systems. Despite that, We employed their Common-neighbor algorithm as our baseline algorithm, considering its practicability and operability. Besides that, many researchers devised innovative approaches regarding recommendation algorithms. Li et al proposed a new product recommendation algorithm in bipartite network via link predictions [7]: instead of solely relying on the network structure topological features of the graph, like most of network-based recommendation algorithms do, they combined link prediction methods with domain knowledge to study the evolution of interactions among consumers and products, with specific weight assigned to products based on domain similarities. Although their algorithm can be expected to achieve high prediction accuracy, they only presented theoretical hypothesis and limited evaluations without comparing to other algorithms. Meanwhile, our work targets multiple approaches to attempt e-commerce product recommendation via link prediction and compares their performance. As large e-commerce business such as Amazon.com gets expanded, choosing an effective and efficient recommendation algorithm is necessary for improving user experience and maximizing the revenue. 3 METHODOLOGY 3.1 Baseline Algorithm One baseline algorithm we use will be "Common-Neighbor Algorithm" described in The link-prediction problem for social networks[2]. For a node x, let τ (x) denote the set of neighbors of x, in our case, the set of customers who bought the same item x (or, if in a customer graph, the set of items purchased by customer x). The similarity score between node x and y is defined as: ( τ (x) τ (y) ) The intuition behind is that, in product similarity graph, if two items are bought by two group of customers, whose intersection is relatively large, then these two items might be similar, following the definition of similarity as "people who buy X also buy Y". In the case of customer similarity graph, two people might have similar interests if they buy almost the same group of items. 3

5 In order to predict the similarity of two items, we apply "Common- Neighbor Algorithm" on the item-item graph described in 1.3. We use item-item graph as training set and ground truth as testing set. Ground truth contains the actual similarity between two items obtained from amazon meta data. Based on the item-item graph, we compute the similarity score between two nodes as the number of common neighbors of two nodes. Then, we sort the list of pairs of two nodes by the decreasing similarity score. Finally, we pick the top n pairs of links in the list, where n is the number of edges in the ground truth. Compared to the actual similarity between two nodes, the accuracy of similarity predictions based on common neighbor algorithm is 5.99%. The accuracy is the proportion of edges predicted correctly in the ground truth. After discussion, the two main reasons why the accuracy is low are listed as follows. First, the size of item-item graph is not large enough (it involves only 555 nodes now). The more information the item-item graph contains, the more accurate the prediction would be. Second, we find that in the amazon dataset the maximum number of similar items of one specific item is 5. Obviously, the number of similar items for a specific item is not merely five. For example, the first pair of item in the list we predict are the same books of two different versions. However, they are not considered as two similar items due to the limitation that maximum number of similar items is merely five. Furthermore, we compare the performance of our algorithm with the performance described in The link-prediction problem for social networks[2]. In this report, they apply common neighbor algorithm to five co-authorship networks and the accuracy is between 5% and 10%. The performance in this report is a little higher than our performance. The possible reason we think is that the algorithm in this report is used to predict the future new links between two authors. However, our goal is to predict the similarity of two items based on item-item graph, which excavates new things. 3.2 Machine Learning Approach The machine learning algorithm we use takes advantage of the features of product nodes. Although we may not be able to implement machine learning algorithm with customer similarity graph, since customers information is too credential to obtain. However, with product similarity graph, we can still do much meaningful work. The intuition behind is that, with the provided product features, e.g. title, group, hierarchical category and ratings, the algorithm will be able to capture the key properties of similar items. One obvious example is when two items are both under the same hierarchical category and have nearly the same ratings, then they are quite likely to be similar because they may serve the same function and have same qualities. To make it more concrete, "Logitech G303 Gaming Mouse" and "Razor Chroma Gaming Mouse", both are under the category "Computers & Accessories > Computer Accessories > Game Hardware > PC Game Hardware > Gaming Mice" and have similar ratings. We are confident to say that these two products are similar. The features we extract from Amazon meta data focus on five similarities and three graph properties: (1) Title similarity: we obtained TF-IDF vectors for each title and calculate cosine similarity between two products titles as a feature. The intuition behind is the title of a product matters. (2) Group similarity: the feature is 1 if two products are in the same general group, 0 if two products are in different general group. If two products are not even in general group, it is really hard for them to be similar items (consider the above example about gaming mouse). (3) Number of reviews similarity: the feature is the absolute value of difference of number of reviews between two products. The idea behind is that similar items might draw similar attentions from customers. (4) Category similarity: the feature is the cardinality of intersection of the sets of two products detailed category. From the above example, both gaming mice have the same set of detailed category: "Computers & Accessories", "Computer Accessories", "Game Hardware", "PC Game Hardware", "Gaming Mice". Thus, the cardinality of the intersection is 5, which is used as a feature. (5) Rating similarity: the feature is the difference of average ratings between two items. The thoughts behind is that two similar items should have similar ratings (or qualities). (6) Node popularity: the feature is the sum of degrees of two nodes in the similarity graph. Higher degrees means higher popularity within Amazon dataset, which might indicate some core products that are similar to many other products. (7) Distance in graph: the feature is the distance between two nodes in the similarity graph. Greater distance means two products are less similar to each other. Disconnected nodes highly indicates they are not related at all. (8) Common neighbors: the feature is the number of common neighbors between two nodes (they already have a distance 1 or 2 between each other). Two nodes that have distance greater than 2 have no common neighbors. This is a detailed observation of how similar two products are. The supervised labels are obtained from partial ground truth. The ground truth is read in as a format resembled that of an adjacency list, where a row indicates a group of nodes similar to each other. Using NetworkX, we build a ground truth graph based on the list of nodes and an edge in this graph means two nodes are closely similar to each other (labeled as 1 for classification). Two nodes that have a long path connecting them may indicate they are roughly similar (labeled as 0). Two nodes that are disconnected probably indicate they are not similar items (labeled as -1). We output the graph in the format of node-node pair and their labels. (for example, item1, item2, -1 ). We further divide the output file into two parts: training set and testing set. With the above features on similarity extracted from training set, we used SVM as our classification algorithm and used partial ground truth as labels. Since the labels are heavily unbalanced ( 1 has much lower frequency than those of 0 and 1 ), we used balanced keyword in SVM class_weight property. We then use the SVM classifier to label items pairs in testing set. 4

6 3.3 4 Graph Mining: Supervised Random Walks Algorithm The graph mining algorithm we use is "Supervised Random Walks Algorithm" described in Supervised Random Walks: Predicting and Recommending Links in Social Networks[3]. This algorithm considers both network structure and the features of nodes and edges. The intuition behind the algorithm is that, the provided features bias a PageRank-like random walk on the constructed network graph. For a basic random walk on the graph, it start from a node chosen uniformly at random,pick one of outgoing edges uniformly at random, and then move to the destination of the edge. It repeats the step described above by computing the PageRank weights until convergence. The transition probability of each edge in basic random walk is equally distributed. In our algorithm, however, we use attribute data of nodes and edges to bias the random walk by assigning each edge (u, v) a transition probability based on features of nodes u and v. To predict new edges of node s, we follow the steps described in Supervised Random Walks: Predicting and Recommending Links in Social Networks[3]. (1) Calculate the edge strength auv = fw (ψuv ) of all edges where ψuv is the edge feature vector between node u and node v and w is the paramter vector. (2) Create a weighted graph by assigning each edge with strength auv. (3) Do random walk with restarts from node s, then each node w will have a PageRank score pw. (4) Sort the nodes other than s by the decresing PageRank score. (5) The top k nodes with the highest score are the predicted destination nodes of node s For the step (3), what we want to mention is the optimization problem to find the optimal set of parameters w of the edge strength function fw (ψuv ). Note that, we label nodes that node s will create edges in the future as destination nodes set D, while we label other nodes that s will not create new edges as no-link nodes set L. The optimization problem is: Í minw F (w) = w 2 + λ d D,l L h(pl pd ) where λ is the regularization parameter that is used to prevent the over-fit of the model and h( ) is a loss function that takes the difference between pl and pd and calculates the penalty. In addition, p is the vector of PageRank scores. The process of sovling the optimization problem is so troublesome that we don t disscuss this part in our paper. If you have interests in how to solve the problem, you can refer to Supervised Random Walks: Predicting and Recommending Links in Social Networks[3]. Next, we need to list the several choices in our algorithm. Based on Backstrom s paper, we choose the functions and parameters whose performance are the best. 1 where b=0.01 (1) Loss function: h(x) = 1+exp( x /b) (2) (3) (4) (5) EVALUATION In this part, we include another category of original set Finance & Investing in our training and testing. Fig. 4-6 have shown the basic properties of the new category. Figure 4. Degree rank plot of GCC-Finance & Investing Figure 5. Degree distribution plot of GCC-Finance & Investing Figure 6. Network-Finance & Investing Category Diameter Triangle No. ClusterCoefficients ShortestPathLength 1 Edge strength function: auv = 1+exp( ψ uv w ) Random walk restart parameter: α = 0.5 Regularization parameter: λ = 1 Feature vector : ψuv = [Number of reviews similarity, Category similarity] 5 Publishing&Books Finance&Investing

7 Unlike previous analysis on Publishing & Books sub-dataset, the new dataset has more nodes with lower degrees, which means this network is not as centered as the former one. Also, comparing to the former network, this network is closer to our real network with assumptions that it has a power-law degree distribution. 4.1 Performance Determination Criteria To evaluate the performance of our algorithm, we compare our prediction results to the ground truth (similar items) obtained from the amazon dataset. Here we present two kinds of metrics as follows: The first evaluation metric that is easy to achieve is to measure classification accuracy. From the ground truth, we built a ground truth graph using python package NetworkX. Then, for each pair of nodes in the graph we assign labels -1,0,1 based on their relative position in graph as in Fig. 6 : Label 1: Two connected nodes that have distance 1 are closely similar to each other Label 0: Two connected nodes that have distance greater than 1 may indicate that they are roughly similar Label -1: Two disconnected nodes indicate that they are not similar For testing, our algorithm predicts a label for each pair of nodes and then we calculate the accuracy of classification (Measurement #1) respectively. We define accuracy formula as: accuracy = number of correct classifications total classification Note that the graph is sparse (a prediction of all zeros will achieve 99% accuracy); therefore we also measure the accuracy of correct labeling on 1 (Measurement #2). accuracy = number of correct classifications on label 1 total number of label 1 Another metric focuses on the accuracy of top K recommendations for each item. Firstly, we compute similarity scores between one node and all of its neighbors using algorithms mentioned previously. Based on similarity scores, we present top K similar items as our results. We are then able to calculate the cardinality of intersection of set of top K results and the set of ground truth items, and we use the ratio of size of intersection and size of ground truth items as our prediction accuracy. We compare accuracies of top 10 recommendations (Measurement #3) and top 5 recommendations (Measurement #4). For each label, correct number of top K recommendation ratio = number of ground truth recommendations accuracy = average of ratios for all labels 4.2 Performance of baseline algorithm "Common-Neighbor Algorithm" will serve as our baseline algorithm. To see how it performs, we firstly perform measurement on a small subcategory Publishing & Books, which contains 1099 nodes (670 of them are isolated) and 863 edges. See detailed analysis of this small network in part 1.5. Accuracy for four measurements are listed as the table below. Common Neighbors Then in comparison, we perform tests on a slightly bigger subcategory Finance & Investing, which contains 1545 nodes (820 of 6 them are isolated) and 1233 edges (see analysis in Section. 4). Here is the result: Common Neighbors From the two tables above, we can see that our baseline algorithm suffers from the unbalanced dataset (comparing measurement #1 and #2). Also, it performs poorly on recommendation accuracy with sharply decrease between the first two measurements. This may result from the fact that it is a naive way of predicting similarity among items. It purely exploits the property of nodes in a graph without taking in considerations of factors like the titles of items and the ratings of items. Another limitation of Common Neightbors algorithm comes from our technique of measurement. For Measurement #3 and #4, we form a similarity graph with the number of edges matches that of ground truth graph, which means that common neighbors only has 863 predictions. We sort the all node pairs based on the number of common neighbors between them in descending order and take top 863 pairs and exclude rest of pairs. This method makes our predictions stay on the same scale as ground truth, however, will degrade our prediction accuracy. Taken into account the more features of the network, new approaches would be proposed with the adjustments accordingly. And in our following analysis with different algorithms, there would also be four measurements for comparison. 4.3 Performance of machine learning approach To measure the performance of Machine Learning predictions, we firstly perform measurement on a small subcategory Publishing & Books. In addition, we make some adjustments for the Linear- SVC classifier due to the randomly normalized distribution property of the dataset: most labels in our training set is -1, thus we assign less weight to label -1 but more weight to label 0 and 1. In our machine learning approach, we have chosen four different algorithms namely linear regression, Linear-SVC with balanced class weight, Linear-SVC, and SVC. Linear regression Linear-SVC balanced Linear-SVC SVC Similar as what we have done with our baseline algorithm, we also perform measurements on a medium sized subcategory Finance & Investing, which contains 1545 nodes (820 of them are isolated) and 1233 edges. The results involve the same algorithms as the previous sub-dataset. Linear regression Linear-SVC balanced Linear-SVC SVC From the results above, we could notice that Linear-SVC with balanced class weight achieves the best results in both sub-datasets. It has better performance than the other three algorithms since they are training with an unbalanced dataset (majority with label 0). In other words, our graphs are sparse. Assigning weights according to the inverse of frequency would benefit in reducing testing errors. In

8 addition, Linear-SVC-balanced has the least observation difference between the first measurements, which means that it makes most reliable predictions with unbalanced labels. We also notice that linear regression performs poorly on both training sets, which could only achieve about 20% recommendations accuracy on average in measurement #4. More specifically, only 1 in 5 recommendations would succeed. The reasons behind might be that the features and labels are not in a linear relationship. Also, with an unbalanced training set, linear regression tends to make trivial predictions and it cannot distinguish between the slight differences among similar items. Comparing results between two datasets, we can see that Linear- SVC balanced performs better on the second dataset while all other algorithms performs worse on the second dataset. One possible explanation is that the second dataset is more sparse than the first dataset, making it a more unbalanced dataset. Other than this reason, all algorithms performs roughly on the same scale on both datasets. Increasing the size of dataset will not significantly reduce or improve the classification and recommendation accuracy. Despite those drawbacks, the best algorithm we have utilized could achieve 80% accuracy on top5 recommendations (4 out of 5 recommendations are correct), which already has significant improvement against our baseline algorithm. Although it observes partial ground truth and use features from ground truth graph, it can still be used to generate recommendations if provided with some behaviors of users. 4.4 Performance of Supervised Random Walk Algorithm In addition to previous experiments on both datasets, we also introduces another graph-mining approach which is Supervised Random Walk algorithm. The performances of Supervised Random Walk on subcategories are presented as follows: Publishing&Books Finance&Investing From the table above, we notice that the performances among all measurements are different from what they have shown in other algorithms. An interesting finding is that for Measure 1 & 2, precision in Finance&Investing dataset is higher while in other measurements they are the other way around. Besides, a sharp drop in measure 2 also attracts our attention on the reasons behind. The diameters for both network are 6 and 7 respectively for Publishing&Books and Finance&Investing, which are very close and not large. The reason why we get a poor performance for measurement 2 may be that the short walks could sometimes get stuck in local network neighborhood. Comparing to other machine learning approaches and graph mining approaches, performances of supervised random walks algorithms are not so good. The diameters for these two subcategories and other properties of network may help explain the low precision accuracy of application of supervised random walks algorithm. Since both networks are highly connected, high clustering coefficients and degree distribution would thus show the poor performance of supervised random walks algorithm. Under these circumstances, we could take some steps for partial remedy. Zhue et al use Metropolis-Hastings biasing where they have tried to tweak the input graph. This trial does have improve the performances of random walk algorithm but sometimes it is not easy to find the tricks for all cases. Further study on this topic would involve a better supervised random walks algorithms. 4.5 Avoid Over-fitting When conducting experiments on Supervised Machine Learning, we employed two methods to avoid overfitting: (1) feature choice: we chose a limited number of features that have significant contribution to classification. Too many features or features of trivial significance may result in overfitting; (2) validation set: we ensured to prepare a validation set that is different from training set or test set and run classifier on validation set. Prediction on validation set should have similar performance as prediction on test set, if data is not overfitted. 5 CONCLUSIONS In this report, we conducted three different approaches to implement a recommendation system based on amazon dataset: (1) Common-Neighbor Algorithm (baseline); (2) Supervised Machine learning; (3) Supervised Random Walks. We thereafter propose four measurement metrics to evaluate performance of different approaches: (1) accuracy of classification; (2) accuracy of classification on label 1; (3) accuracy of top 10 recommendations; (4) accuracy of top 5 recommendations. After experiments, we discovered that Supervised Machine Learning, specifically with Linear-SVC balanced model has the best performance among all algorithms via all measurement metrics. Common-neighbor algorithm (baseline), although with good prediction via measurement metric #1, performs very badly via other three metrics. Supervised Random Walks Algorithm has better performance than Common-Neighbor algorithm (baseline) via measurement metric #2, #3, #4, yet it does not achieve as high accuracy as the Supervised machine learning approach. However, with some data modification (eg. input graph tweak), it is possible for Supervised Random Walks Algorithm to achieve better performance, which would be involved in future study on this topic. Also, due to limitation of experimental condition, we are yet to try our approaches on larger datasets. The research on scalability of these algorithms would also be involved in further study. REFERENCES [1] J. Leskovec, L. Adamic and B. Adamic. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB). 1(1),2017. [2] D. Liben-Nowell, J. Kleinberg. The link-prediction problem for social networks. In CIKM, [3] L. Backstrom, J. Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks. WSDM, [4] Linden, Greg, Brent Smith, and Jeremy York. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7.1 (2003): [5] LuÌĹ, Linyuan, and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications (2011): [6] Jure Leskovec. SNAP: Amazon product co-purchasing network metadata [7] Jing Li, Lingling Zhang, Fan Meng, Fenhua Li Recommendation Algorithm based on Link Prediction and Domain Knowledge in Retail Transactions Procedia Computer Science (2014): Volume 31, Pages

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,