Local is Good: A Fast Citation Recommendation Approach

Similar documents
Supporting Scholarly Search with Keyqueries

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Academic Recommendation using Citation Analysis with theadvisor

A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Bibliometrics: Citation Analysis

Fast Recommendation on Bibliographic Networks with Sparse-Matrix Ordering and Partitioning

Automatically Building Research Reading Lists

Information Retrieval and Web Search

A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Query Independent Scholarly Article Ranking

Bruno Martins. 1 st Semester 2012/2013

Inverted List Caching for Topical Index Shards

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

theadvisor: A Graph-based Citation Recommendation Service

New user profile learning for extremely sparse data sets

Making Recommendations Better: An Analytic Model for Human- Recommender Interaction

Similarity Ranking in Large- Scale Bipartite Graphs

Recommender Systems: User Experience and System Issues

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Design of Research Buddy: Personalized Research Paper Recommendation System

Query Likelihood with Negative Query Generation

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Study on Recommendation Systems and their Evaluation Metrics PRESENTATION BY : KALHAN DHAR

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

A Query-oriented Approach for Relevance in Citation Networks

Content-based Dimensionality Reduction for Recommender Systems

Lecture 17 November 7

Citation Prediction in Heterogeneous Bibliographic Networks

CiteSeer x : A Scholarly Big Dataset

Optimizing Search Engines using Click-through Data

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

TWITTER USE IN ELECTION CAMPAIGNS: TECHNICAL APPENDIX. Jungherr, Andreas. (2016). Twitter Use in Election Campaigns: A Systematic Literature

Link Prediction in Graph Streams

Mining Web Data. Lijun Zhang

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

ScienceDirect. Goes beyond search to research. Genevieve Musasa - Customer Consultant Africa

CSI 445/660 Part 10 (Link Analysis and Web Search)

Information Networks: PageRank

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Link Analysis and Web Search

over Multi Label Images

Recommender Systems (RSs)

Scholarly Big Data: Leverage for Science

Active Blocking Scheme Learning for Entity Resolution

SimRank : A Measure of Structural-Context Similarity

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Social Network Analysis

Retrieval of Highly Related Documents Containing Gene-Disease Association

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Mining Web Data. Lijun Zhang

Top-k Keyword Search Over Graphs Based On Backward Search

Part 1: Link Analysis & Page Rank

Singular Value Decomposition, and Application to Recommender Systems

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

CS6200 Information Retreival. The WebGraph. July 13, 2015

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Clustering using Topic Models

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS

Ranking Clustered Data with Pairwise Comparisons

Automatically Building Research Reading Lists

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

KNOW At The Social Book Search Lab 2016 Suggestion Track

Fast Inbound Top- K Query for Random Walk with Restart

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid

Web of Science. Platform Release Nina Chang Product Release Date: March 25, 2018 EXTERNAL RELEASE DOCUMENTATION

Slides based on those in:

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Towards a hybrid approach to Netflix Challenge

Popularity of Twitter Accounts: PageRank on a Social Network

An Edge-Swap Heuristic for Finding Dense Spanning Trees

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

doi: / _32

Recommender Systems New Approaches with Netflix Dataset

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Chapter 2 Basic Structure of High-Dimensional Spaces

CS425: Algorithms for Web Scale Data

Headings: Academic Libraries. Database Management. Database Searching. Electronic Information Resource Searching Evaluation. Web Portals.

Extracting Algorithms by Indexing and Mining Large Data Sets

A Document-centered Approach to a Natural Language Music Search Engine

Inverted Index for Fast Nearest Neighbour

Music Recommendation with Implicit Feedback and Side Information

Applying Multi-Armed Bandit on top of content similarity recommendation engine

Community-Based Recommendations: a Solution to the Cold Start Problem

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Time-based Recommender System using Implicit Feedback

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Parallelization of K-Means Clustering Algorithm for Data Mining

Transcription:

Local is Good: A Fast Citation Recommendation Approach Haofeng Jia and Erik Saule Dept. of Computer Science, UNC Charlotte, USA {hjia1,esaule}@uncc.edu Abstract. Finding relevant research works from the large number of published articles has become a nontrivial problem. In this paper, we consider the problem of citation recommendation where the query is a set of seed papers. Collaborative filtering and PaperRank are classical approaches for this task. Previous work has shown PaperRank achieves better recommendation in experiments. However, the running time of PaperRank typically depends on the size of input graph and thus tends to be expensive. Here we explore LocRank, a local ranking method on the subgraph induced by the ego network of the vertices in the query. We experimentally demonstrate that LocRank is as effective as PaperRank while being 15x faster than PaperRank and 6x faster than collaborative filtering. 1 Introduction Scientists around the world have published tens of millions of research papers, and the number of new papers has been increasing with time. At the same time, literature search became an essential task performed daily by thousands of researcher around the world. Finding relevant research works from the large number of published articles has become a nontrivial problem. There are several developed paper recommendation systems based on keywords, such as Google Scholar or Mendeley. However, the drawbacks of keyword based systems are obvious: first of all, the vocabulary gap between the query and the relevant documents might result in poor performance; moreover, a simple string of keywords might not be enough to convey the information needs of researchers. There are many cases where a keyword query is either over broad, returning many articles that are loosely relevant to what the researcher actual need, or too narrow, filtering many potentially relevant articles out or returning nothing at all. Researchers have devoted efforts on citation recommendation based on a set of seed papers [1 10], which can avoid above mentioned problems. Intuitively, most approaches rely on the citation graph to recommend relevant papers. In general, there are two classic frameworks: collaborative filtering (CF) [1, 2] and random walk with restart [3, 7, 9], of which PaperRank [3] is the most representative work. The different approaches to recommending academic papers have been extensively surveyed [11].

2 Previous work has shown PaperRank achieves better performance on recall than CF, while CF is more efficient. The running time of PaperRank typically depends on the size of input graph and thus tends to be more expensive. Collaborative filtering essentially computes the weighted co-citation relationships and thus does not need to take the global graph into account. However, for a practical citation recommendation system, both quality and speed of recommendation are important. To this end, we explore LocRank, a local method of which the running time depends on the size of the local subgraph of the query. Experiments demonstrate the LocRank provide recommendation of comparable quality to Paper- Rank; however, LocRank is 15x faster than PaperRank and 6x faster than CF. 2 A Local Ranking Method 2.1 Citation Recommendation Let G = (V, E) be the citation graph, with n papers V = {v 1,..., v n }. In G, each edge e E represents a citation relationship between two papers. We use Ref(v) and Cit(v) to denote the reference set of and citation set to v, respectively. And Adj(v) is used to denote the union of Ref(v) and Cit(v). We will use an superscript denoting the graph used when it is not obvious. In this paper, we focus on the citation recommendation problem assuming that researchers already know a set of relevant papers; this assumption is not uncommon and is made by systems such as theadvisor [9]. Therefore, the recommendation task can be formalized as: Citation Recommendation. Given a set of seed papers S as a query, return a list of papers ranked by relevance to the ones in S. Collaborative Filtering [1] has been proven to be an effective idea for most recommendation problems. For citation recommendation, a ratings matrix is built using the adjacent matrix of citation graph, where citing papers correspond to users and citations correspond to items. A pseudo target paper that cites all seed papers is added to the matrix. Then, CF computes the cosine similarity of all papers with the target paper and identify x peer papers, having the highest similarity to the target paper. Then each paper is scored by summing the similarity of the peer paper that cites it. PaperRank [3] is a biased random walk proposed to recommend papers based on citation graph. In particular, the restarts probability from any paper will be distributed to only the seed papers. 2.2 LocRank In general, PaperRank tends to find those papers which are close to the seeds and have high eigenvector centralities. Collaborative filtering essentially focuses on a local co-citation relationships of seeds papers. Now we present LocRank, a method that can provide high quality recommendations at low computational cost.

3 We define a local induced subgraph of a query q: G q = (V q, E q ), where V q contains all nodes in S and any node which is a neighbor of at least one seed paper: V q = S S n where S n denotes S n = s S{v : v Adj(s)} E q remains all citation relationships between nodes in V q. In other words, G q is the subgraph induced by the distance 1 neighborhood of the seed papers. Given the induced subgraph G q, LocRank computes a random walk on G q. It assumes a random walker in paper v continues to a neighbor with a damping factor d, and with probability (1 d) it restarts at one of the seed papers in S. The edges are followed proportionally to the weight of that edge w ji which is often set to 1, but can be set to the number of time paper i is referenced by paper j. R(v i ) = (1 d) 1 S + d v j Adj Gq (v i) w ji v k Adj Gq (v w R(v j ) j) jk After convergence, LocRank returns papers based on R(v), v S n. 3 Expriments 3.1 Data Preparation There are many publicly available academic data sets. However, none of them is perfect for the citation recommendation task. For instance, Microsoft Academic Graph 1 [12] contains abundant information from various disciplines but it is fairly noisy: some important attributes are missing or wrong. In contrast, the records in DBLP are much more reliable although it does not contain citation information. To obtain a clean and comprehensive academic dataset, we match Microsoft Academic Graph, CiteSeerX 2 and DBLP 3 [13] datasets through DOI and titles for their complementary advantages. In this way, we derive a corpus of Computer Science papers consisting of 2,035,246 Papers and 12,439,090 Citations. 3.2 Experiment Setup In order to simulate the typical usecase where a researcher is writing a paper and tries to find some more references, we design the random-hide experiment. 1 https://www.microsoft.com/en-us/research/project/ microsoft-academic-graph/ 2 http://citeseerx.ist.psu.edu/ 3 http://dblp.uni-trier.de/xml/

4 First of all, a query paper q with 20 to 200 references and published between 2005 to 2010 is randomly (uniformly) selected from the dataset. We then remove the query paper q and all papers published after q from the citation graph to simulate the time when the query paper was being written. Instead of using hideone strategy [1, 2], we randomly hide 10% of the references and place them in the hidden set. This set of hidden papers is used as ground truth to recommend. The remaining papers are used as the set of seed papers. For evaluating effectiveness of recommendation algorithm, we use recall@k, the ratio of hidden papers appearing in top k of the recommended list. Following results are based on the performance for 2,500 independent randomly selected queries. The codes are written in C++. The graphs are represented in Compressed Row Storage format for compact storage. The codes are compiled with g++ 4.8.2 with option -O3. The codes are run on 1 core of an Intel(R) Xeon CPU E-5-2623 @ 3.00GHz processor. 3.3 Results on Effciency Table 1: Performance on Effciency (in second) Method Mean Runtime 95% Confidence Interval CF 1.0070 1.0067 1.0073 PaperRank 2.4796 2.4621 2.4972 LocRank 0.1674 0.1671 0.1677 Table 1 shows the average runtime for a single query and the 95% confidence intervals. In general, LocRank is around 15x faster than PaperRank and even 6x faster than CF. We compare LocRank with PaperRank and CF respectively as follows. LocRank vs PaperRank Basically, LocRank is much faster than Paper- Rank because of the following two reasons: First of all, LocRank is a ranking method of which the runtime only depends on the size of local induced graph, while PaperRank is a global ranking method; Secondly, a local induced graph tends to have a smaller diameter, which means LocRank can reach the convergence within less iterations. LocRank vs CF Although LocRank and CF are both local approaches for citation recommendation, CF needs to sort the neighbors based on their similarities with the pseudo target paper for each query. From the confidence interval in Table 1, it is easy to see that the standard deviation of runtime of PaperRank is larger. This is because we remove the query paper q and all papers published after q from the citation graph to simulate the time when the query paper was being written, so the size of the global graph is different for queries with different publication date. In Figure 1, we show the runtime for 100 randomly sampled independent queries. The fluctuations in the runtime of PaperRank corresponds its larger standard deviation.

5 6 5 CF PaperRank LocRank Run Time(Sec) 4 3 2 1 0 0 20 40 60 80 100 Query 3.4 Results on Effectiveness Fig. 1: Runtime for 100 instance queries Although LocRank is a ranking method on local induced graph, the quality of recommendation of LocRank is competitive with classical methods. Table 2 shows the results on mean recall of 2,500 independent queries. 95% confidence interval at top 50 demonstrates the difference between PaperRank and LocRank is not statistically significant, while LocRank and PaperRank are statistically significantly better than CF. Table 2: Performance on Recall Method @10 @20 @30 @40 @50 95% Confidence Interval@50 CF 0.1894 0.2669 0.3158 0.3597 0.3917 0.3775 0.4059 PaperRank 0.2344 0.3260 0.3895 0.4360 0.4715 0.4573 0.4857 LocRank 0.2395 0.3281 0.3900 0.4338 0.4679 0.4537 0.4821 Essentially, the LocRank is a tradeoff between the upper bound of recall and the efficiency. It turns out that LocRank has an equivalent ability to find hidden papers as PaperRank, which demonstrates that many findable hidden papers are actually neighbors of seed papers. Even though PaperRank could find 100% of the hidden paper, it is reasonable that the hidden papers that are not directly connected to a seed paper are hard to find for current citation recommendation systems. Some recent work has shown that popular methods are poor to find loosely connected hidden papers [10]. There are many reasons that can result in this phenomenon. For example, a machine learning paper will typically cite previous works on the same problem, at the same time, it is likely to cite a bioinformatics paper to demonstrate the wide applications of the proposed method. In this case, it is hard to infer this bioinformatics paper based on the rest of references.

6 4 Conclusion and Future Work This paper explores LocRank, a local method for recommending paper in a citation recommendation problem. The runtime of LocRank depends on the size of the local subgraph of the query. Experiments show that LocRank is 15x faster than PaperRank and 6x faster than CF. At the same time, the performance on recall demonstrates that LocRank is still as effective as PaperRank, and better than CF. In the future, it would be interesting to investigate other local methods that can achieve cheerful performance on both efficiency and effectiveness. In particular closeness centrality and k-core decomposition of the local graph seems relevant and easy to compute. Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. 1652442. References 1. McNee, S.M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S.K., Rashid, A.M., Konstan, J.A., Riedl, J.: On the recommending of citations for research papers. In: Proc. of CSCW. (2002) 116 125 2. Torres, R., McNee, S.M., Abel, M., Konstan, J.A., Riedl, J.: Enhancing digital libraries with techlens+. In: Proc. of JCDL. (2004) 228 236 3. Gori, M., Pucci, A.: Research paper recommender systems: A random-walk based approach. In: Proc. of Web Intelligence. (2006) 778 781 4. Ekstrand, M.D., Kannan, P., Stemper, J.A., Butler, J.T., Konstan, J.A., Riedl, J.T.: Automatically building research reading lists. In: Proc. of RecSys. (2010) 159 166 5. El-Arini, K., Guestrin, C.: Beyond keyword search: discovering relevant scientific literature. In: Proc of KDD. (2011) 439 447 6. Golshan, B., Lappas, T., Terzi, E.: Sofia search: a tool for automating related-work search. In: Proc. of SIGMOD. (2012) 621 624 7. Caragea, C., Silvescu, A., Mitra, P., Giles, C.L.: Can t see the forest for the trees?: a citation recommendation system. In: Proc. of JCDL. (2013) 111 114 8. Küçüktunç, O., Kaya, K., Saule, E., Çatalyürek, Ü.V.: Fast recommendation on bibliographic networks. In: Proc. of ASONAM. (2012) 9. Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Towards a personalized, scalable, and exploratory academic recommendation service. In: Proc. of ASONAM. (2013) 10. Jia, H., Saule, E.: An analysis of citation recommender systems: Beyond the obvious. In: Proc of ASONAM. (2017) 11. Beel, J., Gipp, B., Langer, S., Breitinger, C.: Research-paper recommender systems: a literature survey. International Journal on Digital Libraries 17(4) (2016) 305 338 12. Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.j.P., Wang, K.: An overview of microsoft academic service (mas) and applications. In: Proc. of WWW. (2015) 243 246 13. Ley, M.: DBLP - some lessons learned. PVLDB 2(2) (2009) 1493 1500