Academic Recommendation using Citation Analysis with theadvisor

Size: px

Start display at page:

Download "Academic Recommendation using Citation Analysis with theadvisor"

June Snow
6 years ago
Views:

1 Academic Recommendation using Citation Analysis with theadvisor joint work with Onur Küçüktunç, Kamer Kaya, Ümit V. Çatalyürek Department of Biomedical Informatics The Ohio State University CSTA 2013 :: 1 / 37

2 Table of Contents 1 Introduction Why? Overview 2 Citation Analysis for Document Recommendation Previous Approaches Direction Aware Recommendation 3 A High Performance Computing Problem A specialization of SpMV Ordering and Partitioning 4 Result Diversification 5 Other Features 6 Final Thoughts Conclusion Future Works :: 2 / 37

3 Once upon a time : a survey paper The Jimmy John s scheduling problem scheduling partitioning mapping pipeline workflow data flow task graph linear chain sequences (tree) (serial parallel) Introduction::Why? 3 / 37

4 Once upon a time : a survey paper The Jimmy John s scheduling problem scheduling partitioning mapping pipeline workflow data flow task graph linear chain sequences (tree) (serial parallel) But also... Scheduling problems in parallel query optimization Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallel programming Introduction::Why? 3 / 37

5 Once upon a time : a survey paper The Jimmy John s scheduling problem scheduling partitioning mapping pipeline workflow data flow task graph linear chain sequences (tree) (serial parallel) But also... Scheduling problems in parallel query optimization Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallel programming After 6 months, unknown papers where still uncovered Introduction::Why? 3 / 37

6 Once upon a time : a survey paper The Jimmy John s scheduling problem scheduling partitioning mapping pipeline workflow data flow task graph linear chain sequences (tree) (serial parallel) But also... Scheduling problems in parallel query optimization Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallel programming After 6 months, unknown papers where still uncovered Develop software to make the search easier! Introduction::Why? 3 / 37

7 Design Goals Personalized The user should be able to make a query that describes precisely what she is looking for. Conceptual The system should free of linguistic problems. Ambiguity and synonymy should be taken into accounts. Exploratory Different perspective should be available. The system should enhance the user s search. Easy to use The user should not need to know anything about data mining or algorithms. Introduction::Why? 4 / 37

8 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Introduction::Why? 5 / 37

9 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Citeseer, {Ref,Ack,Collab}Seer Automatically crawled papers in CS. Give PDFs. Contain citation information, full text. Compute similarity. Introduction::Why? 5 / 37

10 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Citeseer, {Ref,Ack,Collab}Seer Automatically crawled papers in CS. Give PDFs. Contain citation information, full text. Compute similarity. CiteUlike Social paper tagging application. Find paper from researchers with similar interest. Introduction::Why? 5 / 37

11 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Citeseer, {Ref,Ack,Collab}Seer Automatically crawled papers in CS. Give PDFs. Contain citation information, full text. Compute similarity. CiteUlike Social paper tagging application. Find paper from researchers with similar interest. ArnetMiner Academic network analysis. Introduction::Why? 5 / 37

12 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Mendeley Application for managing references. Database of reference. Citeseer, {Ref,Ack,Collab}Seer Automatically crawled papers in CS. Give PDFs. Contain citation information, full text. Compute similarity. CiteUlike Social paper tagging application. Find paper from researchers with similar interest. ArnetMiner Academic network analysis. Introduction::Why? 5 / 37

13 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Citeseer, {Ref,Ack,Collab}Seer Automatically crawled papers in CS. Give PDFs. Contain citation information, full text. Compute similarity. Mendeley Application for managing references. Database of reference. Google Scholar Keyword-based search engine (with citation informations). CiteUlike Social paper tagging application. Find paper from researchers with similar interest. ArnetMiner Academic network analysis. Introduction::Why? 5 / 37

14 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Citeseer, {Ref,Ack,Collab}Seer Automatically crawled papers in CS. Give PDFs. Contain citation information, full text. Compute similarity. CiteUlike Social paper tagging application. Find paper from researchers with similar interest. Mendeley Application for managing references. Database of reference. Google Scholar Keyword-based search engine (with citation informations). Microsoft Academic Search Keyword-based search engine and Academic network analysis. ArnetMiner Academic network analysis. Introduction::Why? 5 / 37

15 The Academic Web Service Ecosystem DBLP List of CS papers with clean reference and disambiguated names. Citeseer, {Ref,Ack,Collab}Seer Automatically crawled papers in CS. Give PDFs. Contain citation information, full text. Compute similarity. CiteUlike Social paper tagging application. Find paper from researchers with similar interest. ArnetMiner Academic network analysis. Mendeley Application for managing references. Database of reference. Google Scholar Keyword-based search engine (with citation informations). Microsoft Academic Search Keyword-based search engine and Academic network analysis. IEEE, ACM, Elsevier, JSTOR,... Publishers or digital libraries with complete text and references. Some suggestions. Introduction::Why? 5 / 37

16 A Use Case Introduction::Overview 6 / 37

17 System Overview Architecture A web-server as a front end. A cluster in the back-end. New instances are dynamically created as the load varies. Functional.bib.ris.xml Paper Mapper parameters {k,d,κ} paper IDs Recommendation Engine π π π Venue Rec. Reviewer Rec. Diversification Engine venues reviewers papers Visualization Relevance Feedback Introduction::Overview 7 / 37

18 Outline 1 Introduction Why? Overview 2 Citation Analysis for Document Recommendation Previous Approaches Direction Aware Recommendation 3 A High Performance Computing Problem A specialization of SpMV Ordering and Partitioning 4 Result Diversification 5 Other Features 6 Final Thoughts Conclusion Future Works Citation Analysis:: 8 / 37

19 Using the Citation Graph Hypothesis If two papers are related or treat the same subject, then they will be close to each other in the citation graph (and reciprocal) Benefits No linguistic => no synonymy, no ambiguity Automatically crowd source by researchers Drawbacks Difficult to gather the data (But thanks Citeseer) Relies on researcher already having made similar connections Citation Analysis:: 9 / 37

20 Local Approaches reference edges of v v x citation edges of v u t Bibliographic coupling [Kessler63]: papers having similar references are related Cocitation [Small73]: papers which are cited by the same papers are related CCIDF [Lawrence99]: cocitations weighted with inverse frequencies Citation Analysis::Previous Approaches 10 / 37

21 Local Approaches reference edges of v v x citation edges of v u t Bibliographic coupling [Kessler63]: papers having similar references are related Cocitation [Small73]: papers which are cited by the same papers are related CCIDF [Lawrence99]: cocitations weighted with inverse frequencies Problem: Considers only level-2 papers based on level-1 information. Citation Analysis::Previous Approaches 10 / 37

22 Global Approaches Graph distance-based Katz: number of paths between two papers [Strohman07] Random walk with restarts (RWR) based ArticleRank [Li09] (PageRank [Brin98] extension) PaperRank [Gori06] (Personalized PageRank [Haveliwala02] extension) RWR treats the citations and references in the same way Citation Analysis::Previous Approaches 11 / 37

23 Global Approaches Graph distance-based Katz: number of paths between two papers [Strohman07] Random walk with restarts (RWR) based ArticleRank [Li09] (PageRank [Brin98] extension) PaperRank [Gori06] (Personalized PageRank [Haveliwala02] extension) RWR treats the citations and references in the same way This is not exploratory! Citation Analysis::Previous Approaches 11 / 37

24 PageRank Let G = (V, E) be the citation graph PageRank [Brin98] π i (u) = d 1 V + (1 d) π i 1 (v) v N(u) δ(v) with d (0 : 1) is the damping factor. It converges to a stable distribution. source: wikipedia Citation Analysis::Direction Awareness 12 / 37

25 PageRank Let G = (V, E) be the citation graph PageRank [Brin98] π i (u) = d 1 V + (1 d) π i 1 (v) v N(u) δ(v) with d (0 : 1) is the damping factor. It converges to a stable distribution. But it is not personalized. source: wikipedia Citation Analysis::Direction Awareness 12 / 37

26 PageRank Let G = (V, E) be the citation graph PageRank [Brin98] π i (u) = d 1 V + (1 d) π i 1 (v) v N(u) δ(v) with d (0 : 1) is the damping factor. It converges to a stable distribution. But it is not personalized. Personalized PageRank [Jeh03] π i (u) = dp (u)+(1 d) v N(u) with p (u) = 1. π i 1 (v) δ(v) source: wikipedia Citation Analysis::Direction Awareness 12 / 37

PageRank Let G = (V, E) be the citation graph PageRank [Brin98] π i (u) = d 1 V + (1 d) π i 1 (v) v N(u) δ(v) with d (0 : 1) is the damping factor. It converges to a stable distribution.

27 PageRank Let G = (V, E) be the citation graph PageRank [Brin98] π i (u) = d 1 V + (1 d) π i 1 (v) v N(u) δ(v) with d (0 : 1) is the damping factor. It converges to a stable distribution. But it is not personalized. Personalized PageRank [Jeh03] π i (u) = dp (u)+(1 d) v N(u) with p (u) = 1. But it is not exploratory. π i 1 (v) δ(v) source: wikipedia Citation Analysis::Direction Awareness 12 / 37

28 Direction Awareness Time exploration What if we are interested in searching papers per years. Recent papers? Traditional papers? Let M be a set of known relevant papers. Direction Aware Random Walk with Restart π i (u) = dp (u) + (1 d)(κ v N + (u) d (0 : 1) is the damping factor. π i 1 (v) δ (v) κ (0 : 1). p (u) = 1 M, if u M, p (u) = 0, otherwise + (1 κ) v N (u) d (1-κ) π i 1 (v) δ + (v) ) a b c d δ + (v) reference edge v d κ δ - (v) back-reference (citation) edge (1-d) m restart edge q m Citation Analysis::Direction Awareness 13 / 37

29 Exploring in Depth d κ average shortest distance Citation Analysis::Direction Awareness 14 / 37

30 Exploring in Time d κ average publication year Citation Analysis::Direction Awareness 15 / 37

31 Hidden Reference Discovery The recovery test Let s hide some references from a paper and see if an algorithm can find them Results of the experiments with mean average precision (MAP@50) and 95% confidence intervals. hide random hide recent hide earlier mean interval mean interval mean interval DaRWR P.R Katz β Cocit Cocoup CCIDF Citation Analysis::Direction Awareness 16 / 37

32 Outline 1 Introduction Why? Overview 2 Citation Analysis for Document Recommendation Previous Approaches Direction Aware Recommendation 3 A High Performance Computing Problem A specialization of SpMV Ordering and Partitioning 4 Result Diversification 5 Other Features 6 Final Thoughts Conclusion Future Works A HPC computing problem:: 17 / 37

33 A Sparse Matrix-Vector Multiplication (SpMV) Rewriting DaRWR π i (u) = dp (u) + (1 d) κ π i (u) = dp (u) + π i = v N + (u) dp + Aπ i 1 v N + (u) π i 1 (v) δ (v) (1 d)κ δ (v) π i 1(v) + + (1 κ) v N (u) v N (u) π i 1 (v) δ + (v) (1 d)(1 κ) δ + π i 1 (v) (v) A HPC computing problem::spmv 18 / 37

34 A Sparse Matrix-Vector Multiplication (SpMV) Rewriting DaRWR π i (u) = dp (u) + (1 d) κ π i (u) = dp (u) + π i = v N + (u) dp + Aπ i 1 v N + (u) π i 1 (v) δ (v) (1 d)κ δ (v) π i 1(v) + + (1 κ) v N (u) v N (u) π i 1 (v) δ + (v) (1 d)(1 κ) δ + π i 1 (v) (v) CRS Full Traverse A column per column. Skip columns where π i 1 (v) = 0. Per edge: 2 non-zeros (2 indices, 2 values) A HPC computing problem::spmv 18 / 37

35 A Sparse Matrix-Vector Multiplication (SpMV) Rewriting DaRWR π i (u) = dp (u) + π i = v N + (u) (1 d)κ δ (v) π i 1(v) + v N (u) (1 d)(1 κ) δ + π i 1 (v) (v) dp (1 d)κ + (1 d)(1 κ) + B δ π i 1 +B δ + π i 1 A HPC computing problem::spmv 19 / 37

36 A Sparse Matrix-Vector Multiplication (SpMV) Rewriting DaRWR π i (u) = dp (u) + π i = v N + (u) (1 d)κ δ (v) π i 1(v) + v N (u) (1 d)(1 κ) δ + π i 1 (v) (v) dp (1 d)κ + (1 d)(1 κ) + B δ π i 1 +B δ + π i 1 CRS Half pre-compute: (1 d)κ π δ i 1 and (1 d)(1 κ) δ π + i 1 B and B + are 0/1 and symmetric Traverse the matrix twice (B and B + ) Skip columns where π i 1 (v) = 0. Per edge: 1 non-zeros (1 index, 0 values) A HPC computing problem::spmv 19 / 37

37 A Sparse Matrix-Vector Multiplication (SpMV) Rewriting DaRWR π i (u) = dp (u) + π i = v N + (u) (1 d)κ δ (v) π i 1(v) + v N (u) (1 d)(1 κ) δ + π i 1 (v) (v) dp (1 d)κ + (1 d)(1 κ) + B δ π i 1 +B δ + π i 1 COO Half pre-compute: (1 d)κ π δ i 1 and (1 d)(1 κ) δ π + i 1 B and B + are 0/1 and symmetric Traverse the matrix once (B and B + ) Arbitrary order. Don t skip anything. Per edge: 1 non-zeros (2 indices, 0 values) A HPC computing problem::spmv 19 / 37

38 Number of updates 12M 10M # updates 8M 6M 4M 2M 0 CRS-Full CRS-Half COO-Half iteration A HPC computing problem::spmv 20 / 37

39 Runtimes execution time (s) CRS-Half CRS-Full COO-Half Hybrid CRS-Full CRS-Full (RCM) CRS-Full (AMD) CRS-Full (SB) CRS-Half CRS-Half (RCM) CRS-Half (AMD) CRS-Half (SB) COO-Half COO-Half (RCM) COO-Half (AMD) COO-Half (SB) Hybrid Hybrid (RCM) Hybrid (AMD) Hybrid (SB) #partitions A HPC computing problem::spmv 21 / 37

40 Ordering Locality SpMV is sensitive to non-zero locality. A HPC computing problem::ordering 22 / 37

41 Ordering Locality SpMV is sensitive to non-zero locality. Reverse Cuthill-McKee [Cuthill, McKee, 69] Order with respect to a Breadth First Search ordering. (Do 10 times, pick best) A HPC computing problem::ordering 22 / 37

42 Ordering Locality SpMV is sensitive to non-zero locality. Reverse Cuthill-McKee [Cuthill, McKee, 69] Order with respect to a Breadth First Search ordering. (Do 10 times, pick best) Approximate Minimum Degree [Amestoy et al.,96] Greedily, add the vertex whose degree is minimum. A HPC computing problem::ordering 22 / 37

43 Ordering Locality SpMV is sensitive to non-zero locality. Reverse Cuthill-McKee [Cuthill, McKee, 69] Order with respect to a Breadth First Search ordering. (Do 10 times, pick best) Approximate Minimum Degree [Amestoy et al.,96] Greedily, add the vertex whose degree is minimum. Slashburn [Kang, Faloutsos,11] Order by connected components. Remove the highest degree vertex. Repeat. A HPC computing problem::ordering 22 / 37

44 Partitioning A HPC computing problem::ordering 23 / 37

45 Runtimes execution time (s) CRS-Half CRS-Full COO-Half Hybrid CRS-Full CRS-Full (RCM) CRS-Full (AMD) CRS-Full (SB) CRS-Half CRS-Half (RCM) CRS-Half (AMD) CRS-Half (SB) COO-Half COO-Half (RCM) COO-Half (AMD) COO-Half (SB) Hybrid Hybrid (RCM) Hybrid (AMD) Hybrid (SB) #partitions A HPC computing problem::ordering 24 / 37

46 Outline 1 Introduction Why? Overview 2 Citation Analysis for Document Recommendation Previous Approaches Direction Aware Recommendation 3 A High Performance Computing Problem A specialization of SpMV Ordering and Partitioning 4 Result Diversification 5 Other Features 6 Final Thoughts Conclusion Future Works Result Diversification:: 25 / 37

47 Principle The goal of diversity is to avoid clustered answers. Result Diversification:: 26 / 37

48 Principle The goal of diversity is to avoid clustered answers. Relevant Result Diversification:: 26 / 37

49 Principle The goal of diversity is to avoid clustered answers. Relevant Relevant Diverse Result Diversification:: 26 / 37

50 A Modelization problem dens better rel 10-RLM BC1 BC1 V BC2 BC2 V BC BC BC 15 0 BC BC BC 25 0 CDivRank DRAGON Here is a distribution of known algorithms GRA GSP IL1 IL2 LM PDiv PR k-rl top-9 top-7 top-5 top-2 All R Result Diversification:: 27 / 37

51 A Modelization problem dens RLM BC1 BC1 V BC2 BC2 V BC BC BC 15 0 BC BC BC 25 0 CDivRank DRAGON Would such an algorithm be of interest? GRA GSP IL1 IL2 LM PDiv PR k-rl top-9 top-7 top-5 top-2 All R better rel Result Diversification:: 27 / 37

52 A Modelization problem dens RLM BC1 BC1 V BC2 BC2 V BC BC BC 15 0 BC BC BC 25 0 CDivRank DRAGON Would such an algorithm be of interest? That algorithm is random! GRA GSP IL1 IL2 LM PDiv PR k-rl top-9 top-7 top-5 top-2 All R better rel Result Diversification:: 27 / 37

53 What to do? See later talk! Result Diversification:: 28 / 37

54 Results references references recommendations recommendations top-100 Graph mining top-100 Graph mining Compression Generic SpMV Compression Generic SpMV Multicore e Multicore Partitioning Partitioning GPU GPU Eigensolvers Eigensolvers Result Diversification:: 29 / 37

55 Outline 1 Introduction Why? Overview 2 Citation Analysis for Document Recommendation Previous Approaches Direction Aware Recommendation 3 A High Performance Computing Problem A specialization of SpMV Ordering and Partitioning 4 Result Diversification 5 Other Features 6 Final Thoughts Conclusion Future Works Other Features:: 30 / 37

56 Relevance Feedback Papers can be tagged are relevant or irrelevant. Positive feedback (+RF): Relevant results are added to Q Negative feedback (-RF): Irrelevant results are removed from the graph How long does it take to find the first level-3 paper? Tau No RF 0.2 Pos RF Neg RF 0.1 Pos+Neg RF Ratio Other Features:: 31 / 37

57 Relevance Feedback Papers can be tagged are relevant or irrelevant. Positive feedback (+RF): Relevant results are added to Q Negative feedback (-RF): Irrelevant results are removed from the graph How long does it take to find the first level-3 paper? Tau More exploration! 0.3 No RF 0.2 Pos RF Neg RF 0.1 Pos+Neg RF Ratio Other Features:: 31 / 37

58 Visualization Other Features:: 32 / 37

59 Visualization More exploration! Other Features:: 32 / 37

60 Application Programming Interface Web service theadvisor can be accessed programmatically. Emit HTTP requests and obtain JSON encoded replies. Other Features:: 33 / 37

61 Application Programming Interface Web service theadvisor can be accessed programmatically. Emit HTTP requests and obtain JSON encoded replies. Potential Applications Interfacing with article editors (e.g., TexShop) Recommendation in bibliography manager (e.g., Mendeley) Suggesting reviewers to program committees (e.g., EasyChair) Suggesting sessions of interest at conferences (e.g., iconference ) Other Features:: 33 / 37

62 Application Programming Interface Web service theadvisor can be accessed programmatically. Emit HTTP requests and obtain JSON encoded replies. Potential Applications Interfacing with article editors (e.g., TexShop) Recommendation in bibliography manager (e.g., Mendeley) Suggesting reviewers to program committees (e.g., EasyChair) Suggesting sessions of interest at conferences (e.g., iconference ) Easier to use! Other Features:: 33 / 37

63 Outline 1 Introduction Why? Overview 2 Citation Analysis for Document Recommendation Previous Approaches Direction Aware Recommendation 3 A High Performance Computing Problem A specialization of SpMV Ordering and Partitioning 4 Result Diversification 5 Other Features 6 Final Thoughts Conclusion Future Works Final Thoughts:: 34 / 37

64 Design Goals - Are they matched? Personalized The query is expressed in a very precise way. Conceptual Using the citation graph allows to avoid all linguistic. Though, it may not be enough to find all relevant papers. Exploratory Direction Awareness (to choose time), Diversification (to see more topics), Visualization (for manual crawling) Easy to use Efficient (recommendation in less than 2 seconds), web-based. Final Thoughts::Conclusion 35 / 37

65 Design Goals - Are they matched? Personalized The query is expressed in a very precise way. Conceptual Using the citation graph allows to avoid all linguistic. Though, it may not be enough to find all relevant papers. Exploratory Direction Awareness (to choose time), Diversification (to see more topics), Visualization (for manual crawling) Easy to use Efficient (recommendation in less than 2 seconds), web-based. Is it good enough? Tell us! Final Thoughts::Conclusion 35 / 37

66 Future works Clustering Let s assume for an instant that we have accurate disambiguated tags for every document. We could restrict analysis to some fields. Improve diversification. Betweenness Centrality DaRWR provides recommendation around the query set. What about recommending what is between it? Contextual information Distinguishing types of papers and citations. Survey, Method, Application... Final Thoughts::Future Works 36 / 37

67 Thank you More information contact : esaule@bmi.osu.edu visit: (or or Research at HPC lab is supported by Final Thoughts::Future Works 37 / 37

theadvisor: A Graph-based Citation Recommendation Service

theadvisor: A Graph-based Citation Recommendation Service joint work with Onur Küçüktunç, Erik Saule, Kamer Kaya umit@bmi.osu.edu Professor Department of Biomedical Informatics Department of Electrical