PANDA: A Platform for Academic Knowledge Discovery and Acquisition

Size: px

Start display at page:

Download "PANDA: A Platform for Academic Knowledge Discovery and Acquisition"

Bertina Hudson
5 years ago
Views:

1 PANDA: A Platform for Academic Knowledge Discovery and Acquisition Zhaoan Dong 1 ; Jiaheng Lu 2,1 ; Tok Wang Ling 3 1.Renmin University of China 2.University of Helsinki 3.National University of Singapore

2 CONTENT 1. Motivation and background 2. Definitions and problem statement 3. Our hybrid framework 4. Current system implementation 5. Related work 6. Conclusion and future work 2

3 1. Motivation and background Existing popular web-based academic search systems 3 provide literature search and retrieval services through a user-friendly interface Keywords search return a long list of paper titles and other textual information To find the papers they want, users often need scan the long list and download some papers to read one by one. Time-consuming costly

4 1. Motivation and background PandaSearch: A Fine-grained Academic Search Engine For Research Documents on Computer Science (ICDE 2015) 4

5 1. Motivation and background PandaSearch: A Fine-grained Academic Search Engine For Research Documents On Computer Science (ICDE 2015) 5

6 1. Motivation and background Knowledge cells Some meaningful information objects within academic documents, e.g. Figures, Tables, Definitions, etc.). Example 1: Figure Table Definition Algorithm Some examples for different types of Knowledge Cells 6

Example 2: (1) K-Medoids is compared with PIL.

7 1. Motivation and background some relationships among knowledge cells are usually implied or hidden in the sentences of the articles. Example 2: (1) K-Medoids is compared with PIL. PAM (2) K-Medoids algorithm depends on Equation 4 DEPD Equation 4 K-medoids PIL (3) PAM is a kind of K-Medoids algorithms 7

8 1. Motivation and background some relationships among knowledge cells are usually implied or hidden in the sentences of the articles. Example 3: LKMed PAM (4) HKMed is adapted from LKMed (5) HKMed is adapted from PAM. HKMed Equation 4 VARNT DEPD K-medoids PIL According the sentences we can find the relationships among three algorithms: HKMed, LKmed and PAM. 8

9 1. Motivation and background Example 4: REF DEPD CMP Figure Definition Algorithm Theorem Table 9 A Fragment of an Academic Knowledge Graph

10 1. Motivation and background The academic knowledge graph can provide a more accurate paper-level results Improving the ranking of the relevant papers towards keywords query. a fine-grained search Looking inside the documents to search some research data within scientific articles Returning some fine-grained information objects not only a flat list of paper-level information. deep-level information exploring Academic Knowledge discovery Academic information exploring developers 10

11 1. Motivation and background In the future, on the one hand, we want to add Advanced Search to PandaSearch for common users as below. on the other hand, we can provide SQL-Like APIs for external systems as demonstrated in the following examples. 11

name,"inverted list") AND k.type="figure" AND p.pid=k.

12 1. Motivation and background Example 5: To find the Figures that contain inverted list" in their captions and the papers these Figures from. SELECT p.pid, p.title, k.name, k.content FROM papers p, cells k WHERE contains(k.name,"inverted list") AND k.type="figure" AND p.pid=k.pid; We use a non-standard SQL statements to illustrate what the query language looks like. papers and cells can be either relational tables or nonrelational data collections contains can be some built-in functions. 12

1. Motivation and background Example 6: Search algorithms from different papers which are variants or have been compared with an Algorithm whose name is related to hash join algorithm. SELECT k 1.

13 1. Motivation and background Example 6: Search algorithms from different papers which are variants or have been compared with an Algorithm whose name is related to hash join algorithm. SELECT k 1.pid, k 1.name, k 2.pid, k 2.name FROM cells k 1, cells k 2 WHERE relations(k 1,k 2 ) IN ("CMP","VARNT") AND contains(k 2.name,"hash join") AND k 1.type = k 2.type = "Algorithm" AND k 1.pid!= k 2.pid; cells can be either relational tables or non-relational data collections relations can be some built-in functions. 13

14 1. Motivation and background Objectives and challenges. 14 (1) Correctly identify and extract the contents of each Knowledge Cell. PDFs lacks of enough structural information diverse journals published in different years and layouts (2) Extract the attributes, key phrases and contexts of the Knowledge Cells. The captions of Figures, the specifications of Algorithms etc. are hard for computer to understand. (3) Identify and extract various relationships between Knowledge Cells The relationships are usually implied in text, rare or invisible. Some even require expertise to be recognized.

15 1. Motivation and background Example 8: The layouts of Knowledge Cells are always changing with the format of different documents, different conferences or different journals. 15

16 1. Motivation and background Example 9: Even in one document, the layouts are different. 16 There are at least three different layouts of 11 logical objects including one Table and ten Figures.

17 1. Motivation and background Example 10: Text, Number, Formula missing Caption missing information makes some attributes of the knowledge cells null. information overload makes it hard to extract the attributes and relationships. 17

18 1. Motivation and background To overcome the challenges, 18 we propose a hybrid framework combining the accuracy of human workers with the speed of computer algorithms. Automatic computer algorithms: Low cost, speed can hardly extend to handle diverse journals and layouts, with the increasing amount of scientific publications. Human workers in crowdsourcing more accuracy, higher performance. Expensive crowdsourcing cost, e.g. time, money. The cooperation of human and machine can help researchers to resolve large scale complex problems in a more efficient way

2. Definitions and problem statement The definition of Knowledge Cell Definition 1: A Knowledge Cell is a meaningful information object within an academic document.

19 2. Definitions and problem statement The definition of Knowledge Cell Definition 1: A Knowledge Cell is a meaningful information object within an academic document. Each Knowledge Cell should have some attributes including an identifier, paper identifier, type, name, content and key phrases, and so on. Generally, if papers are also of a special kind of Knowledge Cells that have attributes like paper identifier ( e.g. pid ), title, authors, pages, conference or journal, date, references, etc. 19

2. Definitions and problem statement The definition of Academic Knowledge Graph Definition 2: An Academic Knowledge Graph is a directed graph AKG = (K, R ), where K is the set of Knowledge Cells

20 2. Definitions and problem statement The definition of Academic Knowledge Graph Definition 2: An Academic Knowledge Graph is a directed graph AKG = (K, R ), where K is the set of Knowledge Cells extracted from a collection of academic documents and R = { (k 1, k 2, r ) k 1, k 2 ϵ K; k 1 k 2 ; and r is the relationship between k 1 and k 2 }. Note that k 1 and k 2 are two knowledge cells either from one PDF file or two different files. 20

21 2. Definitions and problem statement We will obtain a more general Academic Knowledge Graph (GAKG) as a hyper graph if it contains the relationships between: each paper and it citations. each paper and Knowledge Cells within it. Knowledge Cells Figure CITE Definition Algorithm Theorem Table A fragment of a general Academic Knowledge Graph 21

2. Definitions and problem statement Problem statement: the problem of academic knowledge discovery and acquisition can be modeled as a crowd-sourced database

supplied by either automatic algorithms or anonymous human workers.

22 2. Definitions and problem statement Problem statement: the problem of academic knowledge discovery and acquisition can be modeled as a crowd-sourced database problem, where scholarly papers, Knowledge Cells and their relationships could be represented as rows /records with some missing attributes that could be supplied by either automatic algorithms or anonymous human workers. Our objectives is to identify and extract them by either automatic algorithms or anonymous human workers for further queries. We focus on how to design such hybrid workflows that combine the automatic algorithms and crowdsourced tasks efficiently and effectively. 22

23 3. Our hybrid framework A generic framework for knowledge discovery and acquisition from PDF documents. Crowd training Automated Extracting Crowdsourcing PDF Pages Automated Extracting Algorithm Low confidence HIT Candidate HIT HIT Candidate Candidates HITs generating HITs High confidence Confirmed Knowledge Cells 23 The hybrid workflows

24 3. Our hybrid framework Our hybrid workflows can be regarded as a multi-stage process (1)Preprocessing stage. 24 Metadata information of papers could be harvested from public website previously. title, authors, publication date, page number, etc. Format conversion PDF documents text files PDF pages JPEG/PNG images pages filtering by rule-based filters. Some PDF pages that obviously do not contain the target objects to be extracted should be filtered

25 3. Our hybrid framework (2) Extracting academic knowledge using automatic algorithms Heuristic methods and machine learning algorithms are employed to: Locate the position of the area of each Knowledge Cell Analyze the texts and extract the attributes, contexts, key phrases of each Knowledge Cells. Provide a confidence estimate value on how accurate and reliable an identified result is likely to be. Adjust the filtering threshold of the confidence dynamically with consideration of time cost, result quality and budget of crowdsourcing. 25

26 3. Our hybrid framework (3) Crowdsourcing tasks design 26 Results with high confidence value will be retained. Otherwise, the current page will be switched to the crowdsourcing layer as a Human Intelligence Task Candidate (HITC). Human Intelligence Tasks (HITs) for extracting certain Knowledge Cells or information will be designed and generated. A web-based task-oriented crowdsourcing system Identifying tasks Reviewing tasks Tutorial tasks Test tasks

27 3. Our hybrid framework (4)Crowdsourcing process management and cost model Answers aggregation and quality control 27 Majority vote, etc. a tutorial module a test module A crowdsourcing cost model how to archive a higher quality with a fixed budget. how to reduce the whole cost with quality constraints. User management module Registration ranking and reputation

28 4. Current system implementation Platform for Academic knowledge Discovery and Acquisition (PANDA) Internet PDFs Crowds PANDA Academic Knowledge Base PandaSearch Query Result User 28 PANDA serves as a data provider for Pandasearch

29 4. Current system implementation 29 The system architecture of PANDA

30 4. Current system implementation (1) Data Storage 2.9 Million PDF documents in computer science. We currently focus on the extraction of Figures Data Type Number Papers Figures Definitions 1939 Lemmas 757 Theorems 726 Algorithms 671 Propositions 52 Examples 1038 Now, we have extracted Figures from 5000 papers, including nearly 4000 SIGMOD papers published from 1980 to So that the number of Figures is quite less than the number of papers. This is why we want to develop the PANDA, to process the rest papers that are still increasing in amount. Statistics of current data stores 30

31 4. Current system implementation (2) Algorithmic Layer we have built an algorithm using rule-based and machine learning methods to automatically extract Figures: 1. Splitting the PDF document into pages. 2. Converting the PDF file into standard text file format. 3. Filtering the pages that obviously do not contain figures. 4. Locating the boundary of the figure s content area by a detector. (PDFBox and libsvm are used.) 5. Cropping the Figures content by an Extractor or a Cropper according to the position information. 31

32 4. Current system implementation We performed an initial experiment for extracting Figures within nearly 4,000 SIGMOD papers published from 1980 to We use Completeness and Purity to evaluate the result of boundary detector in addition to Precision, Recall and F-Measure. Complete: the result region includes all the parts of the Knowledge Cell content. Pure: does not contain anything that does not belong to the Knowledge Cell. A correctly identified component of a Knowledge Cell is therefore both complete and pure. 32

33 4. Current system implementation Example 11:The identified results in the left page are not correct, since the first one discard the left part and the second one covers too much texts. 33

34 4. Current system implementation Preliminary experimental results of current algorithms for Figures Extraction. This figure shows that the performance for papers from 1980 to 1989 are lower than those of the later years Recall Precision F-Measure

35 4. Current system implementation Example 12:PDF pages in earlier years 35 This is because the PDF files in earlier years usually have low quality or resolutions. The extracted texts usually contain various type of noises in character recognition process, e.g. typos. This maybe affect the discovery and locating of some Knowledge Cells.

36 4. Current system implementation (3) Crowdsourcing Layer 36 An Example of Web-based Interfaces for Extracting Figures

37 4. Current system implementation (4) Crowds/human workers 37 Who might contribute to the crowdsourced tasks Common users Authors Student volunteers Published on Mechanical Turk?Crowdflower? How to motivate and retain human workers? Game? award points? recaptcha?

38 5. Related work More and more interests have been spent on the extraction and management of research data within scientific literature. Digital Curation (DC) is the selection, preservation, maintenance, collection and archiving of digital assets. establishes, maintains and adds value to repositories of digital data for present and future use. Deep Indexing(DI) Indexing the research data within articles that are invisible to the traditional bibliographic searches. Deep Indexing is now available in ProQuest, CiteSeerX, ScienceDirect, etc. 38

39 5. Related work Figures and tables are also displayed when the paper they from are returned as a search result. In Citeseer, users can search tables by inputting some keyworks. 39

40 5. Related work However The extraction and management of each kind of Knowledge Cells is independent. The query and display of them depend on the query of academic papers, not the attributes of Knowledge Cells themselves. No published works focus on the relationships among various kind of Knowledge Cells. No related work utilizing the relationships to build the Academic Knowledge Graph as we proposed. 40

41 5. Related work Automatic Information Extraction A number of methods, techniques and tools have been employed to analyze the structure of PDFs and identify different layout blocks within PDFs. Hu, Jianying, and Y. Liu. Analysis of Documents Born Digital. Handbook of Document Image Processing and Recognition. Springer London, 2014: Klampfl, Stefan, et al. "Unsupervised document structure analysis of digital scientific articles." International Journal on Digital Libraries14.3(2014): J. Wu, K. Williams, H. Chen, M. Khabsa, C. Caragea, A. Ororbia, D. Jordan, and C. L. Giles, Citeseerx: AI in a digital library search engine, in AAAI, 2014, pp Most of them focus on the structure analysis of PDF documents to identify and extract the content of Figures and Tables. We want to extend them to the extraction of other kinds of Knowledge Cells and their attributes. 41

42 5. Related work Task-Oriented Crowdsourcing C. Lofi and K. E. Maarry, Design patterns for hybrid algorithmic crowdsourcing workflows, in CBI, 2014, pp N. Luz, N. Silva, and P. Novais, A survey of task-oriented crowdsourcing, Artificial Intelligence Review, pp. 1 27, N. Luz, N. Silva, and P. Novais, Generating human-computer microtask workflows from domain ontologies, in Human-Computer Interaction. Theories, Methods, and Tools. Springer, 2014, pp E. Kamar, S. Hacker, and E. Horvitz, Combining human and machine intelligence in large-scale crowdsourcing in AAMAS, 2012, pp S. K. Kondreddi, P. Triantafillou, and G. Weikum, Combining information extraction and human computing for crowdsourced knowledge acquisition, in ICDE, 2014, pp There are no related work on academic knowledge discovery and acquisition using crowdsourcing methods. 42

43 6. Conclusion and future work The objectives of this research is to identify and extract academic knowledge using a hybrid framework integrating the accuracy of human workers and the speed of algorithms. The contributions of this paper Stated the problem of academic knowledge discovery and acquisition as a crowd-sourced database problem based on the definitions of Knowledge Cells and Academic Knowledge Graph. Proposed a hybrid framework integrating the accuracy of human workers and the speed of automatic algorithms. Designed a web-based crowdsourcing module for Figure extraction with some preliminary achievements. 43

44 6. Conclusion and future work We have a lot of works to do Improving the feasibility of the crowdsourcing interfaces and optimize the design of HITs Making the algorithms to be confidence-aware and to iteratively interact with the crowdsourcing modules. Strategies for switch tasks. Optimization of the algorithms using human contributions. Trade-off considerations. Extending the framework to identify and extract various attributes and information of Knowledge Cells. Different Knowledge Cells have some different features 44

45 Thank you! 45

Scholarly Big Data: Leverage for Science

Scholarly Big Data: Leverage for Science C. Lee Giles The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Funded in part by NSF, Allen Institute for