Scholarly Big Data: Leverage for Science

Scholarly Big Data: Leverage for Science C. Lee Giles The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Funded in part by NSF, Allen Institute for Artificial Intelligence (AI2), Dow Chemical, & the Qatar Foundation.

What is Scholarly Big Data All academic/research documents (journal & conference papers, books, theses, TRs) Related data: Academic/researcher/group/lab web homepages Funding agency and organization grants, records, reports Research laboratories reports Patents Associated data presentations experimental data (very large) images, video, figures, tables, etc. course materials Social networks Examples: Google Scholar, Microsoft Academic Search, Publishers/repositories, CiteSeer, ArnetMiner, funding agencies, universities, Mendeley, ResearchGate, Semantic Scholar, LibGen, Sci- Hub, others

Scholarly Big Data Most of the data that is available in the era of scholarly big data does not look like this Or even like this It looks more like this with Semantics (tags and labels) Courtesy Lise Getoor NIPS 12

Where do you get this data? Web (Wayback machine, crawl - Heritrix) Repositories (arxiv, Cern, PubMed, us) Bibliographic resources (PubMed, DBLP) Funding sources/laboratories Publishers Data aggregators (Web of science) Patents API s (Microsoft Academic) How much is there & how much available?

Who is interested in scholarly big data Scholars, scientists/engineers Economists Policy makers Funding agencies (government, foundations, etc) Educators Social scientists Business Governments Science of Science

Scholarly Big Data Research Directions Data creation, management, collections Search and access, data mining and information extraction NER, entity disambiguation Data integration and linking Data integrity and cleaning Large scale experiments Knowledge discovery Collaboration and sharing Visualization Privacy & security not so much New social networks collaboration; teams sociology & policy of science Many uses of AI & machine learning Ng, ICML 2012

Applications of scholarly big data New discoveries, directions & trends in research DARPA Big Mechanism Scientific, technical and scholarly trends Science and technology innovation Evaluation of science, technology and scholarly investments - science of science Individual, group and organization evaluation Collaboration opportunities, building teams Moneyball for scholar/scientists

IARPA FUSE Program

Scholarly Big Data Workshop

Big Scholar Workshop

Semantic Scholar

Semantic data in CiteSeerX 27

Automatic Metadata Information Extraction (IE) - CiteSeerX Header title, authors, affiliations, abst Table Converter IE Figure Databases Search index PDF Text Formulae Body Citations Many other open source academic document metadata extractors available recent JCDL workshop, metadata hackathon, JCDL tutorial 2016

Tool for entity extraction for scholarly documents - PDFMEF Wu, et.al ACM K-Cap 2015 Header Title Authors Year Conference Journal Full text Citations Filtering Figures Tables Algorithms

Download CiteSeerX Tools

Highlights of AI/ML Technologies in CiteSeerX Document Classification Document Deduplication and Citation Graph Metadata Extraction Header Extraction Citation Extraction Table Extraction Figure Extraction Algorithm Extraction Author Disambiguation Wu, et.al IAAI 2014

TableSeer Table extraction & search engine Liu, et al, AAAI07, JCDL06,

Must scale!! Efficient Large Scale Author Disambiguation CiteSeer X & PubMed Motivation Correct attribution Manually curated databases still have errors DBLP, medical records Entity disambiguation problem documents Actors, entities Determine the real identity of the authors using metadata of the research papers, including co-authors, affiliation, physical address, email address, information from crawling such as host server, etc. Entity normalization Challenges Accuracy Scalability Expandability Han, et.al JCDL 2004 Huang, et.al PKDD 2006 Treeratpituk, et.al JCDL 2009 Khabsa, et.al JCDL 2015 Key features Learn distance function Random Forest others DBSCAN clustering Ameliorate labeling inconsistency (transitivity problem) Efficient solution to find name clusters N logn scaling Recently all of PubMed authors, 80M mentions

Chem X Seer

csseer.ist.psu.edu Expert search for authors H-H Chen, JCDL 2014

Experimental Collaborator recommendation system CollabSeer currently supports 400k authors http://collabseer.ist.psu.edu HH Chen, JCDL 2011

Al-Zaidy, AAAI 2016 Figure Extraction Bar Chart User traffic increases significantly then really drops off Chart Data Extraction Data Feature Extraction Bar Chart Chart Data Values Chart structured as semantic graph Indexed text Text summaries User queries

Automated Figure Data Extraction and Search Large amount of results in digital documents are recorded in figures, time series, experimental results (eg., NMR spectra, income growth) Extraction for purposes of: Further modeling using presented data Indexing, meta-data creation for storage & search on figures for data reuse Current extraction done manually! Documents Extracted Plot Extracted Info. Document Index Merged Index Plot Index Digital Library User

Automatic Citation (or paper) Recommendation Built on millions of papers Never miss a citation and know about the latest work Several recommendations models Huang, AAAI 2015 Huang, CIKM 2013 He, WWW 2010

Big Data Scholarly Document Size Large # of academic/research documents, all containing a great deal of data & related semantics Many millions of documents 50M records Microsoft Academic (2013) 25M records, 10 million authors, 3 times mentions PubMed Google scholar (english) estimated to be ~100M records Total online estimate ~120M records ~25 million full documents freely available 100s of millions of authors, affiliations, locations, dates Billions of citation mentions 100s millions of tables, figures, math, formulae, etc. Related & linked data Raw data > petabytes Khabsa, Giles, PLoSONE, 14

Challenges Scalable methods for extraction and search Tables, figures, formula, equations, methodologies, etc. How do we effectively integrate and utilize this data for search and research? Natural language generation What does the data mean (semantics) Ontologies for scholarly data Scholarly knowledge vault(s) Big Mechanism approaches and knowledge discovery and relations Monetization?

The future ain t what it used to be. Yogi Berra, catcher, NY Yankees. For more information clgiles.ist.psu.edu giles@ist.psu.edu gitbhub.com/seerlabs