Scholarly Big Data: Leverage for Science

Similar documents
A Web Service for Scholarly Big Data Information Extraction

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS

CiteSeer x : A Scholarly Big Dataset

Extracting Algorithms by Indexing and Mining Large Data Sets

PANDA: A Platform for Academic Knowledge Discovery and Acquisition

Web of Science. Platform Release Nina Chang Product Release Date: December 10, 2017 EXTERNAL RELEASE DOCUMENTATION

Abstract and Index and Web Discovery Services IEEE Partners

Your Research Social Media: Leverage the Mendeley platform for your needs

Scholarly collaboration platforms

Who is Citing Your Work?

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Finding Topic-centric Identified Experts based on Full Text Analysis

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Make the most of your access to ScienceDirect

Next-Generation Scholarly Discovery. Director Portfolio Strategy Microsoft Research

Reflections on Three Decades in Internet Time

Searchable. Readable. Relatable. E-Journal Platform for Japanese Academic Societies

Application of Big Data Technology to Library data:a review

Data publication and discovery with Globus

The library s role in promoting the sharing of scientific research data

Some Big Data Challenges

ANNUAL REPORT Visit us at project.eu Supported by. Mission

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

The Semantic Institution: An Agenda for Publishing Authoritative Scholarly Facts. Leslie Carr

Semantic Web Technology Evaluation Ontology (SWETO): A Test Bed for Evaluating Tools and Benchmarking Applications

Role of Social Media and Semantic WEB in Libraries

Web of Science. Platform Release Nina Chang Product Release Date: March 25, 2018 EXTERNAL RELEASE DOCUMENTATION

NSF gateway to Scientific literature

American Institute of Physics

DATA MINING II - 1DL460

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

Particular experience in design and implementation of a Current Research Information System in Russia: national specificity

A Service-Oriented Architecture for Digital Libraries

Chapter 50 Tracing Related Scientific Papers by a Given Seed Paper Using Parscit

Introduction to Data Management for Ocean Science Research

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

Scientific databases

Science 2.0 VU Processing Science 2.0 Data, Content Mining

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS

Automatic Identification of Research Articles from Crawled Documents

Information Extraction from Research Papers by Data Integration and Data Validation from Multiple Header Extraction Sources

CiteSeerX is a digital library search engine providing

XETA: extensible metadata System

An Automatic Extraction of Educational Digital Objects and Metadata from institutional Websites

Colorado PROFILES. An Introduction

OpenAIRE Open Knowledge Infrastructure for Europe

Hierarchical Location and Topic Based Query Expansion

Big Data Integration for Data Enthusiasts. Jayant Madhavan Structured Data Research Google Inc.

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Document Type Classification in Online Digital Libraries

How to Use Google Scholar An Educator s Guide

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Mining Trusted Information in Medical Science: An Information Network Approach

Bring Semantic Web to Social Communities

Introduction to Text Mining. Hongning Wang

EBP. Accessing the Biomedical Literature for the Best Evidence

Oracle Big Data Discovery

Scopus Development Focus

DATA MINING II - 1DL460

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

CL Scholar: The ACL Anthology Knowledge Graph Miner

Building Institutional Repositories: Emerging Challenges

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

Search Engines and Knowledge Graphs

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

Markus Kaindl Senior Manager Semantic Data Business Owner SN SciGraph

Data Management Glossary

Clustering using Topic Models

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Data science How to prepare engineers for this field

Erkki Tolonen

Digital repositories as research infrastructure: a UK perspective

UK Institutional Repository Search Project

Concise Summary: Detailed Summary: Comparison table: VIVO and other 19 websites. Semantic web. Service offered. Number of profiles

An Entity Name Systems (ENS) for the [Semantic] Web

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

User guide. ( Basic Search Tips

The 2018 (14th) International Conference on Data Science (ICDATA)

Efficient Name Disambiguation for Large-Scale Databases

How to Guide. For Personal Users

CORE: Improving access and enabling re-use of open access content using aggregations

Ontology Based Search Engine

JAKUB KOPERWAS, HENRYK RYBINSKI, ŁUKASZ SKONIECZNY Institute of Computer Science, Warsaw University of Technology

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

OpenAIRE From Pilot to Service The Open Knowledge Infrastructure for Europe

Access Innovations, Inc.

A Vision for Bigger Biomedical Data: Integration of REDCap with Other Data Sources

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Application of machine learning and big data technologies in OpenAIRE system

NOW ON. Mike Takats Thomson Reuters April 30, 2013

Context Aware Computing

Semantic Web Technology Evaluation Ontology (SWETO): A test bed for evaluating tools and benchmarking semantic applications

A System for Searching, Extracting & Copying for Algorithm, Pseudocodes & Programs in Data

SUMMON WEB-SCALE DISCOVERY. ADA University Baku 02/04/2014

Development of an Ontology-Based Portal for Digital Archive Services

A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data

Query Independent Scholarly Article Ranking

Objectives of the Webometrics Ranking of World's Universities (2016)

SciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus

Transcription:

Scholarly Big Data: Leverage for Science C. Lee Giles The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Funded in part by NSF, Allen Institute for Artificial Intelligence (AI2), Dow Chemical, & the Qatar Foundation.

What is Scholarly Big Data All academic/research documents (journal & conference papers, books, theses, TRs) Related data: Academic/researcher/group/lab web homepages Funding agency and organization grants, records, reports Research laboratories reports Patents Associated data presentations experimental data (very large) images, video, figures, tables, etc. course materials Social networks Examples: Google Scholar, Microsoft Academic Search, Publishers/repositories, CiteSeer, ArnetMiner, funding agencies, universities, Mendeley, ResearchGate, Semantic Scholar, LibGen, Sci- Hub, others

Scholarly Big Data Most of the data that is available in the era of scholarly big data does not look like this Or even like this It looks more like this with Semantics (tags and labels) Courtesy Lise Getoor NIPS 12

Where do you get this data? Web (Wayback machine, crawl - Heritrix) Repositories (arxiv, Cern, PubMed, us) Bibliographic resources (PubMed, DBLP) Funding sources/laboratories Publishers Data aggregators (Web of science) Patents API s (Microsoft Academic) How much is there & how much available?

Who is interested in scholarly big data Scholars, scientists/engineers Economists Policy makers Funding agencies (government, foundations, etc) Educators Social scientists Business Governments Science of Science

Scholarly Big Data Research Directions Data creation, management, collections Search and access, data mining and information extraction NER, entity disambiguation Data integration and linking Data integrity and cleaning Large scale experiments Knowledge discovery Collaboration and sharing Visualization Privacy & security not so much New social networks collaboration; teams sociology & policy of science Many uses of AI & machine learning Ng, ICML 2012

Applications of scholarly big data New discoveries, directions & trends in research DARPA Big Mechanism Scientific, technical and scholarly trends Science and technology innovation Evaluation of science, technology and scholarly investments - science of science Individual, group and organization evaluation Collaboration opportunities, building teams Moneyball for scholar/scientists

IARPA FUSE Program

IARPA FUSE Program

Scholarly Big Data Workshop

Big Scholar Workshop

Semantic Scholar

Semantic data in CiteSeerX 27

Automatic Metadata Information Extraction (IE) - CiteSeerX Header title, authors, affiliations, abst Table Converter IE Figure Databases Search index PDF Text Formulae Body Citations Many other open source academic document metadata extractors available recent JCDL workshop, metadata hackathon, JCDL tutorial 2016

Tool for entity extraction for scholarly documents - PDFMEF Wu, et.al ACM K-Cap 2015 Header Title Authors Year Conference Journal Full text Citations Filtering Figures Tables Algorithms

Download CiteSeerX Tools

Highlights of AI/ML Technologies in CiteSeerX Document Classification Document Deduplication and Citation Graph Metadata Extraction Header Extraction Citation Extraction Table Extraction Figure Extraction Algorithm Extraction Author Disambiguation Wu, et.al IAAI 2014

TableSeer Table extraction & search engine Liu, et al, AAAI07, JCDL06,

Must scale!! Efficient Large Scale Author Disambiguation CiteSeer X & PubMed Motivation Correct attribution Manually curated databases still have errors DBLP, medical records Entity disambiguation problem documents Actors, entities Determine the real identity of the authors using metadata of the research papers, including co-authors, affiliation, physical address, email address, information from crawling such as host server, etc. Entity normalization Challenges Accuracy Scalability Expandability Han, et.al JCDL 2004 Huang, et.al PKDD 2006 Treeratpituk, et.al JCDL 2009 Khabsa, et.al JCDL 2015 Key features Learn distance function Random Forest others DBSCAN clustering Ameliorate labeling inconsistency (transitivity problem) Efficient solution to find name clusters N logn scaling Recently all of PubMed authors, 80M mentions

Chem X Seer

csseer.ist.psu.edu Expert search for authors H-H Chen, JCDL 2014

Experimental Collaborator recommendation system CollabSeer currently supports 400k authors http://collabseer.ist.psu.edu HH Chen, JCDL 2011

Al-Zaidy, AAAI 2016 Figure Extraction Bar Chart User traffic increases significantly then really drops off Chart Data Extraction Data Feature Extraction Bar Chart Chart Data Values Chart structured as semantic graph Indexed text Text summaries User queries

Automated Figure Data Extraction and Search Large amount of results in digital documents are recorded in figures, time series, experimental results (eg., NMR spectra, income growth) Extraction for purposes of: Further modeling using presented data Indexing, meta-data creation for storage & search on figures for data reuse Current extraction done manually! Documents Extracted Plot Extracted Info. Document Index Merged Index Plot Index Digital Library User

Automatic Citation (or paper) Recommendation Built on millions of papers Never miss a citation and know about the latest work Several recommendations models Huang, AAAI 2015 Huang, CIKM 2013 He, WWW 2010

Big Data Scholarly Document Size Large # of academic/research documents, all containing a great deal of data & related semantics Many millions of documents 50M records Microsoft Academic (2013) 25M records, 10 million authors, 3 times mentions PubMed Google scholar (english) estimated to be ~100M records Total online estimate ~120M records ~25 million full documents freely available 100s of millions of authors, affiliations, locations, dates Billions of citation mentions 100s millions of tables, figures, math, formulae, etc. Related & linked data Raw data > petabytes Khabsa, Giles, PLoSONE, 14

Challenges Scalable methods for extraction and search Tables, figures, formula, equations, methodologies, etc. How do we effectively integrate and utilize this data for search and research? Natural language generation What does the data mean (semantics) Ontologies for scholarly data Scholarly knowledge vault(s) Big Mechanism approaches and knowledge discovery and relations Monetization?

The future ain t what it used to be. Yogi Berra, catcher, NY Yankees. For more information clgiles.ist.psu.edu giles@ist.psu.edu gitbhub.com/seerlabs