Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

Enabling Open Science: Data Discoverability, Access and Use Jo McEntyre Head of Literature Services www.ebi.ac.uk

About EMBL-EBI Part of the European Molecular Biology Laboratory International, non-profit research institute Europe s hub for biological data services and research

A stable, freely available, shared repository Europe PMC Abstracts: 30 million Full-text articles: 3 million Article citation counts Grants ORCIDs Semantic annotation Data citations Data integration Europe PMC is a member of the PMC International Collaboration. Funded by 26 European funders of life science research

Degrees of Data Unstructured/semistructured A picture of a graph A spreadsheet of my results A narrative with citations, pictures and attachments Article Metadata Structured A record in a DNA sequence database A graphical display of a genome Added Value

The scent of information retaining files and being prepared to share them accessible data It's healthy to remember that users are selfish, lazy, and ruthless in applying their cost-benefit analyses Information Foraging: by JAKOB NIELSEN, June 30, 2003 http://www.nngroup.com/articles/information-scent/ Data placement: where people will find it Scent alone is not scientific

1. Discoverability through accessibility Deposit in a public/open database Where possible, structured archive (e.g. PDB, ENA) >> unstructured archive (e.g. Zenodo, Figshare) Uniquely identify it: PID, Accession number, DOI, ROI Give it context: metadata (and more) All of the above = citable =

Data Citation Principles Data as legitimate, citable products of research Attribution and credit Cited as evidence for a claim Unique identification Access Persistence Specificity and verifiability (provenance) Interoperability and flexibility https://www.force11.org/datacitation

2. Discoverability through structured data structured data is one of the true enablers of life science - Discovery of homology between genes across species - Predicting function based on protein folds Structured data can be cross-analysed, compared by algorithm, and encourages development of new products and tools

Discoverability through structured data

Metadata critical to discoverability Generic: title, submitters, date, file format, version. citation basic search Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1 gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006 (Rel. 89, Last updated, Version 7). BN000065 Specific: organism, tissue, assay, page number deep search analysis computation

Structured data is good value for money Annual cost of generating new protein structure data in labs around the world Annual cost of maintaining it in a central database

3. Discoverable through added value In-built QA: otherwise you couldn t do this!

Literature-Data Integration

Quality control and trustworthiness Pre publication QA: a range of activities from is this file valid to is it described adequately/richly by metadata? Post publication Data are immutable: time and repetition required to show outliers. Openness: enabling assessment & reuse Provenance & connectivity with related data, methods, code, tools, articles

Text and Data Mining (TDM) The more structure there is the more reusable data becomes TDM builds bridges between unstructured (including articles) and structured data worlds Such analyses can drive adoption of more structure. Open Access infrastructure is a critical enabler Graphic kindly supplied by the National Centre for Text Mining (NaCTeM), UK

BioStudy database for unstructured data EBI BioStudy Metadata Study Data files Ontologies Other DBs Publications Other DBs

Making data discoverable provide tools to help researchers use it Labs around the world deposit data and we Analyse, add value and integrate it Share it with other data providers A collaborative enterprise Archive it Classify it

Elixir: An international distributed infrastructure for Data Standards Tools Compute Training Industry

Some of many Open Questions What can stakeholders do to capitalize on the research investment? Incentives and credit for good data management Can all data outputs to be peer reviewed (like articles)? How far does the article model apply to data? Negative results? What are we prepared to spend: costs benefits? How does this apply to big data? What is the public interest in discoverable data?