Supporting Data Workflows at STFC. Brian Matthews Scientific Computing Department

Similar documents
Metadata Models for Experimental Science Data Management

Managing Research Data for Diverse Scientific Experiments

PaNSIG (PaNData), and the interactions between SB-IG and PaNSIG

Supporting scientific processes in National Facilities

Requirements for data catalogues within facilities

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

FREYA Connected Open Identifiers for Discovery, Access and Use of Research Resources

EUDAT & SeaDataCloud

PaNdata-ODI Deliverable: D4.4. PaNdata-ODI. Deliverable D4.4. Benchmark of performance of the metadata catalogue

ICAT Job Portal. a generic job submission system built on a scientific data catalog. IWSG 2013 ETH, Zurich, Switzerland 3-5 June 2013

Towards a generalised approach for defining, organising and storing metadata from all experiments at the ESRF. by Andy Götz ESRF

University of Bath. Publication date: Document Version Early version, also known as pre-print. Link to publication. Publisher Rights CC BY-SA

The CEDA Archive: Data, Services and Infrastructure

The Photon and Neutron Data Initiative PaN-data

An Evaluation of Metadata Tools and Methods to Assess the Impact and Value of Provenance Management in the PaNData community.

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

EUDAT- Towards a Global Collaborative Data Infrastructure

Introduction to Grid Computing

Coupled Computing and Data Analytics to support Science EGI Viewpoint Yannick Legré, EGI.eu Director

PaNdata-ODI Deliverable: D4.3. PaNdata-ODI. Deliverable D4.3. Deployment of cross-facility metadata searching

Computing and Networking at Diamond Light Source. Mark Heron Head of Control Systems

EUDAT Towards a Collaborative Data Infrastructure

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

B2FIND and Metadata Quality

Indiana University Research Technology and the Research Data Alliance

The data explosion. and the need to manage diverse data sources in scientific research. Simon Coles

CORE: Improving access and enabling re-use of open access content using aggregations

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

Data Management at NIST

Design patterns for data-driven research acceleration

EUDAT Common data infrastructure

I data set della ricerca ed il progetto EUDAT

The Materials Data Facility

PaBdataODI-8.6 Deliverable: D8.6. PaN-data ODI. Deliverable D8.6

Federated Identity Management for Research Collaborations. Bob Jones IT dept CERN 29 October 2013

The Final Updates. Philippe Rocca-Serra Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone, Oxford e-research Centre, University of Oxford, UK

The iplant Data Commons

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team

From Open Data to Data- Intensive Science through CERIF

Making Open Data work for Europe

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Tools for Data Management. Research Data Management : Session 3 9 th June 2015

Building a federated science web one link at a time

OpenAIRE Guidelines Promoting Repositories Interoperability and Supporting Open Access Funder Mandates

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Overview. Steve Fisher Please do interrupt with any questions

RADAR. Establishing a generic Research Data Repository: RESEARCH DATA REPOSITORY. Dr. Angelina Kraft

BPMN Processes for machine-actionable DMPs

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

B2FIND: EUDAT Metadata Service. Daan Broeder, et al. EUDAT Metadata Task Force

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

Deliverable D4.4. Overview of external datasets, strategy of access methods, and implications on the portal architecture

Data publication and discovery with Globus

EUDAT. Towards a Collaborative Data Infrastructure. Ari Lukkarinen CSC-IT Center for Science, Finland NORDUnet 2012 Oslo, 18 August 2012

Why CERIF? Keith G Jeffery Scientific Coordinator ERCIM Anne Assserson eurocris. Keith G Jeffery SDSVoc Workshop Amsterdam

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Trust and Certification: the case for Trustworthy Digital Repositories. RDA Europe webinar, 14 February 2017 Ingrid Dillo, DANS, The Netherlands

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

Grid technologies, solutions and concepts in the synchrotron Elettra

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Reducing Consumer Uncertainty

(Geo)DCAT-AP Status, Usage, Implementation Guidelines, Extensions

Globus Platform Services for Data Publication. Greg Nawrocki University of Chicago & Argonne National Lab GeoDaRRS August 7, 2018

DOIs for Research Data

Technical documentation. D2.4 KPI Specification

Groovy in Jenkins. Ioannis K. Moutsatsos. Repurposing Jenkins for Life Sciences Data Pipelining

Best Practice Guidelines for the Development and Evaluation of Digital Humanities Projects

Diamond Moonshot Pilot Participation

First Experience with LCG. Board of Sponsors 3 rd April 2009

The EUDAT Collaborative Data Infrastructure

PaN-data ODI Deliverable: D8.3. PaN-data ODI. Deliverable D8.3. Draft: D8.3: Implementation of pnexus and MPI I/O on parallel file systems (Month 21)

Deliverable 6.4. Initial Data Management Plan. RINGO (GA no ) PUBLIC; R. Readiness of ICOS for Necessities of integrated Global Observations

Persistent Identifier the data publishing perspective. Sünje Dallmeier-Tiessen, CERN 1

EGI federated e-infrastructure, a building block for the Open Science Commons

An Approach to Software Preservation

Putting Open Access into Practice

DSpace Fedora. Eprints Greenstone. Handle System

Digital repositories as research infrastructure: a UK perspective

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1

Global Data Sharing The Research Data Alliance

Clare Richards, Benjamin Evans, Kate Snow, Chris Allen, Jingbo Wang, Kelsey A Druken, Sean Pringle, Jon Smillie and Matt Nethery. nci.org.

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

Data Citation and Scholarship

Chemotion funded by. Göttingen eresearch Toolbox Series - Electronic Note Keeping. Nicole Jung.

Registry Interchange Format: Collections and Services (RIF-CS) explained

Scientific Computing at SLAC. Amber Boehnlein

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Monash High Performance Computing

Dryad Curation Manual, Summer 2009

INSPIRE & Environment Data in the EU

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM


Welcome! Presenters: STFC January 10, 2019

Smart Protection Network. Raimund Genes, CTO

Metadata Ingestion and Processinng

A service-oriented national e-thesis information system and repository

Transcription:

Supporting Data Workflows at STFC Brian Matthews Scientific Computing Department 1

What we do now : Raw Data Management What we want to do : Supporting user workflows What we want to do : sharing and publishing data Coming back to Metadata

What we do now : Raw Data Management

Supporting Facilities Data Management STFC Scientific Computing Department - Support three STFC Funded facilities on the RAL campus - Provide data archiving and management tools ISIS Neutron and Muon Source - Provide tools to support ISIS s data workflows - Support through the science lifecycle - Rich metadata - Provide Data archiving DLS Synchrotron Light Source - Data Archiving - Limited metadata - Managing the scale of the archive Central Laser Facility - Real-time data management and feedback to users - Rich metadata on laser configuration - Access to data

Dec-11 May-12 Oct-12 Mar-13 Aug-13 Jan-14 Jun-14 Nov-14 Apr-15 TiB Archive of - 3.3PB - 846 million files in total in archive (July 2015) (cf 2.2PB, 620m Jan 2015) DLS Archive Architecture Data Acquisition Data Storage Data Access ICAT API Metadata ICAT DB Metadata Catalogue Metadata DB Cache TopCAT Web frontend Downloader (IDS) - Cataloguing 12000 Files per minute Lustre file store StorageD Client DB Cache Data StorageD Data (De-) aggregator, Metadata Ingestor Data Retrieved data 3500 3000 2500 DB Cache CASTOR Storage System DB FUSE Data browser 2000 1500 1000 500 0

Supporting Data Management for STFC Facilities Integrated data management pipelines for data handling From data acquisition to storage A Catalogue of Experimental Data ICAT Tool Suite: Metadata as Middleware Automated metadata capture Integrated with the User Office and data acquisition system Providing access to the user TopCat web front end Integrated into Analysis frameworks Mantid for Neutrons, DAWN for X-Rays

Example ISIS Proposal Secure access to user s data Flexible data searching Scalable and extensible architecture GEM High intensity, high resolution neutron diffractometer H2-(zeolite) vibrational frequencies vs polarising potential of cations B-lactoglobulin protein interfacial structure Integration with analysis tools Access to highperformance resources Proposals Experiment Analysed Data Publication Linking to other scientific outputs Data policy aware Once awarded beamtime at ISIS, an entry will be created in ICAT that describes your proposed experiment. Data collected from your experiment will be indexed by ICAT (with additional experimental conditions) and made available to your experimental team You will have the capability to upload any desired analysed data and associate it with your experiments. Using ICAT you will also be able to associate publications to your experiment and even reference data from your publications. An international collaboration http://icatproject.org

Core Scientific Metadata Model (CSMD) Topic Publication Keyword Authorisation Investigation Investigator The Core Metadata model forms the information model for ICAT. Dataset Sample Sample Parameter Designed to describe facilities based experiments in Structural Science throughout a facility s scientific workflow. Datafile Dataset Parameter Parameter http://purl.org/net/csmd http://icatproject.org/csmd/ Related Datafile Datafile Parameter

ICAT Tool Suite and Clients Metadata as Middleware TopCAT (Web Interface to ICATs) Python ICAT Desktop app Clusters/HPC Disk Tape ICAT Job Portal IDS (ICAT Data Service) AA Plugins ICAT Manager ICAT + DAWN (Eclipse Plugin) ICAT + Mantid (desktop client) Data transfer protocols ICAT APIs

The ICAT Data Server (IDS) ICAT metadata catalogue a SOAP web service interface to metadata IDS provides a RESTful interface to the data files cataloged by ICAT Write IDS Write data Storage AA handled via the ICAT Can plugin to different storage infrastructure Authorize, Write metadata Can use different data transfer protocols (http, gftp, GlobusOnline ) ICAT Separation of concerns: metadata management vs data ingest/access Manage data scaling issues Frazer Barnsley, Steve Fisher, Wojciech Grajewski Antony Wilson

ICAT: An international collaboration In daily production use on the RAL campus: CLF, ISIS, DLS Also internationally: In production: ESRF, ILL,SNS, Pre-production: HZB, ALBA Development: PSI, ELLETRA (FERMI) PaNData Consortium Actively contributing to tool development E.g. python library ICAT steering committee has been established. Andy Götz (ESRF) the chairman http://icatproject.org http://code.google.com/p/icatproject/

What we want to do : Supporting user workflows

Facility Data Lifecycle Metadata Catalogue Proposal Publication Approval Data analysis Scheduling Experiment Data reduction http://www.icatproject.org Traditionally, these steps are decoupled from facilities. However, they are key to derive useful insights.

Data Analysis Challenges Diverse science Varying levels of expertise Help users through the analysis Data getting bigger too big too move High CPU / memory requirements Complex software environments Open data / reuse provenance Automation

Supporting Data Analysis Managing analysis codes for external users Accessing HPC Tracking provenance Modified ICAT to support: Derived data Software, jobs Linking between these Modification to the metadata model

Tools to support analysis processes MANTID ICAT Job Portal ISIS Auto-Reduction All use ICAT to : Access data Record Provenance Steps

In- and Post-experimental support Scan Reconstruct Segment + Quantify 3D mesh + Image based Modelling Predict + Compare Data Catalogue Petabyte Data storage Parallel File system HPC CPU+GPU Visualisation ISIS:IMAT Some mage credit: Avizo, Visualization Sciences Group (VSG) Infrastructure + Software + Expertise! DLS:I12/I13 Erica Yang, Sri Nagella Tomography: Dealing with high data volumes 200Gb/scan, ~5 TB/day (one experiment at DLS) MX: high data volumes, smaller files, but a lot more experiments Hard to move the data needs to be handled at the facility?

Tomography Reconstruction for IMAT In- (ISIS) and post-experiment (ISIS and DLS) data processing. IMAT is a new neutron imaging instrument on ISIS HPC integration with experiments; Using SCARF CPU and GPU clusters A tomographic image reconstruction toolbox With supported algorithms; High throughput image reconstruction framework; With fast 3D visualisation; An integral component of IMAT s in-experiment data analysis capability through Mantid (ISIS) and DAWN (DLS), Maximise the science resulting from Data collected on facility instruments. Towards a service in 2015/2016

PanDaas Data Analysis as a Service Led by ESRF 18 institutes worldwide Data reduction and analysis platform Photon and Neutron analytical facilities Not funded, but a continuing need Looking to continue.

Sharing and Publishing Data

Data Publication

Publish metadata to general purpose harvesters, and search engines which provide search tools across disciplines Being developed by other e- Infrastructure projects Worked with the EUDat project B2Find Data Discovery Service www.eudat.eu Made core metadata available to B2Find OAI-PMH interface Published Data (e.g. with DOIs). Mapping of CSMD metadata to Dublin Core and EUDat metadata requirements. Publishing and Sharing Metadata EUDAT ICAT Field(s) Field dc:identifi - Investigation->doi er dc:title title Investigation->title dc:descrit ption notes dc:relation tags dcterms:re ferences URL Investigation->summary dc:creator author User->fullName - spatial - Instrument->fullName Investigation->name InvestigationParameter->name (multiple) dx.doi.org/ + Investigation->doi dc:contrib utor maintainer Science and Technology Facility Council, ISIS dc:subject discipline Clean energy and the environment, pharmaceuticals and health care, nanotechnology and materials engineering, catalysis and polymers, fundamental studies of materials - PublicationY - ear dcterms:is sued PublicationT imestamp Investigation->releaseDate dcterms:te mporal TemporalCo verage:end Date Investigation->startDate Investigation->endDate

NFFA-EUROPE Nanoscience Foundries and Fine Analysis Research and Innovation actions Integrated, distributed research infrastructure for multidisciplinary research at the nanoscale from synthesis and nanolithography Nano-characterization, theoretical modelling and numerical simulation, coordinated open-access to complementary facilities Information and Data management Repository Platform (IDRP) CNR-IOM, ESRF, STFC, KIT RDA standardisation

Back to metadata

3 Levels of Metadata Discovery General lowdetail metadata search engines and aggregators Dublin Core, CKAN, EUDat, DataCite Dryad, Figshare, Zenodo PIDs and DOIs Domain specific terms Access How data is organised Who it belongs to an how to access What was done to it provenance Can be used in data management processes. CSMD, DCAT, CERIF, PROV-O Usage Sample, instrument, technique details Controlled vocabularies ESRF approach CIF, NeXus

NFFA-Europe: Metadata Management To develop metadata standards for the cataloguing, access and exchange of data and associated information describing nanoscience experiments In support of Information and Data management Repository Platform Underpins the data discovery and sharing services Work within the Research Data Alliance www.rd-alliance.org Organisation for sharing and developing best practise in research data management Working with the existing Materials IG and Photon and Neutron Science IG, Metadata WG - may work through these groups Starting points : EUDat, CSMD, CIF, Nexus COData Framework for Nanostructures

Plug: CoData Data Journal Recently Relaunched dedicated to the advancement of data science and its application in policies, practices and management as Open descriptions of data systems, their implementations and their publication, applications, infrastructures, software, legal, reproducibility and transparency issues, the availability and usability of complex datasets, principles, policies and practices for data. Section Editor for large scale data facilities, data intensive research and data management

Conclusion Management of large amounts of raw data complex Good systematic metadata collection Automation Track what happens to data too Need to extend support across the lifecycle Data analysis and publication Support the whole research object Metadata at different levels, Discovery, Access, Use MetaData as an active part of the computing infrastructure