The CEDA Archive: Data, Services and Infrastructure

Similar documents
Linking datasets with user commentary, annotations and publications: the CHARMe project

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P.

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

Storing and manipulating environmental big data with JASMIN

Terabit Networking with JASMIN

EUDAT & SeaDataCloud

Data Issues for next generation HPC

THE ISSUE. TITLE: The East Midlands Business link. Author: Andrew Groom

ExArch, Edinburgh, March 2014

Bruce Wright, John Ward, Malcolm Field, Met Office, United Kingdom

Clare Richards, Benjamin Evans, Kate Snow, Chris Allen, Jingbo Wang, Kelsey A Druken, Sean Pringle, Jon Smillie and Matt Nethery. nci.org.

UK Space Agency Update for CEOS.

Copernicus Space Component. Technical Collaborative Arrangement. between ESA. and. Enterprise Estonia

C3S Data Portal: Setting the scene

The Copernicus Sentinel-3 Mission: Getting Ready for Operations

Western U.S. TEMPO Early Adopters

The Logical Data Store

Sodankylä National Satellite Data Centre. Jyri Heilimo Head of Satellite based services R&D

Simplify EO data exploitation for information-based services

A Climate Monitoring architecture for space-based observations

Future Core Ground Segment Scenarios

The Future of ESGF. in the context of ENES Strategy

CERA: Database System and Data Model

Metadata Models for Experimental Science Data Management

The EOC Geoservice: Standardized Access to Earth Observation Data Sets and Value Added Products ABSTRACT

CMIP5 Datenmanagement erste Erfahrungen

Overview of the GMES Space Component & DMC as a GMES Contributing Mission

JASMIN Overview. UKMO Visit 24/11/2014. Matt Pritchard

JASMIN Petascale storage and terabit networking for environmental science

Centre for Environmental Data Analysis (CEDA) Annual Report 2016 (April-2015 to March-2016)

CCI Visualisation Corner Presentations of ESA s Climate Change Initiative

Analysis Ready Data For Land (CARD4L-ST)

Italy - Information Day: 2012 FP7 Space WP and 5th Call. Peter Breger Space Research and Development Unit

Processing and analysis of Earth Observation data

HPC IN EUROPE. Organisation of public HPC resources

The CEDA Web Processing Service for rapid deployment of earth system data services

NEODC Service Level Agreement (SLA): NDG Discovery Client Service

Technical documentation. D2.4 KPI Specification

EUDAT- Towards a Global Collaborative Data Infrastructure


Data Management Components for a Research Data Archive

EO Ground Segment Evolution Reflections by

Earth Science Community view on Digital Repositories

Greece s Collaborative Ground Segment Initiatives

Managing Research Data for Diverse Scientific Experiments

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

Copernicus Data and Information Access Services(DIAS)

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Uniform Resource Locator Wide Area Network World Climate Research Programme Coupled Model Intercomparison

Data Management Plans. Sarah Jones Digital Curation Centre, Glasgow

NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION S SCIENTIFIC DATA STEWARDSHIP PROGRAM

Building a Global Data Federation for Climate Change Science The Earth System Grid (ESG) and International Partners

The VITO Earth Observation LTDA Facility

: This document describes the verification of SST-CCI products and prototype system elements.

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

BIG DATA CHALLENGES A NOAA PERSPECTIVE

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

First Review of COP 21 and Potential impacts on Space Agencies

aerosol_cci Phase 2 System Specification Document issue 3 ESA Climate Change Initiative aerosol_cci Deliverable D3.1c

Needs for satellite validation of GreenHouse (related) Gases, CO 2, CH 4 and CO

Modeling groups and Data Center Requirements. Session s Keynote. Sébastien Denvil, CNRS, Institut Pierre Simon Laplace (IPSL)

Technical documentation. SIOS Data Management Plan

Scientific Data Curation and the Grid

Next GEOSS der neue europäische GEOSS Hub

Introduction to Grid Computing

Show me the data. The pilot UK Research Data Registry. 26 February 2014

Copernicus Climate Change Service

Pascal Gilles H-EOP-GT. Meeting ESA-FFG-Austrian Actors ESRIN, 24 th May 2016

Connecting the e-infrastructure chain

Space Park Vision Develop a global space hub & collaborative community Supporting UK growth

Data Serving Climate Simulation Science at the NASA Center for Climate Simulation (NCCS)

The Copernicus Data and Exploitation Platform Deutschland (CODE-DE)

NERC EARTH OBSERVATION DATA CENTRE (NEODC) ANNUAL REPORT 2003/04

The Computation and Data Needs of Canadian Astronomy

HPC SERVICE PROVISION FOR THE UK

Ocean Colour Vicarious Calibration Community requirements for future infrastructures

InfraStructure for the European Network for Earth System modelling. From «IS-ENES» to IS-ENES2

RAL IASI MetOp-A TIR Methane Dataset User Guide. Reference : RAL_IASI_TIR_CH4_PUG Version : 1.0 Page Date : 17 Aug /12.

European Cloud Initiative: implementation status. Augusto BURGUEÑO ARJONA European Commission DG CNECT Unit C1: e-infrastructure and Science Cloud

The C3S Climate Data Store and its upcoming use by CAMS

Sea Level CCI project Phase II. System Engineering Status

The Changing Role of Data Stewardship in Creating Trustworthy, Transdisciplinary High Performance Data Platforms for the Future

EVOlution of EO Online Data Access Services (EVO-ODAS) ESA GSTP-6 Project by DLR, EOX and GeoSolutions (2015/ /04)

Some Reflections on Advanced Geocomputations and the Data Deluge

Introduction to PRECIS

High Performance Computing Data Management. Philippe Trautmann BDM High Performance Computing Global Research

Addressing Geospatial Big Data Management and Distribution Challenges ERDAS APOLLO & ECW

Data Stewardship NOAA s Programs for Archive, Access, and Producing Climate Data Records

WHAT IS IRIDIUM PRIME?

CEH Environmental Information Data Centre support to NERCfunded. Jonathan Newman EIDC

GMES The EU Earth Observation Programme

SST Retrieval Methods in the ESA Climate Change Initiative

Big Data Earth Observation Standardization elements Codrina Ilie TERRASIGNA TF7/SG5

GSCB-Workshop on Ground Segment Evolution Strategy

2017 Resource Allocations Competition Results

The Scottish Spatial Data Infrastructure (SSDI)

Benefits of CORDA platform features

Copernicus Global Land Operations Cryosphere and Water

NEXTGEN NOW: The Big Build. Bill Murphy, Managing Director NGA, BT

Transcription:

The CEDA Archive: Data, Services and Infrastructure Kevin Marsh Centre for Environmental Data Archival (CEDA) www.ceda.ac.uk with thanks to V. Bennett, P. Kershaw, S. Donegan and the rest of the CEDA Team

What is CEDA? The Centre for Environmental Data Archival Includes BADC Serves the environmental science community through 4 data centres and involvement in a host of projects www.ceda.ac.uk

CEDA home page: http://www.ceda.ac.uk/

About CEDA Based at RAL, Oxford. One of several NERC Data Centres in the UK. Supports national/international projects funded by NERC, EU, UK Government, etc. and helps develop international standards (e.g. OGC) Promotes good data management practice. Help develop data policy for NERC, Data Management Plans. Provides dedicated user helpdesks/community support/ file format advice CF-Standard names for variables, etc. Supports data users and providers >200 datasets, >100 Million files, ~2Pb of data. Many datasets are available free for academic use. NERC s view is that data is an asset and needs to be properly managed Just finding data can be a challenge.!

NERC Data Catalogue Service (DCS) CEDA supports the NERC DCS to allow users to easily find and access the data they need across all of the Data Centres ARSF in this case.

NERC Data Catalogue Service (DCS) NERC DCS search directs users to ARSF dataset at CEDA.

NERC Data Catalogue Service (DCS) and to the CEDA tools to allow them to find exactly what they need On-line data coverage viewer at CEDA

CEDA Services Traditionally have provided http, and ftp access to data, as well as data delivery on physical media. Also have developed on-line interfaces for data subsetting and visualisation. The CEDA Web Processing Service (WPS) is the latest of these. Based on OGC standards OpenDAP also implemented for some datasets. Extracted data

Centre for Environmental Data Archival CEDA Data (Sept 2013) Project Type Current archive volume (Tb) NEODC Earth Observation 400 (was 300 in 2012) BADC Atmospheric Science 500 (was 350 in 2012) CMIP5 Climate Model 1000 (was 350 in 2012) Total 1900 Tb = 1.9 Pb Data themselves come from a wide range of sources: Observations, Satellites, aircraft, numerical models. -> data deluge Volume of data has doubled in 12 months. -> Petascale Total storage available is 5 Pb and all has been allocated! Storage will increase to 11Pb over the next 2 years.

Centre for Environmental Data Archival CEDA Users ~3000 active users Most are from: Universities (71%) Government (18%) Other Europe Unknown UK

Centre for Environmental Data Archival CEDA Users Earth Observation Geography Engineering Atmospheric Physics Earth Science Diverse range of CEDA user disciplines

BADC Download Statistics A ~small dataset is most popular! A ~small dataset is most popular!

NEODC Download Statistics

JASMIN/CEMS Government funding for infrastructure has enabled us to set up the JASMIN/CEMS system at CEDA JASMIN= Joint Analysis System Meeting Infrastructure Needs CEMS= Centre for Environmental Monitoring from Space JASMIN/CEMS are not just about supporting the traditional archival services of CEDA though they are intended to additionally provide support for high performance analysis of high volume data by the greater NERC scientific community. Bring the users to the data

JASMIN/CEMS

CEMS what is it? A joint academic-industrial facility for climate and environmental data services Will provide: Step change in EO/climate data storage, processing and analysis A scalable model for developing services and applications : hosted in a cloud-based infrastructure Data quality and integrity tools Information on data accuracy and provenance To give users confidence in the data, services and products

JASMIN/CEMS functions CEDA data storage & services Curated data archive Archive management services Archive access services (HTTP, FTP, Helpdesk,...) Data intensive scientific computing Global / regional datasets & models High spatial, temporal resolution Private cloud Flexible access to high-volume & complex data for climate & earth observation communities Online workspaces Services for sharing & collaboration

JASMIN/CEMS 4.6 Petabytes of fast disk with excellent connectivity - to 11 Pb in next 2 years A compute platform for running Virtual Machines - to 3000 cores in next 2 years A small HPC compute cluster (known as LOTUS ) Connected to : Some CEMS infrastructure for commercial applications JASMIN nodes at remote sites Dedicated network connections to specific sites

JASMIN/CEMS All of the CEDA archive has now been moved onto JASMIN/CEMS storage GWS (Group Work Spaces) and Virtual Machines are available to allow groups of users to do processing of data in the CEDA archive locally on the JASMIN system. Advantages : No need for users to download large volumes of data to their local system. No need for users to have access to their own processing capability locally. In the era of Big Data, this is increasingly important. Essential to provide such services to allow such large and complex datasets to be fully exploited.

CEMS Opendap Data discovery and data download: find and access CEDA data through the CEMS interface

JASMIN/CEMS Latest Status Group Workspaces provide additional capability which many projects are already finding invaluable: Multi-terabyte online workspace High-performance, parallel-io storage can be provisioned rapidly in response to user needs GWSs are created as volumes in the same storage infrastructure as archives Access control -GWS manager enable secure data sharing within the group. Data transfer-a shared data transfer server Data analysis- A shared data analysis environment is available Dedicated virtual machines for data analysis are available on request, Elastic Tape near-line storage

JASMIN/CEMS Latest Status As of the time of writing (August 2013), there are: 25 NCAS/JASMIN Group Workspaces, total GWS allocation of 1.4 Petabytes between 1 and 27 users per workspace JASMIN-CEMS Phase 2 underway to expand the infrastructure funded by the NERC Big Data initiative.

JASMIN/CEMS Science Use cases Example academic CEMS projects are: Trial processing of AATSR for Climate (ARC) Sea Surface Temperature Data (University of Edinburgh, University of Reading) Land Surface Temperature processing from AATSR data (University of Leicester) Processing/hosting GlobAlbedo data products (MSSL, UCL) ATSR1+2 Full Mission Reprocessing (RAL) Cloud Essential Climate Variables processing (RAL) All report more efficient and faster data processing on CEMS.

JASMIN/CEMS Science Use case (1) Processing large volume EO datasets to produce: Essential Climate Variables Long term global climate-quality datasets e.g ESA CCI

JASMIN/CEMS Science Use Case (2) Cloud ECV reprocessing Performed by RAL Space Remote Sensing Group (RSG) 15 years worth of (A)ATSR data Co-location with JASMIN / CEMS enables combination of STFC s compute cluster with JASMIN / CEMS fast IO storage. On the previous system it took ~2-3 weeks per year of data to process the AATSR data to cloud products. On the JASMIN/CEMS system 3 days per year of data. Every 3 days cycle means reprocessing is much more feasible. Use of this system set-up is a game changer for RSG.

JASMIN/CEMS Science Use Case (3) LST plot for the UK [John Remedios and Darren Ghent, University of Leicester]. AATSR reprocessing: ATSR1 and ATSR2 data reprocessed using improved algorithms into ENVISAT AATSR format data and hence the aatsr_multimission archive at CEDA. Better algorithms means better cloud clearing and flagging, better ground calibrated and more complete coverage. CEDA has copies of AATSR v2.1 reprocessing done at ESA alongside the ATSR1 and 2 v2.1 data reprocessed on JASMIN Much easier having fast direct access to data and processing power!

JASMIN/CEMS Science Use case (4) User access to 5 th Coupled Model Intercomparison Project (CMIP5) Large volumes of data from best climate models Greater throughput required Critical for analysis of data for next IPCC Report Large model data analysis facility Workspaces for scientific users. Climate modellers need 100s of Tb of disk space, with high-speed connectivity Large cache on fast disk available for post-processing results

JASMIN/CEMS Science Use case (5) Near Real Time Ash and SO2 Hazard Alert Volcanic Gas Detection and Monitoring Oxford university project funded by NCAS starting Jan 2014. CEDA will provide NRT EUMETSAT and ECMWF data to enable early detection and monitoring of volcanic gas emissions. Processing software will run on the JASMIN system but be fully managed by the Oxford group.

Digital Object Identifiers CEDA is actively supporting the development and wider use of DOI s for datasets. DOI s mean that the dataset is CITABLE Users get the credit for the data they produce Online Data Journals now available to describe datasets

Commentary Metadata- CHARMe CHARacterization of Metadata to enable high-quality climate applications and services How can climate data users decide whether a dataset is fit for their purpose? 2 Year EU project to develop the CHARMe system http://www.charme.org.uk

CHARMe metadata Post-fact annotations, e.g. citations, ad-hoc comments and notes; Results of assessments, e.g. validation campaigns, intercomparisons with models or other observations, reanalysis; Provenance, e.g. dependencies on other datasets, processing algorithms and chain, data source; Properties of data distribution, e.g. data policy and licensing, timeliness (is the data delivered in real time?), reliability; External events that may affect the data, e.g. volcanic eruptions, El- Nino index, satellite or instrument failure, operational changes to the orbit calculations. General rule: information originates from users or external entities, not original data providers However, sometimes information is not available from the data provider!

CEDA Summary Archive contains a diverse range of datasets CEDA supports a wide range of data users and data providers CEDA provides Data Management support and advice to data users, providers and policy makers CEDA have implemented scalable systems to deal with era of Big Data CEDA are involved with numerous projects

Thank you! www.ceda.ac.uk