Challenges in Storage

Similar documents
DSIT WP1 WP2. Federated AAI and Federated Storage Report for the Autumn 2014 All Hands Meeting

Data Storage. Paul Millar dcache

dcache integration into HDF

Data Challenges in Photon Science. Manuela Kuhn GridKa School 2016 Karlsruhe, 29th August 2016

Challenges of Big Data Movement in support of the ESA Copernicus program and global research collaborations

A scalable storage element and its usage in HEP

Scientific data management

Distributing storage of LHC data - in the nordic countries

dcache, activities Patrick Fuhrmann 14 April 2010 Wuppertal, DE 4. dcache Workshop dcache.org

Bootstrapping a (New?) LHC Data Transfer Ecosystem

Data Access and Data Management

Storage Management in INDIGO

Future trends in distributed infrastructures the Nordic Tier-1 example

Introduction to SRM. Riccardo Zappi 1

Experiment Control Upgrades at DESY

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

1 / 23. CS 137: File Systems. General Filesystem Design

Computing / The DESY Grid Center

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

dcache: challenges and opportunities when growing into new communities Paul Millar on behalf of the dcache team

Operating the Distributed NDGF Tier-1

NFS around the world Tigran Mkrtchyan for dcache Team dcache User Workshop, Umeå, Sweden

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Coming Changes in Storage Technology. Be Ready or Be Left Behind

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

ELIXIR Compute platform

Matthias Wobben working in Berlin, Germany. Senior Sales Engineer at Nextcloud

[537] Flash. Tyler Harter

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

High-speed data ingest and analysis for photon science at DESY with IBM Spectrum Scale.

Streaming Log Analytics with Kafka

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

Visita delegazione ditte italiane

Was ist dran an einer spezialisierten Data Warehousing platform?

Memory Hierarchy. Mehran Rezaei

INDIGO-DataCloud Architectural Overview

WELCOME TO THE JUNGLE!

DataFinder A Scientific Data Management Solution ABSTRACT

Introduction to Programming and Computing for Scientists

CMPSC 311- Introduction to Systems Programming Module: Caching

IT Challenges and Initiatives in Scientific Research

Daniel J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven

DESY site report. HEPiX Spring 2016 at DESY. Yves Kemp, Peter van der Reest. Zeuthen,

The Internet of Things. Steven M. Bellovin November 24,

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

Software-defined Storage: Fast, Safe and Efficient

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

PRESENTATION TITLE GOES HERE

Introduction to High Performance Parallel I/O

dcache NFSv4.1 Tigran Mkrtchyan Zeuthen, dcache NFSv4.1 Tigran Mkrtchyan 4/13/12 Page 1

Cluster-Level Google How we use Colossus to improve storage efficiency

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

CSE 333 Lecture 9 - storage

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

CMPSC 311- Introduction to Systems Programming Module: Caching

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

Deep Storage for Exponential Data. Nathan Thompson CEO, Spectra Logic

Data preservation for the HERA experiments at DESY using dcache technology

CERN Tape Archive (CTA) :

For those of you who may not have heard of the BHL let me give you some background. The Biodiversity Heritage Library (BHL) is a consortium of

Deep Learning Performance and Cost Evaluation

Deduplication Storage System

High Performance Computing Course Notes Grid Computing I

Outline. ASP 2012 Grid School

Memory and Storage-Side Processing

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Distributed Data Management on the Grid. Mario Lassnig

Parallel Storage Systems for Large-Scale Machines

Academic Workflow for Research Repositories Using irods and Object Storage

I data set della ricerca ed il progetto EUDAT

Grid Architectural Models

Analytics Platform for ATLAS Computing Services

BlueDBM: An Appliance for Big Data Analytics*

ECE 571 Advanced Microprocessor-Based Design Lecture 10

A distributed tier-1. International Conference on Computing in High Energy and Nuclear Physics (CHEP 07) IOP Publishing. c 2008 IOP Publishing Ltd 1

Andrea Sciabà CERN, Switzerland

Promoting Open Standards for Digital Repository. case study examples and challenges

Register Allocation. Lecture 16

Speeding Up Cloud/Server Applications Using Flash Memory

Drawing Fast The Graphics Pipeline

Software-Defined Data Infrastructure Essentials

Federated Data Storage System Prototype based on dcache

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

CMPSC 311- Introduction to Systems Programming Module: Systems Programming

Securing Grid Data Transfer Services with Active Network Portals

Security in Big and Open Data - WG. Alessandra Scicchitano GÉANT - Project Development Officer Ralph Niederberger Jülich Supercomputing Center (FZJ)

SCA19 APRP. Update Andrew Howard - Co-Chair APAN APRP Working Group. nci.org.au

Transferring Deep Learning into a. Recommender System

Overview. About CERN 2 / 11

CS 3733 Operating Systems:

Nextcloud 13: How to Get Started and Why You Should

The Global Grid and the Local Analysis

PaNSIG (PaNData), and the interactions between SB-IG and PaNSIG

The Materials Data Facility

DIGGING INTO THE SOFTWARE DEFINED DATA CENTER

Distributed File Systems Part IV. Hierarchical Mass Storage Systems

Patrick Fuhrmann (DESY)

Transcription:

Challenges in Storage or : A random walk Presented by Patrick Fuhrmann with contributions from many experts.

Contributions and thoughts provided by Oxana Smirnova, NeIC, Lund Markus Schulz, CERN IT Steven Newhouse, EBI (EMBL) Eygene Ryabinkin, KI Material from: Martin Gasthuber at al., DESY for XFEL Paul Alexander, SKA Daniele Cecini, CNAF, INFN Hermann Hessling, HTW Berlin Ian Bird, WLCG 2

My simplified idea of data life cycle Reconstruction and Analysis Reduction Algorithms Visualization Manipulation <Meta data> Long Term Archive e.g. the Location of a nuclear waste facility. 3

What is the order of magnitude we are talking about? Let s pick two arbitrary big data experiments. 4

The Square Kilometer Array SKA I, 2018 SKA II 2030 Stolen from the SKA Homepage 5

The Square Kilometer Array 2020 2030 5.000 100.000 Number in PBytes/Day Stolen from the SKA P. Alexander DSP 2020 2030 1-10 3.000-10.000 6

The European XFEL 4 M Pixel Detector : 30 MBytes / sec up to 11 Beamlines Expected 100-500 PBytes/year 7

The Data Storage rates Now-ish Around 2030 2 Exa Byte / year 1 Exa Byte / year WLCG XFEL SKA I Genome of all humans on earth WLCG HL SKA II 8

Data Injection Factor 1000 DSP Experiment Specific : Here the key-phrase is Smart Algorithm In some fields you should be very careful rejecting events. It s a difficult issue but not really our problem. 9

Next up Get data back for processing (analysis) 10

Current Data Access Structure Super Bottleneck: Process power is expensive, don t let your CPU/core wait. SSD Cache Core Cache Core Core Distributed Storage Files System Cache Core Core CPU Level x Cache 11

Things to look into Predictive engine for smart data placement (caching) Allow platform layer (experiment framework) to steer data location before processing. Use deep learning to predict access pattern. Use vendor provided API s for SSD Move data from spinning devices to SSD (Flash), as you application knows when data is needed. Consider circumvent file system cache. I you know what you are doing. Consider HADOOOP/Sparc approach 12

Things to look into (continued) Simplify access layer (avoid name space lookups) Option : Object stores. Skips name space lookups, client calculates the location of the data itself. Experiment frameworks store IDs anyway. Application Software Improve algorithms Teach best practice programming Learn to parallel programming Port old applications properly. 13

Sensitive Application: Visualization Latency become a major issue. After milliseconds delay, viewer gets nervous or runs against a wall. Jülich Aachen Research Alliance, JARA 3D cave Oculus Rift The GPU has to get data from disk fast enough to keep objects moving smoothly (Like here the details of the Human Brain) 14

Accessing data by the platform layer Authentication Portals Delegation Platform Layer (PaaS) Tape Storage Open Stack 15

The delegation Macaroon Storage Pool M M Portal m Reducing Authz only WRITE, only from <IP addr>, only for 10 minutes. m m m Group Service m m m 16

Orchestrating Storage Resources or The art of Quality of Service in storage. High Speed Data Ingest Fast Analysis NFS 4.1/pNFS Wide Area Transfers Visualization (Globus Online, FTS) & Sharing by GridFTP by WebDAV, OwnCloud

Unfortunately there is no standard protocol resp. API (any more) defining changes in Quality of Service in storage. It would be needed. 18

Coming back to your little friend. 19

Scientists need to share data.. chown chmod acl set Share with /afs/cern.ch/home Solutions available OwnCloud, NextCloud, Seafile, PowerFolder, *** Agreeing on cross platform sharing In progress Difficult for Scientific Systems First attempts by EOS and dcache. 20

Long Term Data Preservation Bit file preservation Easiest Case Still has issues : Cold data gets silently corrupted. Encrypting your data General encryption mechanisms only last for some year. Companies are offering long term security with distributed systems to reduce the possibility of data getting stolen. Who should hold the master key. Best would be one time pad. Difficult to handle the keys. Content preservation. No general solution found yet. DPHEP initiative tries to coordinate some efforts in HEP 21

Long Term Data Preservation (cont.) Legal issue (boring) Ownership question after PI leaves. When should data be made public? How verifies agreed rules? 22

Ending with a sentence Markus Schulz sent as a first reaction on my request for input to this presentation : There is hardy any field in Storage and Data Management which isn t a challenge. Enjoy coffee. 23