SIRIUS SCIENCE- DRIVEN DATA ANAGEMENT FOR MULTI- TIERED STORAGE. June 23, 2016 HPC- IODC Workshop SIRIUS

Similar documents
Stream Processing for Remote Collaborative Data Analysis

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations

The Geodatabase. Chase T. Brooke Ecosystem Science and Management Texas A&M University College StaOon, TX

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Box: Using HBase as a message queue. David MacKenzie Staff So2ware Engineer

Adaptable IO System (ADIOS)

Improved Solutions for I/O Provisioning and Application Acceleration

Towards Weakly Consistent Local Storage Systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

John Hicks Internet2 - Network Research Engineer perfsonar

Introduction to HPC Parallel I/O

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

The Fusion Distributed File System

LAB. Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Pocket: Elastic Ephemeral Storage for Serverless Analytics

PROGRAMMING AND RUNTIME SUPPORT FOR ENABLING DATA-INTENSIVE COUPLED SCIENTIFIC SIMULATION WORKFLOWS

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

High Speed Asynchronous Data Transfers on the Cray XT3

SolidFire and Pure Storage Architectural Comparison

Lock Inference in the Presence of Large Libraries

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

From the Desktop to the Grid: Conversion of KNIME Workflows to guse

ECMWF's Next Generation IO for the IFS Model and Product Generation

Cisco Intelligent WAN

Concurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

IBM Spectrum Scale IO performance

PIRE ExoGENI ENVRI preparation for Big Data science

HPC Storage Use Cases & Future Trends

Storage for HPC, HPDA and Machine Learning (ML)

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Nimble Storage vs HPE 3PAR: A Comparison Snapshot

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Column- Stores vs. Row- Stores How Different Are They Really?

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Benchmarking the Performance of Scientific Applications with Irregular I/O at the Extreme Scale

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

CS140 Final Review. Winter 2014

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Data Movement & Tiering with DMF 7

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Gen-Z Overview. 1. Introduction. 2. Background. 3. A better way to access data. 4. Why a memory-semantic fabric

Runtime Data Management on Non-volatile Memory-based Heterogeneous Memory for Task-Parallel Programs

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Quobyte The Data Center File System QUOBYTE INC.

Service Mesh and Microservices Networking

References. K. Sohraby, D. Minoli, and T. Znati. Wireless Sensor Networks: Technology, Protocols, and

Ian Foster, An Overview of Distributed Systems

Tintenfisch: File System Namespace Schemas and Generators

Extreme I/O Scaling with HDF5

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Modernize Your Storage

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows

Computational Storage: Acceleration Through Intelligence & Agility

Swarm at the Edge of the Cloud. John Kubiatowicz UC Berkeley Swarm Lab September 29 th, 2013

Designing and debugging real-time distributed systems

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

CEPHALOPODS AND SAMBA IRA COOPER SNIA SDC

FUSION PROCESSORS AND HPC

MOHA: Many-Task Computing Framework on Hadoop

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

NEXTGenIO Performance Tools for In-Memory I/O

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Timothy Lanfear, NVIDIA HPC

Ceph: A Scalable, High-Performance Distributed File System

Center Extreme Scale CS Research

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Giza: Erasure Coding Objects across Global Data Centers

Parallel File Systems. John White Lawrence Berkeley National Lab

System Software for Big Data and Post Petascale Computing

NetApp SolidFire and Pure Storage Architectural Comparison A SOLIDFIRE COMPETITIVE COMPARISON

Storage. Hwansoo Han

Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer

CSE6230 Fall Parallel I/O. Fang Zheng

I/O Profiling Towards the Exascale

Fujitsu SC 18. Human Centric Innovation Co-creation for Success 2018 FUJITSU

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

Oasis: An Active Storage Framework for Object Storage Platform

Transcription:

SCIENCE- DRIVEN DATA ANAGEMENT FOR MULTI- TIERED STORAGE June 23, 2016 HPC- IODC Workshop y Lofstead Sandia Klasky, H. Abbasi, Q. Liu, F. Wang ORNL Lofstead, M. Curry, L. Ward Sandia. Parashar Rutgers Maltzahn UCSC. Ainsworth Brown, ORNL

here Do We Spend Our Time in Science? oals Make the process predictable Make the process adaptable Make the process scalable Make the sonware easy- to- use bservaoon Too much Ome is spent in managing, moving, storing, retrieving, and turning th science data into knowledge rius specific Goal Refactor data so that we can efficiently store, find, retrieve data in predictable Omes set by the users, negooated with the system

IRIUS Story ascale Apps will generate too much data Data needs to be prioriozed by the users generaong and reading Data needs to be reduced, re- organized, from large volumes into different buckets of importance Storage systems may not have the capacity or performance to sustain large datasets e management of data in mulo- Oer Storage systems in the data lifecycle w come more difficult to manage ta access Omes need to be described by users and negooated with the orage system for predictable storage performance allenges: How can the middleware and storage system understand refactored data? How to build scalable metadata searching to find refactored data The interplay between data refactoring, data re- generaoon, and data access Om

rinciples inciple 1: A knowledge- centric system design that allows user owledge to define data policies Today SSIO layers are wri\en in a stove- pipe fashion, and quite onen do not allo opomizaoons to take place Re- design the layers in a highly integrated fashion where users place their intenoons into the system and acoons will staocally and dynamically take place opomize for the system and for individual requests inciple 2: Predictable performance and quality of data in the SSIO yers need to be established so science can be done on the exascale stems in a more efficient manner Without predictable performance, not only can the runs be slowed down becau of contenoons on shared resources, but also it affects key science decisions

utline oovaoon RIUS Building Blocks ata Refactoring diong ata DescripOon etadata Searching zzy Predictable Performance

DIOS A critical library for DOE apps sed heavily in many LCF/NERSC applicaoons which we partner with lows our team to rapidly prototype new methods and test them for plicaoon data lifecycles as the ability to describe data uolity (XML descripoon) lows our team to test out hypothesis 1. Accelerator: PIConGPU, Warp 2. Astrophysics: Chimera 3. CombusOon: S3D 4. CFD: FINE/Turbo, OpenFoam 5. Fusion: XGC, GTC, GTC- P, M3D,M3D- C1, M3D- K, Pixie3D 6. Geoscience: SPECFEM3D_GLOBE, AWP- ODC, RTM 7. Materials Science: QMCPack 8. Medical: Cancer pathology imaging 9. Quantum Turbulence: QLG2Q 10. VisualizaOon: Paraview, Visit, VTK, OpenCV, VTKm ://www.nccs.gov/user- support/center- projects/adios/

ataspaces to explore placement strategies on ulti- tier memory/storage hierarchies The DataSpaces AbstracNon ows our team to stage data across memory and storage hierarchies with different data cement strategies ilding on our learning techniques to opomize data placement for different opomizaoon ategies

bject Storage Light Weight File System- inspired philosophy Clients bring/opt- in to services they require Naming, locking, distributed transacoons Peer- to- peer inspired design Ephemeral, diverse servers and clients Data and locaoon(s) are decoupled Addressed systems Sirocco (Sandia) h\p://www.cs.sandia.gov/scalable_io/sirocco/ Full control from ownership, in development technology Ceph (RedHat) Mature, producoon system with special a\enoon required to address HPC workloads DAOS (Intel) Next generaoon Lustre with strong commercial and DOE support, soll under development

utline MoOvaOon Building Blocks Data Refactoring AudiOng Data DescripOon Metadata Searching Fuzzy Predictable Performance

ata Refactoring ata needs to be refactored so that more important data can be ioriozed, e.g., being placed on the higher Oer on the storage systems. allenges: To understand the cost associated with refactoring data, in terms of CPU cycles, extra memory consumed, and communicaoon. Does the refactoring affect the fidelity of the applicaoons? If so, how much? ANer data is refactored, how do we map it to the storage hierarchy? How do we enforce policy? How will data be used? E.g. B is a fundamental variable in a MHD code, but many Omes the user want the current: = J onen looking at the low frequency modes, or B We need error bounds for our refactored quanooes

ata Refactoring and Utility Functions Need to explore what and where the computaoon will occur Flexibility in locaoon based on past work Flexibility in which operaoon to perform on which chunk of data Code generaoon or code containers are potenoal study targets Maintaining relaoonship between data chunks Carry a\ributes from generaoon to consumpoon and feedback into a uolity computaoon

ata Utility Describes how long a data chunk will live at a level of the storage hierarchy UOlity is a broad descripoon SpaOal or temporal uolity of data UOlity based on in- data features UOlity based on staosocal features UOlity has a large component from the user and the use case Experimental design factors in here Solving a specific scienofic problem => specific data uolity funcoon API for ingesong user preferences and combining with historical provenance Dynamic uolity for online analysis/visualizaoon use cases e.g. The uolity of Mesh may b defined more explicitly as, fo example, (priority=1, (Ome- NVRAM=8 hours, Ome- PFS=3 days, Ome- CAMPAIGN=100 days, Ome- TAPE=1000 days), (priority=2, (Ome- NVRAM=1 hours, Ome- PFS=4 days, Ome CAMPAIGN=100 days, Ome- TAPE=300 days).

tility- driven Data Placement al: Determine placement of data objects verncally across different levels of e memory hierarchy, e.g., SSD or DRAM, and horizontally at different staging des UOlity quanofies the relaove value of data objects based on anocipated data read pa\erns UOlity based on data access pa\erns (monitored and learnt at runome) and the locaoon of the applicaoon and staging nodes within the system network topology For example, data objects with higher data u+lity are placed closer to the compuong nodes accessing it xploring Data Staging Across Deep Memory Hierarchies for Coupled Data Intensive Simula@on Workflows.. Jin, F. Zhang, Q. Sun, H. Bui, M. Romanus, N. Podhorszki, S. Klasky, H. Kolla, J. Chen, R. Hager, C. Chang, M. arashar. IEEE IPDPS'15, May 2015 dap@ve Data Placement For Staging- Based Coupled Scien@fic Workflows.. Sun, T. Jin, M. Romanus, H. Bui, F. Zhang, H. Yu, H. Kolla, S. Klasky, J. Chen, M. Parashar. ACM/IEEE SC'15, ov. 2015.

erformance ioal tests use 3/5 byte splits for doubles XGC parocle data - - wrote 819,200,000 parocles using 5 nodes (160 processes) o Sith @ ORNL Write with no compression :: 10.3s The Ome to split each double from that set of parocles into 3 significant and 5 le significant bytes. :: 4.1s The Ome to write out the 3 byte pieces :: 3.4s The Ome to write out the 5 byte pieces :: 9.3s parate read Omes on a laptop Errors: Norms L2: 72028.2 Linf: 0.00109242 Total ReadTime: 10.3131 decompresstime: 27.9715 Whole data ReadTime: 34.1505 decompresstime: 0

esults sets riety of reading opoons Table 1: Reading Options for Variable A # Size Time Time Err Refactoring Data Err 1 ( 1 4 )3 ( 3 8 )A 10s ± 3s stride, byte-split 99% 2 ( 3 4 )3 ( 3 8 )A 90s ± 30s stride, byte-split 58% 3 ( 1 4 )3 ( 5 8 )A 16s ± 5s stride, byte-split 0.01% 4 ( 4 3)3 ( 5 8 )A 120s ± 50s stride, byte-split 0.01% 5 ( 4 1)3 A 1200s ± 30s stride 98% 6 ( 4 3)3 A 2400s ± 90s stride 58% 7 ( 3 8 )A 1350s ± 120s byte-split 5% 8 ( 5 8 )A 2250s ± 120s byte-split 0.01% 9 A 36s ± 6s wavelet 1% 10 A 3600s ± 600s none 0%

utline MoOvaOon Building Blocks Data Refactoring AudiOng Data DescripOon Metadata Searching Fuzzy Predictable Performance

ew techniques for Data Intensive Science DITOR: CreaOng a reduced model to approximate the soluoon in ited spaoal/temporal dimenion. ic Idea is that we need a model to generate a close approximaoo he data ic quannnes in InformaNon Theory ata stream S and for x S let P r (X=x) = p x [0,1] annon InformaOon Content: h(x) = - log 2 p x tropy H(S) = - Σ p x log 2 p x oisy/random data has HIGH ENTROPY

urrent practices of today Linear interpolaoon ant to write data every m th Omestep Because of the Storage and I/O requirements users are forced to wriong less the users reconstruct their data, u(t), at e n th Omestep, they need to interpolate tween the neighboring Omesteps Φ M (u) = interpolant on coarser grid (stride M), reduce storage my 1/M sume (C=constant depending on the complexity of the data) Original storage cost = 32*N bits (floats) New storage cost = 32*N/M bits + { 23 log 2 (C M 2 Δt 2 )}N RaOo = (1/M 1/16 log 2 M) 1/16 log 2 Δt + constant u- ϕ M (u) = O(M 2 Δt 2 ) : 2 nd order interpola Cost to store ϕ M + Cost to store manossa of u- ϕ

ompression with an interpolation auditor Linear interpolaoon (LA) is the auditor If we look at 10MB output, with a stride of 5 Total output = 50MB for 5 steps 10 MB, if we output 1 step, 43MB typical lossless compression, 18MB, using linear audiong but lossless InvesOgaOng adapove techniques 1 step (MB) lossless compression (MB) Linear Audit (MB) Total Data in 50 steps, typical compression Total data in 50 steps in LA 10 43 18 430 180 10 85 25 850 125 10 170 40 1700 100

ther types of auditors e key to be\er auditors is to understand what you are simulaong/ serving Use a reduce model Use less resoluoon Use a linear equaoon for short spaoal/temporal regions d other ways to refactor Precision based re- organizaoon Frequency based re- organizaoon - Wavelets More knowledgeable auditors Cost of data re- generaoon vs. data storage/ retrieval orage becomes more than a stream of bytes Data + Code + workflow

utline MoOvaOon Building Blocks Data Refactoring AudiOng Data DescripOon Metadata Searching Fuzzy Predictable Performance

ata Descriptions ow that data is refactored, how do we describe/annotate data so that ey can be discovered? capture applicaoon knowledge and communicate them to iddleware/storage Data unlity RelaNonships between datasets SemanOcs specify user requirements QoS: bandwidth, latency Policy: E.g., where and how long should my data stay on a storage layer 22

Outline MoOvaOon Building Blocks Data Refactoring AudiOng Data DescripOon Metadata Searching Fuzzy Predictable Performance gflofst@sandia.gov

hallenges in metadata searching orage devices rather than a file system: no built- in metadata operaoons as rt of IO stributed pieces EVERYWHERE silience copies orage devices come and go rformance characterisocs can vary considerably ck a variety of storage targets based on data importance fferent data compression on different data pieces y to guarantee performance Need to consider decompression/regeneraoon Ome if mulople versions exist hance placement decision based on predicted future use Based on tracking previous use (which needs to be tracked somehow)

utline MoOvaOon Building Blocks Data Refactoring AudiOng Data DescripOon Metadata Searching Fuzzy Predictable Performance

urrent day end- to- end path is complex

Autonomic Runtime Optimization tonomic ObjecOve (AO): A requirement/objecove/goal defined by the user E.g. minimize data movement, op@mize throughput, etc. utonomic Policy (AP) A rule that defines how the objecoves should be achieved, i.e., which mechanisms should be used E.g. use a specific data placement adapta@on to minimize data movement, etc. Decide Autonomic Policies (AP) Autonomic ObjecOves (AO) Achieve utonomic Mechanism (AM) An acoon that can be used to achieve an AO E.g. use topology- aware and access- pamern driven data placement to minimize data movement, etc. Specify/U+lize Autonomic Mechanisms (AM)

hallenges alable admission control Need scalable control plane (leverage exisong work on causal metadata propagaoon and online sampling for latency/error trade- offs) source- specific schedulers to make performance of resource edictable Example: read/write separaoon at flash devices nline sampling to provide quick latency/resoluoon trade- offs Without significantly interfering with ongoing workload

ummary rius is a\empong to redefine I/O based on key findings POSIX- compliant block interface does not give the system enough informaoon t fully opomize Data is too big to keep in one place and current systems purge data without use intervenoon Variability is too large, and users are not in control of their data ienofic Data is not random data There is content to the data diong calculaoons to priorioze, reduce data sizes but keep the formaoon is criocal to reduce the Ome of understanding