Scibox: Online Sharing of Scientific Data via the Cloud

Similar documents
Scibox: Online Sharing of Scientific Data via the Cloud

Adaptable IO System (ADIOS)

The Fusion Distributed File System

Enosis: Bridging the Semantic Gap between

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

CSE6230 Fall Parallel I/O. Fang Zheng

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

MOHA: Many-Task Computing Framework on Hadoop

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

HYRISE In-Memory Storage Engine

Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

High-Performance Outlier Detection Algorithm for Finding Blob-Filaments in Plasma

Caching and Buffering in HDF5

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

Stream Processing for Remote Collaborative Data Analysis

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

ECE7995 (7) Parallel I/O

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

An Evolutionary Path to Object Storage Access

Google File System. Arun Sundaram Operating Systems

Amazon AWS-Solution-Architect-Associate Exam

Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression

The Google File System

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Anthony Kougkas, Hariharan Devarajan, Xian-He Sun

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

Log-structured files for fast checkpointing

CLOUD-SCALE FILE SYSTEMS

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

AMath 483/583 Lecture 11

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Structuring PLFS for Extensibility

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

EsgynDB Enterprise 2.0 Platform Reference Architecture

Part2: Let s pick one cloud IaaS middleware: OpenStack. Sergio Maffioletti

Techniques to improve the scalability of Checkpoint-Restart

Storage for HPC, HPDA and Machine Learning (ML)

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Introduction to High Performance Parallel I/O

IBM Aspera Direct-to- Cloud Storage

Albis: High-Performance File Format for Big Data Systems

New research on Key Technologies of unstructured data cloud storage

Performance database technology for SciDAC applications

Map-Reduce. Marco Mura 2010 March, 31th

Object storage platform How it can help? Martin Lenk, Specialist Senior Systems Engineer Unstructured Data Solution, Dell EMC

CloudBerry Backup for Windows 5.9

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Cloud Computing For Researchers

Distributed File System Support for Virtual Machines in Grid Computing

Introduction to Cloud Computing and Virtual Resource Management. Jian Tang Syracuse University

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Cross-layer Optimization for Virtual Machine Resource Management

OnCommand Cloud Manager 3.2 Deploying and Managing ONTAP Cloud Systems

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Memcached Design on High Performance RDMA Capable Interconnects

Hadoop and HDFS Overview. Madhu Ankam

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

The Google File System

Introduction to Database Services

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

High Speed Asynchronous Data Transfers on the Cray XT3

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Next-Generation Cloud Platform

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Flexible IO and Integration for Scientific Codes Through The Adaptable IO System (ADIOS)

April Copyright 2013 Cloudera Inc. All rights reserved.

COMPARING COST MODELS - DETAILS

The Google File System

ECE 598-MS: Advanced Memory and Storage Systems Lecture 7: Unified Address Translation with FlashMap

A New Key-Value Data Store For Heterogeneous Storage Architecture

Computing over the Internet: Beyond Embarrassingly Parallel Applications. BOINC Workshop 09. Fernando Costa

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

2013 AWS Worldwide Public Sector Summit Washington, D.C.

CSE 124: Networked Services Lecture-17

Scalable, Automated Characterization of Parallel Application Communication Behavior

NEXTGenIO Performance Tools for In-Memory I/O

Part 1: Indexes for Big Data

Effizientes Speichern von Cold-Data

YouChoose: A Performance Interface Enabling Convenient and Efficient QoS Support for Consolidated Storage Systems

CUG Talk. In-situ data analytics for highly scalable cloud modelling on Cray machines. Nick Brown, EPCC

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Transcription:

Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang, Xuechen Zhang, Greg Eisenhauer, Karsten Schwan Matthew Wolf,, Stephane Ethier ǂ, Scott Klasky CERCS Research Center, Georgia Tech ǂ Princeton Plasma Physics Laboratory Oak Ridge National Laboratory Supported in part by funding from the US Department of Energy for DOE SDAV SciDac 1

Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 2

Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability 3

Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability Works based on cloud storage Dropbox, GoogleDrive, icloud, SkyDrive, and etc. 3

Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability Works based on cloud storage Dropbox, GoogleDrive, icloud, SkyDrive, and etc. Scibox: focus on scientific data sharing 3

Use Cases for Cloud Storage Combustion Experimental Data Private Cloud Aero Cluster 4

Use Cases for Cloud Storage Combustion Experimental Data Private Cloud Aero Cluster 4

Use Cases for Cloud Storage Combustion Experimental Data Private Cloud Image Processing Aero Cluster Student PC 4

Use Cases for Cloud Storage Combustion Experimental Data Private Cloud GTS/ LAMMPS Image Processing Aero Cluster Vogue Student PC 4

Use Cases for Cloud Storage Combustion Experimental Data Private Public Cloud GTS/ LAMMPS Image Processing Aero Cluster Vogue Student PC 4

Use Cases for Cloud Storage Combustion Experimental Data Private Public Cloud Visualization GTS/ LAMMPS Image Processing Aero Cluster Vogue Student PC GeorgiaTech (Atlanta) WSU (Detroit) OSU (Columbus) 4

Use Cases for Cloud Storage Combustion Experimental Data Private Public Cloud Visualization GTS/ LAMMPS 1. Easy of use 2. Universal accessibility 3. Good scalability Image Processing Aero Cluster Vogue Student PC GeorgiaTech (Atlanta) WSU (Detroit) OSU (Columbus) 4

Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 5

Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data 6

Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data Networking bandwidth is limited Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores 6

Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data Networking bandwidth is limited Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores High costs imposed by cloud providers Expensive for large amounts of data when using the pay-as-you-go model 6

Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data Networking bandwidth is limited Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores High costs imposed by cloud providers Expensive for large amounts of data when using the pay-as-you-go model An example: A GTS runs on 29K cores on the Jaguar machine at OLCF generates over 54 Terabytes of data in a 24 hour period. Amazon S3: ~$0.03/GB for storage and $0.09/GB for data transfer out. Cost: $6635.52/day, increases with increasing number of collaborators 6

Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Example Output of GTS fusion modeling simulation: Checkpoint data, diagnosis data, visualization data and etc. Each data subset includes many elements Cloud Data producer Data consumer 7

Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Cloud Data producer Data consumer 7

Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer 7

Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer 7

Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer 7

Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer Goal: Reduce data transfer from producers to consumers 7

Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data 8

Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files 8

Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data 8

Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data Scibox Approaches: Filter unnecessary data at producer-side via metadata (uploads) 8

Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data Scibox Approaches: Filter unnecessary data at producer-side via metadata (uploads) Merge overlapping subsets when multiple users share the same data (uploads) 8

Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data Scibox Approaches: Filter unnecessary data at producer-side via metadata (uploads) Merge overlapping subsets when multiple users share the same data (uploads) Minimize data sharing cost in cloud storage via new software protocol (downloads) 8

Challenge: How to Filter Data Recall: scientific data is structured and meta-data rich Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Variable { dimensions, attributes } Time steps 9

Challenge: How to Filter Data Recall: scientific data is structured and meta-data rich Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Variable { dimensions, attributes } Time steps Analytics users know what can/needs to be filtered 9

Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 10

Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters 11

Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters Merge shared data objects before uploading Promote data sharing in the cloud to reduce data redundancy on the storage server and data volumes transferred across the network 11

Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters Merge shared data objects before uploading Promote data sharing in the cloud to reduce data redundancy on the storage server and data volumes transferred across the network Utilize ADIOS metadata-rich I/O methods New ADIOS I/O transport Write output to cloud that can be directly read by subsequent, potentially remote data analytics or visualization codes Transparent on both the producer and consumer sides 11

Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters Merge shared data objects before uploading Promote data sharing in the cloud to reduce data redundancy on the storage server and data volumes transferred across the network Utilize ADIOS metadata-rich I/O methods New ADIOS I/O transport Write output to cloud that can be directly read by subsequent, potentially remote data analytics or visualization codes Transparent on both the producer and consumer sides Partial object access for private cloud storage Patch the OpenStack Swift object store 11

Others (Plug-in) CLOUD-IO Viz Engines pnetcdf HDF-5 DART LIVE/DataTap POSIX-IO MPI-IO Cloud-IO Transport HPC Applications Authentication Store Node Account Server Container Server Object Server ADIOS API DR-Function Run-time Execution Auth Node Account Server Container Server Object Server Store Node Account Server Container Server Object Server Store Node Account Server Container Server Object Server Proxy Node Store Node HPC Platform Swift Object Storage 12

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 38

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 39

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 40

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 41

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 42

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 43

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 44

Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 45

Sample XML File <?xml version="1.0"?> <adios-config host-language="fortran"> <adios-group name="restart" coordination-communicator="comm"> <var name="mype" type="integer"/> <var name="numberpe" type="integer"/> <var name="istep" type="integer"/> <var name="mimax_var" type="integer"/> <var name="nx" type="integer"/> <var name="ny" type="integer"/> <var name="zion0_1darray" gwrite="zion0" type="double" dimensions="mimax_var /> <var name="phi_2darray" gwrite="phi" type="double" dimensions="nx, NY"/> <! for reader.xml --> </adios-group> <method group="restart" method= CLOUD-IO >;</method> <buffer size-mb="20" allocate-time="now"/> </adios-config> 14

Sample XML File <?xml version="1.0"?> <adios-config host-language="fortran"> <adios-group name="restart" coordination-communicator="comm"> <var name="mype" type="integer"/> <var name="numberpe" type="integer"/> <var name="istep" type="integer"/> <var name="mimax_var" type="integer"/> <var name="nx" type="integer"/> <var name="ny" type="integer"/> <var name="zion0_1darray" gwrite="zion0" type="double" dimensions="mimax_var /> <var name="phi_2darray" gwrite="phi" type="double" dimensions="nx, NY"/> <! for reader.xml --> <rd type=8 name="phi_2darray cod= int i; double sum = 0.0; for(i = 0; i<input.count; i= i+1) { sum = sum + input.vals[i]; } return sum; /> </adios-group> <method group="restart" method= CLOUD-IO >;</method> <buffer size-mb="20" allocate-time="now"/> </adios-config> 14

Data Reduction (DR) Functions Definition Function defined by end-users that can transform/reduce data to prepare it for cloud storage 15

Data Reduction (DR) Functions Definition Function defined by end-users that can transform/reduce data to prepare it for cloud storage Creation Programmed by end users Generated from higher level descriptions Derived from users I/O access patterns 15

Data Reduction (DR) Functions Definition Function defined by end-users that can transform/reduce data to prepare it for cloud storage Creation Programmed by end users Generated from higher level descriptions Derived from users I/O access patterns Current implementation Customized CoD (C on Demand) require producer-side computational resources DR-function library same DR-function specified by multiple clients, will be executed only once, and its output data will be reused for multiple consumers 15

DR-functions Provided by SciBox Type Description Example DR1 Max(variable) Max(var_double_2Darray) DR2 Min(variable) Min(var_double_2Darray) DR3 Mean(variable) Mean(var_double_2Darray) DR4 DR5 DR6 DR7 Range(variable, dimension, start_pos, end_pos) Select(variable, threshold1, threshold2) Select(variable, DR_Function1, DR_Function2) Select(variable1, variable2, threshold1, threshold2) Range(var_int_1Darray, 1, 100, 1000) Select var.value where var.value in (threshold1, threshold2) Select var.value where var.value Mean(var) Select var2.value where var1.value in (threshold1, threshold2) DR8 Self defined function Double proc(cod_exec_context ec, input_type * input, int k, int m) {int I; intj; double sum = 0.0; double average=0.0; for(i=0;i<m;i++)sum+=input.tmpbuf[i+k*m];aver age=sum/m; resturn average;} 51

DR-functions Provided by SciBox Type Description Example DR1 Max(variable) Max(var_double_2Darray) DR2 Min(variable) Min(var_double_2Darray) DR3 Mean(variable) Mean(var_double_2Darray) DR4 DR5 DR6 DR7 Range(variable, dimension, start_pos, end_pos) Select(variable, threshold1, threshold2) Select(variable, DR_Function1, DR_Function2) Select(variable1, variable2, threshold1, threshold2) Range(var_int_1Darray, 1, 100, 1000) Select var.value where var.value in (threshold1, threshold2) Select var.value where var.value Mean(var) Select var2.value where var1.value in (threshold1, threshold2) DR8 Self defined function Double proc(cod_exec_context ec, input_type * input, int k, int m) {int I; intj; double sum = 0.0; double average=0.0; for(i=0;i<m;i++)sum+=input.tmpbuf[i+k*m];aver age=sum/m; resturn average;} C on Demand (CoD): Consumer: a string Producer: 1. registration 2. compile and execute on demand. 52

Data Object Management Scibox Super File Group Metadata User 0 Metadata User 1 Metadata User N Metadata Group Metadata Container User Metadata Container Writer.xml Reader0.xml User0 Metadata Object User1 Metadata Object UserN Metadata Object Reader1.xml Data Container ReaderN.xml Object 0 Object 1 Object N 17

Data Object Management Scibox Super File Group Metadata User 0 Metadata User 1 Metadata User N Metadata Group Metadata Container User Metadata Container Writer.xml Reader0.xml User0 Metadata Object User1 Metadata Object UserN Metadata Object Reader1.xml Enable object sharing Data Container ReaderN.xml Object 0 Object 1 Object N 17

Determination of Object Size Merge overlapping data sets reduce upload data size Partial object access Current Amazon S3 and OpenStack Swift stores do not support this Users have to download the whole object, even if only a small portion of its data is needed 18

Determination of Object Size Merge overlapping data sets reduce upload data size Partial object access Current Amazon S3 and OpenStack Swift stores do not support this Users have to download the whole object, even if only a small portion of its data is needed Two approaches used in Scibox Private cloud modify the software to enable partial object access Object size is determined by predicting upload throughput 18

Determination of Object Size Merge overlapping data sets reduce upload data size Partial object access Current Amazon S3 and OpenStack Swift stores do not support this Users have to download the whole object, even if only a small portion of its data is needed Two approaches used in Scibox Private cloud modify the software to enable partial object access Object size is determined by predicting upload throughput Public cloud limit object size considering storage pricing Object size is determined by comparing the cost w/ sharing and w/o sharing 18

Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively 19

Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively Without sharing: 19

Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively Without sharing: With sharing ( n objects can be merged into one): 19

Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively Without sharing: With sharing ( n objects can be merged into one): 19

Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 20

Experimental Setup Aerospace App OpenStack Swift GTS Aero Cluster Vogue 21 63

Experimental Setup Aerospace App OpenStack Swift GTS Images Processing Aero Cluster Vogue Jedi Cluster 21 64

Experimental Setup Aerospace App OpenStack Swift Visualization GTS Images Processing Aero Cluster Vogue Jedi Cluster GeorgiaTech (Atlanta) WSU (Detroit) OSU(Columbus) 21 65

Workloads Synthetic Workloads 10 variables shared by multiple consumers 1,000 requests generated by each consumer 8 types of DR functions, uniformly distributed 1 data producer serving 1,000 requests x #client servers 3 self-defined DR-functions: FFT, histogram diagnosis, and average of row values of a matrix 22

Workloads Synthetic Workloads 10 variables shared by multiple consumers 1,000 requests generated by each consumer 8 types of DR functions, uniformly distributed 1 data producer serving 1,000 requests x #client servers 3 self-defined DR-functions: FFT, histogram diagnosis, and average of row values of a matrix Real Workloads GTS workload 128 parallel processes, consumers are from 3 different states in USA Combustion workload 10, 000 512X512 12-bit double framed images (~1.5 MB per image) 22

Latency Breakdown of Swift Object Store 100% Data Transfer & Server Processing Disk Read Authorization & Container Checking Software Overhead & Others 0.43 s 0.61 s 0.92 s 1.49 s 2.62 s 5.05 s 9.36 s 100% Headers Transfer Data Transfer & Server Processing Disk Write & CheckSum Authorization & Container Checking Software Overhead & Others 0.48 s 0.88 s 1.28 s 2.17 s 3.65 s 6.99 s 14.64 s 17.04% Percentage 80% 60% 41.77% 54.38% 67.33% 75.62% 78.78% 83.20% Percentage 80% 60% 34.63% 63.82% 73.58% 84.93% 88.69% 91.30% 93.28% 40% 40% 20% 20% 0% 1 4 8 16 32 64 128 Object Size (MB) 0% 1 4 8 16 32 64 128 Object Size (MB) Put Get 23

Latency Breakdown of Swift Object Store 100% Data Transfer & Server Processing Disk Read Authorization & Container Checking Software Overhead & Others 0.43 s 0.61 s 0.92 s 1.49 s 2.62 s 5.05 s 9.36 s 100% Headers Transfer Data Transfer & Server Processing Disk Write & CheckSum Authorization & Container Checking Software Overhead & Others 0.48 s 0.88 s 1.28 s 2.17 s 3.65 s 6.99 s 14.64 s 17.04% Percentage 80% 60% 41.77% 54.38% 67.33% 75.62% 78.78% 83.20% Percentage 80% 60% 34.63% 63.82% 73.58% 84.93% 88.69% 91.30% 93.28% 40% 40% 20% 20% 0% 1 4 8 16 32 64 128 Object Size (MB) Put 0% Data transfer dominates the request latency. 1 4 8 16 32 64 128 Object Size (MB) Get 23

Execution Time of DR-functions 4 3.5 DR1 DR2 DR3 7 6 DR4-1K DR4-1M DR4-16M DR4-64M DR4-128M Execution Time (seconds) 3 2.5 2 1.5 1 0.5 Execution Time (seconds) 5 4 3 2 1 0 Execution Time (seconds) 64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB Variable Size 40 DR5 DR6 DR7 35 30 25 20 15 10 5 0 64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB Variable Size 0 64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB Variable Size Recall: DR6:var.value where var.value>mean(var). Double scan of input data. Recall: DR7:Select var2.value where var1.value in (r1, r2). var1 is small. 24

SciBox System Scalability 5000 4962.91 90 Execution Time (seconds) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Stock System Scibox 2643.06 1329.83 968.86 842.81 412.12 271.46 92.39 109.65 151.44 1 2 4 8 16 Latency (seconds) 80 70 60 50 40 30 20 10 0 Stock System Scibox 81.91 42.25 21.94 19.66 12.68 9.54 4.34 5.1 6.48 1.98 2 3.09 2 4 8 16 32 64 Number of Producers Number of Consumers in the Same Sharing Group With Scibox, data is merged before upload With Scibox, partial object access is supported 25

GTS Workload 1600 Post-Computation Download Filter-Computation GTS-Computation 1400 1200 Latency (seconds) 1000 800 600 400 200 0 Gatech-Scibox Gatech-FTP OSU-Scibox OSU-FTP WSU-Scibox WSU-FTP Data Sharing Schemes WSU OSU GT GT 900 KB/s 4.4 MB/s 44 MB/s 26

Combustion Workload 2 1.8 1 Process 2 Processes 4 Processes 8 Processes 16 Processes 32 Processes 1.6 Speedup of Performance 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 4 8 16 Number of Clients in the Same Sharing Group DR-function: (ImageName, DR8, FFT) 10, 000 images (~1.5 MB/image) shared via Scibox 27

Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 28

Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores 29

Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores Data Reduction functions Exploit the structured nature of scientific data Reduce the amount of transferred data 29

Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores Data Reduction functions Exploit the structured nature of scientific data Reduce the amount of transferred data Transparent to Applications/Data Producers Implemented in ADIOS Can be also applied to other I/O libraries 29

Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores Data Reduction functions Exploit the structured nature of scientific data Reduce the amount of transferred data Transparent to Applications/Data Producers Implemented in ADIOS Can be also applied to other I/O libraries Future Work Deploy Scibox on national labs facilities to better understand potential use cases Additional optimizations of cloud storage for scientific data management 29

Thanks! 30