Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang, Xuechen Zhang, Greg Eisenhauer, Karsten Schwan Matthew Wolf,, Stephane Ethier ǂ, Scott Klasky CERCS Research Center, Georgia Tech ǂ Princeton Plasma Physics Laboratory Oak Ridge National Laboratory Supported in part by funding from the US Department of Energy for DOE SDAV SciDac 1
Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 2
Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability 3
Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability Works based on cloud storage Dropbox, GoogleDrive, icloud, SkyDrive, and etc. 3
Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability Works based on cloud storage Dropbox, GoogleDrive, icloud, SkyDrive, and etc. Scibox: focus on scientific data sharing 3
Use Cases for Cloud Storage Combustion Experimental Data Private Cloud Aero Cluster 4
Use Cases for Cloud Storage Combustion Experimental Data Private Cloud Aero Cluster 4
Use Cases for Cloud Storage Combustion Experimental Data Private Cloud Image Processing Aero Cluster Student PC 4
Use Cases for Cloud Storage Combustion Experimental Data Private Cloud GTS/ LAMMPS Image Processing Aero Cluster Vogue Student PC 4
Use Cases for Cloud Storage Combustion Experimental Data Private Public Cloud GTS/ LAMMPS Image Processing Aero Cluster Vogue Student PC 4
Use Cases for Cloud Storage Combustion Experimental Data Private Public Cloud Visualization GTS/ LAMMPS Image Processing Aero Cluster Vogue Student PC GeorgiaTech (Atlanta) WSU (Detroit) OSU (Columbus) 4
Use Cases for Cloud Storage Combustion Experimental Data Private Public Cloud Visualization GTS/ LAMMPS 1. Easy of use 2. Universal accessibility 3. Good scalability Image Processing Aero Cluster Vogue Student PC GeorgiaTech (Atlanta) WSU (Detroit) OSU (Columbus) 4
Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 5
Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data 6
Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data Networking bandwidth is limited Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores 6
Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data Networking bandwidth is limited Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores High costs imposed by cloud providers Expensive for large amounts of data when using the pay-as-you-go model 6
Cloud Storage is Not Ready for HEC Scientific applications are data-intensive Generate large amounts of data Networking bandwidth is limited Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores High costs imposed by cloud providers Expensive for large amounts of data when using the pay-as-you-go model An example: A GTS runs on 29K cores on the Jaguar machine at OLCF generates over 54 Terabytes of data in a 24 hour period. Amazon S3: ~$0.03/GB for storage and $0.09/GB for data transfer out. Cost: $6635.52/day, increases with increasing number of collaborators 6
Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Example Output of GTS fusion modeling simulation: Checkpoint data, diagnosis data, visualization data and etc. Each data subset includes many elements Cloud Data producer Data consumer 7
Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Cloud Data producer Data consumer 7
Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer 7
Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer 7
Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer 7
Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Scientific data formats: BP/HDF5 Structured and meta-data rich Standard I/O interface: ADIOS Almost transparent to users Cloud Data producer Data consumer Goal: Reduce data transfer from producers to consumers 7
Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data 8
Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files 8
Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data 8
Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data Scibox Approaches: Filter unnecessary data at producer-side via metadata (uploads) 8
Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data Scibox Approaches: Filter unnecessary data at producer-side via metadata (uploads) Merge overlapping subsets when multiple users share the same data (uploads) 8
Solutions for Minimizing Data Transfers for Data Sharing Compression Helps, but compression ratio can be low for floating-point scientific data Data selection Files: users can ask for subsets of data files, by specifying file offsets Requires knowledge about data layout in files Content-based data indexing Useful, but may require large amounts of meta-data Scibox Approaches: Filter unnecessary data at producer-side via metadata (uploads) Merge overlapping subsets when multiple users share the same data (uploads) Minimize data sharing cost in cloud storage via new software protocol (downloads) 8
Challenge: How to Filter Data Recall: scientific data is structured and meta-data rich Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Variable { dimensions, attributes } Time steps 9
Challenge: How to Filter Data Recall: scientific data is structured and meta-data rich Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 0 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Time step 1 A 0 A 1 A 2 A n B 0 B 1 B 2 B n C 0 C 1 C 2 C n Variable { dimensions, attributes } Time steps Analytics users know what can/needs to be filtered 9
Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 10
Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters 11
Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters Merge shared data objects before uploading Promote data sharing in the cloud to reduce data redundancy on the storage server and data volumes transferred across the network 11
Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters Merge shared data objects before uploading Promote data sharing in the cloud to reduce data redundancy on the storage server and data volumes transferred across the network Utilize ADIOS metadata-rich I/O methods New ADIOS I/O transport Write output to cloud that can be directly read by subsequent, potentially remote data analytics or visualization codes Transparent on both the producer and consumer sides 11
Scibox Design D(ata)R(eduction)-Function for data filtering Reduce cloud upload/download volumes Permit end users to identify the exact data needed for each specific analytics activity, using filters Merge shared data objects before uploading Promote data sharing in the cloud to reduce data redundancy on the storage server and data volumes transferred across the network Utilize ADIOS metadata-rich I/O methods New ADIOS I/O transport Write output to cloud that can be directly read by subsequent, potentially remote data analytics or visualization codes Transparent on both the producer and consumer sides Partial object access for private cloud storage Patch the OpenStack Swift object store 11
Others (Plug-in) CLOUD-IO Viz Engines pnetcdf HDF-5 DART LIVE/DataTap POSIX-IO MPI-IO Cloud-IO Transport HPC Applications Authentication Store Node Account Server Container Server Object Server ADIOS API DR-Function Run-time Execution Auth Node Account Server Container Server Object Server Store Node Account Server Container Server Object Server Store Node Account Server Container Server Object Server Proxy Node Store Node HPC Platform Swift Object Storage 12
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 38
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 39
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 40
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 41
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 42
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 43
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 44
Scibox Architecture Amazon S3/ OpenStack Swift Scibox Client Cloud-IO (Write) Data Producer Cloud Storage Scibox Client Cloud-IO (Read) Data Consumer Control Flow Data Flow User Group 45
Sample XML File <?xml version="1.0"?> <adios-config host-language="fortran"> <adios-group name="restart" coordination-communicator="comm"> <var name="mype" type="integer"/> <var name="numberpe" type="integer"/> <var name="istep" type="integer"/> <var name="mimax_var" type="integer"/> <var name="nx" type="integer"/> <var name="ny" type="integer"/> <var name="zion0_1darray" gwrite="zion0" type="double" dimensions="mimax_var /> <var name="phi_2darray" gwrite="phi" type="double" dimensions="nx, NY"/> <! for reader.xml --> </adios-group> <method group="restart" method= CLOUD-IO >;</method> <buffer size-mb="20" allocate-time="now"/> </adios-config> 14
Sample XML File <?xml version="1.0"?> <adios-config host-language="fortran"> <adios-group name="restart" coordination-communicator="comm"> <var name="mype" type="integer"/> <var name="numberpe" type="integer"/> <var name="istep" type="integer"/> <var name="mimax_var" type="integer"/> <var name="nx" type="integer"/> <var name="ny" type="integer"/> <var name="zion0_1darray" gwrite="zion0" type="double" dimensions="mimax_var /> <var name="phi_2darray" gwrite="phi" type="double" dimensions="nx, NY"/> <! for reader.xml --> <rd type=8 name="phi_2darray cod= int i; double sum = 0.0; for(i = 0; i<input.count; i= i+1) { sum = sum + input.vals[i]; } return sum; /> </adios-group> <method group="restart" method= CLOUD-IO >;</method> <buffer size-mb="20" allocate-time="now"/> </adios-config> 14
Data Reduction (DR) Functions Definition Function defined by end-users that can transform/reduce data to prepare it for cloud storage 15
Data Reduction (DR) Functions Definition Function defined by end-users that can transform/reduce data to prepare it for cloud storage Creation Programmed by end users Generated from higher level descriptions Derived from users I/O access patterns 15
Data Reduction (DR) Functions Definition Function defined by end-users that can transform/reduce data to prepare it for cloud storage Creation Programmed by end users Generated from higher level descriptions Derived from users I/O access patterns Current implementation Customized CoD (C on Demand) require producer-side computational resources DR-function library same DR-function specified by multiple clients, will be executed only once, and its output data will be reused for multiple consumers 15
DR-functions Provided by SciBox Type Description Example DR1 Max(variable) Max(var_double_2Darray) DR2 Min(variable) Min(var_double_2Darray) DR3 Mean(variable) Mean(var_double_2Darray) DR4 DR5 DR6 DR7 Range(variable, dimension, start_pos, end_pos) Select(variable, threshold1, threshold2) Select(variable, DR_Function1, DR_Function2) Select(variable1, variable2, threshold1, threshold2) Range(var_int_1Darray, 1, 100, 1000) Select var.value where var.value in (threshold1, threshold2) Select var.value where var.value Mean(var) Select var2.value where var1.value in (threshold1, threshold2) DR8 Self defined function Double proc(cod_exec_context ec, input_type * input, int k, int m) {int I; intj; double sum = 0.0; double average=0.0; for(i=0;i<m;i++)sum+=input.tmpbuf[i+k*m];aver age=sum/m; resturn average;} 51
DR-functions Provided by SciBox Type Description Example DR1 Max(variable) Max(var_double_2Darray) DR2 Min(variable) Min(var_double_2Darray) DR3 Mean(variable) Mean(var_double_2Darray) DR4 DR5 DR6 DR7 Range(variable, dimension, start_pos, end_pos) Select(variable, threshold1, threshold2) Select(variable, DR_Function1, DR_Function2) Select(variable1, variable2, threshold1, threshold2) Range(var_int_1Darray, 1, 100, 1000) Select var.value where var.value in (threshold1, threshold2) Select var.value where var.value Mean(var) Select var2.value where var1.value in (threshold1, threshold2) DR8 Self defined function Double proc(cod_exec_context ec, input_type * input, int k, int m) {int I; intj; double sum = 0.0; double average=0.0; for(i=0;i<m;i++)sum+=input.tmpbuf[i+k*m];aver age=sum/m; resturn average;} C on Demand (CoD): Consumer: a string Producer: 1. registration 2. compile and execute on demand. 52
Data Object Management Scibox Super File Group Metadata User 0 Metadata User 1 Metadata User N Metadata Group Metadata Container User Metadata Container Writer.xml Reader0.xml User0 Metadata Object User1 Metadata Object UserN Metadata Object Reader1.xml Data Container ReaderN.xml Object 0 Object 1 Object N 17
Data Object Management Scibox Super File Group Metadata User 0 Metadata User 1 Metadata User N Metadata Group Metadata Container User Metadata Container Writer.xml Reader0.xml User0 Metadata Object User1 Metadata Object UserN Metadata Object Reader1.xml Enable object sharing Data Container ReaderN.xml Object 0 Object 1 Object N 17
Determination of Object Size Merge overlapping data sets reduce upload data size Partial object access Current Amazon S3 and OpenStack Swift stores do not support this Users have to download the whole object, even if only a small portion of its data is needed 18
Determination of Object Size Merge overlapping data sets reduce upload data size Partial object access Current Amazon S3 and OpenStack Swift stores do not support this Users have to download the whole object, even if only a small portion of its data is needed Two approaches used in Scibox Private cloud modify the software to enable partial object access Object size is determined by predicting upload throughput 18
Determination of Object Size Merge overlapping data sets reduce upload data size Partial object access Current Amazon S3 and OpenStack Swift stores do not support this Users have to download the whole object, even if only a small portion of its data is needed Two approaches used in Scibox Private cloud modify the software to enable partial object access Object size is determined by predicting upload throughput Public cloud limit object size considering storage pricing Object size is determined by comparing the cost w/ sharing and w/o sharing 18
Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively 19
Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively Without sharing: 19
Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively Without sharing: With sharing ( n objects can be merged into one): 19
Cost Model Definition α: $/GB of standard cloud storage β: $/GB of data transfer into cloud γ: $/GB of data transfer out from cloud Assumption n clients request Data 1, Data 2, Data n respectively Without sharing: With sharing ( n objects can be merged into one): 19
Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 20
Experimental Setup Aerospace App OpenStack Swift GTS Aero Cluster Vogue 21 63
Experimental Setup Aerospace App OpenStack Swift GTS Images Processing Aero Cluster Vogue Jedi Cluster 21 64
Experimental Setup Aerospace App OpenStack Swift Visualization GTS Images Processing Aero Cluster Vogue Jedi Cluster GeorgiaTech (Atlanta) WSU (Detroit) OSU(Columbus) 21 65
Workloads Synthetic Workloads 10 variables shared by multiple consumers 1,000 requests generated by each consumer 8 types of DR functions, uniformly distributed 1 data producer serving 1,000 requests x #client servers 3 self-defined DR-functions: FFT, histogram diagnosis, and average of row values of a matrix 22
Workloads Synthetic Workloads 10 variables shared by multiple consumers 1,000 requests generated by each consumer 8 types of DR functions, uniformly distributed 1 data producer serving 1,000 requests x #client servers 3 self-defined DR-functions: FFT, histogram diagnosis, and average of row values of a matrix Real Workloads GTS workload 128 parallel processes, consumers are from 3 different states in USA Combustion workload 10, 000 512X512 12-bit double framed images (~1.5 MB per image) 22
Latency Breakdown of Swift Object Store 100% Data Transfer & Server Processing Disk Read Authorization & Container Checking Software Overhead & Others 0.43 s 0.61 s 0.92 s 1.49 s 2.62 s 5.05 s 9.36 s 100% Headers Transfer Data Transfer & Server Processing Disk Write & CheckSum Authorization & Container Checking Software Overhead & Others 0.48 s 0.88 s 1.28 s 2.17 s 3.65 s 6.99 s 14.64 s 17.04% Percentage 80% 60% 41.77% 54.38% 67.33% 75.62% 78.78% 83.20% Percentage 80% 60% 34.63% 63.82% 73.58% 84.93% 88.69% 91.30% 93.28% 40% 40% 20% 20% 0% 1 4 8 16 32 64 128 Object Size (MB) 0% 1 4 8 16 32 64 128 Object Size (MB) Put Get 23
Latency Breakdown of Swift Object Store 100% Data Transfer & Server Processing Disk Read Authorization & Container Checking Software Overhead & Others 0.43 s 0.61 s 0.92 s 1.49 s 2.62 s 5.05 s 9.36 s 100% Headers Transfer Data Transfer & Server Processing Disk Write & CheckSum Authorization & Container Checking Software Overhead & Others 0.48 s 0.88 s 1.28 s 2.17 s 3.65 s 6.99 s 14.64 s 17.04% Percentage 80% 60% 41.77% 54.38% 67.33% 75.62% 78.78% 83.20% Percentage 80% 60% 34.63% 63.82% 73.58% 84.93% 88.69% 91.30% 93.28% 40% 40% 20% 20% 0% 1 4 8 16 32 64 128 Object Size (MB) Put 0% Data transfer dominates the request latency. 1 4 8 16 32 64 128 Object Size (MB) Get 23
Execution Time of DR-functions 4 3.5 DR1 DR2 DR3 7 6 DR4-1K DR4-1M DR4-16M DR4-64M DR4-128M Execution Time (seconds) 3 2.5 2 1.5 1 0.5 Execution Time (seconds) 5 4 3 2 1 0 Execution Time (seconds) 64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB Variable Size 40 DR5 DR6 DR7 35 30 25 20 15 10 5 0 64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB Variable Size 0 64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB Variable Size Recall: DR6:var.value where var.value>mean(var). Double scan of input data. Recall: DR7:Select var2.value where var1.value in (r1, r2). var1 is small. 24
SciBox System Scalability 5000 4962.91 90 Execution Time (seconds) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Stock System Scibox 2643.06 1329.83 968.86 842.81 412.12 271.46 92.39 109.65 151.44 1 2 4 8 16 Latency (seconds) 80 70 60 50 40 30 20 10 0 Stock System Scibox 81.91 42.25 21.94 19.66 12.68 9.54 4.34 5.1 6.48 1.98 2 3.09 2 4 8 16 32 64 Number of Producers Number of Consumers in the Same Sharing Group With Scibox, data is merged before upload With Scibox, partial object access is supported 25
GTS Workload 1600 Post-Computation Download Filter-Computation GTS-Computation 1400 1200 Latency (seconds) 1000 800 600 400 200 0 Gatech-Scibox Gatech-FTP OSU-Scibox OSU-FTP WSU-Scibox WSU-FTP Data Sharing Schemes WSU OSU GT GT 900 KB/s 4.4 MB/s 44 MB/s 26
Combustion Workload 2 1.8 1 Process 2 Processes 4 Processes 8 Processes 16 Processes 32 Processes 1.6 Speedup of Performance 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 4 8 16 Number of Clients in the Same Sharing Group DR-function: (ImageName, DR8, FFT) 10, 000 images (~1.5 MB/image) shared via Scibox 27
Outline Background and Motivation Problems and Challenges Design and Implementation Evaluation Conclusion and Future Work 28
Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores 29
Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores Data Reduction functions Exploit the structured nature of scientific data Reduce the amount of transferred data 29
Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores Data Reduction functions Exploit the structured nature of scientific data Reduce the amount of transferred data Transparent to Applications/Data Producers Implemented in ADIOS Can be also applied to other I/O libraries 29
Conclusions and Future Work Scibox: Cloud-based support for scientific data sharing Can operate across both public and private cloud stores Data Reduction functions Exploit the structured nature of scientific data Reduce the amount of transferred data Transparent to Applications/Data Producers Implemented in ADIOS Can be also applied to other I/O libraries Future Work Deploy Scibox on national labs facilities to better understand potential use cases Additional optimizations of cloud storage for scientific data management 29
Thanks! 30