Dynamic Active Storage for High Performance I/O

Similar documents
Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division

ECE7995 (7) Parallel I/O

DOSAS: Mitigating the Resource Contention in Active Storage Systems

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Enabling Active Storage on Parallel I/O Software Stacks

PSA: Performance and Space-Aware Data Layout for Hybrid Parallel File Systems

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Hint Controlled Distribution with Parallel File Systems

The Fusion Distributed File System

Parallel File Systems for HPC

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report

An Overview of Fujitsu s Lustre Based File System

Optimization of non-contiguous MPI-I/O operations

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Improved Solutions for I/O Provisioning and Application Acceleration

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Structuring PLFS for Extensibility

On the Role of Burst Buffers in Leadership- Class Storage Systems

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Revealing Applications Access Pattern in Collective I/O for Cache Management

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Pattern-Aware File Reorganization in MPI-IO

Resource Management on a Mixed Processor Linux Cluster. Haibo Wang. Mississippi Center for Supercomputing Research

XPU A Programmable FPGA Accelerator for Diverse Workloads

Iteration Based Collective I/O Strategy for Parallel I/O Systems

Xen-Based HPC: A Parallel I/O Perspective. Weikuan Yu, Jeffrey S. Vetter

A Hybrid Shared-nothing/Shared-data Storage Scheme for Large-scale Data Processing

REMEM: REmote MEMory as Checkpointing Storage

High Performance Computing in C and C++

An Evolutionary Path to Object Storage Access

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

High-Performance Lustre with Maximum Data Assurance

Challenges in HPC I/O

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

IBM Power Systems HPC Cluster

Progress on Efficient Integration of Lustre* and Hadoop/YARN

pnfs and Linux: Working Towards a Heterogeneous Future

Leveraging Burst Buffer Coordination to Prevent I/O Interference

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

POCCS: A Parallel Out-of-Core Computing System for Linux Clusters

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

High Performance Computing Cloud - a PaaS Perspective

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Boosting Application-specific Parallel I/O Optimization using IOSIG

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

One Optimized I/O Configuration per HPC Application: Leveraging the Configurability of Cloud

Black-Box Problem Diagnosis in Parallel File System

Introduction to HPC Parallel I/O

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

Exploiting Lustre File Joining for Effective Collective IO

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Memory Management Strategies for Data Serving with RDMA

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

High Performance Supercomputing using Infiniband based Clustered Servers

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Extreme I/O Scaling with HDF5

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Parallel I/O Libraries and Techniques

Comprehensive Lustre I/O Tracing with Vampir

Initial Performance Evaluation of the Cray SeaStar Interconnect

Introduction to Parallel I/O

Experiences with HP SFS / Lustre in HPC Production

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Data Centric Computing

The Parallel NFS Bugaboo. Andy Adamson Center For Information Technology Integration University of Michigan

Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Lustre overview and roadmap to Exascale computing

Coordinating Parallel HSM in Object-based Cluster Filesystems

VARIABILITY IN OPERATING SYSTEMS

Storage Hierarchy Management for Scientific Computing

Benchmark of a Cubieboard cluster

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

Dr. John Dennis

Voltaire Making Applications Run Faster

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Paving the Road to Exascale

Xyratex ClusterStor6000 & OneStor

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

PSA: A Performance and Space-Aware Data Layout Scheme for Hybrid Parallel File Systems

A Mathematical Model in Support of Efficient offloading for Active Storage Architectures

Exploration of Parallel Storage Architectures for a Blue Gene/L on the TeraGrid

Real Parallel Computers

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Feedback on BeeGFS. A Parallel File System for High Performance Computing

General Plasma Physics

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Image Processing on the Cloud. Outline

Efficiency Evaluation of the Input/Output System on Computer Clusters

The Oracle Database Appliance I/O and Performance Architecture

InfiniBand Networked Flash Storage

Introduction to High Performance Parallel I/O

Transcription:

Dynamic Active Storage for High Performance I/O Chao Chen(chao.chen@ttu.edu) 4.02.2012 UREaSON

Outline Ø Background Ø Active Storage Ø Issues/challenges Ø Dynamic Active Storage Ø Prototyping and Evaluation Ø Conclusion and future work

Background Ø Applications from the area of geographical information systems, Climate Science, astrophysics, high-energy physics, etc. are becoming more and more data intensive. NASA s Shuttle Radar Topography Mission (10TB) FLASH: Buoyancy-Driven Turbulent Nuclear Burning(75TB~300TB) Climate Science (10TB~355TB) Ø Efficient tools are needed to store and analyze these data sets.

Background Ø CN: compute nodes, dedicated for processing (sum, minus, multiple etc.) Ø SN: storage nodes, dedicated for storing the data. Ø It is very time consuming. Ø I/O operations dominate the system performance CN 1 CN 2 CN 3 CN n Compute Node Application Analysis kernel Network I/O request Data SN 1 SN 2 SN m Storage Node Disk

Active Storage Ø Active Storage was proposed to mitigate such issue, and attracted intensive attention. Ø It moves appropriate computations near data (Storage Nodes) Compute Node I/O request Application Result Network bandwidth cost is reduced Storage Node Analysis kernel Data Disk

Active Storage Two famous prototype: Ø Felix et. al proposed the first prototype based on Lustre Supports limited and simple operations NAL OST User Space Processing Component Lacks a flexible method to add processing kernels ASOBD ASDEV OBDfilter ext3

Active Storage Ø Woo et. al proposed another prototype based on PVFS Application It provides a more sophisticated prototype based on MPI Client 1 Client 2 Client n Parallel File System Client Parallel File System API Active Storage API User can register their process kernels Interconnection network Server 1 Server 2 Server n Parallel File System API Kernels Disk GPU

Issues/Challenges Ø All existing studies don t consider data dependence Ø Dependence commonly exists among data accesses

Issues/Challenges for example, flow-direction and flow-accumulation operations in terrain analysis latitude Single direction multi-direction Fig.1 Examples of SFD and MFD longitude SFD: Single flow direction MFD: Multiple flow direction

Issues/Challenges Ø Dependence has a great impact on performance The)performance)of)Ac&ve)Storage)(no)dependence)) Performance)of)Ac&vestorage)(with)dependence)) 10000" 3500" 9000" 8000" 3000" Execu&on)&me)(s)) 7000" 6000" 5000" 4000" 3000" TS" AS" Execu&on)&me)(s)) 2500" 2000" 1500" 1000" AS" TS" 2000" 1000" 500" 0" 24" 36" 48" 60" 0" 24GB" 36GB" 48GB" 60GB" Data)size)(GB)) Data)size) SUM operation flow-routing operation Question: Is every operation suitable to be offloaded to storage node?

Data Dependence Stripe 1 Stripe L Each stripe is 64kb in PVFS Terrain map 1 2 3 2 3 4 Stripe o 4 5 N-4 N-3 N-2 N-1 N s 1 s 2 s 3 Possible Data distribution Server a Server b Server c M-3 M-2 s 4 s s 5 s 6 s 7 s 8 Stripe p Analysis Kernel Disk Analysis Kernel Stripe q Stripe o Disk Stripe p Analysis Kernel Disk Stripe q M-1 M Possible Bandwidth cost: 2 times

Dynamic Active Storage A Dynamic Active Storage Prototype is proposed: Ø Predicts the I/O bandwidth cost before the active I/O is accepted Ø Dynamically determines operations that are beneficial to be offloaded and processed on storage nodes Ø Introduces a new data layout method

DAS System Architecture Key components: 1. Bandwidth prediction 2. Data Distribution calculation (layout optimizer) 3. Kernel features 4. Local I/O API 5. Processing kernels NEW

Bandwidth Prediction Known the dependence patterns, we can calculate data locations, and then estimate the bandwidth cost previously: k i j stride stride i,j,k i th, j th, k th data elements E Data element size D Num. of Storage Nodes L Location of data elements Stripe_size parallel file system parameter

Bandwidth Prediction if Formula 1 then All dependency data is located at same storage node, and accept download requirement else It would cost 2 times bandwidth of file size, and should reject Active I/O requirement

Issues/Challenges On the other hands, it is common that successive operations share the same data access patterns in terrain analysis and image processing for example, flow-direction is always followed by flow-accumulation operation in terrain analysis flow-direction generate intermediate image/map for flow-accumulation

Layout Optimizer A new data distribution method is introduced: Ø Adopt an suitable data distribution method to store intermediate image/data Ø Ensure no/little data dependency for successive operations (such as flowaccumulation) Ø round-robin pattern is discarded, and each storage node stores k successive stripes. Ø Two copies of the boundary data strips are stored in successive two storage nodes

Layout Optimizer Stripe 1 Stripe L 1 2 3 4 5 N-4 N-3 N-2 N-1 N 2 Normal Data layout 3 Stripe l 4 s 1 s 2 s 3 Stripe o Server a Server b Stripe m Stripe n M-3 M-2 s 4 s s 5 s 6 s 7 s 8 Stripe p Stripe q Stripe l Stripe m Stripe n Data Transfer Stripe o Stripe p Stripe q M-1 M

Layout Optimizer Server a Server b Stripe l Stripe m Stripe n Stripe o Stripe p Stripe q Data Transfer Stripe l Stripe m Stripe n Stripe o Stripe p Stripe q Copy

Layout Optimizer New formulas: What the prototype need to do is to calculate a suitable value for k, D and stripe_size

Evaluation Platform Hrothgar Cluster # of Nodes 24, 36, 48, 60 Evaluated operations Data set size Evaluated schemes Flow-routing, Flow-accumulation and 2D Gaussian Filter 24GB, 36GB, 48GB and 60GB TS: traditional storage, NAS: normal active storage, DAS: proposed prototype

Impact of Data Dependence 16000" Performance)Impact)of)Data)Dependece) 14000" flow_rou/ng_nas" Execu&on)Time)(s)) 12000" 10000" 8000" 6000" flow_rou/ng_ts" flow_accumula/on_nas" flow_accumula/on_ts" 4000" gaussian_nas" 2000" gaussian_ts" 0" 24" 36" 48" 60" Data)Size)(GB)) Execution time of NAS scheme is compared with one of TS scheme

Performance Improvement Execu&on)Time)of)Each)Scheme) Execu&on)Time)(s)) 6000" 5000" 4000" 3000" 2000" NAS" DAS" TS" 30% improvement V.S. TS 60% improvement V.S. NAS 1000" 0" Flow-rou0ng" Flow-accumula0on" Gaussian"Filter" Opera&ons) Comparison of Execution Time OF NAS, TS and DAS. (24GB data, 24 nodes)

Scalability Analysis 10000" Scalability)with)Varied)Number)of)Nodes) Execu&on)Time)(s)) 9000" 8000" 7000" 6000" 5000" 4000" 3000" flow_rou2ng_das" flow_rou2ng_ts" flow_accumula2on_das" flow_accumula2on_ts" All decreased 15% with increasing 12 nodes 2000" gaussian_das" 1000" 0" 24" 36" 48" 60" Number)of)Nodes) gaussian_ts" Comparison of Execution Time when the Number of Nodes Increased

Scalability Analysis Execu&on)Time)(s)) 16000" 14000" 12000" 10000" 8000" 6000" 4000" 2000" Scalability)with)varied)Data)Set)Size) flow_rou/ng_nas" flow_rou/ng_das" flow_rou/ng_ts" flow_accumula/on_nas" flow_accumula/on_das" flow_accumula/on_ts" gaussian_nas" gaussian_das" execution time increases: DAS: 15% NAS: 30% TS: 30% When data increased 12GB 0" 24" 36" 48" 48" Data)Size)(GB)) gaussian_ts" Comparison of Execution Time with varied data size

Bandwidth Improvement 2.5" Normalized+Bandwidth+ 2" Normalized+band+width+ 1.5" 1" 0.5" NAS" DAS" TS" Compared to TS DAS: 1.8 times bandwidth NAS: 0.7 times bandwidth 0" 24" 36" 48" 60" Data+size(GB)+ Normalized Sustained Bandwidth Improvement

Conclusion and Future Work Ø Data dependence has a great impact on performance of Active Storage Ø DAS is introduced to solve such challenge issue Ø Resource contention

Reference 1. R. Ross, R. Latham, M. Unangst and B. Welch. Paralell I/O in Practice. Tutorial in the ACM/ IEEE Supercomputing Conference, 2009. 2. J. F. O. Callaghan and M. D. M. The Extraction of Drainage Networks from Digital Elevation Data. Computer Vision, Graphics and Image Processing, 8:323 344, 1984. 3. J. Piernas, J. Nieplocha, and E. J. Felix. Evaluation of Active Storage Strategies for the Lustre Parallel File System. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007. 4. E. J. Felix, K. Fox, K. Regimbal, and J. Nieplocha. Active Storage 5. Processing in a Parallel File System. In 6th LCI International Conference on Linux Clusters: The HPC Revolution, Chapel Hill, North Carolina, 2005. 6.. W. Son, S. Lang, P. Carns, R. Ross, and R. Thakur. Enabling Active Storage on Parallel I / O Software Stacks. In 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010...etc.

Thank you