Dynamic Active Storage for High Performance I/O

Dynamic Active Storage for High Performance I/O Chao Chen(chao.chen@ttu.edu) 4.02.2012 UREaSON

Outline Ø Background Ø Active Storage Ø Issues/challenges Ø Dynamic Active Storage Ø Prototyping and Evaluation Ø Conclusion and future work

Background Ø Applications from the area of geographical information systems, Climate Science, astrophysics, high-energy physics, etc. are becoming more and more data intensive. NASA s Shuttle Radar Topography Mission (10TB) FLASH: Buoyancy-Driven Turbulent Nuclear Burning(75TB~300TB) Climate Science (10TB~355TB) Ø Efficient tools are needed to store and analyze these data sets.

Background Ø CN: compute nodes, dedicated for processing (sum, minus, multiple etc.) Ø SN: storage nodes, dedicated for storing the data. Ø It is very time consuming. Ø I/O operations dominate the system performance CN 1 CN 2 CN 3 CN n Compute Node Application Analysis kernel Network I/O request Data SN 1 SN 2 SN m Storage Node Disk

Active Storage Ø Active Storage was proposed to mitigate such issue, and attracted intensive attention. Ø It moves appropriate computations near data (Storage Nodes) Compute Node I/O request Application Result Network bandwidth cost is reduced Storage Node Analysis kernel Data Disk

Active Storage Two famous prototype: Ø Felix et. al proposed the first prototype based on Lustre Supports limited and simple operations NAL OST User Space Processing Component Lacks a flexible method to add processing kernels ASOBD ASDEV OBDfilter ext3

Active Storage Ø Woo et. al proposed another prototype based on PVFS Application It provides a more sophisticated prototype based on MPI Client 1 Client 2 Client n Parallel File System Client Parallel File System API Active Storage API User can register their process kernels Interconnection network Server 1 Server 2 Server n Parallel File System API Kernels Disk GPU

Issues/Challenges Ø All existing studies don t consider data dependence Ø Dependence commonly exists among data accesses

Issues/Challenges for example, flow-direction and flow-accumulation operations in terrain analysis latitude Single direction multi-direction Fig.1 Examples of SFD and MFD longitude SFD: Single flow direction MFD: Multiple flow direction

Issues/Challenges Ø Dependence has a great impact on performance The)performance)of)Ac&ve)Storage)(no)dependence)) Performance)of)Ac&vestorage)(with)dependence)) 10000" 3500" 9000" 8000" 3000" Execu&on)&me)(s)) 7000" 6000" 5000" 4000" 3000" TS" AS" Execu&on)&me)(s)) 2500" 2000" 1500" 1000" AS" TS" 2000" 1000" 500" 0" 24" 36" 48" 60" 0" 24GB" 36GB" 48GB" 60GB" Data)size)(GB)) Data)size) SUM operation flow-routing operation Question: Is every operation suitable to be offloaded to storage node?

Data Dependence Stripe 1 Stripe L Each stripe is 64kb in PVFS Terrain map 1 2 3 2 3 4 Stripe o 4 5 N-4 N-3 N-2 N-1 N s 1 s 2 s 3 Possible Data distribution Server a Server b Server c M-3 M-2 s 4 s s 5 s 6 s 7 s 8 Stripe p Analysis Kernel Disk Analysis Kernel Stripe q Stripe o Disk Stripe p Analysis Kernel Disk Stripe q M-1 M Possible Bandwidth cost: 2 times

Dynamic Active Storage A Dynamic Active Storage Prototype is proposed: Ø Predicts the I/O bandwidth cost before the active I/O is accepted Ø Dynamically determines operations that are beneficial to be offloaded and processed on storage nodes Ø Introduces a new data layout method

DAS System Architecture Key components: 1. Bandwidth prediction 2. Data Distribution calculation (layout optimizer) 3. Kernel features 4. Local I/O API 5. Processing kernels NEW

Bandwidth Prediction Known the dependence patterns, we can calculate data locations, and then estimate the bandwidth cost previously: k i j stride stride i,j,k i th, j th, k th data elements E Data element size D Num. of Storage Nodes L Location of data elements Stripe_size parallel file system parameter

Bandwidth Prediction if Formula 1 then All dependency data is located at same storage node, and accept download requirement else It would cost 2 times bandwidth of file size, and should reject Active I/O requirement

Issues/Challenges On the other hands, it is common that successive operations share the same data access patterns in terrain analysis and image processing for example, flow-direction is always followed by flow-accumulation operation in terrain analysis flow-direction generate intermediate image/map for flow-accumulation

Layout Optimizer A new data distribution method is introduced: Ø Adopt an suitable data distribution method to store intermediate image/data Ø Ensure no/little data dependency for successive operations (such as flowaccumulation) Ø round-robin pattern is discarded, and each storage node stores k successive stripes. Ø Two copies of the boundary data strips are stored in successive two storage nodes

Layout Optimizer Stripe 1 Stripe L 1 2 3 4 5 N-4 N-3 N-2 N-1 N 2 Normal Data layout 3 Stripe l 4 s 1 s 2 s 3 Stripe o Server a Server b Stripe m Stripe n M-3 M-2 s 4 s s 5 s 6 s 7 s 8 Stripe p Stripe q Stripe l Stripe m Stripe n Data Transfer Stripe o Stripe p Stripe q M-1 M

Layout Optimizer Server a Server b Stripe l Stripe m Stripe n Stripe o Stripe p Stripe q Data Transfer Stripe l Stripe m Stripe n Stripe o Stripe p Stripe q Copy

Layout Optimizer New formulas: What the prototype need to do is to calculate a suitable value for k, D and stripe_size

Evaluation Platform Hrothgar Cluster # of Nodes 24, 36, 48, 60 Evaluated operations Data set size Evaluated schemes Flow-routing, Flow-accumulation and 2D Gaussian Filter 24GB, 36GB, 48GB and 60GB TS: traditional storage, NAS: normal active storage, DAS: proposed prototype

Impact of Data Dependence 16000" Performance)Impact)of)Data)Dependece) 14000" flow_rou/ng_nas" Execu&on)Time)(s)) 12000" 10000" 8000" 6000" flow_rou/ng_ts" flow_accumula/on_nas" flow_accumula/on_ts" 4000" gaussian_nas" 2000" gaussian_ts" 0" 24" 36" 48" 60" Data)Size)(GB)) Execution time of NAS scheme is compared with one of TS scheme

Performance Improvement Execu&on)Time)of)Each)Scheme) Execu&on)Time)(s)) 6000" 5000" 4000" 3000" 2000" NAS" DAS" TS" 30% improvement V.S. TS 60% improvement V.S. NAS 1000" 0" Flow-rou0ng" Flow-accumula0on" Gaussian"Filter" Opera&ons) Comparison of Execution Time OF NAS, TS and DAS. (24GB data, 24 nodes)

Scalability Analysis 10000" Scalability)with)Varied)Number)of)Nodes) Execu&on)Time)(s)) 9000" 8000" 7000" 6000" 5000" 4000" 3000" flow_rou2ng_das" flow_rou2ng_ts" flow_accumula2on_das" flow_accumula2on_ts" All decreased 15% with increasing 12 nodes 2000" gaussian_das" 1000" 0" 24" 36" 48" 60" Number)of)Nodes) gaussian_ts" Comparison of Execution Time when the Number of Nodes Increased

Scalability Analysis Execu&on)Time)(s)) 16000" 14000" 12000" 10000" 8000" 6000" 4000" 2000" Scalability)with)varied)Data)Set)Size) flow_rou/ng_nas" flow_rou/ng_das" flow_rou/ng_ts" flow_accumula/on_nas" flow_accumula/on_das" flow_accumula/on_ts" gaussian_nas" gaussian_das" execution time increases: DAS: 15% NAS: 30% TS: 30% When data increased 12GB 0" 24" 36" 48" 48" Data)Size)(GB)) gaussian_ts" Comparison of Execution Time with varied data size

Bandwidth Improvement 2.5" Normalized+Bandwidth+ 2" Normalized+band+width+ 1.5" 1" 0.5" NAS" DAS" TS" Compared to TS DAS: 1.8 times bandwidth NAS: 0.7 times bandwidth 0" 24" 36" 48" 60" Data+size(GB)+ Normalized Sustained Bandwidth Improvement

Conclusion and Future Work Ø Data dependence has a great impact on performance of Active Storage Ø DAS is introduced to solve such challenge issue Ø Resource contention

Reference 1. R. Ross, R. Latham, M. Unangst and B. Welch. Paralell I/O in Practice. Tutorial in the ACM/ IEEE Supercomputing Conference, 2009. 2. J. F. O. Callaghan and M. D. M. The Extraction of Drainage Networks from Digital Elevation Data. Computer Vision, Graphics and Image Processing, 8:323 344, 1984. 3. J. Piernas, J. Nieplocha, and E. J. Felix. Evaluation of Active Storage Strategies for the Lustre Parallel File System. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007. 4. E. J. Felix, K. Fox, K. Regimbal, and J. Nieplocha. Active Storage 5. Processing in a Parallel File System. In 6th LCI International Conference on Linux Clusters: The HPC Revolution, Chapel Hill, North Carolina, 2005. 6.. W. Son, S. Lang, P. Carns, R. Ross, and R. Thakur. Enabling Active Storage on Parallel I / O Software Stacks. In 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010...etc.

Thank you