Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division

Similar documents
Enabling Active Storage on Parallel I/O Software Stacks

Dynamic Active Storage for High Performance I/O

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Revealing Applications Access Pattern in Collective I/O for Cache Management

AMBER 11 Performance Benchmark and Profiling. July 2011

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

ECE7995 (7) Parallel I/O

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

The Fusion Distributed File System

Performance impact of dynamic parallelism on different clustering algorithms

Lecture 33: More on MPI I/O. William Gropp

Bridging the Gap Between High Quality and High Performance for HPC Visualization

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

AcuSolve Performance Benchmark and Profiling. October 2011

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)

PSA: Performance and Space-Aware Data Layout for Hybrid Parallel File Systems

Challenges in HPC I/O

Sharing High-Performance Devices Across Multiple Virtual Machines

ScalaIOTrace: Scalable I/O Tracing and Analysis

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Treelogy: A Benchmark Suite for Tree Traversals

Speeding up the execution of numerical computations and simulations with rcuda José Duato

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

CP2K Performance Benchmark and Profiling. April 2011

Harp-DAAL for High Performance Big Data Computing

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

A Case for High Performance Computing with Virtual Machines

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

GPU-centric communication for improved efficiency

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

OpenFOAM Performance Testing and Profiling. October 2017

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Taming Parallel I/O Complexity with Auto-Tuning

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

DOSAS: Mitigating the Resource Contention in Active Storage Systems

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

HPC Architectures. Types of resource currently in use

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Pattern-Aware File Reorganization in MPI-IO

Deep Learning on SHARCNET:

Parallel I/O and MPI-IO contd. Rajeev Thakur

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing

Storage Hierarchy Management for Scientific Computing

The rcuda middleware and applications

Filesystems on SSCK's HP XC6000

SNAP Performance Benchmark and Profiling. April 2014

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

OP2 FOR MANY-CORE ARCHITECTURES

MILC Performance Benchmark and Profiling. April 2013

OCTOPUS Performance Benchmark and Profiling. June 2015

MOHA: Many-Task Computing Framework on Hadoop

Accelerating Implicit LS-DYNA with GPU

CUDA OPTIMIZATIONS ISC 2011 Tutorial

LDetector: A low overhead data race detector for GPU programs

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Parallel Pipeline STAP System

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Diamond Networks/Computing. Nick Rees January 2011

Introduction to High Performance Parallel I/O

Parallelization of K-Means Clustering Algorithm for Data Mining

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

Massive Online Analysis - Storm,Spark

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Comparison of High-Speed Ray Casting on GPU

POCCS: A Parallel Out-of-Core Computing System for Linux Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Optimization of non-contiguous MPI-I/O operations

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

STAR-CCM+ Performance Benchmark and Profiling. July 2014

The BioHPC Nucleus Cluster & Future Developments

Transcription:

Enabling Active Storage on Parallel I/O Software Stacks Seung Woo Son sson@mcs.anl.gov Mathematics and Computer Science Division MSST 2010, Incline Village, NV May 7, 2010

Performing analysis on large data sets is often frustrating Original Data Acquire data or simulate Target Data Data integration & selection Preprocessed Data S3D simulations for combustion research are producing 30 130 TB of data per simulation Preprocessing (pattern recognition & feature extraction) Patterns Model construction Knowledge Interpretation (interactive analysis & visualization) Scientists and engineers spent too much time on data manipulation, especially moving and reorganizing data 2

Talk outline Motivation Active storage in parallel file systems Our prototype Enhanced runtime interface that uses embedded analysis kernels Runtime stripe alignment Server-to-server communication for reduction and aggregation Experimental evaluation Conclusion 3

Active storage in parallel file systems client nodes C 1 C 2 C m filter Interconnect network filter File read filter Reduced data sent to client Library or user space implementation; not well integrated into I/O software stacks Targeting applications that manipulates fundamentallyindependent data sets server nodes S 1 S 2 Active storage is a technique for performing data transformations in the storage system S n Lack of reduction and aggregation on the storage nodes E. Riedel et al., Active disks for large-scale data processing, IEEE Computer, 2001. J. Piernas et al., Evaluation of active storage strategies for the Lustre parallel file systems, in SC, 2007. 4

We enable active storage on parallel I/O software stack 1. Enhanced runtime interface (API) to enable active storage operations 2. Runtime data stripe alignment 3. Server-to-server communication primitives for complex analysis 5

Enhanced runtime I/O interface to trigger embedded analysis kernels sum = 0.0; MPI_File_open(&fh); double *tmp = (double*)malloc(nitem*sizeof(doble)); offset = rank * nitem * type_size; MPI_File_read_at(fh, offset, tmp, nitem, MPI_DOUBLE, &status) for(i=0; i<nitem; i++) sum += tmp[i] Conventional MPI-based... MPI_File_open(&fh);... MPI_File_read_ex(, SUM, )... <client> Active storage based for(i=0; i<nitem; i++) sum += tmp[i]; <server> 6

Why MPI? MPI is a widely used interface There are a large number of applications Therefore, it might be relatively easy to migrate MPI specification provides interfaces where user functions can be embedded into it Enabling the incorporation of data mining and statistical functions easily Hint mechanism Passing kernel specific argument to the server, e.g., data types 7

Mapping embedded analysis kernels into I/O pipeline pvfs state machine machine pvfs_pipeline_sm { state fetch static int fetch_data { disk I/O; { run fetch_data; normal_op => dispatch; active_op => do_comp; } static int dispatch_data { send the data; } } } state do_comp { run do_comp_op; success => dispatch; } state dispatch { run dispatch_data; success => check_done; } state check_done { run check_done_action; not_done => fetch; default => terminate; } static int do_comp_op { for(i=0;i<nitem;i++) } sum += tmp[i] ; 8

Computational unit is often not perfectly aligned to file stripe unit n-dimensional data set 80 bytes day1 day2 day3 65536 bytes 65600 bytes 65536 bytes 9

I/O pipeline with data alignment 10

Server-to-server communication for reduction and aggregation 1. Randomly choose initial centers 2. Assign each point to the nearest center Reduction and aggregation can be done on client side (e.g., simple statistical operations) 3. Update centers (mean of members) 4. Repeat until convergence Complex analysis kernels (e.g., k-means clustering) requires broadcast and reduction during iterative execution K-means cluster algorithm 11

K-means clustering is performed purely on the server side! 12

Benchmarks and evaluation platform Name description Base (sec) Input data % of filtering sum Global reduction 1.38 512 MB ~100% grep kmeans vren String pattern matching 1.49 K-means clustering algorithm 0.44 Parallel volume rendering Test cluster 2.61 512 MB (4M of 128 string) 40 MB (1M*10 dim of double) 103MB (300*300*300 of float) 32 nodes Dual Intel Xeon Quad Core 2.66 MHz Main memory Storage capacity Interconnection network GPU accelerator 16GB ~200GB per node 1 Gb Ethernet 2 NVIDIA C1060 GPU card ~100% 90% 97% 13

All benchmarks are I/O dominant 64.4% time is spent on I/O Benchmarks are executed using 4 nodes 14

Moving computation to storage server (AS) improves performance significantly TS: Traditional Storage, 4 client nodes and 4 server nodes AS: Active Storage, 4 server nodes 15

Our approach is scalable w.r.t the different number of nodes SUM benchmark Fixed data size: 512MB 16

Putting client and server together SUM benchmark using 1 node No Inter-node communication, but Inter-process communication still exists To achieve this in reality, client should be aware of storage layout! 17

Conclusion Target Data Preprocessed Data Model construction Patterns Knowledge Interpretation Original Data Acquire data or simulate Data integration & selection Preprocessing Enabling active storage through: 1. Enhanced runtime interfaces (APIs) 2. Runtime stripe alignment 3. Server-to-server aggregation Enabling Active storage within parallel I/O software stack removes not only internode data transfer, but also inter-process data communication, resulting in a huge performance improvement for data-intensive analysis applications 18

Acknowledgments Department of Energy for funding this work Phil Carns, Sam Lang, Rob Ross, Rajeev Thakur (ANL) Alok Choudhary, Prabhat Kumar, Wei-Keng Liao, Berkin Ozisikyilmaz (NWU) 19

Thanks! 20

Future work Function shipping More flexible hint mechanism Hadoop style execution Write output result to the local storage Scalability analysis NCSA Lincoln cluster: 192 compute nodes and 96 NVIDIA Tesla S1070 accelerator units. More benchmarks/applications Visualization and Bioinformatics 21

Give hints to file servers for more information MPI_Info info; MPI_Init(); MPI_Comm_rank(); MPI_Info_create (&info); MPI_Info_set (info, key, val ); MPI_File_open (, info, ); MPI_Info_free (&info); MPI_Finalize(); <general MPI hint mechanism> Data type and operators are sufficient for simple operations, e.g., sum Some kernels might need more information to perform correct computation Grep: string length per line (128), search pattern ( aaaaa ) K-means: number of dimension (10), number of clusters (20), threshold value (0.001), etc. 22

Our approach is scalable w.r.t the different data set sizes SUM benchmark Fixed number of nodes: 4 23

Data mining kernels can be compute intensive 1. Randomly choose initial centers 2. Assign each point to the nearest center 3. Update centers (mean of members) 4. Repeat until convergence K-means clustering algorithm 24

Our approach is scalable w.r.t number of nodes to execute and data set size Fixed data set size = 1M data points Delta = 0.001 AS+GPU: active storage with GPU Fixed # of nodes = 4 Delta = 0.001 25