A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Similar documents
System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Parallel File Systems Compared

Parallel File Systems for HPC

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

ECE7995 (7) Parallel I/O

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Cloud Computing CS

Lustre A Platform for Intelligent Scale-Out Storage

Coordinating Parallel HSM in Object-based Cluster Filesystems

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Using Transparent Compression to Improve SSD-based I/O Caches

What is a file system

An Overview of Fujitsu s Lustre Based File System

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

Lustre overview and roadmap to Exascale computing

Orchestra: Extensible Block-level Support for Resource and Data Sharing in Networked Storage Systems

The Parallel NFS Bugaboo. Andy Adamson Center For Information Technology Integration University of Michigan

Filesystems on SSCK's HP XC6000

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Storage and Storage Access

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Massive Data Processing on the Acxiom Cluster Testbed

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Introduction to HPC Parallel I/O

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Clotho: Transparent Data Versioning at the Block I/O Level

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Memory Management Strategies for Data Serving with RDMA

System input-output, performance aspects March 2009 Guy Chesnot

Distributed Systems 16. Distributed File Systems II

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

A FRAMEWORK ARCHITECTURE FOR SHARED FILE POINTER OPERATIONS IN OPEN MPI

CA485 Ray Walshe Google File System

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

Architecture of a Next- Generation Parallel File System

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Crossing the Chasm: Sneaking a parallel file system into Hadoop

The Google File System

The ASCI/DOD Scalable I/O History and Strategy Run Time Systems and Scalable I/O Team Gary Grider CCN-8 Los Alamos National Laboratory LAUR

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

<Insert Picture Here> Lustre Development

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Triton file systems - an introduction. slide 1 of 28

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Initial Performance Evaluation of the Cray SeaStar Interconnect

Lessons learned implementing a multithreaded SMB2 server in OneFS

Metadata Performance Evaluation LUG Sorin Faibish, EMC Branislav Radovanovic, NetApp and MD BWG April 8-10, 2014

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

NFS-Ganesha w/gpfs FSAL And Other Use Cases

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Red Hat Enterprise 7 Beta File Systems

Lecture 2 Distributed Filesystems

University of Wisconsin-Madison

Buffer Management for XFS in Linux. William J. Earl SGI

Crossing the Chasm: Sneaking a parallel file system into Hadoop

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Progress on Efficient Integration of Lustre* and Hadoop/YARN

New types of Memory, their support in Linux and how to use them with RDMA

Structuring PLFS for Extensibility

Scalable I/O, File Systems, and Storage Networks R&D at Los Alamos LA-UR /2005. Gary Grider CCN-9

The State and Needs of IO Performance Tools

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS PARALLEL FILE I/O 1

Optimizing Local File Accesses for FUSE-Based Distributed Storage

Lustre Metadata Fundamental Benchmark and Performance

HPC Customer Requirements for OpenFabrics Software

Object Storage: Redefining Bandwidth for Linux Clusters

The MOSIX Scalable Cluster Computing for Linux. mosix.org

High-Performance Lustre with Maximum Data Assurance

pnfs Update A standard for parallel file systems HPC Advisory Council Lugano, March 2011

Fast Forward I/O & Storage

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Forming an ad-hoc nearby storage, based on IKAROS and social networking services

High Performance MPI on IBM 12x InfiniBand Architecture

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Using file systems at HC3

IBM s General Parallel File System (GPFS) 1.4 for AIX

Lustre Parallel Filesystem Best Practices

An Introduction to Balder An OpenMP Run-time Library for Clusters of SMPs

NetApp High-Performance Storage Solution for Lustre

Pattern-Aware File Reorganization in MPI-IO

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

HPC Storage Use Cases & Future Trends

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

DHTbd: A Reliable Block-based Storage System for High Performance Clusters

Transcription:

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) Heraklion, Crete, Greece

Distributed file systems Evolution NFS versions 2, 3, 4; AFS; etc. Shared-disk parallel file systems Frangipani/Petal; GPFS; GFS; etc. Separating data/metadata paths to object storage NASD; pnfs; Panassas; Lustre; PVFS; etc. FORTH-ICS First USE IX LaSCo Workshop 2008 2

NASD-style Parallel File Systems Read, Write Open, Lookup, etc. FORTH-ICS First USE IX LaSCo Workshop 2008 3

Lustre, PVFS Open-source systems, following the NASD paradigm Lustre: Cluster File Systems, Inc.; acquired (2007) by Sun PVFS: Clemson University, ANL, OSC Targeted for large-scale data processing LLNL, ORNL, ANL, CERN Representative of different approaches to filesystem design Client caching Statelessness Consistency and file access semantics Portability FORTH-ICS First USE IX LaSCo Workshop 2008 4

Lustre Architecture Kernel-based Client Structure Data/MD Caching Stripping TCP/IP, RDMA, etc. Intent-based locking POSIX semantics Pre-allocation of MD objs active/backup Linux ext3 (modified) Linux ext3 (modified) FORTH-ICS First USE IX LaSCo Workshop 2008 5

PVFS2 Architecture User-level Client Structure No Data Caching Stripping TCP/IP, RDMA, etc. No file locking; non-conflicting writes multiple active Any Linux file system DBPF ( database plus files ) FORTH-ICS First USE IX LaSCo Workshop 2008 6

Benchmarks Streaming; one client, many servers IOZone Streaming: many clients, many servers Parallel I/O (MPI) Metadata-intensive Cluster PostMark Near-random I/O, optionally with data overlap Tile I/O (MPI) User-perceived response time ls lr on Linux kernel tree FORTH-ICS First USE IX LaSCo Workshop 2008 7

Experimental Testbed PVFS 2.6.3 Lustre 1.6.0.1 MPICH 2 Metadata Server Object Server Object Server Object Server Object Server Linux ext3 or modified ext3 PVFS2 stripe: 64KB (default) Lustre stripe: 64KB, 256KB, 1MB (default) FORTH-ICS First USE IX LaSCo Workshop 2008 8

Streaming: One Client, Many Servers (IOZone Benchmark) 1, 4 threads Block sizes 1KB, 64KB, 1MB, 4MB PVFS 2.6.3 Lustre 1.6.0.1 MPICH 2 Metadata Server Object Server Object Server Object Server Object Server Linux ext3 or modified ext3 PVFS2 stripe: 64KB (default) Lustre stripe: 64KB, 256KB, 1MB (default) FORTH-ICS First USE IX LaSCo Workshop 2008 9

Streaming: One Client, Many Servers (IOZone Benchmark) Lustre 1 client, 12 servers 95% CPU 4 Threads 70% CPU 1 Thread FORTH-ICS First USE IX LaSCo Workshop 2008 10

Streaming: One Client, Many Servers (IOZone Benchmark) PVFS 1 client, 12 servers 50% CPU 4 Threads 1 Thread 20% CPU FORTH-ICS First USE IX LaSCo Workshop 2008 11

Streaming: Many Clients, Many Servers (Parallel I/O Benchmark) Block sizes 64KB-16MB PVFS 2.6.3 Lustre 1.6.0.1 MPICH 2 1. Writes to separate files 2. Reads from single file 3. Read-modify-write to single file Metadata Server Object Server Object Server Object Server Object Server Linux ext3 or modified ext3 PVFS2 stripe: 64KB (default) Lustre stripe: 64KB, 256KB, 1MB (default) FORTH-ICS First USE IX LaSCo Workshop 2008 12

Streaming: Many Clients, Many Servers (Parallel I/O Benchmark) Writes to Separate Files; 12 clients, 12 servers Lustre PVFS2 FORTH-ICS First USE IX LaSCo Workshop 2008 13

Streaming: Many Clients, Many Servers (Parallel I/O Benchmark) Reads from Single File; 12 clients, 12 servers Lustre Lustre PVFS2 PVFS2 FORTH-ICS First USE IX LaSCo Workshop 2008 14

Three configurations: Cluster PostMark (a) Few (100), large (10-100MB) files (b) Moderate number (800) of medium-size (1-10MB) files (c) Many (8000), small (4-128KB) files FORTH-ICS First USE IX LaSCo Workshop 2008 15

Cluster PostMark Bandwidth 12 clients, 12 servers Block Size FORTH-ICS First USE IX LaSCo Workshop 2008 16

Cluster PostMark Transactions 12 clients, 12 servers Block Size FORTH-ICS First USE IX LaSCo Workshop 2008 17

Near-random I/O, optionally with data overlap (Tile I/O) two-dimensional logical structure overlaid on a single file each tile assigned to a separate client process block element tile FORTH-ICS First USE IX LaSCo Workshop 2008 18

Tile I/O - Reads FORTH-ICS First USE IX LaSCo Workshop 2008 19

Tile I/O - Writes FORTH-ICS First USE IX LaSCo Workshop 2008 20

User-Perceived Response Time ls lr on Linux 2.6.12 kernel tree (~25,000 files) File System Linux ext3 (local) Lustre PVFS2 Response Time (sec) 5.5 58 80 FORTH-ICS First USE IX LaSCo Workshop 2008 21

Conclusions Scalable I/O bandwidth is achievable through parallel I/O paths to file servers Lustre s efficient metadata management is critical for metadata-intensive applications Lustre s consistency semantics are useful to some applications but cause unnecessary overhead to others that do not require them FORTH-ICS First USE IX LaSCo Workshop 2008 22

Thank You! FORTH-ICS First USE IX LaSCo Workshop 2008 23