The Fusion Distributed File System

Similar documents
Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Structuring PLFS for Extensibility

THE conventional architecture of high-performance

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

MOHA: Many-Task Computing Framework on Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Distributed NoSQL Storage for Extreme-scale System Services

Crossing the Chasm: Sneaking a parallel file system into Hadoop

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Towards High-Performance and Cost-Effective Distributed Storage Systems with Information Dispersal Algorithms

Distributed Data Provenance for Large-Scale Data-Intensive Computing

Distributed File Systems II

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

Revealing Applications Access Pattern in Collective I/O for Cache Management

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Enosis: Bridging the Semantic Gap between

Analytics in the cloud

HPC Storage Use Cases & Future Trends

Introduction to High Performance Parallel I/O

CA485 Ray Walshe Google File System

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

The Google File System. Alexandru Costan

On the Role of Burst Buffers in Leadership- Class Storage Systems

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Scibox: Online Sharing of Scientific Data via the Cloud

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Map-Reduce. Marco Mura 2010 March, 31th

Improved Solutions for I/O Provisioning and Application Acceleration

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Overcoming Data Locality: an In-Memory Runtime File System with Symmetrical Data Distribution

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Distributed Filesystem

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Extreme I/O Scaling with HDF5

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

CSE 124: Networked Services Lecture-16

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

CSE 124: Networked Services Fall 2009 Lecture-19

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Google File System (GFS) and Hadoop Distributed File System (HDFS)

IBM Spectrum Scale IO performance

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

ECE7995 (7) Parallel I/O

CLOUD-SCALE FILE SYSTEMS

Let s Make Parallel File System More Parallel

Dept. Of Computer Science, Colorado State University

Introduction to HPC Parallel I/O

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

System Software for Big Data and Post Petascale Computing

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

A CONVERGENCE OF NOSQL STORAGE SYSTEMS FROM CLOUDS TO SUPERCOMPUTERS TONGLIN LI

Today (2010): Multicore Computing 80. Near future (~2018): Manycore Computing Number of Cores Processing

The Google File System

CSE 153 Design of Operating Systems

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

EsgynDB Enterprise 2.0 Platform Reference Architecture

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Cross-layer Optimization for Virtual Machine Resource Management

SoftNAS Cloud Performance Evaluation on AWS

libhio: Optimizing IO on Cray XC Systems With DataWarp

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Speeding Up Cloud/Server Applications Using Flash Memory

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Distributed Systems 16. Distributed File Systems II

SDS: A Framework for Scientific Data Services

The Google File System

The Google File System

The Google File System

Distributed caching for cloud computing

Parallel I/O on JUQUEEN

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Succinct Data Structures: Theory and Practice

Fast Forward I/O & Storage

Next Generation Storage for The Software-Defned World

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Evolving To The Big Data Warehouse

The Google File System

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

The Google File System

The Google File System

Towards Cost-Effective and High-Performance Caching Middleware for Distributed Systems

Transcription:

Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015

Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique Features: virtual chunks, cooperative caching, distributed provenance Results Future Work

Slide 3 / 44 Outline Introduction FusionFS Results Future Work

Slide 4 / 44 Background State-of-the-art high-performance computing (HPC) system architecture Decades old Has compute and storage resources separated Was originally designed for compute-intensive workloads

Slide 5 / 44 I/O Bottleneck Would this architecture suffice in Big Data Era at extreme scales? Concerns: Shared network infrastructure All I/Os are (eventually) remote More frequent checkpointing

Slide 6 / 44 Challenges at Exascales Let s simulate, for example, checkpointing, at extreme scales More detail available at: ACM/SCS HPC 13 [6]

Slide 7 / 44 The Conventional Wisdom Recent study to address the I/O bottleneck (without changing the current existing HPC architecture) Ning Liu, et al. On the Role of Burst Buffers in Leadership-Class Storage Systems, IEEE MSST 12 Pillips Carns, et al. Small-file access in parallel file systems, IEEE IPDPS 09

Slide 8 / 44 Proposed Work We address the issue by proposing a new architecture: introducing node-local storage ACM LSAP 11

Slide 9 / 44 Outline Introduction FusionFS Results Future Work

Slide 10 / 44 Overview Goal: a node-local distributed storage for HPC systems Design principles Maximal metadata concurrency Distributed hash table Optimal write throughput Write local disk if possible

Slide 11 / 44 Architecture In IBM Blue Gene/P supercomputer: Architectural change

Slide 12 / 44 Directory Tree Physical-logical mapping

Slide 13 / 44 Metadata Management Directories vs. regular files

Slide 14 / 44 File Manipulation File open

Slide 15 / 44 File Manipulation File close

Slide 16 / 44 FusionFS Implementation C/C++ Building blocks FUSE (user-level POSIX interface) ZHT (distributed hash table for metadata) Protocol Buffer (data serialization) UDT (data transfer protocol) Open source: https://bitbucket.org/dongfang/fusionfszht/src

Slide 17 / 44 Virtual Chunks One of features in FusionFS Goal: support efficient random accesses to large compressed scientific data

Slide 18 / 44 Background In Big Data Era, many data-intensive scientific applications performance are bounded on I/O Two major solutions: Parallelize the file I/O by concurrently read or write independent file chunks Compress the (big) file before writing it to the disk, and decompress it before reading it Could be done efficiently at the file system layer transparent to end users

Slide 19 / 44 Conventional Wisdom File systems leverage data (de)compression in a naïve manner: Apply the compressor when writing files Apply the decompressor when reading files Leave important metrics to the underlying compressing algorithms Compression ratio Computational overhead

Slide 20 / 44 Limitations File-level compression: computation overhead of read. E.g., read the latest temperature Step 1: compress the entire 1GB climate data to 500MB file Step 2: decompress the 500MB file to only retrieve the last few bytes Chunk-level compression: space overhead of write. E.g., write 64MB data with 4:1 compression ratio and 4KB metadata 64K-chunk: 16MB + 4KB * 1K = 20MB No chunk: 16MB + 4KB 16MB

Slide 21 / 44 Key Idea Virtually split the file into smaller chunks but keep its physical entirety Small chunk: fine granularity for random accesses So to fix the computation overhead of file-level compression Physical entirety: high compression ratio So to fix the space overhead of (physical)chunk-level compression

Slide 22 / 44 Motivating Example A simple XOR-based delta compression

Slide 23 / 44 Reference Placement Need to consider different strategies In place Pro: save one jump Con: overhead for sequential read Coalition Beginning or end

Slide 24 / 44 Compression Procedure Turns to be very straightforward: Runs efficiently: Time complexity is O(n)

Slide 25 / 44 But How Many Virtual Chunks? Extreme cases 1 virtual chunk Same as file-level compression N virtual chunks (suppose there are N physical chunks in the file) Same as (physical)chunk-level compression Find a number for good tradeoff? Let s assume the requested file is located at the end of the virtual chunk

Number of Virtual Chunks Slide 26 / 44 Variables Optimum for end-to-end I/O Short version (under some assumptions, refer to the paper for detail)

Slide 27 / 44 Decompression Procedure Algorithms omitted (refer to paper for detail) Key steps: Find the latest reference (r_last) before the starting position of the request file Decompress the file from r_last, until the end of the requested data For file write, we also need to update values after the decompression, and re-compress the affected portion

Slide 28 / 44 Outline Introduction FusionFS Results Future Work

Slide 29 / 44 Test Bed Intrepid (Argonne National Laboratory) 40K compute nodes, each has CPU: quad-core 850 MHz PowerPC RAM: 2 GB 128 storage nodes, 7.6 PB GPFS parallel file system Others not covered by this presentation Kodiak, 1,024-node cluster at Los Alamos National Laboratory HEC, 64-node Linux cluster at SCS lab Amazon EC2

Slide 30 / 44 Metadata Rate FusionFS vs. GPFS

Slide 31 / 44 Metadata Rate FusionFS vs. PVFS

Slide 32 / 44 I/O Throughput Write, FusionFS vs. GPFS

Slide 33 / 44 I/O Throughput Read, FusionFS vs. GPFS

Slide 34 / 44 I/O Throughput at Extreme Scale Peak: 2.5 TB/s on 16K nodes: GPFS theoretical upperbound throughput: 88 GB/s

Slide 35 / 44 The BLAST Application BLAST: Basic Local Alignment Search Tool A Bioinformatics application that Searches one or more protein sequences against a sequence database Calculates similarities Main idea: divide and conquer Split the database and store chunks on different nodes Each node searches the matching for assigned chunks All searches are merged into the final results

Slide 36 / 44 Execution Time FusionFS 30X faster on 1,024 nodes

Slide 37 / 44 System Efficiency FusionFS s sustainable efficiency:

Slide 38 / 44 Data Throughput FusionFS is orders of magnitude higher

Slide 39 / 44 Virtual Chunks in FusionFS Setup 64-node Linux cluster Data set each node has a FUSE mount point for POSIX and Virtual Chunk module GCRM: Global Cloud Resolving Model 3.2 million data entries; each entry has 80 singleprecision floats SDSS: Sloan Digital Sky Survey

Slide 40 / 44 Result Number of virtual chunks: sqrt(n) Compression ratio GCRM 1.49 vs. 2.28 SDSS

Slide 41 / 44 Outline Introduction FusionFS Results Future Work

Slide 42 / 44 Simulation of FusionFS at Exascales Approach Leverage CODES framework to develop a simulator (FusionSim) to simulate FusionFS at Validate FusionSim with FusionFS traces Predict FusionFS performance at Exascales with the Darshan logs

Slide 43 / 44 Integrate FusionFS with Swift Approach Expose/develop the FusionFS API for Swift Test FusionFS with the Swift applications Ultimately an ecosystem from data management to the underlying filesystem as an analogy of Hadoop stack: MapReduce <-> Swift HDFS <-> FusionFS

Slide 44 / 44 Questions