FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems

Similar documents
Introduction to HPC Parallel I/O

A More Realistic Way of Stressing the End-to-end I/O System

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

The Google File System

The Google File System

The Google File System. Alexandru Costan

Google File System. Arun Sundaram Operating Systems

Distributed System. Gang Wu. Spring,2018

The Google File System

CA485 Ray Walshe Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Cluster-Level Google How we use Colossus to improve storage efficiency

Distributed File Systems II

Distributed Systems 16. Distributed File Systems II

libhio: Optimizing IO on Cray XC Systems With DataWarp

5.4 - DAOS Demonstration and Benchmark Report

Ben Walker Data Center Group Intel Corporation

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

The Google File System

April 21, 2017 Revision GridDB Reliability and Robustness

CLOUD-SCALE FILE SYSTEMS

ECS High Availability Design

Information Retrieval and Organisation

Distributed Filesystem

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Lecture 11 Hadoop & Spark

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Social Environment

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Map Reduce. Yerevan.

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Flash-Conscious Cache Population for Enterprise Database Workloads

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

The Google File System

Welcome! Virtual tutorial starts at 15:00 BST

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Triton file systems - an introduction. slide 1 of 28

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

Amazon Elastic File System

One Trillion Edges. Graph processing at Facebook scale

Map-Reduce. Marco Mura 2010 March, 31th

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

SONAS Best Practices and options for CIFS Scalability

The Google File System

Scaling Distributed Machine Learning with the Parameter Server

Efficient Object Storage Journaling in a Distributed Parallel File System

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Toward An Integrated Cluster File System

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

Load Balancing and Termination Detection

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Blue Waters I/O Performance

Introduction to Distributed Data Systems

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

The Titan Tools Experience

Google File System 2

Last Class: Clock Synchronization. Today: More Canonical Problems

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System. Haruo Yokota Tokyo Institute of Technology

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Scalable Name-Based Packet Forwarding: From Millions to Billions. Tian Song, Beijing Institute of Technology

What is a file system

The Google File System (GFS)

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

A Case for Packing and Indexing in Cloud File Systems

IOGP. an Incremental Online Graph Partitioning algorithm for distributed graph databases. Dong Dai*, Wei Zhang, Yong Chen

File System Concepts File Allocation Table (FAT) New Technology File System (NTFS) Extended File System (EXT) Master File Table (MFT)

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Cloudera s Enterprise Data Hub on the Amazon Web Services Cloud: Quick Start Reference Deployment October 2014

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Multifunction Networking Adapters

ScalaIOTrace: Scalable I/O Tracing and Analysis

Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

Algorithms and Applications

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Extreme Storage Performance with exflash DIMM and AMPS

CSE 124: Networked Services Lecture-16

A Simulated Annealing algorithm for GPU clusters

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

Top Ten ways to improve your MQ/IIB environment

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System. By Dinesh Amatya

Transcription:

FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems Feiyi Wang (Ph.D.) Veronica Vergara Larrea Dustin Leverman Sarp Oral ORNL is managed by UT-Battelle for the US Department of Energy

The Challenge Conceptually simple, everyone knows. Balance usability, scalability, and performance Workload parallelization Asynchronous Progress Report Chunking Checksumming Checkpoint & Restart Out of Core 2

What Tools Do We Have? cp bbcp grid-ftp dcp customized rsync (manual or automatic, with GNU parallel utility) 3

Large Scale Profiling Percent of (File Count, File Size) 40 30 20 10 0 Small file count (<512KiB) 89.97 % Large file size (>256MiB) % 91.19 % 4.00_KiB 8.00_KiB 16.00_KiB 32.00_KiB 64.00_KiB 128.00_KiB 256.00_KiB 512.00_KiB 1.00_MiB 2.00_MiB 4.00_MiB 16.00_MiB 128.00_MiB 256.00_MiB 512.00_MiB 32.00_MiB 64.00_MiB Buckets 1.00_GiB 4.00_GiB 64.00_GiB 128.00_GiB 256.00_GiB 512.00_GiB 1.00_TiB 4.00_TiB 4.00Plus_TiB ~ 750 million files ~ 88 million directories ~ 3.4 million files in single directory (max) ~ tens of thousands of files in TB range, 32 TB (max) 4

Workload Parallelization: Master Slave Slave 1 Slave 2 Slave 3 Slave N query response Master /path/to/traversal/root Problem: (1) centralized (2) unbalanced 5

Workload Parallelization: Work Stealing Key Ideas Each worker maintains it own work queue After local work queue is processed, it picks a random worker, and asks for more work items. Key attributes Initial load placement doesn t matter, self-correction. Slow worker(s) doesn t matter, self-pacing. Question: Without a master process, how do we know when to terminate? 6

Distributed Termination Detection Edsger W. Dijkstra: Derivation of a termination detection algorithm for distributed computations. June 10, 1983 1 The system in consideration is composed of N machines, n 0, n 1,...,n N 1, logically ordered and arranged as a ring. Each machine can be either white or black. All machines are initially colored as white. 2 A token is passed around the ring. machine n s next stop is n + 1. A token can be either white or black. Initially, machine n 0 is white and has the white token. 3 A machine only forwards the token when it is passive (no work) 4 Any time a machine sends work to a machine with lower rank, it colors itself as black. 5 Both initially at n 0, or upon receiving a token: 1 if a machine is white, it sends the token unchanged. 2 if a machine is black, it makes the token black, makes itself white, and forward the token. Termination condition: white n 0 receives a white token. 7

Understanding the Algorithm Stable state is reached when all machines are passive. Edge case: a system is composed of one machine: it will send a white token to itself, thus it meets the termination condition, also it reaches the stable state. Even a machine becomes passive at time t and forward the token, it can become active again upon receiving works from others. When a black token returns to machine n 0 or a white token returns to a black machine n 0, a termination conition can not be met. The token forwarding continues. P0 P0 P0 P0 P4 P1 P4 P1 P4 P1 P4 P1 work P3 P2 P3 P2 P3 P2 P3 P2 Initial State Send work to j < i turn black Color token black if itself is black P2 forwards the token, color itself white 8

Workflow and Division of Labor: A Compromise Parallel Walk Copy B1 A Local Work Queue B1 B2 B3 B4 B5 C D1 D2 D3 D4 B C D E F Chunking Parallel Walk B1 B2 B3 B4 B5 We have a problem of intermixing all three: parallel walk, chunk, and copy 9

Async Progress Report Users want to know the progress, in particular when doing a large data transfer. Yet, this can be difficult in a fully distributed setup environment Solution reduce() callback Hierarchical Async reduce request reduce request 0 reduce 1 response 2 3 4 5 6 7 8 9 10

Adaptive Chunking chunk 1 chunk 2 file 1 file 2 file N Large chunks you may miss the opportunity for parallelization Small chunks you may have too much overhead 11

Parallel Checksumming Directory Files Blocks & Checksums block 1 block 2 chksum 1 chksum 2 Final chksum block 3 chksum 3 block N chksum N (1) Post copy verification (2) On-the-fly dataset signature (3) Post copy dataset signature (4) Compare two datasets 12

Parallel Checksumming: Scaling 350.00 23.17 TiB Dataset (Processing Time) Time (minutes) 300.00 250.00 200.00 150.00 100.00 319.05 160.52 50.00 86.95 65.72 63.23 0.00 8m8p 8m16p 8m32p 8m64p 8m128p Running Scales: 8 hosts, 8 processes to 128 processes 13

Parallel Checksumming: Scaling 7 6 23.17 TiB Dataset (Processing Rate) 6.02 6.25 Rate (GiB/s) 5 4 3 2 1 1.24 2.46 4.55 0 8m8p 8m16p 8m32p 8m64p 8m128p Running Scales: 8 hosts, 8 processes to 128 processes 14

Checkpoint and Restart For serial application, resume would have been trivial. For FCP, it is conceptually simple: Write checkpoint file periodically Read it back and resume the work Devil is in the details: Fail when doing parallel walk Fail before first checkpoint Fail during the writing of the checkpoint Fail when workload (stealing) message is in flight Should each rank have its own checkpoint? What if next time users launch differently (e.g. # of mpi)? After restart, what bookkeeping needed to show correct progress? 15

Extreme Scalability: Out-Of-Core Each Metadata object = 112 bytes (File system) Each Chunk object = 64 bytes (Work queue) 500 million objects = 56 GB Optimization: queue file objects at the front directory object at the back for extreme unbalanced directories Head Local work queue expand Tail We also considered: K-V store Database Distributed task queue or cache 16

Evaluation: Small Files 2000 Scaling for Small Files Time (seconds) 1500 1000 500 0 1 node 2 nodes 4 nodes 8 nodes Real application dataset 100,000 small as in less than 4k files, default striping 17

Evaluation: Large Files 2000 Scaling of Large File Time (S) 1500 1000 500 0 1 node 2 nodes 4 nodes 8 nodes 1TB file 32 stripes, np=8 Stripe matters 18

Conclusions We presented a parallel data copy tool, under the umbrella of a suite of tools developed at OLCF. It is built on the core assumptions and principles of: Ubiquitous MPI in cluster environment Work-stealing pattern Distributed termination for scalability and self-stabilization It has shown promising results at our production site: Large scale profiling Large scale parallel checksumming Large scale file transfer Source availability: http://www.github.com/olcf/pcircle 19