FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems

FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems Feiyi Wang (Ph.D.) Veronica Vergara Larrea Dustin Leverman Sarp Oral ORNL is managed by UT-Battelle for the US Department of Energy

The Challenge Conceptually simple, everyone knows. Balance usability, scalability, and performance Workload parallelization Asynchronous Progress Report Chunking Checksumming Checkpoint & Restart Out of Core 2

What Tools Do We Have? cp bbcp grid-ftp dcp customized rsync (manual or automatic, with GNU parallel utility) 3

Large Scale Profiling Percent of (File Count, File Size) 40 30 20 10 0 Small file count (<512KiB) 89.97 % Large file size (>256MiB) % 91.19 % 4.00_KiB 8.00_KiB 16.00_KiB 32.00_KiB 64.00_KiB 128.00_KiB 256.00_KiB 512.00_KiB 1.00_MiB 2.00_MiB 4.00_MiB 16.00_MiB 128.00_MiB 256.00_MiB 512.00_MiB 32.00_MiB 64.00_MiB Buckets 1.00_GiB 4.00_GiB 64.00_GiB 128.00_GiB 256.00_GiB 512.00_GiB 1.00_TiB 4.00_TiB 4.00Plus_TiB ~ 750 million files ~ 88 million directories ~ 3.4 million files in single directory (max) ~ tens of thousands of files in TB range, 32 TB (max) 4

Workload Parallelization: Master Slave Slave 1 Slave 2 Slave 3 Slave N query response Master /path/to/traversal/root Problem: (1) centralized (2) unbalanced 5

Workload Parallelization: Work Stealing Key Ideas Each worker maintains it own work queue After local work queue is processed, it picks a random worker, and asks for more work items. Key attributes Initial load placement doesn t matter, self-correction. Slow worker(s) doesn t matter, self-pacing. Question: Without a master process, how do we know when to terminate? 6

Distributed Termination Detection Edsger W. Dijkstra: Derivation of a termination detection algorithm for distributed computations. June 10, 1983 1 The system in consideration is composed of N machines, n 0, n 1,...,n N 1, logically ordered and arranged as a ring. Each machine can be either white or black. All machines are initially colored as white. 2 A token is passed around the ring. machine n s next stop is n + 1. A token can be either white or black. Initially, machine n 0 is white and has the white token. 3 A machine only forwards the token when it is passive (no work) 4 Any time a machine sends work to a machine with lower rank, it colors itself as black. 5 Both initially at n 0, or upon receiving a token: 1 if a machine is white, it sends the token unchanged. 2 if a machine is black, it makes the token black, makes itself white, and forward the token. Termination condition: white n 0 receives a white token. 7

Understanding the Algorithm Stable state is reached when all machines are passive. Edge case: a system is composed of one machine: it will send a white token to itself, thus it meets the termination condition, also it reaches the stable state. Even a machine becomes passive at time t and forward the token, it can become active again upon receiving works from others. When a black token returns to machine n 0 or a white token returns to a black machine n 0, a termination conition can not be met. The token forwarding continues. P0 P0 P0 P0 P4 P1 P4 P1 P4 P1 P4 P1 work P3 P2 P3 P2 P3 P2 P3 P2 Initial State Send work to j < i turn black Color token black if itself is black P2 forwards the token, color itself white 8

Workflow and Division of Labor: A Compromise Parallel Walk Copy B1 A Local Work Queue B1 B2 B3 B4 B5 C D1 D2 D3 D4 B C D E F Chunking Parallel Walk B1 B2 B3 B4 B5 We have a problem of intermixing all three: parallel walk, chunk, and copy 9

Async Progress Report Users want to know the progress, in particular when doing a large data transfer. Yet, this can be difficult in a fully distributed setup environment Solution reduce() callback Hierarchical Async reduce request reduce request 0 reduce 1 response 2 3 4 5 6 7 8 9 10

Adaptive Chunking chunk 1 chunk 2 file 1 file 2 file N Large chunks you may miss the opportunity for parallelization Small chunks you may have too much overhead 11

Parallel Checksumming Directory Files Blocks & Checksums block 1 block 2 chksum 1 chksum 2 Final chksum block 3 chksum 3 block N chksum N (1) Post copy verification (2) On-the-fly dataset signature (3) Post copy dataset signature (4) Compare two datasets 12

Parallel Checksumming: Scaling 350.00 23.17 TiB Dataset (Processing Time) Time (minutes) 300.00 250.00 200.00 150.00 100.00 319.05 160.52 50.00 86.95 65.72 63.23 0.00 8m8p 8m16p 8m32p 8m64p 8m128p Running Scales: 8 hosts, 8 processes to 128 processes 13

Parallel Checksumming: Scaling 7 6 23.17 TiB Dataset (Processing Rate) 6.02 6.25 Rate (GiB/s) 5 4 3 2 1 1.24 2.46 4.55 0 8m8p 8m16p 8m32p 8m64p 8m128p Running Scales: 8 hosts, 8 processes to 128 processes 14

Checkpoint and Restart For serial application, resume would have been trivial. For FCP, it is conceptually simple: Write checkpoint file periodically Read it back and resume the work Devil is in the details: Fail when doing parallel walk Fail before first checkpoint Fail during the writing of the checkpoint Fail when workload (stealing) message is in flight Should each rank have its own checkpoint? What if next time users launch differently (e.g. # of mpi)? After restart, what bookkeeping needed to show correct progress? 15

Extreme Scalability: Out-Of-Core Each Metadata object = 112 bytes (File system) Each Chunk object = 64 bytes (Work queue) 500 million objects = 56 GB Optimization: queue file objects at the front directory object at the back for extreme unbalanced directories Head Local work queue expand Tail We also considered: K-V store Database Distributed task queue or cache 16

Evaluation: Small Files 2000 Scaling for Small Files Time (seconds) 1500 1000 500 0 1 node 2 nodes 4 nodes 8 nodes Real application dataset 100,000 small as in less than 4k files, default striping 17

Evaluation: Large Files 2000 Scaling of Large File Time (S) 1500 1000 500 0 1 node 2 nodes 4 nodes 8 nodes 1TB file 32 stripes, np=8 Stripe matters 18

Conclusions We presented a parallel data copy tool, under the umbrella of a suite of tools developed at OLCF. It is built on the core assumptions and principles of: Ubiquitous MPI in cluster environment Work-stealing pattern Distributed termination for scalability and self-stabilization It has shown promising results at our production site: Large scale profiling Large scale parallel checksumming Large scale file transfer Source availability: http://www.github.com/olcf/pcircle 19