Crossing the Chasm: Sneaking a parallel file system into Hadoop

Size: px

Start display at page:

Download "Crossing the Chasm: Sneaking a parallel file system into Hadoop"

Randell Jordan
5 years ago
Views:

1 Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

2 In this work Compare and contrast large storage system architectures Internet services High performance computing Can we use a parallel file system for Internet service applications? Hadoop, an Internet service software stack HDFS, an Internet service file system for Hadoop PVFS, a parallel file system 2

3 Today s Internet services Applications are becoming data-intensive Large input data set (e.g. the entire web) Distributed, parallel application execution Distributed file system is a key component Define new semantics for anticipated workloads Atomic append in Google FS Write-once in HDFS Commodity hardware and network Handle failures through replication 3

4 The HPC world Equally large applications Large input data set (e.g. astronomy data) Parallel execution on large clusters Use parallel file systems for scalable I/O e.g. IBM s GPFS, Sun s Lustre FS, PanFS, and Parallel Virtual File System (PVFS) 4

5 Why use parallel file systems? Handle a wide variety of workloads High concurrent reads and writes Small file support, scalable metadata Offer performance vs. reliability tradeoff RAID-5 (e.g., PanFS) Mirroring Failover (e.g., LustreFS) Standard Unix FS interface & POSIX semantics pnfs standard (NFS v4.1) 5

6 Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Evaluation 6

7 HDFS & PVFS: high level design Meta-data servers Store all file system metadata Handle all metadata operations Data servers Store actual file system data Handle all read and write operations Files are divided into chunks Chunks of a file are distributed across servers 7

8 PVFS shim layer under Hadoop Hadoop applications Hadoop framework Extensible file system API HDFS client library PVFS shim layer Unmodified PVFS client library (C) Unmodified PVFS HDFS servers servers Forward requests to and respond from PVFS client library using Java Native Interface (JNI) Client Server 8

9 Preliminary Evaluation Text search ( grep ) common workloads in Internet service applications Search for a rare pattern in 100-byte records 64GB data set 32 nodes Each node serves as storage and compute nodes 9

10 Completion Time (sec) Vanilla PVFS is disappointing Grep (64GB, 32 nodes, no replication) 2.5 times slower PVFS: HDFS 10

11 Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Readahead buffer File layout information Replication Evaluation 11

12 Read operation in Hadoop Typical read workload: Small (less than 128 KB) Sequential through an entire chunk HDFS prefetches an entire chunk No cache coherence issue with its write-once semantic 12

13 Readahead buffer PVFS has no client buffer cache Avoid a cache coherence issue with concurrent writes Readahead buffer can be added to PVFS shim layer In Hadoop, a file can become immutable after it is closed No need for cache coherence mechanism 13

Completion Time (sec) 300 250 200 150 100 50 0 PVFS with 4MB buffer Grep (64GB, 32 nodes, no

14 Completion Time (sec) PVFS with 4MB buffer Grep (64GB, 32 nodes, no replication) still quite slow PVFS: no buffer PVFS: with buffer HDFS 14

15 Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Readahead buffer File layout information Replication Evaluation 15

16 Collocation in Hadoop File layout information Describe where chunks are located Collocate computation and data Ship computation to where data is located Reduce network traffic 16

17 Hadoop without collocation Computation Chunk1 Chunk2 Chunk3 Compute Node 3 data transfers over network Storage Node Chunk1 Chunk2 Chunk3 Chunk 3 Node A Chunk 1 Node B Chunk 2 Node C 17

18 Hadoop with collocation Computation Chunk1 Chunk2 Chunk3 Compute Node no data transfer over network Storage Node Chunk3 Chunk1 Chunk2 Chunk 3 Node A Chunk 1 Node B Chunk 2 Node C 18

19 Expose file layout information File layout information in PVFS Stored as extended attributes Different format from Hadoop format A shim layer converts file layout information from PVFS format to Hadoop format Enable Hadoop to collocate computation and data 19

Completion Time (sec) PVFS with file layout information 300 250 200 150 100 50 0 Grep (64GB, 32 nodes, no replication) comparable

20 Completion Time (sec) PVFS with file layout information Grep (64GB, 32 nodes, no replication) comparable performance PVFS: no buffer no file layout PVFS: with buffer no file layout PVFS: with buffer with file layout HDFS 20

21 Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Readahead buffer File layout information Replication Evaluation 21

22 Replication in HDFS Rack-awareness replication By default, 3 copies for each file (triplication) 1.Write to a local storage node 2.Write to a storage node in the local rack 3.Write to a storage node in the other rack 22

23 Replication in PVFS No replication in the public release of PVFS Rely on hardware based reliability solutions Per server RAID inside logical storage devices Replication can be added in a shim layer Write each file to three servers No reconstruction/recovery in the prototype 23

24 PVFS with replication Hadoop applications Hadoop framework Extensible file system API PVFS shim layer Unmodified PVFS client library (C) Unmodified PVFS server Unmodified PVFS server Unmodified PVFS server 24

25 PVFS shim layer under Hadoop Hadoop applications Hadoop framework Extensible file system API ~1,700 lines of code PVFS shim layer Readahead buffer HDFS client library HDFS servers PVFS shim layer Unmodified PVFS client library (C) Unmodified PVFS servers File layout info Replication Client Server 25

26 Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Evaluation Micro-benchmark (non MapReduce) MapReduce benchmark 26

27 Micro-benchmark Cluster configuration 16 nodes Pentium D dual-core 3.0GHz 4 GB Memory One 7200 rpm SATA 160 GB (8 MB buffer) Gigabit Ethernet Use file system API directly without Hadoop involvement 27

28 Aggregate read throughput (MB/s) N clients, each reads 1/N of single file Number of Clients PVFS (no replication) HDFS (no replication) Round-robin file layout in PVFS helps avoid contention 28

29 Why is PVFS better in this case? Without scheduling, clients read in a uniform pattern Client1 reads A1 then A4 Client2 reads A2 then A5 Client3 reads A3 then A6 PVFS Round-robin placement HDFS Random placement A1 A4 A2 A5 A3 A6 A1 A3 A2 A5 A4 A6 Contention 29

30 HDFS with Hadoop s scheduling Example 1: Client1 reads A1 then A4 Client2 reads A2 then A5 Client3 reads A6 then A3 A1 A3 A2 A5 A4 A6 Example 2: Client1 reads A1 then A3 Client2 reads A2 then A5 Client3 reads A4 then A6 A1 A3 A2 A5 A4 A6 30

31 Completion Time (sec) Read with Hadoop s scheduling Read (16GB, 16 nodes) PVFS HDFS Hadoop s scheduling can mask a problem with a non-uniform file layout in HDFS 31

32 Aggregate write throughput (MB/s) N clients write to n distinct files Number of Clients PVFS (no replication) HDFS (no replication) By writing one of three copies locally, HDFS write throughput grows linearly 32

Completion Time (sec) Concurrent writes to a single file 700 600 500 400 300 200 100 0 Parallel Copy (16GB, 16 nodes) PVFS (16

33 Completion Time (sec) Concurrent writes to a single file Parallel Copy (16GB, 16 nodes) PVFS (16 writers) HDFS (1 writer) By allowing concurrent writes in PVFS, copy completes faster by using multiple writers 33

34 Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Evaluation Micro-benchmark (non MapReduce) MapReduce benchmark 34

35 MapReduce benchmark setting Yahoo! M45 cluster Use nodes Xeon quad-core 1.86 GHz with 6GB Memory One 7200 rpm SATA 750 GB (8 MB buffer) Gigabit Ethernet Use Hadoop framework for MapReduce processing 35

36 MapReduce benchmark Grep: Search for a rare pattern in hundred million 100-byte records (100GB) Sort: Sort hundred million 100-byte records (100GB) Never-Ending Language Learning (NELL): (J. Betteridge, CMU) Count the numbers of selected phrases in 37GB data-set 36

37 Completion Time (sec) Completion Time (sec) Read-Intensive Benchmark Grep (100GB, 50 nodes) NELL (37GB, 100 nodes) PVFS HDFS 0 PVFS HDFS PVFS s performance is similar to HDFS 37

38 Completion Time (sec) Network Traffic (GB) Write-Intensive Benchmark Sort (100GB, 50 nodes) PVFS HDFS PVFS(2 copies) Sort (100GB, 50 nodes) PVFS HDFS PVFS(2 copies) By writing one of three copies locally, HDFS does better than PVFS 38

39 Summary PVFS can be tuned to deliver promising performance for Hadoop applications Simple shim layer in Hadoop No modification to PVFS PVFS can expose file layout information Enable Hadoop to collocate computation and data Hadoop application can benefit from concurrent writing supported by parallel file systems 39

40 Acknowledgements Sam Lang and Rob Ross for help with PVFS internals Yahoo! for the M45 cluster Julio Lopez for help with M45 and Hadoop Justin Betteridge, Le Zhao, Jamie Callan, Shay Cohen, Noah Smith, U Kang and Christos Faloutsos for their scientific applications 40

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large