Crossing the Chasm: Sneaking a parallel file system into Hadoop

Similar documents
Crossing the Chasm: Sneaking a parallel file system into Hadoop

Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL )

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Structuring PLFS for Extensibility

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Analytics in the cloud

CA485 Ray Walshe Google File System

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Distributed Filesystem

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Distributed File Systems II

The Fusion Distributed File System

Log-structured files for fast checkpointing

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Fall 2009 Lecture-19

Distributed Systems 16. Distributed File Systems II

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

CLOUD-SCALE FILE SYSTEMS

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

The Google File System

Cloud Computing CS

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Parallel File Systems for HPC

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1

Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

EsgynDB Enterprise 2.0 Platform Reference Architecture

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Google is Really Different.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

The Last Bottleneck: How Parallel I/O can improve application performance

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

CS 345A Data Mining. MapReduce

CSE 153 Design of Operating Systems

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

The Google File System

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Fall HDFS Basics

The Last Bottleneck: How Parallel I/O can attenuate Amdahl's Law

The Google File System

Parallel Storage Systems for Large-Scale Machines

CSE6230 Fall Parallel I/O. Fang Zheng

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

The Google File System

The Google File System

Distributed File Systems

The Google File System

Let s Make Parallel File System More Parallel

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

CSE 124: Networked Services Lecture-17

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Hadoop/MapReduce Computing Paradigm

Lecture 2 Distributed Filesystems

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

MapReduce. U of Toronto, 2014

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Fault Tolerance Design for Hadoop MapReduce on Gfarm Distributed Filesystem

Introduction to High Performance Parallel I/O

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop Distributed File System(HDFS)

Cluster Setup and Distributed File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Introduction to Cloud Computing

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Google File System. By Dinesh Amatya

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

In search of an API for scalable file systems: Under the table or above it?

Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division

BigData and Map Reduce VITMAC03

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

The Google File System. Alexandru Costan

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Quobyte The Data Center File System QUOBYTE INC.

Extreme computing Infrastructure

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

COS 318: Operating Systems. Journaling, NFS and WAFL

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

MOHA: Many-Task Computing Framework on Hadoop

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Transcription:

Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

In this work Compare and contrast large storage system architectures Internet services High performance computing Can we use a parallel file system for Internet service applications? Hadoop, an Internet service software stack HDFS, an Internet service file system for Hadoop PVFS, a parallel file system http://www.pdl.cmu.edu/ 2

Today s Internet services Applications are becoming data-intensive Large input data set (e.g. the entire web) Distributed, parallel application execution Distributed file system is a key component Define new semantics for anticipated workloads Atomic append in Google FS Write-once in HDFS Commodity hardware and network Handle failures through replication http://www.pdl.cmu.edu/ 3

The HPC world Equally large applications Large input data set (e.g. astronomy data) Parallel execution on large clusters Use parallel file systems for scalable I/O e.g. IBM s GPFS, Sun s Lustre FS, PanFS, and Parallel Virtual File System (PVFS) http://www.pdl.cmu.edu/ 4

Why use parallel file systems? Handle a wide variety of workloads High concurrent reads and writes Small file support, scalable metadata Offer performance vs. reliability tradeoff RAID-5 (e.g., PanFS) Mirroring Failover (e.g., LustreFS) Standard Unix FS interface & POSIX semantics pnfs standard (NFS v4.1) http://www.pdl.cmu.edu/ 5

Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Evaluation http://www.pdl.cmu.edu/ 6

HDFS & PVFS: high level design Meta-data servers Store all file system metadata Handle all metadata operations Data servers Store actual file system data Handle all read and write operations Files are divided into chunks Chunks of a file are distributed across servers http://www.pdl.cmu.edu/ 7

PVFS shim layer under Hadoop Hadoop applications Hadoop framework Extensible file system API HDFS client library PVFS shim layer Unmodified PVFS client library (C) Unmodified PVFS HDFS servers servers Forward requests to and respond from PVFS client library using Java Native Interface (JNI) Client Server http://www.pdl.cmu.edu/ 8

Preliminary Evaluation Text search ( grep ) common workloads in Internet service applications Search for a rare pattern in 100-byte records 64GB data set 32 nodes Each node servers as storage and compute nodes http://www.pdl.cmu.edu/ 9

Vanilla PVFS is disappointing 2.5 times slower http://www.pdl.cmu.edu/ 10

Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Readahead buffer File layout information Replication Evaluation http://www.pdl.cmu.edu/ 11

Read operation in Hadoop Typical read workload: Small (less than 128 KB) Sequential through an entire chunk HDFS prefetches an entire chunk No cache coherence issue with its write-once semantic http://www.pdl.cmu.edu/ 12

Readahead buffer PVFS has no client buffer cache Avoid a cache coherence issue with concurrent writes Readahead buffer can be added to PVFS shim layer In Hadoop, a file can become immutable after it is closed No need for cache coherence mechanism http://www.pdl.cmu.edu/ 13

PVFS with 4MB buffer still quite slow http://www.pdl.cmu.edu/ 14

Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Readahead buffer File layout information Replication Evaluation http://www.pdl.cmu.edu/ 15

Collocation in Hadoop File layout information Describe where chunks are located Collocate computation and data Ship computation to where data is located Reduce network traffic http://www.pdl.cmu.edu/ 16

Hadoop without collocation Computation Chunk1 Chunk2 Chunk3 Compute Node 3 data transfers over network Storage Node Chunk1 Chunk2 Chunk3 Chunk 3 Node A Chunk 1 Node B Chunk 2 Node C http://www.pdl.cmu.edu/ 17

Hadoop with collocation Computation Chunk1 Chunk2 Chunk3 Compute Node no data transfer over network Storage Node Chunk3 Chunk1 Chunk2 Chunk 3 Node A Chunk 1 Node B Chunk 2 Node C http://www.pdl.cmu.edu/ 18

Expose file layout information File layout information in PVFS Stored as extended attributes Different format from Hadoop format A shim layer converts file layout information from PVFS format to Hadoop format Enable Hadoop to collocate computation and data http://www.pdl.cmu.edu/ 19

PVFS with file layout information comparable performance http://www.pdl.cmu.edu/ 20

Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Readahead buffer File layout information Replication Evaluation http://www.pdl.cmu.edu/ 21

Replication in HDFS Rack-awareness replication By default, 3 copies for each file (triplication) 1.Write to a local storage node 2.Write to a storage node in the local rack 3.Write to a storage node in the other rack http://www.pdl.cmu.edu/ 22

Replication in PVFS No replication in the public release of PVFS Rely on hardware based reliability solutions Per server RAID inside logical storage devices Replication can be added in a shim layer Write each file to three servers No reconstruction/recovery in the prototype http://www.pdl.cmu.edu/ 23

PVFS with replication Hadoop applications Hadoop framework Extensible file system API PVFS shim layer Unmodified PVFS client library (C) Unmodified PVFS server Unmodified PVFS server Unmodified PVFS server http://www.pdl.cmu.edu/ 24

PVFS shim layer under Hadoop Hadoop applications Hadoop framework Extensible file system API ~1,700 lines of code PVFS shim layer Readahead buffer HDFS client library HDFS servers PVFS shim layer Unmodified PVFS client library (C) Unmodified PVFS servers File layout info Replication Client Server http://www.pdl.cmu.edu/ 25

Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Evaluation Micro-benchmark (non MapReduce) MapReduce benchmark http://www.pdl.cmu.edu/ 26

Micro-benchmark Cluster configuration 16 nodes Pentium D dual-core 3.0GHz 4 GB Memory One 7200 rpm SATA 160 GB (8 MB buffer) Gigabit Ethernet Use file system API directly without Hadoop involvement http://www.pdl.cmu.edu/ 27

N clients, each reads 1/N of single file Round-robin file layout in PVFS helps avoid contention http://www.pdl.cmu.edu/ 28

Why is PVFS better in this case? Without scheduling, clients read in a uniform pattern Client1 reads A1 then A4 Client2 reads A2 then A5 Client3 reads A3 then A6 PVFS Round-robin placement HDFS Random placement A1 A4 A2 A5 A3 A6 A1 A3 A2 A5 A4 A6 Contention http://www.pdl.cmu.edu/ 29

HDFS with Hadoop s scheduling Example 1: Client1 reads A1 then A4 Client2 reads A2 then A5 Client3 reads A6 then A3 A1 A3 A2 A5 A4 A6 Example 2: Client1 reads A1 then A3 Client2 reads A2 then A5 Client3 reads A4 then A6 A1 A3 A2 A5 A4 A6 http://www.pdl.cmu.edu/ 30

Read with Hadoop s scheduling Hadoop s scheduling can mask a problem with a non-uniform file layout in HDFS http://www.pdl.cmu.edu/ 31

N clients write to n distinct files By writing one of three copies locally, HDFS write throughout grows linearly http://www.pdl.cmu.edu/ 32

Concurrent writes to a single file By allowing concurrent writes in PVFS, copy completes faster by using multiple writers http://www.pdl.cmu.edu/ 33

Outline A basic shim layer & preliminary evaluation Three add-on features in a shim layer Evaluation Micro-benchmark (non MapReduce) MapReduce benchmark http://www.pdl.cmu.edu/ 34

MapReduce benchmark setting Yahoo! M45 cluster Use 50-100 nodes Xeon quad-core 1.86 GHz with 6GB Memory One 7200 rpm SATA 750 GB (8 MB buffer) Gigabit Ethernet Use Hadoop framework for MapReduce processing http://www.pdl.cmu.edu/ 35

MapReduce benchmark Grep: Search for a rare pattern in hundred million 100-byte records (100GB) Sort: Sort hundred million 100-byte records (100GB) Never-Ending Language Learning (NELL): (J. Betteridge, CMU) Count the numbers of selected phrases in 37GB data-set http://www.pdl.cmu.edu/ 36

Read-Intensive Benchmark PVFS s performance is similar to HDFS http://www.pdl.cmu.edu/ 37

Write-Intensive Benchmark By writing one of three copies locally, HDFS does better than PVFS http://www.pdl.cmu.edu/ 38

Summary PVFS can be tuned to deliver promising performance for Hadoop applications Simple shim layer in Hadoop No modification to PVFS PVFS can expose file layout information Enable Hadoop to collocate computation and data Hadoop application can benefit from concurrent writing supported by parallel file systems http://www.pdl.cmu.edu/ 39

Acknowledgements Sam Lang and Rob Ross for help with PVFS internals Yahoo! for the M45 cluster Julio Lopez for help with M45 and Hadoop Justin Betteridge, Le Zhao, Jamie Callan, Shay Cohen, Noah Smith, U Kang and Christos Faloutsos for their scientific applications http://www.pdl.cmu.edu/ 40