DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Similar documents
DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL )

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

The Google File System

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

MixApart: Decoupled Analytics for Shared Storage Systems

Performance Models of Access Latency in Cloud Storage Systems

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

CS 345A Data Mining. MapReduce

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

CSE 153 Design of Operating Systems

CLOUD-SCALE FILE SYSTEMS

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

HP AutoRAID (Lecture 5, cs262a)

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Definition of RAID Levels

The Google File System

Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems

CLIENT DATA NODE NAME NODE

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Distributed File System Based on Erasure Coding for I/O-Intensive Applications

Structuring PLFS for Extensibility

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Map Reduce Group Meeting

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Cloud Storage and Parallel File Systems

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

The Google File System. Alexandru Costan

Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL )

L5-6:Runtime Platforms Hadoop and HDFS

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Disk Arrays Nov. 7, 2011!

Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

The Google File System

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

BIGData (massive generation of content), has been growing

An Empirical Study of High Availability in Stream Processing Systems

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

Fault Tolerance Design for Hadoop MapReduce on Gfarm Distributed Filesystem

The Google File System

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

PANASAS TIERED PARITY ARCHITECTURE

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Hadoop and HDFS Overview. Madhu Ankam

Clustering Lecture 8: MapReduce

Modern Erasure Codes for Distributed Storage Systems

CS2410: Computer Architecture. Storage systems. Sangyeun Cho. Computer Science Department University of Pittsburgh

CS 345A Data Mining. MapReduce

Software-defined Storage: Fast, Safe and Efficient

IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers

Distributed Filesystem

Australian Journal of Basic and Applied Sciences

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

Introduction to MapReduce

MapReduce. U of Toronto, 2014

Next-Generation Cloud Platform

CS370 Operating Systems

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Data Intensive Scalable Computing

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

RAID (Redundant Array of Inexpensive Disks)

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Correlation-aware Prefetching in Fault-tolerant Distributed Object-based File System

In the late 1980s, rapid adoption of computers

Modern Erasure Codes for Distributed Storage Systems

CSE-E5430 Scalable Cloud Computing Lecture 9

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Google File System. By Dinesh Amatya

Current Topics in OS Research. So, what s hot?

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce. Cloud Computing COMP / ECPE 293A

The Google File System

BigData and Map Reduce VITMAC03

Information Storage and Spintronics 16

Introduction to MapReduce Algorithms and Analysis

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

Map-Reduce. Marco Mura 2010 March, 31th

Guide to SATA Hard Disks Installation and RAID Configuration

Storage Systems. Storage Systems

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

ActiveScale Erasure Coding and Self Protecting Technologies

TidyFS: A Simple and Small Distributed File System

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

In search of an API for scalable file systems: Under the table or above it?

Solid-state drive controller with embedded RAID functions

ECMWF Web re-engineering project

Transcription:

DiskReduce: Making Room for More Data on DISCs Wittawat Tantisiriroj Lin Xiao, in Fan, and Garth Gibson PARALLEL DATA LAORATORY Carnegie Mellon University

GFS/HDFS Triplication GFS & HDFS triplicate every data block Triplication: one local + two remote copies 200% space overhead to handle node failures RAID has been used to handle disk failures Why can t we use RAID to handle node failures? Is it too complex? Is it too hard to scale? Can it work with commodity hardware? 2 Wittawat Tantisiriroj November 09"

RAID5 Across Nodes at Scale RAID5 across nodes can be done at scale Panasas does it [Welch08] ut, error handling is complicated 3 Wittawat Tantisiriroj November 09"

GFS & HDFS Reconstruction GFS & HDFS defer repair ackground (asynchronous) process repairs copies Notably less scary to developers 4 Wittawat Tantisiriroj November 09"

Motivation Outline DiskReduce basic (replace 1 copy with RAID5) Encoding Reconstruction Design options Evaluation DiskReduce V2.0 (Work-in-progress) Conclusion 5 Wittawat Tantisiriroj November 09"

Triplication First Start the same: triplicate every data block Triplication: one local + two remote copies 200% space overhead A A A 6 Wittawat Tantisiriroj November 09"

ackground Encoding Goal: a parity encoding and two copies Asynchronous background process to encode In coding terms: Data is A, Check is A,, f(a,)=a+ A A A+ 7 Wittawat Tantisiriroj November 09"

ackground Repair (Single Failure) Standard single failure recovery Use the 2nd data block to reconstruct A! A A+ 8 Wittawat Tantisiriroj November 09"

ackground Repair (Double Failure) Use the parity and other necessary data blocks to reconstruct Continue with standard single failure recovery A!! A A+ 9 Wittawat Tantisiriroj November 09"

Design Options Encode blocks within a single file Pro: Simple deletion Con: Not space efficient for small files Encode blocks across files Pro: More space efficient Con: Need to clean up deletion!! A C A C A++C 10 Wittawat Tantisiriroj November 09"

Cloud File Size Distribution (Yahoo! M45) Cumulative Capacity Used (%) Large portion of space used by files with a small number of blocks 100 80 60 40 20 0 58% of capacity used by files with 8 blocks or less 25% of capacity used by files with 1 block 0 8 16 24 32 40 48 56 64 File Size (128M locks) 11 Wittawat Tantisiriroj November 09"

Across-file RAID Saves More Capacity ~138 % overhead X X Lower bound for RAID5 + Mirror ~103 % overhead 12 Wittawat Tantisiriroj November 09"

Testbed Evaluation 16 nodes, PentiumD dual-core 3.00GHz 4G memory, 7200 rpm SATA 160G disk Gigabit Ethernet Implementation specification: Hadoop/HDFS version 0.17.1 Test conditions enchmarks modeled on Google FS paper enchmark input after all parity groups are encoded enchmark output has encoding in background No failures during tests 13 Wittawat Tantisiriroj November 09"

As Expected, Little Performance Degradation 400 300 200 100 0 ~1 % degradation No write Completion Time (s) Grep with 45G data Original Hadoop 3 replicas Parity Group Size 8 Parity Group Size 16 ~7% degradation Encoding competes with write 200 150 100 50 0 Completion Time (s) Sort with 7.5G data Original Hadoop 3 replicas Parity Group Size 8 Parity Group Size 16 14 Wittawat Tantisiriroj November 09"

Reduce Overhead to Nearly Optimal Optimal Optimal 113 106 Optimal only when blocks are perfectly balanced 15 Wittawat Tantisiriroj November 09"

DiskReduce in the Real World! ased on a talk about DiskReduce v1 An user-level of RAID5 + Mirror in HDFS [orthakur09] Combine third replica of blocks from a single file to create parity blocks & remove third replica Apache JIRA HDFS-503 @ Hadoop 0.22.0 16 Wittawat Tantisiriroj November 09"

Motivation Outline DiskReduce asic (Apply RAID to HDFS) DiskReduce V2.0 (Work-in-progress) Goal Delayed Encoding Future work Conclusion 17 Wittawat Tantisiriroj November 09"

Why DiskReduce V2.0? Goal: Save more space with stronger codes Challenge Simple search used in DiskReduce V1.0 to find feasible groups cannot be applied for stronger codes Solution Pre-determine placement of blocks 18 Wittawat Tantisiriroj November 09"

Example: lock Placement Codeword drawn from any erasure code All data in codeword created at one node Pick up codeword randomly D1 D2 D3 D1 D2 D3 D1 D2 D3 P1 P2 D1 D2 D3 19 Wittawat Tantisiriroj November 09"

Testbed Prototype Evaluation 32 nodes, two quad-core 2.83GHz Xeon 16G memory, 4 x 7200 rpm SATA 1T disk 10 Gigabit Ethernet Implementation: Hadoop/HDFS version 0.20.0 Encoding part is implemented Other parts are work-in-progress Test Each node write a file of 16 G into a DiskReduce modified HDFS 512G of user data in total 20 Wittawat Tantisiriroj November 09"

DiskReduce v2.0 Prototype Works! Storage Use (G) 1600 1200 800 400 ~113 % overhead ~25 % overhead User data (512G) 0 0 200 400 600 Time (s) RAID6 (Group size: 8) RAID5 & Mirror (Group size: 8) oth schemes can achieve close to optimal 21 Wittawat Tantisiriroj November 09"

Possible Performance Degradation When does more than one copy help? ackup tasks More data copies may help schedule the backup tasks on a node where it has a local copy Hot files Popular files may be read by many jobs at the same time Load balance and local assignment With more data copies, the job tracker has more flexibility to assign tasks to nodes with a local data copy 22 Wittawat Tantisiriroj November 09"

Delayed Encoding Encode blocks when extra copies are likely to yield only small benefit For example, only blocks that have been created for at least one day can be encoded How long should we delay? 23 Wittawat Tantisiriroj November 09"

Data Access Happens a Short Time After Creation ~99% of data accesses happen within the first hour of a data block s life time 24 Wittawat Tantisiriroj November 09"

How Long Should We Delay? Fixed delay (ex. 1 hour) enefit ~99% of data accesses get benefits from multiple copies Cost For a workload of continuous writing 25M/s per disk ~90G/hour per disk < 5% of each disks capacity (a 2T disk) 25 Wittawat Tantisiriroj November 09"

Online reconstruction Future Work Allow degraded mode read A! A+! A A+ 26 Wittawat Tantisiriroj November 09"

Deletion Future Work How to reduce cleanup work?!! A C A C A++C Analyze additional traces File size distribution Data access time 27 Wittawat Tantisiriroj November 09"

Conclusion RAID can be applied to HDFS Dhruba orthakur of Facebook has implemented a variant of RAID5 + Mirror in HDFS RAID can bring overhead down from 200% to 25% Delayed encoding helps avoid performance degradation 28 Wittawat Tantisiriroj November 09"

References [orthakur09] D. orthakur. Hdfs and erasure codes, Aug. 2009. URL http://hadoopblog.blogspot.com/2009/ 08/hdfs-anderasure-codes-hdfs-raid. html [Ghemawat03] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29 43, 2003. [Hadoop] Hadoop. URL http://hadoop.apache.org/. [Patterson88] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (raid). SIGMOD Rec., 17(3):109 116, 1988. [Welch08]. Welch, M. Unangst, Z. Abbasi, G. Gibson,. Mueller, J. Small, J. Zelenka, and. Zhou. Scalable performance of the panasas parallel file system. In FAST, 2008. [Yahoo] Yahoo! reaches for the stars with m45 supercomputing project. URL http://research.yahoo.com/node/1879. 29