DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Similar documents
DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Structuring PLFS for Extensibility

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

The Google File System

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Distributed File System Based on Erasure Coding for I/O-Intensive Applications

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

CS 345A Data Mining. MapReduce

Map Reduce Group Meeting

MixApart: Decoupled Analytics for Shared Storage Systems

Cloudian Sizing and Architecture Guidelines

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Next-Generation Cloud Platform

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

CSE 153 Design of Operating Systems

Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Demartek December 2007

IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers

EMC Business Continuity for Microsoft Applications

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

BIGData (massive generation of content), has been growing

CLOUD-SCALE FILE SYSTEMS

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

PREEvision System Requirements. Version 9.0 English

Consolidating Complementary VMs with Spatial/Temporalawareness

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Filesystem

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Dell Technologies IoT Solution Surveillance with Genetec Security Center

Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL )

TEMPERATURE MANAGEMENT IN DATA CENTERS: WHY SOME (MIGHT) LIKE IT HOT

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

SRCMap: Energy Proportional Storage using Dynamic Consolidation

L5-6:Runtime Platforms Hadoop and HDFS

MapReduce. Stony Brook University CSE545, Fall 2016

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Energy Management of MapReduce Clusters. Jan Pohland

CA485 Ray Walshe Google File System

Definition of RAID Levels

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

The Google File System

Data Intensive Scalable Computing

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

Decentralized Distributed Storage System for Big Data

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Atlas: Baidu s Key-value Storage System for Cloud Data

Initial Results on Provisioning Variation in Cloud Services

Storage Systems for Shingled Disks

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

DHRUBA BORTHAKUR, ROCKSET PRESENTED AT PERCONA-LIVE, APRIL 2017 ROCKSDB CLOUD

Oracle NoSQL Database and Cisco- Collaboration that produces results. 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Delegated Access for Hadoop Clusters in the Cloud

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1

Guide to SATA Hard Disks Installation and RAID Configuration

Correlation based File Prefetching Approach for Hadoop

CS 425 / ECE 428 Distributed Systems Fall 2015

CSE 124: Networked Services Lecture-17

Modern Erasure Codes for Distributed Storage Systems

Hadoop/MapReduce Computing Paradigm

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

An Empirical Study of High Availability in Stream Processing Systems

Towards Energy Proportional Cloud for Data Processing Frameworks

Modern Erasure Codes for Distributed Storage Systems

Map-Reduce. Marco Mura 2010 March, 31th

CS 61C: Great Ideas in Computer Architecture. MapReduce

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Google File System. Arun Sundaram Operating Systems

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

ECMWF Web re-engineering project

Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

CHARON-VAX application note

System Requirements. PREEvision. System requirements and deployment scenarios Version 7.0 English

EsgynDB Enterprise 2.0 Platform Reference Architecture

Clustering Lecture 8: MapReduce

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

MapReduce. U of Toronto, 2014

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

PANASAS TIERED PARITY ARCHITECTURE

PREEvision System Requirements. Version 8.5 English

Transcription:

DiskReduce: Making Room for More Data on DISCs Wittawat Tantisiriroj Lin Xiao, Bin Fan, and Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

GFS/HDFS Triplication GFS & HDFS triplicate every data block Triplication: one local + two remote copies 200% space overhead to handle node failures RAID has been used to handle disk failures Why can t we use RAID to handle node failures? Is it too complex? Is it too hard to scale? Can it work with commodity hardware? 2 Wittawat Tantisiriroj November 09"

RAID5 Across Nodes at Scale RAID5 across nodes can be done at scale Panasas does it [Welch08] But, error handling is complicated 3 Wittawat Tantisiriroj November 09"

GFS & HDFS Reconstruction GFS & HDFS defer repair Background (asynchronous) process repairs copies Notably less scary to developers 4 Wittawat Tantisiriroj November 09"

Motivation Outline DiskReduce basic (replace 1 copy with RAID5) Encoding Reconstruction Design options Evaluation DiskReduce V2.0 Conclusion 5 Wittawat Tantisiriroj November 09"

Triplication First Start the same: triplicate every data block Triplication: one local + two remote copies 200% space overhead A B B A A B 6 Wittawat Tantisiriroj November 09"

Background Encoding Goal: a parity encoding and two copies Asynchronous background process to encode In coding terms: Data is A, B Check is A, B, f(a,b)=a+b A B B A A+B 7 Wittawat Tantisiriroj November 09"

Background Repair (Single Failure) Standard single failure recovery Use the 2nd data block to reconstruct A B! B B A A+B 8 Wittawat Tantisiriroj November 09"

Background Repair (Double Failure) Use the parity and other necessary data blocks to reconstruct Continue with standard single failure recovery A B!! B B A A+B B 9 Wittawat Tantisiriroj November 09"

Design Options Encode blocks within a single file Pro: Simple deletion Con: Not space efficient for small files Encode blocks across files Pro: More space efficient Con: Need to clean up deletion!! A B C A B C A+B+C 10 Wittawat Tantisiriroj November 09"

Cloud File Size Distribution (Yahoo! M45) Cumulative Capacity Used (%) Large portion of space used by files with a small number of blocks 100 80 60 40 20 0 58% of capacity used by files with 8 blocks or less 25% of capacity used by files with 1 block 0 8 16 24 32 40 48 56 64 File Size (128MB Blocks) 11 Wittawat Tantisiriroj November 09"

Across-file RAID Saves More Capacity ~138 % overhead X X Lower bound for RAID5 + Mirror ~103 % overhead 12 Wittawat Tantisiriroj November 09"

Testbed Evaluation 16 nodes, PentiumD dual-core 3.00GHz 4GB memory, 7200 rpm SATA 160GB disk Gigabit Ethernet Implementation specification: Hadoop/HDFS version 0.17.1 Test conditions Benchmarks modeled on Google FS paper Benchmark input after all parity groups are encoded Benchmark output has encoding in background No failures during tests 13 Wittawat Tantisiriroj November 09"

As Expected, Little Performance Degradation 400 300 200 100 0 ~1 % degradation No write Completion Time (s) Grep with 45G data Original Hadoop 3 replicas Parity Group Size 8 Parity Group Size 16 ~7% degradation Encoding competes with write 200 150 100 50 0 Completion Time (s) Sort with 7.5G data Original Hadoop 3 replicas Parity Group Size 8 Parity Group Size 16 14 Wittawat Tantisiriroj November 09"

Reduce Overhead to Nearly Optimal Optimal Optimal 113 106 Optimal only when blocks are perfectly balanced 15 Wittawat Tantisiriroj November 09"

DiskReduce in the Real World! Based on a talk about DiskReduce v1 An user-level of RAID5 + Mirror in HDFS [Borthakur09] Combine third replica of blocks from a single file to create parity blocks & remove third replica Apache JIRA HDFS-503 @ Hadoop 0.22.0 16 Wittawat Tantisiriroj November 09"

Outline Motivation DiskReduce Basic (Apply RAID to HDFS) DiskReduce V2.0 Goal Delayed Encoding Conclusion 17 Wittawat Tantisiriroj November 09"

Why DiskReduce V2.0? Goal: Save more space with stronger codes Challenge Simple search used in DiskReduce V1.0 to find feasible groups cannot be applied for stronger codes Solution Pre-determine placement of blocks 18 Wittawat Tantisiriroj November 09"

Example: Block Placement Codeword drawn from any erasure code All data in codeword created at one node Pick up codeword randomly D1 D2 D3 D1 D2 D3 D1 D2 D3 P1 P2 D1 D2 D3 19 Wittawat Tantisiriroj November 09"

Testbed Prototype Evaluation 32 nodes, two quad-core 2.83GHz Xeon 16GB memory, 4 x 7200 rpm SATA 1TB disk 10 Gigabit Ethernet Implementation: Hadoop/HDFS version 0.20.0 Encoding part is implemented Other parts are work-in-progress Test Each node write a file of 16 GB into a DiskReduce modified HDFS 512GB of user data in total 20 Wittawat Tantisiriroj November 09"

DiskReduce v2.0 Prototype Works! Storage Use (GB) 1600 1200 800 400 ~113 % overhead ~25 % overhead User data (512GB) 0 0 200 400 600 Time (s) RAID6 (Group size: 8) RAID5 & Mirror (Group size: 8) Both schemes can achieve close to optimal 21 Wittawat Tantisiriroj November 09"

Possible Performance Degradation When does more than one copy help? Backup tasks More data copies may help schedule the backup tasks on a node where it has a local copy Hot files Popular files may be read by many jobs at the same time Load balance and local assignment With more data copies, the job tracker has more flexibility to assign tasks to nodes with a local data copy 22 Wittawat Tantisiriroj November 09"

Delayed Encoding Encode blocks when extra copies are likely to yield only small benefit For example, only blocks that have been created for at least one day can be encoded How long should we delay? 23 Wittawat Tantisiriroj November 09"

Age of Block Accesses Distribution (Yahoo! M45) ~99% of data accesses happen within the first hour of a data block s life time 24 Wittawat Tantisiriroj November 09"

How Long Should We Delay? Fixed delay (ex. 1 hour) Benefit ~99% of data accesses get benefits from multiple copies Cost For a workload of continuous writing 25MB/s per disk ~90GB/hour per disk < 5% of each disks capacity (a 2TB disk) 25 Wittawat Tantisiriroj November 09"

Conclusions and Future Work RAID can be applied to HDFS Dhruba Borthakur of Facebook has implemented a variant of RAID5 + Mirror in HDFS RAID can bring overhead down from 200% to 25% Delayed encoding helps avoid performance degradation We are currently working on... Reduce clean up work for deletion Analyze additional traces 26 Wittawat Tantisiriroj November 09"