HARDFS: Hardening HDFS with Selective and Lightweight Versioning

Similar documents
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Distributed Filesystem

CLOUD-SCALE FILE SYSTEMS

GFS: The Google File System. Dr. Yingwu Zhu

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed File Systems II

CA485 Ray Walshe Google File System

Announcements. Persistence: Log-Structured FS (LFS)

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Google File System (GFS) and Hadoop Distributed File System (HDFS)

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Hadoop and HDFS Overview. Madhu Ankam

Google File System. By Dinesh Amatya

MapReduce. U of Toronto, 2014

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Distributed Systems 16. Distributed File Systems II

The Google File System. Alexandru Costan

ViewBox. Integrating Local File Systems with Cloud Storage Services. Yupu Zhang +, Chris Dragga + *, Andrea Arpaci-Dusseau +, Remzi Arpaci-Dusseau +

Distributed System. Gang Wu. Spring,2018

*-Box (star-box) Towards Reliability and Consistency in Dropbox-like File Synchronization Services

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Distributed Systems. GFS / HDFS / Spanner

The Google File System

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

BigData and Map Reduce VITMAC03

The Google File System

HDFS Architecture Guide

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Automatic-Hot HA for HDFS NameNode Konstantin V Shvachko Ari Flink Timothy Coulter EBay Cisco Aisle Five. November 11, 2011

The Google File System

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

A BigData Tour HDFS, Ceph and MapReduce

UNIT-IV HDFS. Ms. Selva Mary. G

The Google File System (GFS)

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

GFS: The Google File System

Staggeringly Large File Systems. Presented by Haoyan Geng

The Google File System

Practical Byzantine Fault

Google Disk Farm. Early days

Apache Accumulo 1.4 & 1.5 Features

Namenode HA. Sanjay Radia - Hortonworks

A brief history on Hadoop

7680: Distributed Systems

Zettabyte Reliability with Flexible End-to-end Data Integrity

Physical Disentanglement in a Container-Based File System

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

The Google File System

MI-PDB, MIE-PDB: Advanced Database Systems

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Dept. Of Computer Science, Colorado State University

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Google File System 2

Map-Reduce. Marco Mura 2010 March, 31th

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Google File System. Arun Sundaram Operating Systems

University of Wisconsin-Madison

End-to-end Data Integrity for File Systems: A ZFS Case Study

Lecture 11 Hadoop & Spark

The Google File System

2/27/2019 Week 6-B Sangmi Lee Pallickara

Advanced Memory Management

Outline. Spanner Mo/va/on. Tom Anderson

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

IronSync File Synchronization Server. IronSync FILE SYNC SERVER. User Manual. Version 2.6. May Flexense Ltd.

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Trinity File System (TFS) Specification V0.8

Physical Storage Media

Hadoop/MapReduce Computing Paradigm

The Google File System

EIO: Error-handling is Occasionally Correct

Data Centers. Tom Anderson

Simple testing can prevent most critical failures -- An analysis of production failures in distributed data-intensive systems

CSE 124: Networked Services Lecture-16

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Nooks. Robert Grimm New York University

Distributed Computation Models

Towards Automatically Checking Thousands of Failures with Micro-Specifications

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Introduction to MapReduce

A Study of Linux File System Evolution

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Data Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Hadoop Distributed File System(HDFS)

NPTEL Course Jan K. Gopinath Indian Institute of Science

Systematic Cooperation in P2P Grids

Transcription:

HARDFS: Hardening HDFS with Selective and Lightweight Versioning Thanh Do, Tyler Harter, Yingchao Liu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau Haryadi S. Gunawi 1

Cloud Reliability q Cloud systems Complex software Thousands of commodity machines Rare failures become frequent [Hamilton] q Failure detection and recovery has to come from the software [Dean] must be a first-class operation [Ramakrishnan et al.] 2

Fail-stop failures q Machine crashes, disk failures q Pretty much handled q Current systems have sophisticated crashrecovery machineries Data replication Logging Fail-over 3

Fail-silent failures q Exhibits incorrect behaviors instead of crashing q Caused by memory corruption or software bugs q Crash recovery is useless if fault can spread Master Workers 4

Fail-silent failure headlines 5

Current approaches Replicated state machine using BFT library N-Version programing Ver. 1 Ver.2 Agree? Ver. 3 High resource consumption High engineering effort Rare deployment 6

Selective and Lightweight Versioning (SLEEVE) Master Reloading state during reboot Trusted sources Workers q 2 nd version models basic protocols of the system q Detects and isolates fail-silent behaviors q Exploits crash recovery machinery for recovery 7

Selective and lightweight versioning (SLEEVE) q Selective Goal: small engineering effort Protects important parts - Bug sensitive - Frequently changed - Currently unprotected q Lightweight Avoids replicating full state Encodes states to reduce space A B C D A D 0 1 0 0 1 0 1 0 0 1 0 1 8

HARDFS q HARDFS - hardened version HDFS: Namespace management Replica management Read/write protocol q HARDFS detects and recovers from: 90% of the faults caused by random memory corruption 100% of the faults caused by targeted memory corruption 5 injected software bugs q Fast recovery using micro-recovery 3 orders of magnitude faster than full reboot q Little space and performance overhead 9

Outline ü Introduction q HARDFS Design q HARDFS Implementation q Evaluation q Conclusion 10

Case study: namespace integrity NameNode Create(F) NameNode exists(f) exists(f) NameNode F Client F Client G Client G F txcreat(f) No Incorrect behavior Yes F F Trusted source F Normal Operation Corrupted HDFS HARDFS 11

SLEEVE layer components Interposition module State manager Action verifier Recovery module 12

State manager q Replicates subset of state of the main version Directory entries without modification time q Adds new state incrementally Adds permissions for security checks q Understands semantics of various protocol messages and thread events to update state correctly q Compresses state using compact encoding 13

Naïve: Full replication F F 100% memory overhead q HDFS master manages millions of files q 100% memory overhead reduces HDFS master scalability [;login; 11] 14

Lightweight: Counting Bloom Filters q Space-efficient data structure q Supports 3 APIs insert( A fact ) delete( A fact ) exists( A fact ) 15

Lightweight: Counting Bloom Filters F is 10 bytes Give me length of F F:10 F:10 F:5 F:10 Disagreement detected! insert( F is 10 bytes ) exists( F is 5 bytes ) à NO q Suitable for boolean checking Does F exist? Does F has length X? Has block B been allocated? 16

Challenges of using Counting Bloom Filters q Hard to check stateful system q False positives 17

Non-boolean verification F is 20 bytes Bloom filter does not support this API X = returnsize(f) delete(f:x) insert(f:20) F:10 F:10 F:10 F:20 Before After 18

Non-boolean verification F is 20 bytes F:10 F:10 Before Ask-Then-Check X ç MainVersion.returnSize(F); IF exists(f:x) delete(f:x); insert(f:20); ELSE initiate recovery; F:10 F:20 After 19

Stateful verification Ask Then Check Checking stateful systems Bloom Filter (boolean verification) 20

Dealing with False positive F G F q Bloom filters can give false positive 4 per billion 1 false positive per month (given 100 op/s) q Only leads to unnecessary recovery Reloading state F Trusted source 21

Outline ü Introduction q HARDFS Design ü Lightweight Selective Recovery q HARDFS Implementation q Evaluation q Conclusion 22

Selective Checks Client create(f) Operation log HDFS Master G F q Goals: small engineering effort q Selectively chooses namespace protection F F No Yes exists(f) Client Disagreement detected! txcreate(f) X ß mainversion.exists(f); Y ß bloomfilter.exists(f); If X!= Y then handledisagreement(); q Excludes security checks 23

Incorrect action examples Create(F) Create(F) Create(F) txcreate(f) reject Normal correct action Corrupt action Missing action Create(D/F) Mkdir(D) txcreate(f) Orphan action txcreate(d/f) txmkdir(d) Out-of-order action All of these happen in practice 24

Action verifier q Set of micro-checks to detect incorrect actions of the main version q Mechanisms: Expected-action list Actions dependency checking Timeout Domain knowledge to handle disagreement 25

Outline ü Introduction q HARDFS Design ü Lightweight ü Selective q Recovery q HARDFS Implementation q Evaluation q Conclusion 26

Recovery Master Reloading state during reboot Trusted sources Workers q Crash is good provided no fault propagation q Detects and turns bad behaviors into crashes q Exploits HDFS crash recovery machineries 27

HARDFS Recovery q Full recovery (crash and reboot) q Micro-recovery Repairing the main version Repairing the 2 nd version 28

Crash and Reboot Reloading Full state q Full state is reconstructed from trusted sources q Full recovery may be expensive Restarting an HDFS master could take hours 29

Micro-recovery q Repairs only corrupted state from trusted sources q Falls back to full reboot when micro-recovery fails 30

Repairing main version Main Version 2 nd Version F:100 F:200 F:100 Direct update F:200 ç F:100 F:100 Trusted source: checkpoint file 31

Repairing 2 nd version Main Version F:100 2 nd Version F:100 F:200 Must: 1. Delete( F is 200 bytes ) 2. Insert( F is 100 bytes ) F:100 Solution: 1. Start with an empty BF 2. Add facts as they are verified Trusted source: checkpoint file 32

Outline ü Introduction ü HARDFS Design q HARDFS Implementation q Evaluation q Conclusion 33

Implementation q Hardens three functionalities of HDFS Namespace management (HARDFS-N) Replica management (HARDFS-R) Read/write protocol of datanodes (HARDFS-D) q Uses 3 Bloom filters API insert( a fact ), delete( a fact ), exists( a fact ) q Uses ask-then-check for non-boolean verification 34

Protecting namespace integrity q Guards namespace structures necessary for reaching data: File hierarchy File-to-block mapping File length information q Detects and recovers from namespacerelated problems: Corrupt file-to-block mapping Unreachable files 35

Namespace management Message Create(F): Client request NN to create F AddBlock(F): client requests NN to allocate a block to file F Logic of the secondary version Entry: If exists(f) Then reject; Else insert(f); generateaction(txcreate[f]); Return: check response; Entry: F:X = ask-then-check(f); Return: B = addblk(f); If exists(f) &!exists(b) Then X = X {B}; delete(f:x); insert(f:x ) insert(b@0); Else declare error; 36

Outline ü Introduction ü HARDFS Design ü HARDFS Implementation q Evaluation and Conclusion 37

Evaluation q Is HARDFS robust against fail-silent faults? q How much time and space overhead incurred? q Is micro-recovery efficient? q How much engineering effort required? 38

Random memory corruption results Outcome HDFS HARDFS Silent failure 117 9 Detect and reboot - 140 Detect and micro-recover - 107 Crash 133 268 Hang 22 16 No problem observed 728 460 q # fail-silent failures reduced by factor of 10 q Crash happens twice as often 39

Silent failures FIELD HDFS HARDFS pathname 95 0 replication 1 0 modification time 6 8 permission 3 0 block size 12 1 40

Namepsace management Space Overhead Memory allocated (MB) 800 700 600 500 400 300 200 100 0 200K 400K 600K 800K 1000K File system size (number of files) HDFS HARDFS + Concrete State HARDFS + Bloom Filters HARDFS with Bloom filter incurs little space overhead (2.6%) 41

Recovery Time Recovery Time (seconds) 10000 1000 100 10 1 200K 400K 600K 800K 1000K File system size (number of files) Reboot Micro- recovery OpGmized Micro- recovery Rebooting NameNode is expensive Micro-recovery is 3 order of magnitude faster 42

Complexity (LOC) Functionality HDFS HARDFS Namespace management 10114 1751 17% Replica management 2342 934 40% Read/write protocol 5050 944 19% Others 13339 - - Lightweight versions are smaller 43

Injecting software bugs Bug Year Priority Description HARDFS HADOOP-1135 2007 Major Blocks in block report wrongly marked for deletion HADOOP-3002 2008 Blocker Blocks removed during safemode HDFS-900 2010 Blocker Valid replica deleted rather than corrupt replica HDFS-1250 2010 Major Namenode processes block report from dead datanode HDFS-3087 2012 Critical Decommission before replication during namenode restart 44

Conclusion q Crashing is good q To die (and be reborn) is better than to lie q But lies do happen in reality q HARDFS turns lies into crashes q Leverages existing crash recovery techniques to resurrect 45

Thank you! Questions? http://research.cs.wisc.edu/adsl/ http://wisdom.cs.wisc.edu/ http://ucare.cs.uchicago.edu/ 46