Distributed File Systems II

Similar documents
GFS: The Google File System. Dr. Yingwu Zhu

CA485 Ray Walshe Google File System

The Google File System. Alexandru Costan

Distributed Systems. GFS / HDFS / Spanner

GFS: The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CLOUD-SCALE FILE SYSTEMS

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Distributed Filesystem

The Google File System

Distributed Systems 16. Distributed File Systems II

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

The Google File System

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

The Google File System (GFS)

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Distributed System. Gang Wu. Spring,2018

The Google File System

Google File System 2

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

MapReduce. U of Toronto, 2014

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

BigData and Map Reduce VITMAC03

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Google File System. Arun Sundaram Operating Systems

Map-Reduce. Marco Mura 2010 March, 31th

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

The Google File System

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CS November 2017

The Google File System

Google Disk Farm. Early days

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System. By Dinesh Amatya

The Google File System

Distributed File Systems

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

CS November 2018

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google is Really Different.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

BigTable. CSE-291 (Cloud Computing) Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CS5412: OTHER DATA CENTER SERVICES

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CSE 124: Networked Services Lecture-16

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

CSE 124: Networked Services Fall 2009 Lecture-19

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Outline. Spanner Mo/va/on. Tom Anderson

The Google File System GFS

2/27/2019 Week 6-B Sangmi Lee Pallickara

7680: Distributed Systems

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CS5412: DIVING IN: INSIDE THE DATA CENTER

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Staggeringly Large Filesystems

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

The Google File System

Extreme Computing. NoSQL.

BigTable: A Distributed Storage System for Structured Data

Data Storage in the Cloud

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Staggeringly Large File Systems. Presented by Haoyan Geng

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Lecture XIII: Replication-II

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

GFS-python: A Simplified GFS Implementation in Python

Transcription:

Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things

GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation Large scale Very specific, well-understood workloads 2

GFS environment Why did Google build its own file system? Unique file system requirements Huge volume of data Huge read/write bandwidth Reliability over tens of thousands of nodes with frequent failures (commodity nodes based clusters) Mostly operating on large data blocks Needs efficient distributed operations Google s unique position it has control over, and customizes, its Applications, libraries, operating system, networks even its computers! 3

GFS workload Files are huge by traditional standards (GB, TB, PB) Large files are >= 100MB, multi-gb files common Most files are mutated by appending new data rather than overwriting existing data E.g., what did you search for, which link did you follow, Once written, the files are only read, often sequentially Mining for patterns Appending becomes the focus of performance optimization and atomicity guarantees A conventional, if not standard, interface; some specialized operation (snapshot, record append) 4

GFS Design aims Maintain data and system availability Handle failures gracefully and transparently Low synchronization overhead between entities of GFS Exploit parallelism of numerous entities Ensure high sustained throughput over low latency for individual reads / writes 5

GFS File layout Files divided into fixed-size chunks (64 MB) with an immutable global uid Each chunk is replicated on multiple chunk servers for reliability 6

GFS Architecture One master server, many chunk servers (100-1000s) Master maintains all FS metadata File namespace File to chunk mappings Chunk location info Access control info Chunk version #s Info maintain persistently in a replicated operation log Master Uses heartbeat to check on chunk servers Garbage collects orphaned chunks Metadata req/resp App Client RW req/resp Migrates chunks between chunkservers Master Chunkserver Linux FS Metadata Chunkserver Linux FS 7

GFS Architecture Chunkserver Stores 64 MB file chunks on local disk using standard Linux filesystem, each with version # and checksum Has no understanding of overall file system Only deals with chunks R/w requests specify chunk handle and byte range Chunks replicated on configurable number of chunkservers (default: 3) No caching of file data (beyond standard Linux buffer cache) Send periodic heartbeats to Master 8

GFS Architecture Client No file system interface at the operating-system level User-level API is provided Does not support all the features of POSIX file system access but looks familiar (i.e. open, close, read ) Two special operations Snapshot An efficient way of creating a copy of the current instance of a file or directory tree Append Allows a client to append data to a file as an atomic operation without having to lock a file; multiple processes can append to the same file concurrently without fear of overwriting one another s data 9

Read algorithm Access request translated by GFS client Ask (RPC) master for chunk handle and replica location (info cached at the client) Get data (RPC) from one of the replicas Application 1 (file name, byte range) GFS Client (file name, chunk indx) 2 3 (chunk handle, replica loc) Master Application 6 (data from file) GFS Client (chunk handle, byte range) 4 5 Chunk server Chunk server Chunk server (data from file) 10

Write algorithm Master grants a chunk lease to 1 replica That replica is called the primary Leases expire in 60 Application 1 (file name, data) GFS Client (file name, chunk indx) 2 Master Primary can request extension Master can take it back 3 (chunk handle, primary & replica loc) Client sends request to all replicas Application (data) Primary Buffer Chunk When ACK, send write request to primary GFS Client (data) (data) 4 Secondary Buffer Secondary Buffer Chunk Chunk 11

Write algorithm Primary picks an order for mutations to the chunk Ask replicas to apply same mutations in the same order Application GFS Client (write command) 5 Primary Secondary Secondary 6 D1 D2 D3 D1 D2 D3 D1 D2 D3 (write command, serial order) Chunk Chunk Chunk 7 When ACK, report to client Similar to passive replication but optimized for large data Application GFS Client (response) 9 Primary Secondary Secondary D1 D2 D3 D1 D2 D3 D1 D2 D3 Chunk Chunk Chunk 8 (response) 12

GFS record append Google uses large files as queues between multiple producers and consumers Same control flow as for writes, except Client pushes data to replicas of last chunk of file Client sends request to primary Common case: request fits in current last chunk: Primary appends data to own replica Primary tells secondaries to do same at same byte offset in theirs Primary replies with success to client 13

GFS record append When data won t fit in last chunk Primary fills current chunk with padding Primary instructs other replicas to do same Primary replies to client, retry on next chunk If record append fails at any replica, client retries operation So replicas of same chunk may contain different data even duplicates of all or part of record data What guarantee does GFS provide on success? Data written at least once in atomic unit 14

GFS Limitations Security? Trusted environment, trusted users Master biggest impediment to scaling Performance bottleneck Holds all data structures in memory Takes long time to rebuild metadata Must vulnerable point for reliability Solution: Have systems with multiple master nodes, all sharing set of chunk servers. Not a uniform name space. Large chunk size. Can t afford to make smaller, since this would create more work for master. 15

Fault tolerance Fast recovery, master and chunk-servers designed to restart and restore state in seconds No persistent log of chunk location in the master Chunk replicated across multiple machines and racks Data structures are kept in memory, must be able to recover from system failure Log of all changes made to metadata, checkpoints of state when log grows to big Log and latest checkpoint used to recover state Log and checkpoints replicated on multiple machines 16

GFS Summary Success: used actively by Google Availability and recoverability on cheap hardware High throughput by decoupling control and data Supports massive data sets and concurrent appends Semantics not transparent to apps Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics) Performance not good for all apps Assumes read-once, write-once workload (no client caching!) Replaced in 2010 by Colossus Eliminate master node as single point of failure Targets latency problems, more latency sensitive applications Reduce block size to be between 1~8 MB Few details public 17

Hadoop Distributed File System (HDFS) Apache Hadoop A SW framework for distributed storage and processing of big data sets using MapReduce programming model Key for it HDFS Hadoop splits files into blocks, distributes them across nodes, and transfers packaged code to process data in parallel (data locality) Hadoop's MapReduce and HDFS inspired by Google s MapReduce and GFS HDFS Portable file system, no POSIX compliant Provides shell commands and Java API similar to other file systems Can be mounted using FUSE 18

HDFS 19

GFS vs. HDFS GFS Master chunkserver operation log chunk random file writes possible multiple writer, multiple reader model chunk: 64KB data and 32bit checksum pieces default block size: 64MB HDFS NameNode DataNode journal, edit log block only append is possible single writer, multiple reader model per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp) default block size: 128MB 20

Bigtable Distributed storage (no FS) for structured data Designed to scale to petabytes of data stored across thousands of commodity servers 450,000 machines (NYTimes estimate, June 06) Example users: Google Earth, Google Analytics, Google Finance, Personalized Search, Build on Scheduler (Google WorkQueue) Google File System Chubby Lock service {lock/file/name} service Coarse-grained locks, can store small amount of data in a lock 21

Data model: a big map <Row, Column, Timestamp> triple for key Each value is an uninterpreted array of bytes Arbitrary columns on a row-by-row basis Column family:qualifier. Family is heavyweight, qualifier lightweight Column-oriented physical store- rows are sparse! Lookup, insert, delete API Each read or write of data under a single row key is atomic 22

Bigtable is not Structured data but not a DHT Not addressing the same problems as DHTs churn, variable bandwidth, untrusted participants Key-value pairs are useful but too limiting nor a database No table-wide integrity constraints No multi-row transactions Uninterpreted values: No aggregation over data Can specify: keep last N versions or last N days C++ functions, not SQL (no complex queries) Clients indicate what data to cache in memory 23

Tables, tablets and SSTables Bigtable keeps data in lexicographic order by row key Row range for a table is dynamically partitioned Each row range is called a tablet The unit of distribution and load balancing Clients can exploit this by selecting their row keys for good locality, e.g., maps.google.com/index.html stored under key com.google.maps/index.html Built out of multiple, possible shared, SSTables Tablet Tablet aardvark apple apple_two_e boat Tablet Start:aardvark End:apple SSTable SSTable SSTable SSTable 64K block 64K block 64K block SSTable 64K block 64K block 64K block SSTable Index Index 24

SSTable Immutable, sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values Index loaded into memory when SSTable is opened Lookup is a single disk seek Alternatively, client can load SSTable into mem 64K block 64K block 64K block SSTable Index 25

Servers Tablet servers manage tablets, multiple tablets per server. Each tablet is 100-200 MBs Each tablet lives at only one server Tablet server splits tablets that get too big Master responsible for load balancing and fault tolerance Use Chubby to monitor health of tablet servers, restart failed servers GFS replicates data. Prefer to start tablet server on same machine that the data is already at 26

Editing/Reading a table Mutations are committed to a commit log (in GFS) Then applied to an in-memory version (memtable) For concurrency, each memtable row is copy-on-write Reads applied to merged view of SSTables & memtable Reads & writes continue during tablet split or merge Tablet Insert Insert Delete Memtable (sorted) apple_two_e boat Insert Delete SSTable (sorted) SSTable (sorted) Insert 27

Compactions Minor compaction convert a full memtable into an SSTable, and start a new memtable Reduce memory usage Reduce log traffic on restart Merging compaction Reduce number of SSTables Good place to apply policy keep only N versions Major compaction Merging compaction that results in only one SSTable No deletion records, only live data 28

Finding a tablet A three-level hierarchy to store tablet location information Client library caches tablet locations Metadata includes log of all events about each tablet 29

Summary GFS / HDFS Data-center customized API, optimizations Append focused DFS Separate control (filesystem) and data (chunks) Replication and locality Rough consistency à apps handle rest Bigtable Specialized storage rather than file systems Value simple designs 30