DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Similar documents
DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

The Google File System

The Google File System. Alexandru Costan

The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

GFS: The Google File System

The Google File System

CLOUD-SCALE FILE SYSTEMS

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Google File System 2

The Google File System

Google Disk Farm. Early days

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

Google File System. By Dinesh Amatya

Distributed File Systems II

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Google File System. Arun Sundaram Operating Systems

The Google File System (GFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Distributed System. Gang Wu. Spring,2018

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

7680: Distributed Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

CA485 Ray Walshe Google File System

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Fall 2009 Lecture-19

Distributed File Systems. Directory Hierarchy. Transfer Model

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

NPTEL Course Jan K. Gopinath Indian Institute of Science

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed Systems 16. Distributed File Systems II

Distributed Filesystem

NPTEL Course Jan K. Gopinath Indian Institute of Science

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Staggeringly Large File Systems. Presented by Haoyan Geng

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Google is Really Different.

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

2/27/2019 Week 6-B Sangmi Lee Pallickara

Dynamo: Amazon s Highly Available Key-value Store

The Google File System GFS

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

MapReduce. U of Toronto, 2014

Distributed File Systems

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

BigData and Map Reduce VITMAC03

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Lecture XIII: Replication-II

Staggeringly Large Filesystems

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Seminar Report On. Google File System. Submitted by SARITHA.S

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Distributed Systems. GFS / HDFS / Spanner

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Map-Reduce. Marco Mura 2010 March, 31th

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

GFS-python: A Simplified GFS Implementation in Python

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Dynamo: Amazon s Highly Available Key-value Store

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Data Storage in the Cloud

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Google Cluster Computing Faculty Training Workshop

Extreme computing Infrastructure

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

Distributed File Systems. CS432: Distributed Systems Spring 2017

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

What is a file system

Introduction to Cloud Computing

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CSE 124: Networked Services Lecture-17

Distributed File Systems. CS 537 Lecture 15. Distributed File Systems. Transfer Model. Naming transparency 3/27/09

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

Transcription:

Department of Computer Science Institute of System Architecture, Operating Systems Group DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

OUTLINE Classical distributed file systems NFS: Sun Network File System AFS: Andrew File System Parallel distributed file systems Case study: The Google File System Scalability Fault tolerance Other approaches 2

DFS ARCHITECTURE Client Machine Application DFS (Client) Network Simplified idea: Forward all file system operations to server via network RPC. File Server DFS (Server) Client Machine Application DFS (Client) Network Local FS 3

DFS ARCHITECTURE Client Machine Application DFS (Client) Network Multiple servers possible, but cooperation and consistency become difficult. File Server DFS (Server) Client Machine Application DFS (Client) Network Local FS Storage Array 4

SUN NFS (V2, V3) API Names/Lookup Open/Close Read/Write Caching (client) Consistency Replication Fault Handling As close to UNIX as possible Message to file server for each path element Unique NFS handle per file, no state on server Messages to read/write blocks, small block size Metadata (e.g., NFS handle), data blocks Consistency messages exchanged regularly, clients might see stale data/metadata Multiple read-only servers (if synced manually) Write through on server (v2), idempotent client writes, clients block if server crashed 5

ANDREW FILE SYSTEM API Names/Lookup Open/Close Read/Write Caching (client) Consistency Replication Fault Handling As close to UNIX as possible Name resolution on client, uses dir caches Local file, might need to transmit from/to server Local file, but some work in open/close phase Complete files, LRU replacement if needed Callback promises: server informs client, if another client wants to modify a cached file Pool of servers, may improve performance Some (e.g., client can still access files in local cache if network or servers fail) 6

DFS SCALABILITY Work well for home directories (e.g., AFS) POSIX consistency causes complexity: Cache coherency traffic (e.g., AFS callbacks) Write semantics (e.g., may need distributed locks for concurrent writes to same file) One-to-one mapping: File in DFS is file on server (higher load?) Servers must cache both metadata+data 7

BIG DATA DEMANDS Scientific Computing: Approximately 1GB/s of data generated at the Worldwide LHC Computing Grid. This is after two filtering stages... [3] ATLAS Experiment 2012 CERN, Image source: http://www.atlas.ch/photos/full-detector-photos.html Social Media: Facebook serves over one million images per second at peak. [...] our previous approach [...] leveraged network attached storage appliances over NFS. Our key observation is that this traditional design incurs an excessive number of disk operations because of metadata lookups. [4] Image source: http://facebook.com 8

OUTLINE Classical distributed file systems NFS: Sun Network File System AFS: Andrew File System Parallel distributed file systems Case study: The Google File System Scalability Fault tolerance Other approaches 9

DFS ARCHITECTURE Client Machine Application Client Machine Application DFS (Client) Network File Server DFS (Server) DFS (Client) Network Local FS Storage Array 10

DFS ARCHITECTURE Client Machine Application Client Machine Application DFS (Client) File Server DFS (Server) DFS (Client) Local FS 11

PARALLEL FILE SYSTEMS Client Machine Application DFS (Client) Metadata operations Coordination, Consistency DFS (Metadata) Coordination Heartbeat Client Machine Application DFS (Client) File Server Data transfer DFS (Server, (Server) data only) Local FS 12

LARGER DESIGN SPACE Better load balancing: Few servers handle metadata only Many servers serve (their) data More flexibilty, more options: Replication, fault tolerance built in Specialized APIs for different workloads Lower hardware requirements per machine Client and data server on same machine 13

PARALLEL FILE SYSTEMS Lustre GFS = Global Google File System GPFS PVFS HadoopFS TidyFS... 14

OUTLINE Classical distributed file systems NFS: Sun Network File System AFS: Andrew File System Parallel distributed file systems Case study: The Google File System Scalability Fault tolerance Other approaches 15

GFS KEY DESIGN GOALS Scalability: High throughput, parallel reads/writes Fault tolerance built in: Commodity components might fail often Network partitions can happen Re-examine standard I/O semantics: Complicated POSIX semantics vs scalable primitives vs common workloads Co-design file system and applications 16

GFS ARCHITECTURE Client metadata request metadata response Master Metadata read/write request read/write response Chunk Server Chunk Server Chunk Server Chunkservers store data as chunks, which are files in local Linux file system Master manages metadata (e.g., which chunks belong to which file, etc.) Source [2] 17

MASTER & METADATA Master is process on separate machine Manages all metadata: File namespace File-to-chunk mappings Chunk location information Chunk version information Access control information Does not store/read/write any file data! 18

FILES & CHUNKS Files are made of (multiple) chunks: Chunk size: 64 MiB Stored on chunkserver, in Linux file system Referenced by chunk handle (i.e., filename in Linux file system) Replicated across multiple chunkservers Chunkservers located in different racks 19

FILES & CHUNKS Metadata describing file File /some/dir/f: Chunk C4: (S1,S2,S3) Chunk C6: (S1,S2,S3) Logical view of file Chunk C8: (S1,S3,S4) Chunk C9: (S1,S2,S4) C4 C6 C8 C9 #4 #4 #4 #4 C4 #4 C6 #4 C8 #4 C9 #4 Chunks, replicated on chunk servers (S1,S2,S3,...) 20

ACCESSING FILES Client accesses file in two steps: (1) Contact Master to retrieve metadata (2) Talk to chunkservers directly Benefits: Metadata is small (one master can handle it) Metadata can be cached at client Master not involved in data operations Note: clients cache metadata, but not data 21

READ ALGORITHM Client Application (1) (filename, byte range) GFS Client (2) (filename, chunk index) (3) (chunk handle, replica locations) Master Metadata (1) Application originates read request (2) GFS client translates request from (filename, byte range) to (filename, chunk index) and sends it to master (3) Master responds with (chunk handle, replica locations) Source [6] 22

READ ALGORITHM Client Application (6) data (4) GFS client picks location and sends (chunk handle, byte range) request to that location (5) Chunkserver sends requested data to client (6) Client forwards data to application GFS Client (4) (chunk handle, byte range) Chunk Server Chunk Server Chunk Server (5) data Source [6] 23

READ PERFORMANCE Division of work reduces load: Master provides metadata quickly (in RAM) Multiple chunkservers available One chunkserver (e.g., the closest one) is selected for delivering requested data Chunk replicas equally distributed across chunkservers for load balancing Can we do this for writes, too? 24

WRITE ALGORITHM Client Application (1) (filename, data) GFS Client (2) (filename, chunk index) (3) (chunk handle, replica locations) Master Metadata (1) Application originates write request (2) GFS client translates request from (filename, data) to (filename, chunk index) and sends it to master (3) Master responds with (chunk handle, replica locations) Source [6] 25

HOW TO WRITE DATA? Client Application Data needs to be pushed to all chunkservers to reach required replication count Client could send a copy to each chunkserver on its own, but that could cause contention on its network link This might not scale... GFS Client send data 3x to 3 locations? Chunk Server Chunk Server Chunk Server 26

PIPELINING WRITES Sending data over client s network card multiple times is inefficient, but...... network cards can send and receive at the same time at full speed (full duplex) Idea: Pipeline data writes Client sends data to just one chunkserver Chunkserver starts forwarding data to next chunkserver, while still receiving more data Multiple links utilized, lower latency 27

WRITE ALGORITHM Client Application GFS Client Write algorithm, continued: (4) GFS Client sends data to first chunkserver (5) While receiving: first chunkserver forwards received data to second chunkserver, second chunkserver forwards to third replica location,... (4) (data) Data buffered on servers at first, but not applied to chunks immediately. Chunk Server Chunk Server Chunk Server Open questions: How to coordinate concurrent writes? What if a write fails? (5) (data) (5) (data) Source [6] 28

REPLICA TYPES Primary: Determines serial order of pending writes Forwards write command + serial order Secondary: Execute writes as ordered by primary Replies to primary (success or failure) Replica roles determined by Master: Tells client in step (2) of write algorithm Decided per chunk, not per chunkserver 29

WRITE ALGORITHM Client Application (11) write done GFS Client Write algorithm, continued: (6) GFS Client sends write command to primary (7) Primary determines serial order of writes, then writes buffered data to its chunk replica (8) Primary forwards (write command, serial order) to secondaries (9) Secondaries execute writes, respond to primary (10) Primary responds to GFS client Chunk Server (secondary) Chunk Server (primary) (8) (8) (7) Chunk Server (secondary) (9) (9) (6) (write command) (10) (response) Source [6] 30

WRITE SEMANTICS Multiple clients can write concurrently Things to consider: Clients determine offset in chunk Concurrent writes to overlapping byte ranges possible, may cause overwrites Last writer wins, as determined by primary Problem: what if multiple clients want to write to a file and no write must be lost? 31

ATOMIC RECORD APPEND Append is common workload (at Google): Multiple clients merge results in single file Must not overwrite other s records, but specific order not important Use file as consumer-producer queue! Primary + secondary chunkservers agree on common order of records Client library provides record abstraction 32

APPEND ALGORITHM (1) Application originates append request (2) GFS client translates, sends it to master (3) Master responds with (chunk handle, primary+secondary replica locations) (4) Client pushes data to locations (pipelined) (5) Primary check, if record fits into chunk Case (A): It fits, primary does: (6) Appends record to end of chunk (7) Tells secondaries to do the same (8) Receives responses from all secondaries (9) Sends final response to client Case (B): Does not fit, primary does: (6) Pads chunk (7) Tells secondaries to do the same (8) Informs client about padding (9) Client retries with next chunk Source [6] 33

HANDLING RECORDS GFS guarantees: Records are appended atomically (not fragmented, not partially overwritten) Each record is appended at least once Failed append: may lead to undefined regions (partial records, no data) Retries: may lead to duplicate records in some chunks Client: handles broken/duplicate records 34

EXAMPLE: RECORDS Client library: Generic support for perrecord checksums Duplicate Records Application: May add unique IDs to record to help detect duplicates Partial Record Note: the example offers a conceptual view, as the paper [5] does not have details on the real data layout for records Chunk boundaries Record Header (contains checksum) Record Data 35

FILE NAMESPACE / crawls foo... abc xyz map_tiles... 40N74E 51N13E 43,0,s0 43,0,s1... 81,23,s3 Hierarchical namespace In Master s memory Master is multi-threaded: concurrent access possible (read/writer lock) No real directories: (+) Read-lock parent dirs, write-lock file s name (-) No readdir() 36

EFFICIENT SNAPSHOTS Copy-on-write snapshots are cheap: Master revokes leases on chunks to be snapshotted to temporarily block writes Master acquires write locks on all directories / files be snapshotted Master creates new metadata structures pointing to original chunks Upon write access to chunks, master delays client reply until chunkservers duplicated respective chunks 37

DELETING FILES Deleting a file: Renamed to hidden filename + timestamp Can still be accessed under hidden name Undelete possible via rename Chunkservers not involved (yet) Background scan of namespace: Find deleted file based on special filename Erase metadata if timestamp is older than grace period 38

GARBAGE COLLECTION Garbage collection is background activity Master: Scans chunk namespace regularly Chunks not linked from any file are obsolete Chunkservers: Send heartbeat messages to master Receive list of obsolete chunks in reply Delete obsolete chunks when convenient 39

OUTLINE Classical distributed file systems NFS: Sun Network File System AFS: Andrew File System Parallel distributed file systems Case study: The Google File System Scalability Fault tolerance Other approaches 40

MASTER & METADATA Master is process on separate machine Manages all metadata: File namespace Cached in RAM File-to-chunk mappings Chunk location information Access control information Persistently stored on disk Chunk version information Does not store/read/write any file data! 41

LOG + CHECKPOINTS Client GFS Client (1) request (4) reply Master (3) updated metadata Metadata Volatile memory (1) GFS client sends modifying request (2) Master logs requested operation to its disk (3) Master applies modification to in-memory metadata (4) Master sends reply to GFS client Chkpt Persistent storage (local file system) (2) new log entry written for requested operation Log of operations since checkpoint Chkpt 42

MASTER FAILURES Fast restart from checkpoint+log, if Master process dies, but...... the Master s machine might still fail! Master replication: Log + checkpoints replicated on multiple machines Changes considered committed after being logged both locally and remotely Clients are sent reply only after full commit 43

SHADOW MASTERS Only one (real) master is in charge, performs background jobs (e.g., garbage collection) For better read availability: Shadow Masters Read replicated logs, apply observed changes to their own in-memory metadata Receive heartbeat messages from all chunkservers, like real master Can serve read-only requests, if real master is down 44

GFS: KEY TECHNIQUES Scalability: metadata + data separated Large chunk size, less coordination overhead Simple, in-memory metadata (namespace,...) Fault tolerant: Replication: Master + chunks More in paper [5]: checksums for chunks, chunk replica recovery,... Non-POSIX: application use primitives that suit their workload (e.g., record append) 45

OUTLINE Classical distributed file systems NFS: Sun Network File System AFS: Andrew File System Parallel distributed file systems Case study: The Google File System Scalability Fault tolerance Other approaches 46

OTHER APPROACHES Distributed metadata servers: Replicated state machine handles metadata TidyFS, GPFS,... Distributed key value stores: Data stored as binary objects (blobs) Read / write access via get() / put() Multiple nodes store replicas of blobs Consistent hashing determines location 47

EXAMPLE: DYNAMO [7] Nodes organized in circle, handle range of keys G H Key K (uniform hash of name) A B C Overview: (details in paper [7]) Blobs replicated to N-1 neighboring nodes Gossip protocol used to inform neighbors about key-range assignment First node (determined by position in key range) is coordinator responsible for replication to its N-1 sucessor nodes F Example: E D N=3, node A is coordinator for key K Nodes B, C, and D store replicas in range (A,B) including K Coordinator node manages get() / put() requests on N-1 sucessor nodes To account for node failures, more than N adjacent nodes form a preference list Node failures may temporarily redirect writes to nodes further down in the preference list (in example on the left: D may receive writes if A failed) 48

REFERENCES Classical distributed file systems [1] Text book: Distributed Systems - Concepts and Design, Couloris, Dollimore, Kindberg [2] Basic lecture on distributed file systems from Operating Systems and Security (in German) Large-scale distributed file systems and applications [3] Data processing at Worldwide LHC Computing Grid: http://lcg.web.cern.ch/lcg/public/dataprocessing.htm [4] Finding a Needle in Haystack: Facebook's Photo Storage, Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel, OSDI 10, Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, 2010 [5] The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, SOSP'03 Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, 2003 [6] The Google File System, Slides presented at SOSP '03, copy provided by the authors mirrored on DOS lecture website: http://os.inf.tu-dresden.de/studium/dos/ss2012/gfs-sosp03.pdf [7] Dynamo: Amazon s Highly Available Key-value Store, Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, SOSP'09 Proceedings of the 22nd ACM Symposium on Operating Systems Principles, 2009 49