DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Similar documents
Distributed File Systems II

CLOUD-SCALE FILE SYSTEMS

Distributed File Systems. Directory Hierarchy. Transfer Model

Distributed Systems 16. Distributed File Systems II

CA485 Ray Walshe Google File System

Distributed File Systems. CS432: Distributed Systems Spring 2017

Distributed File Systems

Distributed File Systems

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Distributed File Systems. CS 537 Lecture 15. Distributed File Systems. Transfer Model. Naming transparency 3/27/09

Distributed System. Gang Wu. Spring,2018

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

DISTRIBUTED FILE SYSTEMS & NFS

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

GFS: The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

The Google File System

Google File System 2

The Google File System

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Google File System (GFS)

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

The Google File System

The Google File System

Extreme computing Infrastructure

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Lecture 7: Distributed File Systems

416 Distributed Systems. Distributed File Systems 4 Jan 23, 2017

GFS: The Google File System. Dr. Yingwu Zhu

Distributed Systems. GFS / HDFS / Spanner

The Google File System. Alexandru Costan

CSE 486/586: Distributed Systems

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Disconnected Operation in the Coda File System

The Google File System

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Google File System. Arun Sundaram Operating Systems

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CSE 124: Networked Services Fall 2009 Lecture-19

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

The Google File System

Introduction to Cloud Computing

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Lecture 14: Distributed File Systems. Contents. Basic File Service Architecture. CDK: Chapter 8 TVS: Chapter 11

Google is Really Different.

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

7680: Distributed Systems

Today: Distributed File Systems

What is a file system

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Distributed Filesystem

Google Disk Farm. Early days

Google File System. By Dinesh Amatya

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Staggeringly Large Filesystems

Distributed File Systems. Distributed Systems IT332

CSE 124: Networked Services Lecture-16

MapReduce. U of Toronto, 2014

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

NPTEL Course Jan K. Gopinath Indian Institute of Science

Lecture XIII: Replication-II

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Nov 28, 2017 Lecture 25: Distributed File Systems All slides IG

Today: Distributed File Systems. File System Basics

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Today: Distributed File Systems

3/4/14. Review of Last Lecture Distributed Systems. Topic 2: File Access Consistency. Today's Lecture. Session Semantics in AFS v2

Today: Distributed File Systems!

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google Cluster Computing Faculty Training Workshop

BigData and Map Reduce VITMAC03

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Today: Distributed File Systems. Naming and Transparency

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Distributed Systems - III

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

The Google File System GFS

Distributed File Systems. Case Studies: Sprite Coda

Chapter 12 Distributed File Systems. Copyright 2015 Prof. Amr El-Kadi

Operating Systems Design 16. Networking: Remote File Systems

416 Distributed Systems. Distributed File Systems 1: NFS Sep 18, 2018

Cloud Computing CS

Transcription:

CHALLENGES Transparency: Slide 1 DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems ➀ Introduction ➁ NFS (Network File System) ➂ AFS (Andrew File System) & Coda ➃ GFS (Google File System) Slide 3 Location: a client cannot tell where a file is located Migration: a file can transparently move to another server Replication: multiple copies of a file may exist Concurrency: multiple clients access the same file Flexibility: Servers may be added or replaced Support for multiple file system types Dependability: Consistency: conflicts with replication & concurrency Security: users may have different access rights on clients sharing files & network transmission Fault tolerance: server crash, availability of files INTRODUCTION Slide 2 Distributed File System Paradigm: File system that is shared by many distributed clients Communication through shared files Shared data remains available for long time Basic layer for many distributed systems and applications Clients and Servers: Clients access files and directories Servers provide files and directories Servers allow clients to perform operations on the files and directories Operations: add/remove, read/write Servers may provide different views to different clients Slide 4 Performance: Requests may be distributed across servers Multiple servers allow higher storage capacity Scalability: Handle increasing number of files and users Growth over geographic and administrative areas Growth of storage space No central naming service No centralised locking No central file store CHALLENGES 1 THE CLIENT S PERSPECTIVE: FILE SERVICES 2

Immutable files: THE CLIENT S PERSPECTIVE: FILE SERVICES Files allow only CREATE and READ Directories can be updated Slide 5 Ideally, the client would perceive remote files like local ones. File Service Interface: File: uninterpreted sequence of bytes Attributes: owner, size, creation date, permissions, etc. Protection: access control lists or capabilities Immutable files: simplifies caching and replication Upload/download model versus remote access model Slide 7 Instead of overwriting the contents of a file, a new one is created and replaces the old one Race condition when two clients replace the same file How to handle readers of a file when it is replaced? Atomic transactions: A sequence of file manipulations is executed indivisibly Two transaction can never interfere Standard for databases Expensive to implement FILE ACCESS SEMANTICS THE SERVER S PERSPECTIVE: IMPLEMENTATION Slide 6 UNIX semantics: A READ after a WRITE returns the value just written When two WRITEs follow in quick succession, the second persists Caches are needed for performance & write-through is expensive UNIX semantics is too strong for a distributed file system Session semantics: Changes to an open file are only locally visible When a file is closed, changes are propagated to the server (and other clients) But it also has problems: What happens if two clients modify the same file simultaneously? Parent and child processes cannot share file pointers if running on different machines. Slide 8 Design Depends On the Use: Satyanarayanan, 1980 s university UNIX use Most files are small less than 10k Reading is much more common than writing Usually access is sequential; random access is rare Most files have a short lifetime File sharing is unusual, Most process use only a few files Distinct files classes with different properties exist Is this still valid? There are also varying reasons for using a DFS: Big file system, many users, inherent distribution High performance Fault tolerance FILE ACCESS SEMANTICS 3 STATELESS VERSUS STATEFUL SERVERS 4

Slide 9 STATELESS VERSUS STATEFUL SERVERS Advantages of stateless servers: Fault tolerance No OPEN/CLOSE calls needed No server space needed for tables No limits on number of open files No problems if server crashes No problems if client crashes Advantages of stateful servers: Shorter request messages Better performance Read ahead easier File locking possible Slide 11 REPLICATION Multiple copies of files on different servers: Prevent data loss Protect system against down time of a single server Distribute workload Three designs: Explicit replication: The client explicitly writes files to multiple servers (not transparent). Lazy file replication: Server automatically copies files to other servers after file is written. Group file replication: WRITEs simultaneously go to a group of servers. CACHING We can cache in three locations: ➀ Main memory of the server: easy & transparent ➁ Disk of the client ➂ Main memory of the client (process local, kernel, or dedicated cache process) CASE STUDIES Slide 10 Cache consistency: Obvious parallels to shared-memory systems, but other trade offs No UNIX semantics without centralised control Plain write-through is too expensive; alternatives: delay WRITEs and agglomerate multiple WRITEs Write-on-close; possibly with delay (file may be deleted) Invalid cache entries may be accessed if server is not contacted whenever a file is opened Slide 12 Network File System (NFS) Andrew File System (AFS) & Coda Google File System (GFS) REPLICATION 5 NETWORK FILE SYSTEM (NFS) 6

NETWORK FILE SYSTEM (NFS) Properties: Slide 13 Introduced by Sun Fits nicely into UNIX s idea of mount points, but does not implement UNIX semantics Multiple clients & servers (a single machine can be a client and a server) Stateless servers (no OPEN & CLOSE) (changed in v4) File locking through separate server No replication ONC RPC for communication Slide 15 Server side: NFS protocol independent of underlying FS NFS server runs as a daemon /etc/export: specifies what directories are exported to whom under which policy Transparent caching Client System call layer Server System call layer Virtual file system (VFS) layer Virtual file system (VFS) layer Slide 14 Local file system interface NFS client NFS server Local file system interface Slide 16 RPC client stub RPC server stub Network NETWORK FILE SYSTEM (NFS) 7 NETWORK FILE SYSTEM (NFS) 8

Client side: Explicit mounting versus automounting Hard mounts versus soft mounts Supports diskless workstations Caching of file attributes and file data System Architecture: Client: User-level process Venus (AFS daemon) Cache on local disk Trusted servers collectively called Vice Virtue client machine Slide 17 Caching: Implementation specific Caches result of read, write, getattr, lookup, readdir Consistency through polling and timestamps Cache entries are discarded after a fixed period of time Modified files are sent to server asynchronously (when closed, or client performs sync) Read-ahead and delayed write possible Slide 19 User User process process Local file Virtual file system layer system interface Local OS Network Venus process RPC client stub Transparent access to a Vice file server ANDREW FILE SYSTEM (AFS) & CODA Slide 18 Properties: From Carnegie Mellon University (CMU) in the 1980s. Developed as campus-wide file system: Scalability Global name space for file system (divided in cells, e.g. /afs/cs.cmu.edu,/afs/ethz.ch) Slide 20 Virtue client API same as for UNIX UNIX semantics for processes on one machine, but globally write-on-close Vice file server ANDREW FILE SYSTEM (AFS) & CODA 9 ANDREW FILE SYSTEM (AFS) & CODA 10

DESIGN & ARCHITECTURE Disconnected operation: Slide 21 Scalability: Server serves whole files Clients cache whole files Server invalidates cached files with callback (stateful servers) Clients do not validate cache (except on first use after booting) Modified files are written back to server on close() Result: Very little cache validation traffic Flexible volume per user (resize, move to other server) Read-only volumes for software Read-only replicas Slide 23 All client updates are logged in a Client Modification Log (CML) On re-connection, the operations registered in the CML are replayed on the server CML is optimised (e.g. file creation and removal cancels out) On weak connection, CML is reintegrated on server by trickle reintegration Trickle reintegration tradeoff: Immediate reintegration of log entries reduces chance for optimisation, late reintegration increases risk of conflicts File hoarding: System (or user) can build a user hoard database, which it uses to update frequently used files in a hoard walk Conflicts: Automatically resolved where possible; otherwise, manual correction necessary Conflict resolution for temporarily disconnected servers Slide 22 CODA Successor of the Andrew File System (AFS) System architecture quite similar to AFS Supports disconnected, mobile operation of clients Supports replication Slide 24 Servers: Read/write replication servers are supported Replication is organised on a per volume basis Group file replication (multicast RPCs); read from any server Version stamps are used to recognise server with out of date files (due to disconnect or failure) DESIGN & ARCHITECTURE 11 GOOGLE FILE SYSTEM 12

GOOGLE FILE SYSTEM Motivation: Slide 25 10+ clusters 1000+ nodes per cluster Pools of 1000+ clients Assumptions: Failure occurs often Huge files (millions, 100+MB) Large streaming reads 350TB+ filesystems 500Mb/s read/write load Commercial and R&D applications Small random reads Large appends Concurrent appends Bandwidth more important than latency Slide 27 Design Overview: Files split in fixed size chunks of 64 MByte Chunks stored on chunk servers Chunks replicated on multiple chunk servers GFS master manages name space Clients interact with master to get chunk handles Clients interact with chunk servers for reads and writes No explicit caching Interface: No common standard like POSIX. Provides familiar file system interface: Architecture: Application (file name, chunk index) GFS client (chunk handle, chunk locations) GFS Master File name space /foo/bar chunk 2ef0 Slide 26 Create, Delete, Open, Close, Read, Write In addition: Snapshot: low cost copy of a whole file with copy-on-write operation Record append: Atomic append operation Slide 28 (chunk handle, byte range) chunk data Instructions to chunk server Chunk server state GFS chunkserver GFS chunkserver Linux file system Linux file system......... GOOGLE FILE SYSTEM 13 GOOGLE FILE SYSTEM 14

GFS Master: Example: Write: Single point of failure Write(filename, offset, data) Slide 29 Keeps data structures in memory (speed, easy background tasks) Mutations logged to operation log Operation log replicated Checkpoint state when log is too large Checkpoint has same form as memory (quick recovery) Note: Locations of chunks not stored (master periodically asks chunk servers for list of their chunks) GFS Chunkservers: Checksum blocks of chunks Verify checksums before data is delivered Slide 31 7.ACK 4.commit Client 3.data push Secondary Replica 3.data push Lease Holder 3.data push Secondary Replica 1.who has lease? 2.lease info 6.commit ACK 6.commit ACK Master 5.serialised cmmit Verify checksums of seldomly used blocks when idle Slide 30 Data Mutations: Write, atomic record append, snapshot Master grants chunk lease to one of a chunk s replicas Replica with chunk becomes primary Primary defines serial order for all mutations Leases typically expire after 60 s, but are usually extended Easy recovery from failed primary: master chooses another replica after the initial lease expires Slide 32 Example: Append: RecordAppend(filename, data) Primary has extra logic Check if fits in current chunk If not pad and tell client to try again Otherwise continue as with write Guarantees that data is written at least once atomically GOOGLE FILE SYSTEM 15 GOOGLE FILE SYSTEM 16

Consistency: File Size: Relaxed Write Record Append Smaller files than expected Reduce block size to 1MB Slide 33 Serial success defined defined/inconsistent Concurrent success consistent & undefined defined/inconsistent Failure inconsistent inconsistent Slide 35 Throughput vs Latency: Too much latency for interactive applications (e.g. Gmail) Automated master failover Applications hide latency: e.g. multi-homed model CHUBBY RE-EVALUATING GFS AFTER 10 YEARS Chubby is...: Slide 34 Workload has changed changed assumptions Single Master: Too many requests for a single master Single point of failure Tune master performance Multiple cells Develop distributed masters File Counts: Too much meta-data for a single master applications rely on Big Table (distributed) Slide 36 Lock service Simple FS Name service Synchronisation/consensus service Architecture: Cell: 5 replicas Master: gets all client requests elected with Paxos master lease: no new master until lease expires Write: Paxos agreement of all replicas Read: local by master RE-EVALUATING GFS AFTER 10 YEARS 17 CHUBBY 18

API: Pathname: /ls/cell/some/file/name Open (R/W), Close, Read, Write, Delete Lock: Acquire, Release READING LIST Events: file modified, lock acquired, etc. Using Chubby: electing a leader: Scale and Performance in a Distributed File System File system properties Slide 37 if (open("/ls/cell/theleader", W)) { write(my_id); } else { wait until "/ls/cell/theleader" modified; leader_id = read(); } Slide 39 NFS Version 3: Design and Implementation NFS Disconnected Operation in the Coda File System Coda The Google File System GFS WHAT ELSE...? Colossus: follow up to GFS BigTable: Slide 38 Distributed, sparse, storage map Chubby for consistency GFS/Colossus for actual storage Megastore: Semi-relational data model, ACID transactions BigTable as storage, synchronous replication (using Paxos) Slide 40 HOMEWORK Compare Dropbox, Google Drive, or other popular distributed file systems to the ones discussed in class. Hacker s edition: See Naming slides Poor write latency (100-400 ms) and throughput Spanner: Structured storage, SQL-like language Transactions with TrueTime, synchronous replication (Paxos) Better write latency (72-100ms) READING LIST 19 HOMEWORK 20