Cloud Computing CS

Similar documents
Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

Cloud Computing CS

Distributed File Systems. Distributed Systems IT332

Chapter 11 DISTRIBUTED FILE SYSTEMS

Advanced Operating Systems

Introduction to Cloud Computing

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Today: Distributed File Systems

UNIT-IV HDFS. Ms. Selva Mary. G

Distributed File Systems. CS 537 Lecture 15. Distributed File Systems. Transfer Model. Naming transparency 3/27/09

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Lecture 7: Distributed File Systems

Today: Distributed File Systems. File System Basics

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture Guide

What is a file system

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Today: Distributed File Systems

Distributed Systems - III

Distributed File Systems. Directory Hierarchy. Transfer Model

NFSv4 as the Building Block for Fault Tolerant Applications

Distributed File Systems II

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

CSE 486/586: Distributed Systems

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

Distributed Filesystem

Distributed File Systems: Design Comparisons

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

DISTRIBUTED FILE SYSTEMS & NFS

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Fall HDFS Basics

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Operating Systems. Week 13 Recitation: Exam 3 Preview Review of Exam 3, Spring Paul Krzyzanowski. Rutgers University.

CS 416: Operating Systems Design April 22, 2015

Today: Distributed File Systems!

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/15

Distributed File Systems I

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Distributed Systems 16. Distributed File Systems II

Distributed File Systems

HDFS What is New and Futures

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Hadoop and HDFS Overview. Madhu Ankam

Announcements. me your survey: See the Announcements page. Today. Reading. Take a break around 10:15am. Ack: Some figures are from Coulouris

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Today: Distributed File Systems. Naming and Transparency

Database Applications (15-415)

Distributed File Systems. CS432: Distributed Systems Spring 2017

CLOUD-SCALE FILE SYSTEMS

Chapter 11: File-System Interface

COS 318: Operating Systems. Journaling, NFS and WAFL

GFS: The Google File System

NFS Design Goals. Network File System - NFS

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Lecture 19. NFS: Big Picture. File Lookup. File Positioning. Stateful Approach. Version 4. NFS March 4, 2005

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Module 7 File Systems & Replication CS755! 7-1!

Dept. Of Computer Science, Colorado State University

Lecture 14: Distributed File Systems. Contents. Basic File Service Architecture. CDK: Chapter 8 TVS: Chapter 11

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Operating Systems 2010/2011

7680: Distributed Systems

Announcements. Reading: Chapter 16 Project #5 Due on Friday at 6:00 PM. CMSC 412 S10 (lect 24) copyright Jeffrey K.

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition,

The Google File System. Alexandru Costan

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

File-System Structure

Hadoop Distributed File System (HDFS) 10/05/2018 1

Distributed File Systems. File Systems

Distributed File Systems

DFS Case Studies, Part 1

Chapter 11: Implementing File Systems

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Distributed System. Gang Wu. Spring,2018

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

CS November 2017

Current Topics in OS Research. So, what s hot?

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

Cloud Computing CS

Chapter 11: Implementing File-Systems

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Chapter 11: File-System Interface

Filesystems Lecture 11

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Nov 28, 2017 Lecture 25: Distributed File Systems All slides IG

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 5 Naming

Chapter 12 Distributed File Systems. Copyright 2015 Prof. Amr El-Kadi

Distributed File Systems. Jonathan Walpole CSE515 Distributed Computing Systems

Chapter 9: File System Interface

Distributed File Systems Part II. Distributed File System Implementation

Distributed Computing Environment (DCE)

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

Chapter 18 Distributed Systems and Web Services

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

CS454/654 Midterm Exam Fall 2004

Chapter 11: File-System Interface. Operating System Concepts 9 th Edition

File Management. Chapter 12

Chapter 12 File-System Implementation

C 1. Recap. CSE 486/586 Distributed Systems Distributed File Systems. Traditional Distributed File Systems. Local File Systems.

Transcription:

Cloud Computing CS 15-319 Distributed File Systems and Cloud Storage Part II Lecture 13, Feb 27, 2012 Majd F. Sakr, Mohammad Hammoud and Suhail Rehman 1

Today Last session Distributed File Systems and Cloud Storage- Part I Today s session Distributed File Systems and Cloud Storage- Part II Announcement: Project update is due next Wednesday, Feb 29 2

Discussion on Distributed File Systems Distributed File Systems (DFSs) Basics DFS Aspects 3

DFS Aspects Aspect Description Architecture How are DFSs generally organized? Processes Who are the cooperating processes? Are processes stateful or stateless? Communication What is the typical communication paradigm followed by DFSs? How do processes in DFSs communicate? Naming How is naming often handled in DFSs? Synchronization What are the file sharing semantics adopted by DFSs? Consistency and Replication What are the various features of client-side caching as well as server-side replication? Fault Tolerance How is fault tolerance handled in DFSs?

DFS Aspects Aspect Description Architecture How are DFSs generally organized? Processes Who are the cooperating processes? Are processes stateful or stateless? Communication What is the typical communication paradigm followed by DFSs? How do processes in DFSs communicate? Naming How is naming often handled in DFSs? Synchronization What are the file sharing semantics adopted by DFSs? Consistency and Replication What are the various features of client-side caching as well as server-side replication? Fault Tolerance How is fault tolerance handled in DFSs?

Processes (1) Cooperating processes in DFSs are usually the storage servers and file manager(s) The most important aspect concerning DFS processes is whether they should be stateless or stateful 1. Stateless Approach: Does not require that servers maintain any client state When a server crashes, there is no need to enter a recovery phase to bring the server to a previous state Locking a file cannot be easily done E.g., NFSv3 and PVFS (no client-side caching)

Processes (2) 2. Stateful Approach: Requires that a server maintains some client state Clients can make effective use of caches but this would entail an efficient underlying cache consistency protocol Provides a server with the ability to support callbacks (i.e., the ability to do RPC to a client) in order to keep track of its clients E.g., NFSv4 and HDFS

DFS Aspects Aspect Description Architecture How are DFSs generally organized? Processes Who are the cooperating processes? Are processes stateful or stateless? Communication What is the typical communication paradigm followed by DFSs? How do processes in DFSs communicate? Naming How is naming often handled in DFSs? Synchronization What are the file sharing semantics adopted by DFSs? Consistency and Replication What are the various features of client-side caching as well as server-side replication? Fault Tolerance How is fault tolerance handled in DFSs?

Communication Communication in DFSs is typically based on remote procedure calls (RPCs) The main reason for choosing RPC is to make the system independent from underlying OSs, networks, and transport protocols In NFS, all communication between a client and server proceeds along the Open Network Computing RPC (ONC RPC) HDFS uses RPC for the communication between clients, DataNodes and the NameNode PVFS currently uses TCP for all its internal communication The communication with I/O daemons and the manager is handled transparently within the API implementation

DFS Aspects Aspect Description Architecture How are DFSs generally organized? Processes Who are the cooperating processes? Are processes stateful or stateless? Communication What is the typical communication paradigm followed by DFSs? How do processes in DFSs communicate? Naming How is naming often handled in DFSs? Synchronization What are the file sharing semantics adopted by DFSs? Consistency and Replication What are the various features of client-side caching as well as server-side replication? Fault Tolerance How is fault tolerance handled in DFSs?

Naming Names are used to uniquely identify entities in distributed systems Entities may be processes, remote objects, newsgroups, Names are mapped to an entity s location using a name resolution An example of name resolution Name http://www.cdk5.net:8888/webexamples/earth.html DNS Lookup Resource ID (IP Address, Port, File Path) 55.55.55.55 8888 WebExamples/earth.html MAC address Host 02:60:8c:02:b0:5a

Naming In DFSs NFS is considered as a representative of how naming is handled in DFSs

Naming In NFS The fundamental idea underlying the NFS naming model is to provide clients with complete transparency Transparency in NFS is achieved by allowing a client to mount a remote file system into its own local file system However, instead of mounting an entire file system, NFS allows clients to mount only part of a file system A server is said to export a directory to a client when a client mounts a directory, and its entries, into its own name space

Mounting in NFS Client A Server Client B remote usr Mount steen subdirectory users usr Mount steen subdirectory work usr vu steen me mbox mbox mbox Exported directory mounted by Client A Exported directory mounted by Client B The file named /remote/vu/mbox at Client A Sharing files becomes harder The file named /work/vu/mbox at Client B

Sharing Files In NFS A common solution for sharing files in NFS is to provide each client with a name space that is partly standardized For example, each client may by using the local directory /usr/bin to mount a file system A remote file system can then be mounted in the same manner for each user

Example Client A Server Client B remote usr Mount steen subdirectory users usr Mount steen subdirectory work usr bin steen bin mbox mbox mbox Exported directory mounted by Client A Exported directory mounted by Client B The file named /usr/bin/mbox at Client A Sharing files resolved The file named /usr/bin/mbox at Client B

Mounting Nested Directories In NFSv3 An NFS server, S, can itself mount directories, Ds, that are exported by other servers However, in NFSv3, S is not allowed to export Ds to its own clients Instead, a client of S will have to explicitly mount Ds If S will be allowed to export Ds, it would have to return to its clients file handles that include identifiers for the exporting servers NFSv4 solves this problem

Mounting Nested Directories in NFS Client Server A Server B bin packages draw draw install install install Client imports directory from server A Server A imports directory from server B Client needs to explicitly import subdirectory from server B

NFS: Mounting Upon Logging In (1) Another problem with the NFS naming model has to do with deciding when a remote file system should be mounted Example: Let us assume a large system with 1000s of users and that each user has a local directory /home that is used to mount the home directories of other users Alice s (a user) home directory is made locally available to her as /home/alice This directory can be automatically mounted when Alice logs into her workstation In addition, Alice may have access to Bob s (another user) public files by accessing Bob s directory through /home/bob

NFS: Mounting Upon Logging In (2) Example (Cont d): The question, however, is whether Bob s home directory should also be mounted automatically when Alice logs in If automatic mounting is followed for each user: Logging in could incur a lot of communication and administrative overhead All users should be known in advance A better approach is to transparently mount another user s home directory on-demand

On-Demand Mounting In NFS On-demand mounting of a remote file system is handled in NFS by an automounter, which runs as a separate process on the client s machine Client Machine Server Machine 1. Lookup /home/alice NFS Client Automounter 3. Mount request users 2. Create subdir alice alice Local File System Interface home alice

Naming in HDFS An HDFS cluster consists of a single NameNode (the master) and multiple DataNodes (the slaves) The NameNode manages HDFS namespace and regulates accesses to files by clients It executes file system namespace operations (e.g., opening, closing, and renaming files and directories) It is an arbitrator and repository for all HDFS metadata It determines the mapping of blocks to DataNodes The DataNodes manage storage attached to the nodes that they run on They are responsible for serving read and write requests from clients They perform block creation, deletion, and replication upon instructions from the NameNode

A Client Reading Data from HDFS Here is the main sequence of events when reading a file in HDFS DistributedFileSystem HDFS Client NameNode 6. Close FSDataInputStream namenode Client JVM Client Node DataNode DataNode DataNode namenode namenode namenode

Data Reads DistributedFileSystem calls the NameNode, using RPC, to determine the locations of the blocks for the first few blocks in the file For each block, the NameNode returns the addresses of the DataNodes that have a copy of that block During the read process, DFSInputStream calls the NameNode to retrieve the DataNode locations for the next batch of blocks needed The DataNodes are sorted according to their proximity to the client in order to exploit data locality An important aspect of this design is that the client contacts DataNodes directly to retrieve data and is guided by the NameNode to the best DataNode for each block

A Client Writing Data to HDFS Here is the main sequence of events when writing a file to HDFS (the case assumes creating a new file, writing data to it, then closing the file) DistributedFileSystem HDFS Client NameNode 6. Close FSDataOutputStream namenode Client JVM Client Node 4 4 Pipeline of DataNodes DataNode DataNode DataNode namenode 5 namenode 5 namenode

Data Pipelining When a client is writing data to an HDFS file, its data is first written to a local file When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode The client then flushes the block to the first DataNode The first DataNode: Starts receiving the data in small portions (4KB) Writes each portion to its local repository Transfers that portion to the subsequent DataNode in the list A subsequent DataNode follows the same steps as the previous DataNode Thus, the data is pipelined from one DataNode to the next

PVFS System View Some major components of the PVFS system: Metadata server (mgr) I/O server (iod) PVFS native API (libpvfs) Compute Nodes Metadata Manager The mgr manages all file metadata for PVFS files The iods handle storing and retrieving file data stored on local disks connected to the node Libpvfs: provides user-space access to the PVFS servers handles the scatter/gather operations necessary to move data between user buffers and PVFS servers Network I/O Nodes

Naming in PVFS PVFS file systems may be mounted on all nodes in the same directory simultaneously This allows all nodes to see and access all files on the PVFS file system through the same directory scheme Once mounted, PVFS files and directories can be operated on with all the familiar tools, such as ls, cp, and rm With PVFS, clients can avoid making requests to the file system through the kernel by linking to the PVFS native API This library implements a subset of the UNIX operations which directly contact PVFS servers rather than passing through the local kernel

Metadata and Data Accesses For metadata operations, applications communicate through the library with the metadata server For data access, the metadata server is eliminated from the access path and instead I/O servers are contacted directly Metadata Access Data Access

Next Class: DFS Aspects Aspect Description Architecture How are DFSs generally organized? Processes Who are the cooperating processes? Are processes stateful or stateless? Communication What is the typical communication paradigm followed by DFSs? How do processes in DFSs communicate? Naming How is naming often handled in DFSs? Synchronization What are the file sharing semantics adopted by DFSs? Consistency and Replication What are the various features of client-side caching as well as server-side replication? Fault Tolerance How is fault tolerance handled in DFSs?