Fault-Tolerant Storage and Implications for the Cloud Charles Snyder
|
|
- Chloe Lawrence
- 6 years ago
- Views:
Transcription
1 Fault-Tolerant Storage and Implications for the Cloud Charles Snyder Abstract Fault-tolerance is an essential aspect of any storage system data must be correctly preserved and transmitted in order to be useful. The emergence of cloud computing as a service requires storage systems to be fault-tolerant in the same sense as traditional computing systems, as well as in new ways to accommodate a distributed system. This paper looks at the current state of reliability in storage systems and how things change in the cloud. 1. Introduction Reliability is a key issue in system design an unreliable system will be subject to frequent down-time and decreased availability, which in the practical world translates to a loss of customers. Designing reliable systems requires the ability to deal with faults by masking them and recovering behind the scenes, so that users are unaware that a fault has even occurred. The most common technique to achieve such fault-tolerance is through replication: creating copies of data, components, or even the entire system to hope to provide at least one correct copy at any given time. Creating fault-tolerant storage for a traditional computing system is no easy task; component degradation, physical stress, power failures, etc. can cause data corruption, data loss, or even bring down the entire storage system [1]. Luckily current magnetic disk technology has been around long enough for significant advances in fault-tolerance, though advances in storage technology offer new sets of problems and solutions with respect to reliability [2]. Fault-tolerant storage is an even greater challenge on the cloud. With the paradigm focus on up-time and reliable computing, storage systems must have higher standards of fault-tolerance. This typically means sacrificing resources for either more replication or smarter replication to keep throughput high even during high-traffic or geographically isolated system failures. 2. Basic Concepts Fault: Faults are unintended operations of the system. This term applies to incorrectly preformed read/write operations as well as storage failures; a faulty read or write will place an incorrect value into memory or storage respectively, causing errors when that value is used in the future, while a storage failure makes storage permanently or temporarily unavailable. Faulty reads and writes can be caused by component deterioration, accumulation of dust on the magnetic plate, and a variety of other noise events [1]. Storage failures generally have more serious causes overheating or power failures cause a disk or rack to become inoperable, or a natural disaster wipes out a physical data center. Reliability/Fault-tolerance: In storage systems, fault-tolerance is the ability to consistently access correct data. The longer a disk remains operable the more it can be accessed, however it must also provide non-volatile storage when it is accessed it must supply data that is consistent with past operations. Fault-tolerance is typically measured
2 by the mean time to failure (MTTF), mean time to detection (MTTD), mean time to repair (MTTR) [3], and mean time to data loss (MTTDL) metrics. Replication and Redundancy: Replication and redundancy are the main methods by which systems improve fault-tolerance. Components are replicated to account for the failure of a single (or a few) components; if several components fail, the others remain operational and assume responsibility [4]. Such components are placed redundantly so as not to cause an extreme change in control flow when a fault occurs [3] this helps to mask both that the fault has occurred and the latency involved in switching. Cloud Storage: Storage in a cloud architecture amounts to a large number of traditional storage systems with an additional layer of abstraction to provide the illusion of a single, high-capacity storage system. To ensure reliable storage not only must each disk be fault-tolerant, but the overarching system must be fault-tolerant: cloud providers often have many server farms across the world, and the system must be reliable in the face of a single disk failure as well as a farm failure. Typically a cloud storage system leaves the responsibility of data reliability to the individual disks and network-level error correction, while assuming the responsibility of masking disk and farm failures by replicating data across a handful of farms. In this paper, we examine fault-tolerance at both disk and cloud levels. 3. Traditional (Single Machine) Storage Fault-Tolerance The main ways in which traditional machines have achieved reasonably faulttolerant storage are by making storage disks themselves more fault tolerant and by designing aggregated storage systems to introduce redundancy. A physical disk is subject to wear from use and age: constant rotation and exposure to magnetic fields can degrade parts of the disk and affect its ability to store data. During operation the machine can be jarred, causing the read head to impact on the disk in a "head crash" - this causes damage to the area of impact, as well as scattering magnetic dust within the disk's sealed container that can later interfere with reading. Such degradation - in addition to manufacturing defects - render parts of the disk temporarily or permanently unusable, so the main sense of fault-tolerance in a single disk is to mask as much of the temporarily failures as possible (permanent failures cannot be overcome without some outside assistance). To achieve this fault-tolerance, levels of disk controllers have been introduced to ensure fail-fast reliable behavior. In a fail-fast system, when a component fails it immediately stops operation and alerts its superior; rather than attempting to complete a failed operation, the superior is notified and is responsible for making a more "intelligent" decision. One way a storage system can be made fail-fast is by introducing an error-detecting mechanism such as a sector checksum; if a read fails at the mechanical level some data will still be reported, but by verifying the data against a checksum the controller will detect the failure and is able to report a fail-fast error [4]. [1] discusses such an approach: at the lowest level the hardware simply reads and writes as asked, at the fail-fast level reading entails checking the sector checksum and verifying that the read data was correct (failing if it was not), and at the careful level a read entails performing fail-fast reads until one such read succeeds - if there is no success within a certain number of fail-fast reads the careful layer proposes that a hardware error has occurred, and since it cannot obtain correct data reports an error. From a higher level, such a system tolerates temporary errors such as magnetic dust on the disk by retrying in
3 hopes of a success - if the dust shifts in the process of rotating the disk, then the original data will be uncovered. However, permanent errors such as unusable sectors from the manufacturing process are insurmountable with this technique. To provide additional reliability, we can aggregate several disks into a functionally single storage system as in RAID. At the core of RAID design is the duplication and distribution of data across multiple disks; RAID 1 simply duplicates the data onto two identical disks - if disk 1 fails we simply try disk 2. This system will work through a single disk failure, and it will preserve data after a single permanent sector failure, but if both disks fail at the same sector the data will be lost. Other RAID organizations, such as RAID 4 and 5, provide the same single disk failure tolerance with increased performance. The advantage of most RAID implementations is that they not only detect and mask errors, but they also provide for error correction. In the event that an entire RAID 1 disk fails and all data is lost, the contents of the functioning disk can be copied to a new disk. Similarly for RAID implementations with parity, the missing disk contents can be inferred from the remaining disks. Unfortunately many obvious disk recovery schemes require the total use of the remaining hardware: if a RAID 4 parity disk fails, the obvious solution is to reconstruct the missing sectors by reading the corresponding data sectors and re-computing the parity. While these reads can be done in parallel, user reads cannot be done parallel to the parity construction, and so the system performance will suffer. [7] proposes a more optimal technique of recovering data from an XOR-based checksum, using fewer disk I/O operations so as to improve performance. With the emergence of new storage technology, it may become possible to make storage more innately reliable thanks to the medium. Advances in solid-state drives could lead to memory that is competitive in capacity and price (with advantages in speed), while avoiding the current manufacturing problems of bad sectors and the operational degradation from head crashes and magnetic dust. Other technologies such as phasechange memory could provide fast, dense memory without the volatility of DRAM [2]. 4. Cloud Storage Fault-Tolerance Cloud datacenters offer large volumes of storage to many users; these datacenters are composed of incredible numbers of individual storage drives located in farms across continents, yet still must appear and function as a single storage unit. For such systems it is assumed that each individual storage device functions as reliably as is necessary, and the topic of fault-tolerant cloud storage focuses on maintaining data accessibility and consistency in the face of shifting network loads and geographical disturbances. A major initial concern of cloud storage is achieving an acceptable balance between the large amounts of data and the customers' preference for fast access. If the data is placed on as few machines as possible then the cloud host will save on overhead and operating costs, but congestion at these machines will affect performance in the customers' eyes. Conversely distributing data across machines such that the amount of data on each machine is well below capacity will provide lower access times, but will incur higher operating costs (costs that will directly affect customers). In addition, saving the data on only one machine would introduce an obvious single point of failure - which, with the large network traffic seen by cloud systems, is unacceptable. Instead data is saved on multiple machines: if one machine already has a large queue of requests then the new request can be routed to a second, available machine
4 to reduce response time, similarly if the first machine overheats and becomes temporarily or permanently unavailable then the data is not lost and requests can be routed to the second machine. However, the datacenter that contains these machines becomes the new point of failure: if there is a spike in network traffic, power outage, or natural disaster, then the entire datacenter becomes unusable. Because of this it is common for cloud providers to distribute data across geographically diverse datacenters in order to ensure constant availability (an earthquake at a datacenter in California will leave a datacenter in China untouched and working). This replication introduces the problem of consistency: if a user's data is written to a number of corresponding machines, then each of these machines must contain an identical copy of the data. Luckily, protocols for establishing such consistency in a system of multiple machines already exist - such as the Paxos algorithm or the solution proposed in [6] - and it is (relatively) trivial to apply these protocols in a cloud environment. To maintain this replication consistency, the system must be able to update itself if a machine comes back online after being absent from the original write or if one of the original machines fails and must be replaced. Restoring this consistency is much the same as other multiple-storage systems like RAID, however the constant network traffic and concurrent machine accesses in a cloud environment place more emphasis on efficiency to prevent interference with other activities - so finding an optimal recovery scheme such as the one proposed in [7] is advantageous. A key advantage in attacking the problem of reliability in cloud storage is the cloud's similarity to a single-machine RAID implementation. If we view the entire cloud structure - processing machines, network, controllers, etc. - as the non-storage elements of a single-machine, then the machines aggregated from all of the datacenters become an extensive array of semi-redundant, semi-independent disks; the main difference between the cloud array and the traditional RAID array is that the cloud uses an uneven distribution - while the RAID array places data evenly according to a simple formula, data placement in the cloud appears to be more chaotic and can even be adaptive to network load and usage spikes. Regardless of this difference, viewing cloud storage at a high level as RAID allows the general application of classic tools and algorithms, as well as simplifying analysis. Indeed even the physical dissimilarities of RAID and cloud storage systems mimic each other: RAID ideally uses identical disks, but this is impossible thanks to heterogeneity imparted by disk manufacturing techniques and RAID controllers must adapt to small differences [5], while a cloud storage system intentionally aggregates heterogeneous disks. 5. Conclusions The main mechanisms used to increase the fault-tolerance of traditional storage systems are abstraction and replication: by adding additional levels of abstraction above the basic disk controller we achieve fail-fast reliable performance, while the replication of data and disks provides redundancy that can tolerate a fixed number of permanent failures. Reliable cloud storage expands on these ideas, using controllers that distribute data to provide independence and quick access - however the vast scale of cloud storage requires a more calculated approach to ensure a realistic system. Thanks to the abstract similarity of traditional and cloud storage systems, existing methods to ensure fault-
5 tolerance are often easily applied to the cloud in theory, but may require modifications in practice to accommodate heterogeneous and geographically distributed systems. 6. References [1] J. Saltzer and M. Kaashoek. Principles of computer system design: an introduction, Part II [2] K. Bailey, L. Ceze, S. Gribble, and H. Levy. Operating system implications of fast, cheap, non-volatile memory. In USENIX HotOS XIII, [3] D. Siewiorek. Architecture of fault-tolerant computers. In Computer, Vol. 17 Issue 8, [4] J. Gray and D. Siewiorek. High-availability computer systems. In Computer, Vol. 24 Issue 9, [5] E. Krevat, J. Tucek, and G. Ganger. Disks are like snowflakes: no two are alike. In USENIX HotOS XIII, [6] R. Padilha and F. Pedone. Scalable Byzantine fault-tolerant storage. In IEEE/IFIP 41 st International Conference on Dependable Systems and Networks, [7] O. Khan, R. Burns, J. Plank, and C. Huang. In search of I/O-optimal recovery from disk failures. In USENIX HotStorage 11, 2011.
An Introduction to RAID
Intro An Introduction to RAID Gursimtan Singh Dept. of CS & IT Doaba College RAID stands for Redundant Array of Inexpensive Disks. RAID is the organization of multiple disks into a large, high performance
More informationDefinition of RAID Levels
RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds
More informationActiveScale Erasure Coding and Self Protecting Technologies
WHITE PAPER AUGUST 2018 ActiveScale Erasure Coding and Self Protecting Technologies BitSpread Erasure Coding and BitDynamics Data Integrity and Repair Technologies within The ActiveScale Object Storage
More informationActiveScale Erasure Coding and Self Protecting Technologies
NOVEMBER 2017 ActiveScale Erasure Coding and Self Protecting Technologies BitSpread Erasure Coding and BitDynamics Data Integrity and Repair Technologies within The ActiveScale Object Storage System Software
More informationWWW. FUSIONIO. COM. Fusion-io s Solid State Storage A New Standard for Enterprise-Class Reliability Fusion-io, All Rights Reserved.
Fusion-io s Solid State Storage A New Standard for Enterprise-Class Reliability iodrive Fusion-io s Solid State Storage A New Standard for Enterprise-Class Reliability Fusion-io offers solid state storage
More informationRAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE
RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting
More informationAppendix D: Storage Systems
Appendix D: Storage Systems Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Storage Systems : Disks Used for long term storage of files temporarily store parts of pgm
More informationBusiness Continuity and Disaster Recovery. Ed Crowley Ch 12
Business Continuity and Disaster Recovery Ed Crowley Ch 12 Topics Disaster Recovery Business Impact Analysis MTBF and MTTR RTO and RPO Redundancy Failover Backup Sites Load Balancing Mirror Sites Disaster
More informationCS2410: Computer Architecture. Storage systems. Sangyeun Cho. Computer Science Department University of Pittsburgh
CS24: Computer Architecture Storage systems Sangyeun Cho Computer Science Department (Some slides borrowed from D Patterson s lecture slides) Case for storage Shift in focus from computation to communication
More informationChapter 6. Storage and Other I/O Topics
Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections
More informationCHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI
CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS Assist. Prof. Dr. Volkan TUNALI PART 1 2 RECOVERY Topics 3 Introduction Transactions Transaction Log System Recovery Media Recovery Introduction
More informationI/O CANNOT BE IGNORED
LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.
More informationTSW Reliability and Fault Tolerance
TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software
More informationMass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova
Mass-Storage ICS332 - Fall 2017 Operating Systems Henri Casanova (henric@hawaii.edu) Magnetic Disks! Magnetic disks (a.k.a. hard drives ) are (still) the most common secondary storage devices today! They
More informationFAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)
Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy
More informationDistributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi
1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.
More informationDistributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs
1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds
More informationCSE 380 Computer Operating Systems
CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania Fall 2003 Lecture Note on Disk I/O 1 I/O Devices Storage devices Floppy, Magnetic disk, Magnetic tape, CD-ROM, DVD User
More informationRAID (Redundant Array of Inexpensive Disks)
Magnetic Disk Characteristics I/O Connection Structure Types of Buses Cache & I/O I/O Performance Metrics I/O System Modeling Using Queuing Theory Designing an I/O System RAID (Redundant Array of Inexpensive
More informationDatabase Systems. November 2, 2011 Lecture #7. topobo (mit)
Database Systems November 2, 2011 Lecture #7 1 topobo (mit) 1 Announcement Assignment #2 due today Assignment #3 out today & due on 11/16. Midterm exam in class next week. Cover Chapters 1, 2,
More informationDistributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID
Distributed Video ystems Chapter 5 Issues in Video torage and Retrieval art 2 - Disk Array and RAID Jack Yiu-bun Lee Department of Information Engineering The Chinese University of Hong Kong Contents 5.1
More informationStorage. Hwansoo Han
Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics
More information6.033 Lecture Fault Tolerant Computing 3/31/2014
6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults
More informationPowerVault MD3 Storage Array Enterprise % Availability
PowerVault MD3 Storage Array Enterprise 99.999% Availability Dell Engineering June 2015 A Dell Technical White Paper THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS
More informationLecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )
Lecture 23: Storage Systems Topics: disk access, bus design, evaluation metrics, RAID (Sections 7.1-7.9) 1 Role of I/O Activities external to the CPU are typically orders of magnitude slower Example: while
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationHigh Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.
High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Warm Standby...2 The Business Problem...2 Section II:
More informationMass-Storage. ICS332 Operating Systems
Mass-Storage ICS332 Operating Systems Magnetic Disks Magnetic disks are (still) the most common secondary storage devices today They are messy Errors, bad blocks, missed seeks, moving parts And yet, the
More informationI/O CANNOT BE IGNORED
LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.
More information1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected
1. Introduction Traditionally, a high bandwidth file system comprises a supercomputer with disks connected by a high speed backplane bus such as SCSI [3][4] or Fibre Channel [2][67][71]. These systems
More informationSecurity+ Guide to Network Security Fundamentals, Third Edition. Chapter 13 Business Continuity
Security+ Guide to Network Security Fundamentals, Third Edition Chapter 13 Business Continuity Objectives Define business continuity Describe the components of redundancy planning List disaster recovery
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices
More informationChapter 10: Mass-Storage Systems
Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Chapter 10: Mass-Storage Systems Overview of Mass Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk Management Swap-Space
More informationCOMP283-Lecture 3 Applied Database Management
COMP283-Lecture 3 Applied Database Management Introduction DB Design Continued Disk Sizing Disk Types & Controllers DB Capacity 1 COMP283-Lecture 3 DB Storage: Linear Growth Disk space requirements increases
More informationChapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition
Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Chapter 10: Mass-Storage Systems Overview of Mass Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk Management Swap-Space
More informationMaximum Availability Architecture: Overview. An Oracle White Paper July 2002
Maximum Availability Architecture: Overview An Oracle White Paper July 2002 Maximum Availability Architecture: Overview Abstract...3 Introduction...3 Architecture Overview...4 Application Tier...5 Network
More informationVirtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili
Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationhot plug RAID memory technology for fault tolerance and scalability
hp industry standard servers april 2003 technology brief TC030412TB hot plug RAID memory technology for fault tolerance and scalability table of contents abstract... 2 introduction... 2 memory reliability...
More informationSYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID
System Upgrade Teaches RAID In the growing computer industry we often find it difficult to keep track of the everyday changes in technology. At System Upgrade, Inc it is our goal and mission to provide
More informationFile systems CS 241. May 2, University of Illinois
File systems CS 241 May 2, 2014 University of Illinois 1 Announcements Finals approaching, know your times and conflicts Ours: Friday May 16, 8-11 am Inform us by Wed May 7 if you have to take a conflict
More informationNext Generation Erasure Coding Techniques Wesley Leggette Cleversafe
Next Generation Erasure Coding Techniques Wesley Leggette Cleversafe Topics r What is Erasure Coded Storage? r The evolution of Erasure Coded storage r From first- to third-generation erasure coding r
More informationVERIFYING SOFTWARE ROBUSTNESS. Ross Collard Collard & Company
VERIFYING SOFTWARE ROBUSTNESS Ross Collard Collard & Company OVERVIEW Software is robust if it can tolerate such problems as unanticipated events, invalid inputs, corrupted internally stored data, improper
More informationVERITAS Volume Replicator. Successful Replication and Disaster Recovery
VERITAS Volume Replicator Successful Replication and Disaster Recovery V E R I T A S W H I T E P A P E R Table of Contents Introduction.................................................................................1
More informationPage 1. Magnetic Disk Purpose Long term, nonvolatile storage Lowest level in the memory hierarchy. Typical Disk Access Time
Review: Major Components of a Computer Processor Control Datapath Cache Memory Main Memory Secondary Memory (Disk) Devices Output Input Magnetic Disk Purpose Long term, nonvolatile storage Lowest level
More informationPhysical Storage Media
Physical Storage Media These slides are a modified version of the slides of the book Database System Concepts, 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 24 Mass Storage, HDFS/Hadoop Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ What 2
More informationFault Tolerant Computing CS 530
Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu
More informationChapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition
Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Objectives To describe the physical structure of secondary storage devices and its effects on the uses of the devices To explain the
More informationAnnouncements. R3 - There will be Presentations
Announcements R3 - There will be Presentations Clarify any requirements and expectations with stakeholder Unify any assumptions/dependencies with other silos Distributed Systems SWEN-343 Distributed Systems
More informationHigh Performance Computing Course Notes High Performance Storage
High Performance Computing Course Notes 2008-2009 2009 High Performance Storage Storage devices Primary storage: register (1 CPU cycle, a few ns) Cache (10-200 cycles, 0.02-0.5us) Main memory Local main
More informationSoftware reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.
SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES
More information5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1
5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks Amdahl s law in Chapter 1 reminds us that
More informationAfter the Attack. Business Continuity. Planning and Testing Steps. Disaster Recovery. Business Impact Analysis (BIA) Succession Planning
After the Attack Business Continuity Week 6 Part 2 Staying in Business Disaster Recovery Planning and Testing Steps Business continuity is a organization s ability to maintain operations after a disruptive
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationCSCI-GA Database Systems Lecture 8: Physical Schema: Storage
CSCI-GA.2433-001 Database Systems Lecture 8: Physical Schema: Storage Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com View 1 View 2 View 3 Conceptual Schema Physical Schema 1. Create a
More informationBoosts Server Performance in a BGA-SSD
WHITE PAPER Boosts Server Performance in a BGA-SSD Introduction Over the past few years, an increasing number of solid state storage drives (SSDs) have been appearing in consumer devices. Not surprisingly,
More informationMass-Storage Structure
CS 4410 Operating Systems Mass-Storage Structure Summer 2011 Cornell University 1 Today How is data saved in the hard disk? Magnetic disk Disk speed parameters Disk Scheduling RAID Structure 2 Secondary
More informationDatabase Systems II. Secondary Storage
Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM
More informationReal-time Protection for Microsoft Hyper-V
Real-time Protection for Microsoft Hyper-V Introduction Computer virtualization has come a long way in a very short time, triggered primarily by the rapid rate of customer adoption. Moving resources to
More informationPANASAS TIERED PARITY ARCHITECTURE
PANASAS TIERED PARITY ARCHITECTURE Larry Jones, Matt Reid, Marc Unangst, Garth Gibson, and Brent Welch White Paper May 2010 Abstract Disk drives are approximately 250 times denser today than a decade ago.
More informationOperating Systems 2010/2011
Operating Systems 2010/2011 Input/Output Systems part 2 (ch13, ch12) Shudong Chen 1 Recap Discuss the principles of I/O hardware and its complexity Explore the structure of an operating system s I/O subsystem
More informationCS5460: Operating Systems Lecture 20: File System Reliability
CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving
More informationDatabase Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill
Lecture Handout Database Management System Lecture No. 34 Reading Material Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill Modern Database Management, Fred McFadden,
More informationCONFIGURING ftscalable STORAGE ARRAYS ON OpenVOS SYSTEMS
Best Practices CONFIGURING ftscalable STORAGE ARRAYS ON OpenVOS SYSTEMS Best Practices 2 Abstract ftscalable TM Storage G1, G2 and G3 arrays are highly flexible, scalable hardware storage subsystems that
More informationWHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group
WHITE PAPER: BEST PRACTICES Sizing and Scalability Recommendations for Symantec Rev 2.2 Symantec Enterprise Security Solutions Group White Paper: Symantec Best Practices Contents Introduction... 4 The
More informationViewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De
Viewstamped Replication to Practical Byzantine Fault Tolerance Pradipta De pradipta.de@sunykorea.ac.kr ViewStamped Replication: Basics What does VR solve? VR supports replicated service Abstraction is
More information3.3 Understanding Disk Fault Tolerance Windows May 15th, 2007
3.3 Understanding Disk Fault Tolerance Windows May 15th, 2007 Fault tolerance refers to the capability of a computer or network to continue to function when some component fails. Disk fault tolerance refers
More informationBlizzard: A Distributed Queue
Blizzard: A Distributed Queue Amit Levy (levya@cs), Daniel Suskin (dsuskin@u), Josh Goodwin (dravir@cs) December 14th 2009 CSE 551 Project Report 1 Motivation Distributed systems have received much attention
More informationDisks. Storage Technology. Vera Goebel Thomas Plagemann. Department of Informatics University of Oslo
Disks Vera Goebel Thomas Plagemann 2014 Department of Informatics University of Oslo Storage Technology [Source: http://www-03.ibm.com/ibm/history/exhibits/storage/storage_photo.html] 1 Filesystems & Disks
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance
More information416 Distributed Systems. Errors and Failures Oct 16, 2018
416 Distributed Systems Errors and Failures Oct 16, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:
More informationRediffmail Enterprise High Availability Architecture
Rediffmail Enterprise High Availability Architecture Introduction Rediffmail Enterprise has proven track record of 99.9%+ service availability. Multifold increase in number of users and introduction of
More informationDistributed Systems COMP 212. Revision 2 Othon Michail
Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise
More informationComputer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine
More informationChapter 11: File System Implementation. Objectives
Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block
More informationSpecifying data availability in multi-device file systems
Specifying data availability in multi-device file systems John Wilkes and Raymie Stata Concurrent Computing Department Hewlett-Packard Laboratories Palo Alto, CA Technical report HPL CSP 90 6 1 April 1990
More informationFault Tolerance Dealing with an imperfect world
Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction
More informationHigh Availability and Disaster Recovery Solutions for Perforce
High Availability and Disaster Recovery Solutions for Perforce This paper provides strategies for achieving high Perforce server availability and minimizing data loss in the event of a disaster. Perforce
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationINFRASTRUCTURE BEST PRACTICES FOR PERFORMANCE
INFRASTRUCTURE BEST PRACTICES FOR PERFORMANCE Michael Poulson and Devin Jansen EMS Software Software Support Engineer October 16-18, 2017 Performance Improvements and Best Practices Medium-Volume Traffic
More informationCSE380 - Operating Systems. Communicating with Devices
CSE380 - Operating Systems Notes for Lecture 15-11/4/04 Matt Blaze (some examples by Insup Lee) Communicating with Devices Modern architectures support convenient communication with devices memory mapped
More informationAppendix D: Storage Systems (Cont)
Appendix D: Storage Systems (Cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Reliability, Availability, Dependability Dependability: deliver service such that
More informationIntelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan
Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan White paper Version: 1.1 Updated: Sep., 2017 Abstract: This white paper introduces Infortrend Intelligent
More informationCourse: Advanced Software Engineering. academic year: Lecture 14: Software Dependability
Course: Advanced Software Engineering academic year: 2011-2012 Lecture 14: Software Dependability Lecturer: Vittorio Cortellessa Computer Science Department University of L'Aquila - Italy vittorio.cortellessa@di.univaq.it
More informationDisks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]
Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse] Storage Devices Magnetic disks Storage that rarely becomes corrupted Large capacity at low cost Block
More informationIn this unit we are going to review a set of computer protection measures also known as countermeasures.
1 In this unit we are going to review a set of computer protection measures also known as countermeasures. A countermeasure can be defined as an action, device, procedure, or technique that reduces a threat,
More informationConsidering the 2.5-inch SSD-based RAID Solution:
Considering the 2.5-inch SSD-based RAID Solution: Using Infortrend EonStor B12 Series with Intel SSD in a Microsoft SQL Server Environment Application Note Abstract This application note discusses the
More informationRAID - Redundant Array of Inexpensive/Independent Disks
Safety of information systems Lecturer: Roman Danel Hardware means - UPS, RAID... RAID - Redundant Array of Inexpensive/Independent Disks is a data storage virtualization technology that combines multiple
More informationCS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University
CS 370: SYSTEM ARCHITECTURE & SOFTWARE [MASS STORAGE] Frequently asked questions from the previous class survey Shrideep Pallickara Computer Science Colorado State University L29.1 L29.2 Topics covered
More informationDistributed Operating Systems
2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System
More informationReplicator Disaster Recovery Best Practices
Replicator Disaster Recovery Best Practices VERSION 7.4.0 June 21, 2017 Scenario Guide Article 1120504-01 www.metalogix.com info@metalogix.com 202.609.9100 Copyright International GmbH, 2002-2017 All rights
More informationVERITAS Volume Replicator Successful Replication and Disaster Recovery
VERITAS Replicator Successful Replication and Disaster Recovery Introduction Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses.
More informationNutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure
Nutanix Tech Note Virtualizing Microsoft Applications on Web-Scale Infrastructure The increase in virtualization of critical applications has brought significant attention to compute and storage infrastructure.
More informationCSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review Final exam review Goal of this section: key concepts you should understand Not just a summary of lectures Slides coverage and
More informationThe Microsoft Large Mailbox Vision
WHITE PAPER The Microsoft Large Mailbox Vision Giving users large mailboxes without breaking your budget Introduction Giving your users the ability to store more email has many advantages. Large mailboxes
More informationCONFIGURATION GUIDE WHITE PAPER JULY ActiveScale. Family Configuration Guide
WHITE PAPER JULY 2018 ActiveScale Family Configuration Guide Introduction The world is awash in a sea of data. Unstructured data from our mobile devices, emails, social media, clickstreams, log files,
More informationData Protection Using Premium Features
Data Protection Using Premium Features A Dell Technical White Paper PowerVault MD3200 and MD3200i Series Storage Arrays www.dell.com/md3200 www.dell.com/md3200i THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES
More informationVolley: Automated Data Placement for Geo-Distributed Cloud Services
Volley: Automated Data Placement for Geo-Distributed Cloud Services Authors: Sharad Agarwal, John Dunagen, Navendu Jain, Stefan Saroiu, Alec Wolman, Harbinder Bogan 7th USENIX Symposium on Networked Systems
More information1 of 6 4/8/2011 4:08 PM Electronic Hardware Information, Guides and Tools search newsletter subscribe Home Utilities Downloads Links Info Ads by Google Raid Hard Drives Raid Raid Data Recovery SSD in Raid
More information