Resiliency at Scale in the Distributed Storage Cloud

Resiliency at Scale in the Distributed Storage Cloud Alma Riska Advanced Storage Division EMC Corporation In collaboration with many at Cloud Infrastructure Group

Outline Wi topic but this talk will focus on Architecture Resiliency Failures Redundancy schemes Policies to differentiate services 2

Digital Content Creation & Investment 3

Scaled-out Storage Systems Large amount of hardware Thousands of disks Tens to hundreds of servers Significant amount of networking Wi range of applications Internet Service Provirs On-line Service Provirs Private cloud Up-to million of users 4

Storage Requirements Store amount of data (Tens) PetaBytes massive Direct attached high capacity nearline HDDs Highly available Minimum down time Reliably stored Beyond the traditional 5 nines Ubiquitous access Cross geographical boundaries 5

Scaled-out Storage Architecture Hardware organized in nos / racks / geographical sites LAN / WAN o o s,. Services/ no Services/ no 6

Scalability in Scaled-out Storage Inpennce between components no single point of failure Hardware disks, nos, racks, sites Software services such as metadata Seamlessly add/remove storage vices or nos Isolation of failures Sustaining performance Shared-nothing architecture Elasticity / resilience / performance 7

EMC Atmos Architecture Shared nothing architecture s - 15-60 large capacity SAS HDDs s up to 8 nos or 480 4TB HDDs (>1PByte) At least two sites o LAN / WAN o 8

Storage Resiliency Data Reliability Data is stored persistently in vice(s) like HDDs Data Availability Data is available inpenntly of the failures of hardware Data Consistency and Accuracy Returned data is what the user has stored in the system 9

Failures Data vices (HDDs) Other components o o Hardware Network Power outages Cooling outages LAN / WAN Software Drivers o Services (metadata) o 10

Transient Failures Many failures are transient Temporary interruption of operation of a component o o Variability in component response time can be seen as a transient failure Particularly network lays LAN / WAN System load causes transient failures o o Transient failures occur much more often than hardware component failures 11

Impact of Failures Reliability Disk failures directly o o But all other failures too Availability Directly impacted by any failure, particularly transient Consistency o LAN / WAN o Service failures Metadata Transient failures 12

Criticality of Failures in the Cloud Large scale, e.g., no failure Make unavailable large amount of data and other components simultaneously Since there are more components in the system, failures happen more often System needs to be sign with high component unavailability in mind Even if the unavailability is transient 13

Challenges of Handling Failures Correct intification of failures Many failures have similar symptoms Disk unreachable (disk failure, controller failure, power failure, network failure) Effective isolation of failures Limit the cases when a single component failure becomes a no or site failure Timely tection of failures In a large system failures may go untected Particularly transient failures and their impact 14

Example of System Alerts HDD events are overwhelming Event do not necessarily indicate disk failures Rather temporary unreachable HDDs Various reasons Majority, transient 15

Fault Tolerance in Cloud Storage Transparency toward failures Disks / s / s Services Even entire sites Transparency varies by system Goal or targets o o X LA N / W AN X x x o o 16

Fault Tolerance in Cloud Storage Transparency toward failures Disks / s / s o o Services Even entire sites Transparency varies by system Goal or targets o LA N / W AN o Resilience goals termine fault domains 17

Fault Domains The hierarchy of the set of resources whose failure can be tolerated in a system o o Example: Tolerate a site failure Two racks or 16 nos or 240 disks o LA N / W AN o Determines distribution Data Services 18

Fault Tolerance and Redundancy Fault tolerance is primarily achieved via redundancy More hardware and software than need Achieving a fault tolerant goal pends Amount of redundancy (storage capacity) Traditionally parity (RAID) Often in the cloud is replication Erasure coding Pro-active measures Monitoring/analysis/prediction of system s health Background tection of failures 19

Fault Tolerance and Data Replication Replicate data (including metadata) up to 4 times Pros High reliability High availability Good performance and accessibility Easy to implement Cons High capacity overhead Up to 300% in a 4-way replication 20

Replication in Scale-Out Cloud Storage Average case in a cloud storage system Several tens (up to hundred) of raw PBytes capacity Multiple tens of user PBytes capacity Does not scale well with regard to Cost Resilience With only 3 replicas it is not always possible to tolerate multi-no and site failure 21

Erasure Coding Generalization of parity-based fault tolerance RAID schemes Replication is a special case Out of n fragments of information m are actual data k are additional cos (n=m+k) k missing fragments of data can be tolerated Co is referred to as m/n co 22

Erasure Coding Capacity overhead k/n Overhead reduces as n increases Same protection Complexity computational and management Increases as n increases As network lays dominate performance erasure coding becomes feasible approach Tra-off between protection, complexity, overhead Common EMC Atmos cos are 9/12, 10/16 23

EC vs. Other redundancy schemes 24

Erasure Coding at Scale Data fragments distributed based on the system fault domains Placement of these fragments is crucial Round-robin placement ensures uniform distribution of fragments Assumed in previous calculations Placement of data fragments pends on User requirements with regard to Performance Priorities 25

EC data placement in the Cloud We velop a mol to see penncies between EC fragment placement and system size/architecture Determine Tolerance toward site failures as a function of Number of sites m/n erasure co parameters Additional no failure tolerance

EC data placement in the Cloud Assumptions: Homogeneous geographically distributed sites Equal number of nos and disks Equal network lays between any pair of sites Equal data priority Round robin distribution of the fragments across s / nos / disks Failures on disks / nos / sites (power, network)

Failure Tolerance in 2 System In a two site system there is only one site failure tolerance Each site has 6 nos available The numbers insi each (x,y) tuple are the number of nos tolerated in addition to the sites tolerated

Failure Tolerance in 4 System In a four site system there are one, two and three site failure tolerance Each site has 6 nos available The numbers insi each (x,y) tuple are the number of nos tolerated in addition to the sites tolerated

Heterogeneous Protection Policies As system evolve their resources become heterogeneous Different no or site sizes Different network bandwidth Different data priority location origin In such a case Uniformity of data distribution not a requirement The above factors (including performance) should termine data fragment placement 30

Abstraction of Heterogeneous Cloud Storage Group components based on affinity criteria o o Network bandwidth Create homogeneous subcluster Determine redundancy for each sub-cluster Handle each sub-cluster inpenntly o LAN / WAN o Combine outcome for system-wi placement

Abstraction of Heterogeneous Cloud Storage - Example Two sites are close (e.g. on the same US coast) Fast network connection o o Data can be placed in any of the nos in both sites and retrieving it will not suffer extra network lay If an 6/12 redundancy scheme is used If data primary location is the upper two-site subcluster then 6 data fragments can be placed in its two sites and the 6 cos in the other remote sites o LAN / WAN o Accessing the data is not affected by network bandwidth One site failure is tolerated

Differentiate Protection via Policy Flexible policy settings for grouping resources and isolating applications /tenants Easily managing a large heterogeneous system Hybrid protection schemes that combine multiple replication schemes E.g., a two replications policy where First replica is the original data (stored in the closest site to tenant) Second replica is a 9/12 EC scheme that distributes the data in the rest of the sites for resilience 33

Protection Policies in the Field Tenants s 2 replicas >= 3 replicas 1 EC replica >= 2 EC replica Mix regular/ec 1 1 10/2; 9/3 4 1 sync; async 2 2 sync; async async 3 2 sync async 4 2 sync; async 9/3 9/3; 10/6; sync; async 2 2 sync; async sync; async 2 2 9/3; sync; async 2 sync; async sync 3 2 sync 9:3; sync; async 2 4 sync; async async 10/6; 9/3 9/3; sync; async 9:/3; async 1 2 10:2 2 2 sync 9:3; sync; async 9/3 async 2 1 sync 9:3 2 2 sync async 9/3 async 1 6 9:3 async 2 2 async 2 2 sync 9:3 9:3 async 3 2 sync async 3 3 sync async 2 1 sync 34

Proactive Failure Detection Monitoring the health of vices and services Logging events Taking corrective measures before failures happen Strengthen the resilience Address without the redundancy affected by failure Example Use of SMART logs to termine health of drives Replace HDDs that are about to fail rather than failed 35

Proactive Failure Detection Verify in the background the validity of data, services and health of hardware Critical aspect of resiliency in the cloud System are large and some portions maybe idle for extend periods of time Failures and issues may go untected Ensure timely failure tection Improve resilience for a given amount of redundancy 36

Conclusions Resilience at scale = reliability+availability+consistency Wi range of large scale failures Redundancy aids resiliency at scale Erasure coding efficient scaling of resiliency Proactive measures to ensure resiliency at scale 37