Designing Fault-Tolerant Applications

Designing Fault-Tolerant Applications Miles Ward Enterprise Solutions Architect

Building Fault-Tolerant Applications on AWS White paper published last year Sharing best practices We d like to hear your best practices as well http://media.amazonwebservices.com/aws_building_fault_tolerant_applications.pdf

AWS Fault-Tolerant Building Blocks Two approaches: 1) AWS services that are inherently fault-tolerant and highly available: Amazon Simple Storage Service (S3) Amazon SimpleDB Amazon SQS, SNS, SES, CloudWatch, CloudFront, and more. 2) AWS services that offer tools and features to design faulttolerant and highly available systems: Amazon Elastic Compute Cloud (EC2) Availability Zones, Elastic IPs, EBS, etc. Flexible to trade off budget vs. time to recovery Amazon Relational Database Service (RDS) Multi-AZ Deployments Backup/Restore

Amazon EC2 Architecture Amazon Machine Image (AMI) Region Availability Zone EC2 Instance Ephemeral Storage CloudWatch Auto Scaling Security Group(s) Elastic IP Address Elastic Block Storage EBS Snapshot Amazon S3 EBS Snapshot Load Balancing

EC2 Features AMI Packaged, reusable functionality On-Instance Storage Lifetime tied to instance lifetime AFR like standard hard disk (around 5%) EBS Volumes Lifetime independent of any particular EC2 instance Redundant within an AZ AFR is 0.1% to 0.5% Incorporate volume mappings into your architecture Use EBS snapshot backups

EC2 Features Elastic IP Addresses Map to any EC2 instance within a given Region Detach from failed instance; map to replacement Auto Scaling Two ways to use it: Respond to changing conditions by adding or terminating EC2 instances (attach to CloudWatch metrics) Maintain a fixed number of instances running, replacing them if they fail or become unhealthy Reserved Instances Guarantees capacity for when it s needed

EC2 Features CloudWatch Alarms

EC2 Features Elastic Load Balancing Distributes incoming traffic across multiple instances Sends traffic only to healthy instances

Amazon EC2 Regions and Availability Zones US East (Northern Virginia) EU (Dublin) Availability Zone A Availability Zone B Availability Zone C Availability Zone D Availability Zone A Availability Zone B Amazon EC2 Regions: US East (Northern Virginia) / US West (Northern California) / EU (Ireland) / Asia Pacific (Singapore) / Asia Pacific (Tokyo)

Availability Zone Characteristics and Advice Distinct physical locations Low-latency network connections between AZs Independent power, cooling, network, security Always partition app stacks across 2 or more AZs Elastic Load Balance across instances in multiple AZs

Proper Use of Multiple Availability Zones Centralized Services (S3 Backups, SimpleDB, etc) Availability Zone A Database Server or RDS DB Instance App Server Availability Zone B Database Server or RDS DB Instance App Server Web Server Web Server Requests and Health Checks Elastic Load Balancer Incoming Requests

Region Characteristics and Advice Regions are: Functionally separate Composed of 2 or more AZs Connected via the public internet Use regions to: Have functionality geographically close to customers Comply with national laws and practices Implement a DR strategy

RDS Fault-Tolerant Features Multi-AZ Deployments Synchronous replication across AZs Automatic fail-over to standby replica Automated Backups Enables point-in-time recovery of the DB instance Retention period configurable Snapshots User initiated full backup of DB New DB can be created from snapshots

AWS Architectural Guidance

Design For Failure Basic Principles Avoid single points of failure Assume everything fails, and design backwards Goal: Applications should continue to function even if the underlying physical hardware fails or is removed or replaced. Design your recovery process Trade off business needs vs. cost of high-availability

Design For Failure Use AWS Building Blocks Use Elastic IP addresses for consistent and remappable routes Use multiple Amazon EC2 Availability Zones (AZs) Replicate data across multiple AZs Example: Amazon RDS Multi-AZ mode Use real-time monitoring (Amazon CloudWatch) Use Amazon Elastic Block Store (EBS) for persistent file systems Take EBS Snapshots and use S3 for backups

Build Loosely Coupled Systems Copyright 2011 Amazon Web Services Use independent components Design everything as a Black Box Load-balance and scale clusters Think about graceful degradation Amazon SQS as Buffers Tight Coupling Loose Coupling using Queues Controller A Controller A Controller B Q Q Q Controller B Controller C Controller C

Implement Elasticity Don t assume health or fixed location of components Use designs that are resilient to reboot and re-launch Bootstrap your instances Who am I am and what is my role? Enable dynamic configuration Use configurations in SimpleDB for bootstrapping Use Auto Scaling Use Elastic Load Balancing on each tier

Implementing Elasticity Elastic Load Balancing, CloudWatch, and AutoScaling Elastic Load Balancing Utilization Auto Scaling CloudWatch Metrics

Use a Chaos Monkey Copyright 2011 Amazon Web Services From the Netflix blog: Simple monkey: Kill any instance in the account Complex monkey: Kill instances with specific tags Introduce other faults (e.g. connectivity via Security Group) Human monkey: Kill instances from the AWS Management Console http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

AWS Architecture Center aws.amazon.com/architecture White papers: Cloud architectures Building fault-tolerant applications Web hosting best practices Leveraging different storage options AWS security best practices

Thank You!