Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Similar documents
Design Patterns for the Cloud. MCSN - N. Tonellotto - Distributed Enabling Platforms 68

Designing Fault-Tolerant Applications

AWS Solution Architecture Patterns

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Principal Solutions Architect. Architecting in the Cloud

CS15-319: Cloud Computing. Lecture 3 Course Project and Amazon AWS Majd Sakr and Mohammad Hammoud

Introduction to Cloud Computing

Training on Amazon AWS Cloud Computing. Course Content

Cloud Computing /AWS Course Content

Enroll Now to Take online Course Contact: Demo video By Chandra sir

Amazon Web Services and Feb 28 outage. Overview presented by Divya

Introduction to Amazon Web Services. Jeff Barr Senior AWS /

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

CIT 668: System Architecture. Amazon Web Services

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

LINUX, WINDOWS(MCSE),

Amazon AWS-Solution-Architect-Associate Exam

SAA-C01. AWS Solutions Architect Associate. Exam Summary Syllabus Questions

AWS_SOA-C00 Exam. Volume: 758 Questions

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

AWS SQS for better architecture. 23 rd Dec'14 Saurabh S Bangad

Building Web-Scale Applications with AWS

AWS Solutions Architect Associate (SAA-C01) Sample Exam Questions

Amazon Web Services Training. Training Topics:

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

AWS Administration. Suggested Pre-requisites Basic IT Knowledge

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

Scaling on AWS. From 1 to 10 Million Users. Matthias Jung, Solutions Architect

Security & Compliance in the AWS Cloud. Amazon Web Services

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

Lassoing the Clouds: Best Practices on AWS. Brian DeShong May 26, 2017

Security & Compliance in the AWS Cloud. Vijay Rangarajan Senior Cloud Architect, ASEAN Amazon Web

WHITEPAPER AMAZON ELB: Your Master Key to a Secure, Cost-Efficient and Scalable Cloud.

Amazon Web Services (AWS) Training Course Content

Lassoing the Clouds: Best Practices on AWS. Brian DeShong May 26, 2017

Project Presentation

Aurora, RDS, or On-Prem, Which is right for you

PracticeDump. Free Practice Dumps - Unlimited Free Access of practice exam

Better, Faster, Stronger web apps with Amazon Web Services. Senior Technology Evangelist, Amazon Web Services

AWS Course Syllabus. Linux Fundamentals. Installation and Initialization:

CIT 668: System Architecture

AWS Solution Architect Associate

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Pass4test Certification IT garanti, The Easy Way!

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

AWS Storage Gateway. Not your father s hybrid storage. University of Arizona IT Summit October 23, Jay Vagalatos, AWS Solutions Architect

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona

How to host and manage enterprise customers on AWS: TOYOTA, Nippon Television, UNIQLO use cases

Getting Started with AWS. Computing Basics for Windows

Amazon Aurora Deep Dive

Emulating Lambda to speed up development. Kevin Epstein CTO CorpInfo AWS Premier Partner

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Executive Summary. Bare Metal Performance for Virtualized Servers

We are ready to serve Latest IT Trends, Are you ready to learn? New Batches Info

Amazon Linux: Operating System of the Cloud

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Amazon Web Services. Amazon Web Services

The Orion Papers. AWS Solutions Architect (Associate) Exam Course Manual. Enter

Expected Learning Outcomes Introduction To AWS

ARCHITECTURAL DESIGN ON AWS: 3 COMMONLY MISSED BEST PRACTICES

Splunk & AWS. Gain real-time insights from your data at scale. Ray Zhu Product Manager, AWS Elias Haddad Product Manager, Splunk

Amazon Aurora Relational databases reimagined.

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

Programming Models MapReduce

Amazon Web Services. Block 402, 4 th Floor, Saptagiri Towers, Above Pantaloons, Begumpet Main Road, Hyderabad Telangana India

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS

HPE Digital Learner AWS Certified SysOps Administrator (Intermediate) Content Pack

About Intellipaat. About the Course. Why Take This Course?

Data Centers. Tom Anderson

MySQL In the Cloud. Migration, Best Practices, High Availability, Scaling. Peter Zaitsev CEO Los Angeles MySQL Meetup June 12 th, 2017.

Cloud and Storage. Transforming IT with AWS and Zadara. Doug Cliche, Storage Solutions Architect June 5, 2018

Introduction to. Amazon Web Services. Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake.

CANVAS DISASTER RECOVERY PLAN AND PROCEDURES

Getting Started with AWS Web Application Hosting for Microsoft Windows

Cloudera s Enterprise Data Hub on the Amazon Web Services Cloud: Quick Start Reference Deployment October 2014

Oracle WebLogic Server 12c on AWS. December 2018

High School Technology Services myhsts.org Certification Courses

Migrating Existing Applications to AWS. Matt Tavis Principal Solutions Architect

Advanced Architectures for Oracle Database on Amazon EC2

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Lean & Mean on AWS: Cost-Effective Architectures. Constantin Gonzalez, Solutions Architect, AWS

Introduction: Is Amazon Web Service (AWS) cloud supports best cost effective & high performance modern disaster recovery.

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Security Aspekts on Services for Serverless Architectures. Bertram Dorn EMEA Specialized Solutions Architect Security and Compliance

Reactive Microservices Architecture on AWS

Deep Dive on Amazon Elastic File System

Serverless Computing. Redefining the Cloud. Roger S. Barga, Ph.D. General Manager Amazon Web Services

TestkingPass. Reliable test dumps & stable pass king & valid test questions

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

Introduction to Database Services

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Primary-Backup Replication

MySQL Multi-Site/Multi-Master Done Right

Getting Started with Amazon EC2 and Amazon SQS

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

ArcGIS 10.3 Server on Amazon Web Services

Transcription:

Fault-Tolerant Computer System Design ECE 695/CS 590 Putting it All Together Saurabh Bagchi ECE/CS Purdue University ECE 695/CS 590 1 Outline Looking at some practical systems that integrate multiple techniques that we have learned Amazon Web Services (AWS) Hadoop Distributed Processing ECE 695/CS 590 2 1

Amazon Web Service (AWS): Case Study A set of services built in for reliability and security ECE 695/CS 590 3 AWS Ecosystem ECE 695/CS 590 4 2

AWS Reliability Examples Amazon S3 is highly durable and distributed data store With a simple web services interface, you can store and retrieve large amounts of data as objects in containers From anywhere on the web using standard HTTP commands Copies of objects can be distributed and cached at 14 edge locations around the world by creating a distribution Distribution handled using Amazon CloudFront service a web service for content delivery (static or streaming content) Amazon Simple Queue Service (Amazon SQS) is a reliable, highly scalable, hosted distributed queue for storing messages as they travel between computers and application components ECE 695/CS 590 5 Amazon Web Service: Case Study Amazon Machine Images (AMIs): Commonly used machine instances from which the user can choose to use as an execution platform; Spare instances can be kept running Amazon Elastic Block Store (Amazon EBS): Block-level storage volumes for AMIs Durability of EBS is higher than a typical hard drive due to storing data redundantly; Annual failure rate for an EBS volume is 0.1 to 0.5% compared to 4% for a regular hard drive. EBS provides a snapshot feature a backup of the system taken at a specific instance of time. Snapshots are stored in the Amazon S3 to ensure high durability. ECE 695/CS 590 6 3

Amazon Web Service: Case Study Autoscaling and Elastic Load Balancing: Allows EC2 capacity to go up or down as needed by load Example: When # running server instances is below a threshold, launch new server instances Example: Monitor resource utilization of server instances using CloudWatch service; if utilization too high, start up new server instances Example: Distribute incoming traffic across EC2 instances, for load balancing or to route around failed instances Regions and Availability zones: Distribute the application geographically in distant data centers. Each geographic location is called a Region. Within each Region, there are Availability Zones. Availability Zones are distinct locations that are insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. ECE 695/CS 590 7 AWS Regions and Availability Zones US East (Northern Virginia) Region EC2 Availability Zones: 5 US East (Ohio) Region EC2 Availability Zones: 3 (Launched 2016) US West (Oregon) Region EC2 Availability Zones: 3 US West (Northern California) Region EC2 Availability Zones: 3 AWS GovCloud (US) Region EC2 Availability Zones: 2 ECE 695/CS 590 8 4

Designing for Failure in AWS Questions that should be asked wrt hardware failure: What happens if a node in your system fails? How do you recognize that failure? How do I replace that node? What kind of scenarios do I have to plan for? What are my single points of failure? If a load balancer is sitting in front of an array of application servers, what if that load balancer fails? If there are master and slaves in your architecture, what if the master node fails? How does the failover occur and how is a new slave instantiated and brought into sync with the master? Questions that should be asked wrt software failure: What happens to my application if the dependent services changes its interface? What if downstream service times out or returns an exception? What if the cache keys grow beyond memory limit of an instance? ECE 695/CS 590 9 Designing for Failures in AWS Best practices that help dealing with failures: 1. Have a coherent backup and restore strategy for your data and automate it 2. Build process threads that resume on reboot 3. Allow the state of the system to re-sync by reloading messages from queues 4. Keep pre-configured and pre-optimized virtual images to support (2) and (3) on launch/boot 5. Avoid in-memory sessions or stateful user context, move that to data stores. ECE 695/CS 590 10 5

AWS Specific Fault-Tolerance Techniques 1. Failover gracefully using Elastic IPs: Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and failover to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures 2. Utilize multiple Availability Zones: Availability Zones are conceptually like logical datacenters. By deploying your architecture to multiple availability zones, you can ensure highly availability. Automatically replicate database updates across multiple Availability Zones. 3. Maintain an Amazon Machine Image so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and setup hot replication. ECE 695/CS 590 11 AWS Specific Fault-Tolerance Techniques 4. Utilize Amazon CloudWatch (or various real-time open source monitoring tools) to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Setup an Auto scaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones. 5. Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances. 6. Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups. ECE 695/CS 590 12 6

Loosely Connected Components Catchy principle: The more loosely coupled the components of the system, the bigger and better it scales. Key: Build components that do not have tight dependencies If one component were to die (fail), sleep (not respond) or remain busy (slow to respond), the other components will continue to work as if no failure is happening Each component interacts asynchronously with the others and treats them as a black box Use Amazon SQS as buffers between components Make your applications as stateless as possible. Store session state outside of component (e.g., in Amazon SimpleDB) ECE 695/CS 590 13 Outline Looking at some practical systems that integrate multiple techniques that we have learned Amazon Web Services (AWS) Hadoop Distributed Processing ECE 695/CS 590 14 7

Hadoop: Case Study Runs on a collection of COTS shared-nothing servers ECE 695/CS 590 15 Hadoop: Case Study Hadoop job has two types of tasks: mappers and reducers. Mappers read the job input data from a distributed file system (HDFS) and produce key-value pairs. These map outputs are stored locally on compute nodes Each reducer processes a particular key range. For this, it copies map outputs from the mappers which produced values with that key (oftentimes all mappers). A reducer writes job output data to HDFS. A Task- Tracker (TT) is a Hadoop process running on compute nodes which is responsible for starting and managing tasks locally. A TT has a number of mapper and reducer slots which determine task concurrency. A TT communicates regularly with a Job Tracker (JT), a centralized Hadoop component that decides when and where to start tasks. JT also runs a speculative execution algorithm which attempts to improve job running time by duplicating under-performing tasks. ECE 695/CS 590 16 8

Hadoop: Case Study Failure cases it worries about: non-responsiveness of a task Can be due to overload of the task or network congestion Waits for non-responsive tasks (on the order of 10 minutes) and then re-executes the work of these tasks TT sends heartbeat to JT every 3 s. JT declares a TT dead if no heartbeat for 600 s. Then tasks are restarted on a different node A reducer is considered faulty if it failed too many times to copy map outputs. This decision is made at the TT. ECE 695/CS 590 17 References Architecting for the cloud: Best Practices, Jinesh Varia (AWS), Jan 2011. On Designing and Deploying Internet-Scale Services, James Hamilton - Windows Live Services Platform, pp. 231-242 of the Proceedings of the 21st Large Installation System Administration Conference (LISA '07) Hadoop MapReduce Tutorial at: https://hadoop.apache.org/docs/stable/hadoop-mapreduceclient/hadoop-mapreduce-client-core/mapreducetutorial.html ECE 695/CS 590 18 9