Availability, Reliability, and Fault Tolerance

Similar documents
Data Centers and Cloud Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Centers and Cloud Computing. Data Centers

Data Centers. Tom Anderson

Datacenters. Mendel Rosenblum. CS142 Lecture Notes - Datacenters

Today: Fault Tolerance

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Today: Fault Tolerance. Fault Tolerance

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Today: Fault Tolerance. Replica Management

Worse Is Better. Vijay Gill/Bikash Koley/Paul Schultz for Google Network Architecture Google. June 14, 2010 NANOG 49, San Francisco

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

Byzantine Fault Tolerance

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Fault Tolerance. Basic Concepts

Consensus and related problems

Implementation Issues. Remote-Write Protocols

Today: Fault Tolerance. Failure Masking by Redundancy

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

Fault Tolerance. Distributed Systems. September 2002

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Byzantine Fault Tolerance

Failure Tolerance. Distributed Systems Santa Clara University

Security (and finale) Dan Ports, CSEP 552

Last Class:Consistency Semantics. Today: More on Consistency

Byzantine Failures. Nikola Knezevic. knl

Security+ Guide to Network Security Fundamentals, Fourth Edition. Network Attacks Denial of service Attacks

Recovering from a Crash. Three-Phase Commit

image credit Fabien Hermenier Cloud compting 101

image credit Fabien Hermenier Cloud compting 101

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

DoS Cyber Attack on a Government Agency in Europe- April 2012 Constantly Changing Attack Vectors

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Fault Tolerance Dealing with an imperfect world

Distributed Systems COMP 212. Lecture 19 Othon Michail

To do. Consensus and related problems. q Failure. q Raft

Distributed Systems Fault Tolerance

Towards Recoverable Hybrid Byzantine Consensus

Denial of Service, Traceback and Anonymity

6.033 Lecture Fault Tolerant Computing 3/31/2014

Eventual Consistency. Eventual Consistency

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Practical Byzantine Fault

Computer Security: Principles and Practice

CS Paul Krzyzanowski

Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li

DDoS MITIGATION BEST PRACTICES

CSE 124: TAIL LATENCY AND PERFORMANCE AT SCALE. George Porter November 27, 2017

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

DDoS: Coordinated Attacks Analysis

Byzantine Techniques

CS5112: Algorithms and Data Structures for Applications

Lecture 12. Application Layer. Application Layer 1

Network Security - ISA 656 Routing Security

Chapter 7. Denial of Service Attacks

Chair for Network Architectures and Services Department of Informatics TU München Prof. Carle. Network Security. Chapter 8

CS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT

Denial of Service (DoS)

EE 122: Network Security

Distributed Systems 24. Fault Tolerance

Information Security of the Beijing 2008 Olympic Games. Yonglin ZHOU

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Dep. Systems Requirements

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

TAIL LATENCY AND PERFORMANCE AT SCALE

Distributed Operating Systems

Network Security - ISA 656 Routing Security

Distributed Systems 11. Consensus. Paul Krzyzanowski

How to deal with large numbers (millions) of entities in a system? IP devices in the internet (0.5 billion) Users in P2P network (millions)

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Routing Security* CSE598K/CSE545 - Advanced Network Security Prof. McDaniel - Spring * Thanks to Steve Bellovin for slide source material.

VERISIGN DISTRIBUTED DENIAL OF SERVICE TRENDS REPORT

Steven M. Bellovin AT&T Labs Research Florham Park, NJ 07932

Calcanas 1. Edgar Calcanas Dr. Narayanan CST March 2010 CST 412 Mid-Term. Workstations

Managing IT Risk: What Now and What to Look For. Presented By Tina Bode IT Assurance Services

Securing Online Businesses Against SSL-based DDoS Attacks. Whitepaper

Fault Tolerance. Distributed Software Systems. Definitions

416 Distributed Systems. Errors and Failures Oct 16, 2018

A definition. Byzantine Generals Problem. Synchronous, Byzantine world

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Management of IT Infrastructure Security by Establishing Separate Functional Area with Spiral Security Model

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Distributed Systems 23. Fault Tolerance

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Second Exam.

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions

Chapter 10: Denial-of-Services

: Scalable Lookup

Prolexic Attack Report Q4 2011

Failures, Elections, and Raft

Last lecture we talked about how Intrusion Detection works. Today we will talk about the attacks. Intrusion Detection. Shell code

Large-Scale Web Applications

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Insight Guide into Securing your Connectivity

Module: Routing Security. Professor Patrick McDaniel Spring CMPSC443 - Introduction to Computer and Network Security

Transcription:

Availability, Reliability, and Fault Tolerance Guest Lecture for Software Systems Security Tim Wood Professor Tim Wood - The George Washington University

Distributed Systems have Problems Hardware breaks Software is buggy People are jerks But software is increasingly important Runs nuclear reactors Controls engines and wings on a passenger jet Runs your facebook wall! How can we make these systems more reliable? Particularly large distributed systems 2

Inside a Data Center Giant warehouse filled with: Racks of servers Disk arrays Cooling infrastructure Power converters Backup generators 3

Modular Data Center...or use shipping containers Each container filled with thousands of servers Can easily add new containers Plug and play Just add electricity Allows data center to be easily expanded Pre-assembled, cheaper 4

Definitions Availability: whether the system is ready to use at a particular time Reliability: whether the system can run continuously without failure Safety: whether a disaster happens if the system fails to run correctly at some point Maintainability: how easily a system can be repaired after failure 5

Availability and Reliability System 1: crashes for 1 millisecond every hour System 2: never crashes, but has to be shutdown two weeks a year 6

Availability and Reliability System 1: crashes for 1 millisecond every hour Better than 99.9999% availability Not very good reliability... System 2: never crashes, but has to be shutdown two weeks a year "Perfectly" reliable Only 96% availability Is one more important? 7

Quantifying Reliability MTTF: Mean Time To Failure The average amount of time until a failure occurs MTTR: Mean Time To Repair The average amount of time to repair after a failure MTBF: Mean Time Between Failures Time <---MTTF---> <-MTTR-> <---MTTF---> <-------MTBF-------> 8

Real Failure Rates Standard hard drive MTBF 600,000 hours = 68 years! 1/68 = 1.5% chance of failure per year 9

Real Failure Rates Standard hard drive MTBF 600,000 hours = 68 years! 1/68 = 1.5% chance of failure per year A big Google data center: Has 200,000+ hard drives 1.5% x 200,000 = 2,921 drive crashes per year or about 8 disk failures per day 9

Real Failure Rates Standard hard drive MTBF 600,000 hours = 68 years! 1/68 = 1.5% chance of failure per year A big Google data center: Has 200,000+ hard drives 1.5% x 200,000 = 2,921 drive crashes per year or about 8 disk failures per day Failures happen a lot Need to design software to be resilient to all types of hardware failures Actual failure rates are closer to 3% per year 9

Reliability Challenges Typical failures in one year of a google data center: 1000 individual machine failures thousands of hard drive failures 1 PDU (Power Distribution Unit) failure (about 500-1000 machines suddenly disappear, budget 6 hours to come back) 1 rack-reorganization (You have plenty of warning: 500-1000 machines powered down, about 6 hours) 1 network rewiring (rolling 5% of machines down over 2-day span) 20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) 5 racks go wonky (40-80 machines see 50% packet loss) 8 network maintenances (4 might cause ~30-minute random connectivity losses) 12 router reloads (takes out DNS and external virtual IP address (VIPS) for a couple minutes) 3 router failures (have to immediately pull traffic for an hour) 0.5% overheat (power down most machines in under five minutes, expect 1-2 days to recover) dozens of minor 30-second blips for DNS http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/stanford-295- talk.pdf 10

Types of Failures Systems can fail in different ways How? 11

Types of Failures Systems can fail in different ways Crash failure Timing failure Content failure Malicious failure Are some easier to deal with than others? 12

Fault Tolerance through Replication We can handle failures through redundancy Have multiple replicas run the program May want to keep them away from each other May want to use different hardware platforms May want to use different software implementations 13

Fault Detection Detecting a fault can be difficult Crash failure Timing failure Content failure Malicious failure 14

Fault Detection Detecting a fault can be difficult Crash failure Timing failure Content failure Malicious failure Approaches: Heartbeat messages Adaptive timeouts Voting / Quorums Authentication / signatures 14

Detection is Hard Or maybe even impossible How long should we set a timeout? How do we know heart beat messages will go through? 15

Two Generals Problem Ninjas The Enemy Pirates The Ninja general and the Pirate general need to coordinate an attack Can (try to) send messengers back and forth Messengers can be shot How can they guarantee they will both attack at the same time? 16

Two Generals Problem We need to worry about physical characteristics when we build systems Packets can be lost, delayed, reordered Disks can be slow, fail, or crash Or things can be actively malicious Big trouble... What kinds of assumptions do we need to make? Network is ordered and reliable, but may be slow We can quantify the expected number of nodes that will fail 17

Fault Tolerance through Replication How to tolerate a crash failure? How to tolerate a content failure? How many replicas to tolerate f such failures? 18

Fault Tolerance through Replication How to tolerate a crash failure? Inputs P1 P2 f+1 replicas 2+2=4 crash x( output = 4 How to tolerate a content failure? Inputs P1 P2 P3 2+2=4 2+2=5 2f+1 replicas 2+2=4 Majority Voter output = 4 19

Agreement without Voters We can't always assume there is a perfectly correct voter to validate the answers Better: Have replicas reach agreement amongst themselves about what to do Exchange calculated value and have each node pick winner A 4 5 4 4 C 5 4 B Replica Receives Action A 4, 4, 5 = 4 B 4, 4, 5 = 4 C 4, 4, 5 = 4? 20

Byzantine Generals Problem There are N generals making plans for an attack They need to decide whether to Attack or Retreat Send your vote to everyone (0=retreat, 1=attack) But f generals may be traitors that lie and collude Can all correct replicas agree on what to do? Take majority vote of planned actions A 1 1 0 1 C 0 0 B Replica Receives Action A 1, 0, 1 Attack! B 1, 0, 0 Retreat! C 1, 0,???? Majority voting doesn't work if a replica lies! 21

Byzantine Generals Solved! Need more replicas to reach consensus Requires 3f+1 replicas to tolerate f byzantine faults Step 1: Send your plan to everyone Step 2: Send learned plans to everyone Step 3: Use majority of each column Replica Receives Vote A B A A: (1,0,1,1) B: (1,0,0,1) C: (1,1,1,1) D: (1,0,1,1) A: 1 B: 0 C: 1 D: 1 x y B A: (1,0,1,1) B: (1,0,0,1) C: (0,0,0,0) D: (1,0,0,1) A: 1 B: 0 C: 0 D: 1 C z D 22

Can we make this any easier? Fundamental challenge in BFT: Nodes can misrepresent what they heard from other nodes How can we fix this? Have nodes sign messages! Then liars can't forge messages with false information 23

Can we make this any easier? Fundamental challenge in BFT: Nodes can misrepresent what they heard from other nodes How can we fix this? Have nodes sign messages! Then liars can't forge messages with false information Crypto actually is useful! 23

Denial of Service Attack to reduce the availability of a service Can also cause crashes if software is poorly written "Unsophisticated but effective" Flood target with traffic No easy way to differentiate between a valid request and an attacker Amazon.com 24

Sept 2012 DDoS Six US banks attacked Attacks were announced in advance Banks still could not prevent the damage Attackers sent 65 gigabytes of data per second Iranian "Cyber Fighter" group claimed responsibility Encouraged members to use Low Orbit Ion Cannon software to flood banks with traffic Also used botnets as a traffic source 25

Sept 2012 DDoS But it's not clear if that was the real source... Botnet machines have relatively low bandwidth Would need 65,000+ compromised machines Most traffic to the banks was coming from about 200 IP addresses Appear to be a small set of compromised high powered web servers Not clear if Iranian hacker group did all of this or if some other group was the mastermind Iranian government fighting against sanctions? Eastern European crime groups make fraudulent purchases and then disrupt bank web activity long enough for them to go through 26

Anonymous (?) The Anonymous "hacktivist" group has used DDoS for various political causes Members run LOIC software and target a specific site "Volunteer bot net" But be careful... In March 2012 the LOIC software had a trojan Ran a DDoS on the enemy... And stole your bank and gmail account info 27

Defending against DDoS Some DDoS traffic can be easily distinguished Most web apps can safely ignore ICMP and UDP traffic But performance impact will depend where filtering is performed Firewall on server being attacked may limit impact on application, but still clogs network Firewall at ISP is much better, but may be under someone else's control ISP Amazon.com 28

Summary Software systems must worry about: Hardware and software failures Service availability Malicious attacks that affect reliability and/or availability Approaches: Redundancy Fault mitigation Fault detection 29