Heading Off Correlated Failures through Independence-as-a-Service

Similar documents
Cloud Computing. Ennan Zhai. Computer Science at Yale University

Failures in Distributed Systems

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

Amazon Web Services. Block 402, 4 th Floor, Saptagiri Towers, Above Pantaloons, Begumpet Main Road, Hyderabad Telangana India

Introduction to data centers

LINUX, WINDOWS(MCSE),

Developing Microsoft Azure Solutions (70-532) Syllabus

AWS Solution Architecture Patterns

Distributed Systems. 31. The Cloud: Infrastructure as a Service Paul Krzyzanowski. Rutgers University. Fall 2013

Cloud Computing. Amazon Web Services (AWS)

Designing Fault-Tolerant Applications

Lecture 7: Data Center Networks

Introduction To Cloud Computing

Enroll Now to Take online Course Contact: Demo video By Chandra sir

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS

Developing Microsoft Azure Solutions (70-532) Syllabus

ArcGIS Server Architecture Considerations. Andrew Sakowicz

Oracle IaaS, a modern felhő infrastruktúra

Cooperative Private Searching in Clouds

Managing and Auditing Organizational Migration to the Cloud TELASA SECURITY

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

The Cloud's Cutting Edge: ArcGIS for Server Use Cases for Amazon Web Services. David Cordes David McGuire Jim Herries Sridhar Karra

Design and Implementation of Privacy-Preserving Surveillance. Aaron Segal

Introduction to Database Services

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Oracle Exadata X7. Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer

Microsoft Azure for AWS Experts

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Developing Enterprise Cloud Solutions with Azure

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

Question: 1 Which three methods can you use to manage Oracle Cloud Infrastructure services? (Choose three.)

CIT 668: System Architecture. Amazon Web Services

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Lecture 09: VMs and VCS head in the clouds

Developing Microsoft Azure Solutions (70-532) Syllabus

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Migrating Enterprise Applications to the Cloud Session 672. Leighton L. Nelson

Welcome to the New Era of Cloud Computing

Windows Azure Services - At Different Levels

Welcome to the. Migrating SQL Server Databases to Azure

NEXT GENERATION CLOUD SECURITY

Amazon Web Services Training. Training Topics:

Splunk & AWS. Gain real-time insights from your data at scale. Ray Zhu Product Manager, AWS Elias Haddad Product Manager, Splunk

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Next-Generation Cloud Platform

Live Migration of Virtualized Edge Networks: Analytical Modeling and Performance Evaluation

Get the Most Out of GoAnywhere: Achieving Cloud File Transfers and Integrations

Secure Block Storage (SBS) FAQ

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Course Outline. Module 1: Microsoft Azure for AWS Experts Course Overview

High Availability & Disaster Recovery. Witt Mathot

Availability in the Modern Datacenter

COMP6511A: Large-Scale Distributed Systems. Windows Azure. Lin Gu. Hong Kong University of Science and Technology Spring, 2014

DEEP DIVE INTO CLOUD COMPUTING

Amazon Web Services (AWS) Training Course Content

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

An Auditing Language for Preventing Correlated Failures in the Cloud

NetApp AWS Worldwide Public Sector Summit Washington, D.C.

Cloud Computing /AWS Course Content

Security & Compliance in the AWS Cloud. Amazon Web Services

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

RADIAN6 SECURITY, PRIVACY, AND ARCHITECTURE

WEBSCALE CONVERGED APPLICATION DELIVERY PLATFORM

Exam : Implementing Microsoft Azure Infrastructure Solutions

Data Centers and Cloud Computing

Azure File Sync. Webinaari

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

DATA PROTECTION FOR THE CLOUD

High Availability Infrastructure for Cloud Computing

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Security & Compliance in the AWS Cloud. Vijay Rangarajan Senior Cloud Architect, ASEAN Amazon Web

Lecture 16: Data Center Network Architectures

Web Cloud Solution. User Guide. Issue 01. Date

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

Data Centers and Cloud Computing. Data Centers

/ Cloud Computing. Recitation 5 September 26 th, 2017

ANIKET DAPTARI & RANJINI RAJENDRAN CONTRAIL TEAM

Exam Questions

Hosting DesktopNow in Amazon Web Services. Ivanti DesktopNow powered by AppSense

CogniFit Technical Security Details

Automate best practices and operational health for your AWS resources with Trusted Advisor and AWS Health

1V0-621.testking. 1V VMware Certified Associate 6 - Data Center Virtualization Fundamentals Exam

Data Sheet Gigamon Visibility Platform for AWS

70-414: Implementing an Advanced Server Infrastructure Course 01 - Creating the Virtualization Infrastructure

Hosted Azure for your business. Build virtual servers, deploy with flexibility, and reduce your hardware costs with a managed cloud solution.

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

AUTOTASK ENDPOINT BACKUP (AEB) SECURITY ARCHITECTURE GUIDE

Deploying High Availability and Business Resilient R12 Applications over the Cloud

What s in Installing and Configuring Windows Server 2012 (70-410):

Launching a Highly-regulated Startup in the Cloud

About Intellipaat. About the Course. Why Take This Course?

Oracle WebLogic Server 12c on AWS. December 2018

Training on Amazon AWS Cloud Computing. Course Content

Magento Commerce Architecture and Security Model Last updated: Aug 2017

We are ready to serve Latest IT Trends, Are you ready to learn? New Batches Info

Differential Privacy

The HR Avatar Testing Platform

Better, Faster, Stronger web apps with Amazon Web Services. Senior Technology Evangelist, Amazon Web Services

Oracle Database 11g: Real Application Testing & Manageability Overview

Transcription:

Heading Off Correlated Failures through Independence-as-a-Service Ennan Zhai 1 Ruichuan Chen 2, David Isaac Wolinsky 1, Bryan Ford 1 1 Yale University 2 Bell Labs/Alcatel-Lucent

Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly

Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly

Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly

Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly Unexpected common dependencies!

Service Outage Losses Data Center Outages Generate Big Losses Downtime in a data center can cost an average of $505,500 per incident, according to a Ponemon Institute study. Analytics Slideshow: 2010 Data Center Operational Trends Report

Service Outage Losses Data Center Outages Generate Big Losses Downtime in a data center can cost an average of $505,500 per incident, according to a Ponemon Institute study. Analytics Slideshow: 2010 Data Center Operational Trends Report

What is correlated failure?

What is correlated failure? Rack1 Rack2 Rack3 Switch1 Switch2 Switch3

What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3

What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3 Agg Switch

What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3 Agg Switch

What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3 Agg Switch

Realistic Example

Realistic Example Correlated failures resulting from EBS due to bugs in one EBS server Summary of the October 22, 2012 AWS Service Event in the US-East Region We d like to share more about the service event that occurred on Monday, October 22nd in the US-East Region. We have now completed the analysis of the events that affected AWS customers, and we want to describe what happened, our understanding of how customers were affected, and what we are doing to prevent a similar issue from occurring in the future. The Primary Event and the Impact to Amazon Elastic Block Store (EBS) and Amazon Elastic Compute Cloud (EC2)

Realistic Example Elastic Compute Cloud (EC2) Elastic Block Store (EBS)

Realistic Example...... Elastic Block Store (EBS)

Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ Elastic Block Store (EBS)

Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2

Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2

Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2

Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2

Even Worse

Video App Cloud Provider A Cloud Provider B

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

Video App Cloud Provider A Cloud Provider B Cloud providers do not usually share Third-party infrastructure components information about their dependencies ISP Router A ISP Router B ISP Router C Power Source

Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems.

Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems. Solving the problem after outage occurs. We want to prevent the problem before the outage occurs.

Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems. Solving the problem after outage occurs. Prevent correlated failures before outage occurs.

Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems. Solving the problem after outage occurs. Prevent correlated failures before outage occurs. Independence-as-a-Service (INDaaS)

INDaaS Workflow Service Provider, Alice INDaaS A Given Redundancy Configuration

INDaaS Workflow Service Provider, Alice INDaaS Two-Way Redundancy Configuration Dependency Data Source1 Dependency Data Source2

INDaaS Workflow Service Provider, Alice Independence of this two-way redundancy? INDaaS Two-Way Redundancy Configuration Dependency Data Source1 Dependency Data Source2

INDaaS Workflow Service Provider, Alice Step1: Specification Submission INDaaS Dependency Data Source1 Dependency Data Source2

INDaaS Workflow Service Provider, Alice Step1 INDaaS Step2: Dependency data collection Dependency Data Source1 Step2: Dependency data collection Dependency Data Source2

INDaaS Workflow Service Provider, Alice Step1 Step3: Independence Evaluation INDaaS Step2 Step2 Dependency Data Source1 Dependency Data Source2

INDaaS Workflow Service Provider, Alice Relative Independence is 0.3 Step1 Step3 INDaaS Step2 Step2 Dependency Data Source1 Dependency Data Source2

Multiple Cloud Providers Service Provider, Alice Step1 Step3 INDaaS Step2 Step2 Dependency Data Source1 Dependency Data Source2

Multiple Cloud Providers Service Provider, Alice Step1 Step3 Unwilling to share the dependency data INDaaS Step2 Step2 Unwilling to share the dependency data Data Source1 (Cloud1) Data Source2 (Cloud2)

Private Independence Evaluation Service Provider, Alice Step1 INDaaS Step2 Data Source1 (Cloud1) Step3: Private auditing Step2 Data Source2 (Cloud2)

Private Independence Evaluation Service Provider, Alice Step1 INDaaS Step2 Step4 Step4 Step2 Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)

Private Independence Evaluation Service Provider, Alice Step1 Only know my information Step2 Step4 INDaaS Step4 Step2 Only know my information Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)

Private Independence Evaluation Service Provider, Alice Only the relative independence Step1 INDaaS Step2 Step4 Step4 Step2 Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)

Private Independence Evaluation Service Provider, Alice Step1 Step5 INDaaS Step2 Step4 Step4 Step2 Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)

Technical Challenges Step1 Step5 INDaaS Step2 Step4 Step4 Step2 Dependency Data Source1 Step3 Dependency Data Source2

Technical Challenges #1: Dependency collections - Solution: Reusing existing tools Step1 Step5 INDaaS Step2 Step4 Step4 Step2 Dependency Data Source1 Step3 Dependency Data Source2

Technical Challenges #1: Dependency collections - Solution: Reusing existing tools Step1 INDaaS Step5 #2: Dependency representation - Solution: Fault graphs Step2 Step4 Step4 Step2 Dependency Data Source1 Step3 Dependency Data Source2

Technical Challenges #1: Dependency collections - Solution: Reusing existing tools Step2 Step1 Step5 INDaaS Step4 Step4 Step2 #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm Dependency Data Source1 Step3 Dependency Data Source2

Technical Challenges Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

Dependency Data Collections Reuse existing data collection tools: - Convert the outputs to uniform format. - Three types of format: NET, HW and SW. Our defined format Type Network Hardware Software Dependency Expression <src= S dst= D route= x,y,z /> <hw= H type= T dep= x /> <pgm= S hw= H dep= x,y,z /> Please see our paper for more details

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

Example Redundancy

Example Redundancy

Example Redundancy SW HW NET

Building Fault Graph Top-to-Bottom SW HW NET

Step1: Root Node Redundancy configuration fails

Step2: Server Nodes Redundancy configuration fails Server 1 fails Server 2 fails

Step2: Server Nodes Redundancy configuration fails Server 1 fails Server 2 fails AND gate: all the sublayer nodes fail, the upper layer node fails

Step3: Dependency Nodes Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails

Step3: Dependency Nodes Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails OR gate: one of the sublayer nodes fails, the upper layer node fails

Step4: Hardware Dependency Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails +" +" CPU1 Disk1 CPU2 Disk2

Step5: Network Dependency Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails +" +" CPU1 Disk1 Path1 Path2 CPU2 Disk2 Core1 +" +" ToR1 Core2

Step6: Software Dependency Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails +" CPU1 Disk1 Path1 Path2 Riak +" Query CPU2 +" Disk2 +" +" +" +" Core1 ToR1 Core2 libsvnl libc6 libccl

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

Efficient Auditing Two algorithms balancing cost and accuracy: - Minimal fault set algorithm - Failure sampling algorithm

Efficient Auditing Two algorithms balancing cost and accuracy: - Minimal fault set algorithm - Failure sampling algorithm

Minimal Fault Set Algorithm Traditional algorithm in safety engineering - Exponential complexity (NP-hard) We are the first to apply it in Cloud area: - Analyzing a fat tree with 30,528 with ~40 hours

Minimal Fault Set Algorithm Traditional algorithm in safety engineer - Exponential complexity (NP-hard) We are the first to apply it in Cloud area: - Analyzing a fat tree with 30,528 with ~40 hours We propose efficient failure sampling algorithm.

Failure Sampling Algorithm Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails

Failure Sampling Algorithm Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0

Failure Sampling Algorithm Redundancy configuration fails 1 or 0? Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0

Failure Sampling Algorithm Redundancy configuration fails 1 or 0? Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0 Fault Sets

The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails Fault Sets

The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0 Fault Sets

The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails

The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails

The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}

The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails Fault Sets {Server1 s HW, Server2 s HW}

The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0 Fault Sets {Server1 s HW, Server2 s HW}

The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}

The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}

The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails Fault Sets {Server1 s HW, Server2 s HW}

The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}

The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}

The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW} {Switch1}

After Many (e.g., 10 7 ) Rounds Fault Sets {Server1 s HW, Server2 s HW} {Switch1} {Switch1, Server2 s HW} {Switch1} {Switch1, Server2 s HW}......

Size-Based Ranking Fault Sets {Server1 s HW, Server2 s HW} {Switch1} {Switch1, Server2 s HW} {Switch1} {Switch1, Server2 s HW}......

Size-Based Ranking Fault Sets {Switch1} {Switch1} {Switch1, Server2 s HW} {Switch1, Server2 s HW} {Server1 s HW, Server2 s HW}......

Independence Evaluation Multiple equations for option: - summation of sizes - weighted average of sizes Fault Sets {Switch1} {Switch1} {Switch1, Server2 s HW} {Switch1, Server2 s HW} {Server1 s HW, Server2 s HW}......

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

RoadMap Step2 Step1 Step4 Data Source1 (Cloud1) INDaaS Step3 Step5 Step4 Step2 Data Source2 (Cloud2) #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

Service Provider INDaaS Agent Cloud A Cloud B Cloud C

Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Trusted Third-Party Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Trusted Third-Party Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud providers are reluctant to share this information! Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Secure Multiparty Computation (SMPC) Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent SMPC Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Secure Multiparty Computation (SMPC) Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent SMPC is hard to scale! SMPC [Xiao et al. CCSW 13] Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent Evaluating independence by the dataset similarity between clouds Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

App Provider INDaaS Auditor Agent ISP A Power A Power B ISP B Power A Power B Using Jaccard similarity to evaluate the ISP B Power C independence of each redundancy configuration. Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

App Provider INDaaS Agent S1 S2...... Sn J(S1, S2,..., Sn) = ISP A Power A Power B ISP B Power A Power B S1 S2...... Sn ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B =1 =4 ISP B Power C J = 1/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider 0 means fully independent INDaaS Agent Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

P-SOP [Vaidya et al. JCS05] We apply Private Set Operation Protocol (P-SOP): - Private set intersection cardinality. - Private set union cardinality. S1 S2...... Sn J(S1, S2,..., Sn) = S1 S2...... Sn

P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information.

P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information. 11 3 10 3 7 1 5 20 3

P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information. 11 3 10 3 7 P-SOP 1 5 20 3

P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information. 11 3 10 =1, =7 3 7 1 =1, =7 =1, =7 P-SOP 5 20 3

P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other 11 3 10 information. But I do not know which elements are overlapping/union. Protocol P-SOP 3 7 But I do not know which elements are overlapping/union. But I do not know which elements are overlapping/union. 1 5 20 3

P-SOP [Vaidya et al. JCS05] 3 7 Kf 11 3 10 Kd Ke 1 5 20 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] 3 7 Kf 11 3 10 Kd Ke 1 5 20 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] 3 7 Kf 11 3 10 Kd Ke 1 5 20 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ef(Ee(Ed(3))) Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Kf Ef(Ee(Ed(3))) 11 Kd Ef(Ee(Ed(3))) Ef(Ee(Ed(10))) Ef(Ee(Ed(11))) Ee(Ed(Ef(7))) Ed(Ef(Ee(5))) Ed(Ef(Ee(1))) Ed(Ef(Ee(20))) Ke Ed(Ef(Ee(3))) 1 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

P-SOP [Vaidya et al. JCS05] Kf Kd Ef(Ee(Ed(3))) Ef(Ee(Ed(10))) Ef(Ee(Ed(11))) Ee(Ed(Ef(7))) Ed(Ef(Ee(5))) Ed(Ef(Ee(1))) Ed(Ef(Ee(20))) 7 Ke Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

Private Independence Evaluation

Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B P-SOP ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B =2 ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B P-SOP ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B =1 =4 ISP B Power C J = 1/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B =1 =4 ISP B Power C J = 1/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B P-SOP ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

Service Provider Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C

RoadMap Step2 Step1 INDaaS Agent Step4 Dependency Data Source1 Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity

Evaluation

Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol

Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol

Three Case Studies Common network dependency Common hardware dependency Common software dependency Please see our paper for more details

Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol

Tradeoff Evaluation We evaluate efficiency/accuracy tradeoff. We generate topology based on fat tree model. We also bu

Tradeoff Evaluation Topology A Topology B Topology C # of Core Routers 64 144 576 # of Agg Switches 128 288 1,152 # of ToR Switches 128 288 1,152 # of Servers 1,024 3,456 27,648 Total # of devices 1,344 4,176 30,528

Tradeoff Evaluation Topology A Topology B Topology C # of Core Routers 64 144 576 # of Agg Switches 128 288 1,152 # of ToR Switches 128 288 1,152 # of Servers 1,024 3,456 27,648 Total # of devices 1,344 4,176 30,528

Topology C: 30,528 Devices Minimal Fault Set Algorithm % critical fault sets detected 100 90 80 70 60 50 40 30 Failure Sampling Algorithm (10 4 ) Minimal Fault Set Algorithm Failure Sampling Algorithm 1 2 4 8 16 32 64 128 256 512 2048 Computational time (minutes) Failure Sampling Algorithm (10 6 )

Topology C: 30,528 Devices Minimal Fault Set Algorithm % critical fault sets detected 100 90 80 70 60 50 40 30 Failure Sampling Algorithm (10 4 ) Minimal Fault Set Algorithm Failure Sampling Algorithm 1 2 4 8 16 32 64 128 256 512 2048 Computational time (minutes) Failure Sampling Algorithm (10 6 )

Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol

What we evaluate? Kissner and Song (KS) protocol for comparison. Bandwidth overhead of P-SOP and KS. Computational overhead of P-SOP and KS.

Bandwidth Overhead Total traffic sent (MB) 200 150 100 50 0 KS (4) P-SOP (4) KS (3) P-SOP (3) KS (2) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset

Bandwidth Overhead Total traffic sent (MB) 200 150 100 50 0 KS (4) P-SOP (4) KS (3) P-SOP (3) KS (2) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset

Bandwidth Overhead Total traffic sent (MB) 200 150 100 50 0 KS (4) P-SOP (4) KS (3) P-SOP (3) KS (2) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset ~80MB

Computational Overhead Computational time (seconds) 1e+06 100000 10000 1000 100 10 1 KS (4) KS (3) KS (2) P-SOP (4) P-SOP (3) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset ~10 3 sec

Conclusions INDaaS is a first step towards reliable clouds: - Dependency collections - Dependency representations - Efficient auditing - Private independence auditing We evaluated INDaas with three realistic case studies and large-scale simulations.

Thanks, questions? INDaaS: Heading off correlated failures in clouds Find out more at: - http://dedis.cs.yale.edu/cloud/ We will be at the poster session tonight.