Heading Off Correlated Failures through Independence-as-a-Service Ennan Zhai 1 Ruichuan Chen 2, David Isaac Wolinsky 1, Bryan Ford 1 1 Yale University 2 Bell Labs/Alcatel-Lucent
Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly
Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly
Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly
Background Cloud services ensure reliability by redundancy: - Amazon S3 replicates data on multiple racks - icloud rents EC2 and Azure redundantly Unexpected common dependencies!
Service Outage Losses Data Center Outages Generate Big Losses Downtime in a data center can cost an average of $505,500 per incident, according to a Ponemon Institute study. Analytics Slideshow: 2010 Data Center Operational Trends Report
Service Outage Losses Data Center Outages Generate Big Losses Downtime in a data center can cost an average of $505,500 per incident, according to a Ponemon Institute study. Analytics Slideshow: 2010 Data Center Operational Trends Report
What is correlated failure?
What is correlated failure? Rack1 Rack2 Rack3 Switch1 Switch2 Switch3
What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3
What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3 Agg Switch
What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3 Agg Switch
What is correlated failure? Rack1 Rack2 Rack3 Primary Backup Backup Switch1 Switch2 Switch3 Agg Switch
Realistic Example
Realistic Example Correlated failures resulting from EBS due to bugs in one EBS server Summary of the October 22, 2012 AWS Service Event in the US-East Region We d like to share more about the service event that occurred on Monday, October 22nd in the US-East Region. We have now completed the analysis of the events that affected AWS customers, and we want to describe what happened, our understanding of how customers were affected, and what we are doing to prevent a similar issue from occurring in the future. The Primary Event and the Impact to Amazon Elastic Block Store (EBS) and Amazon Elastic Compute Cloud (EC2)
Realistic Example Elastic Compute Cloud (EC2) Elastic Block Store (EBS)
Realistic Example...... Elastic Block Store (EBS)
Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ Elastic Block Store (EBS)
Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2
Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2
Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2
Realistic Example VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Server1 EBS Server2
Even Worse
Video App Cloud Provider A Cloud Provider B
Video App Cloud Provider A Cloud Provider B Third-party infrastructure components
Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C
Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source
Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source
Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source
Video App Cloud Provider A Cloud Provider B Cloud providers do not usually share Third-party infrastructure components information about their dependencies ISP Router A ISP Router B ISP Router C Power Source
Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems.
Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems. Solving the problem after outage occurs. We want to prevent the problem before the outage occurs.
Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems. Solving the problem after outage occurs. Prevent correlated failures before outage occurs.
Existing Efforts Cloud providers allocate or tolerate failures via: - diagnosis systems; - fault-tolerant systems. Solving the problem after outage occurs. Prevent correlated failures before outage occurs. Independence-as-a-Service (INDaaS)
INDaaS Workflow Service Provider, Alice INDaaS A Given Redundancy Configuration
INDaaS Workflow Service Provider, Alice INDaaS Two-Way Redundancy Configuration Dependency Data Source1 Dependency Data Source2
INDaaS Workflow Service Provider, Alice Independence of this two-way redundancy? INDaaS Two-Way Redundancy Configuration Dependency Data Source1 Dependency Data Source2
INDaaS Workflow Service Provider, Alice Step1: Specification Submission INDaaS Dependency Data Source1 Dependency Data Source2
INDaaS Workflow Service Provider, Alice Step1 INDaaS Step2: Dependency data collection Dependency Data Source1 Step2: Dependency data collection Dependency Data Source2
INDaaS Workflow Service Provider, Alice Step1 Step3: Independence Evaluation INDaaS Step2 Step2 Dependency Data Source1 Dependency Data Source2
INDaaS Workflow Service Provider, Alice Relative Independence is 0.3 Step1 Step3 INDaaS Step2 Step2 Dependency Data Source1 Dependency Data Source2
Multiple Cloud Providers Service Provider, Alice Step1 Step3 INDaaS Step2 Step2 Dependency Data Source1 Dependency Data Source2
Multiple Cloud Providers Service Provider, Alice Step1 Step3 Unwilling to share the dependency data INDaaS Step2 Step2 Unwilling to share the dependency data Data Source1 (Cloud1) Data Source2 (Cloud2)
Private Independence Evaluation Service Provider, Alice Step1 INDaaS Step2 Data Source1 (Cloud1) Step3: Private auditing Step2 Data Source2 (Cloud2)
Private Independence Evaluation Service Provider, Alice Step1 INDaaS Step2 Step4 Step4 Step2 Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)
Private Independence Evaluation Service Provider, Alice Step1 Only know my information Step2 Step4 INDaaS Step4 Step2 Only know my information Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)
Private Independence Evaluation Service Provider, Alice Only the relative independence Step1 INDaaS Step2 Step4 Step4 Step2 Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)
Private Independence Evaluation Service Provider, Alice Step1 Step5 INDaaS Step2 Step4 Step4 Step2 Data Source1 (Cloud1) Step3 Data Source2 (Cloud2)
Technical Challenges Step1 Step5 INDaaS Step2 Step4 Step4 Step2 Dependency Data Source1 Step3 Dependency Data Source2
Technical Challenges #1: Dependency collections - Solution: Reusing existing tools Step1 Step5 INDaaS Step2 Step4 Step4 Step2 Dependency Data Source1 Step3 Dependency Data Source2
Technical Challenges #1: Dependency collections - Solution: Reusing existing tools Step1 INDaaS Step5 #2: Dependency representation - Solution: Fault graphs Step2 Step4 Step4 Step2 Dependency Data Source1 Step3 Dependency Data Source2
Technical Challenges #1: Dependency collections - Solution: Reusing existing tools Step2 Step1 Step5 INDaaS Step4 Step4 Step2 #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm Dependency Data Source1 Step3 Dependency Data Source2
Technical Challenges Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
Dependency Data Collections Reuse existing data collection tools: - Convert the outputs to uniform format. - Three types of format: NET, HW and SW. Our defined format Type Network Hardware Software Dependency Expression <src= S dst= D route= x,y,z /> <hw= H type= T dep= x /> <pgm= S hw= H dep= x,y,z /> Please see our paper for more details
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
Example Redundancy
Example Redundancy
Example Redundancy SW HW NET
Building Fault Graph Top-to-Bottom SW HW NET
Step1: Root Node Redundancy configuration fails
Step2: Server Nodes Redundancy configuration fails Server 1 fails Server 2 fails
Step2: Server Nodes Redundancy configuration fails Server 1 fails Server 2 fails AND gate: all the sublayer nodes fail, the upper layer node fails
Step3: Dependency Nodes Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails
Step3: Dependency Nodes Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails OR gate: one of the sublayer nodes fails, the upper layer node fails
Step4: Hardware Dependency Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails +" +" CPU1 Disk1 CPU2 Disk2
Step5: Network Dependency Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails +" +" CPU1 Disk1 Path1 Path2 CPU2 Disk2 Core1 +" +" ToR1 Core2
Step6: Software Dependency Redundancy configuration fails Server 1 fails Server 2 fails +" +" HW fails Net fails SW fails Net fails SW fails HW fails +" CPU1 Disk1 Path1 Path2 Riak +" Query CPU2 +" Disk2 +" +" +" +" Core1 ToR1 Core2 libsvnl libc6 libccl
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
Efficient Auditing Two algorithms balancing cost and accuracy: - Minimal fault set algorithm - Failure sampling algorithm
Efficient Auditing Two algorithms balancing cost and accuracy: - Minimal fault set algorithm - Failure sampling algorithm
Minimal Fault Set Algorithm Traditional algorithm in safety engineering - Exponential complexity (NP-hard) We are the first to apply it in Cloud area: - Analyzing a fat tree with 30,528 with ~40 hours
Minimal Fault Set Algorithm Traditional algorithm in safety engineer - Exponential complexity (NP-hard) We are the first to apply it in Cloud area: - Analyzing a fat tree with 30,528 with ~40 hours We propose efficient failure sampling algorithm.
Failure Sampling Algorithm Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails
Failure Sampling Algorithm Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0
Failure Sampling Algorithm Redundancy configuration fails 1 or 0? Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0
Failure Sampling Algorithm Redundancy configuration fails 1 or 0? Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0 Fault Sets
The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails Fault Sets
The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0 Fault Sets
The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails
The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails
The 1st Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}
The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails Fault Sets {Server1 s HW, Server2 s HW}
The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails 1 or 0 1 or 0 1 or 0 Fault Sets {Server1 s HW, Server2 s HW}
The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}
The 2nd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}
The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Server2 s HW fails Fault Sets {Server1 s HW, Server2 s HW}
The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}
The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW}
The 3rd Sampling Round Redundancy configuration fails Server 1 fails Server 2 fails +" +" Server1 s HW fails Switch1 fails Fault Sets Server2 s HW fails {Server1 s HW, Server2 s HW} {Switch1}
After Many (e.g., 10 7 ) Rounds Fault Sets {Server1 s HW, Server2 s HW} {Switch1} {Switch1, Server2 s HW} {Switch1} {Switch1, Server2 s HW}......
Size-Based Ranking Fault Sets {Server1 s HW, Server2 s HW} {Switch1} {Switch1, Server2 s HW} {Switch1} {Switch1, Server2 s HW}......
Size-Based Ranking Fault Sets {Switch1} {Switch1} {Switch1, Server2 s HW} {Switch1, Server2 s HW} {Server1 s HW, Server2 s HW}......
Independence Evaluation Multiple equations for option: - summation of sizes - weighted average of sizes Fault Sets {Switch1} {Switch1} {Switch1, Server2 s HW} {Switch1, Server2 s HW} {Server1 s HW, Server2 s HW}......
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
RoadMap Step2 Step1 Step4 Dependency Data Source1 INDaaS Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
RoadMap Step2 Step1 Step4 Data Source1 (Cloud1) INDaaS Step3 Step5 Step4 Step2 Data Source2 (Cloud2) #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
Service Provider INDaaS Agent Cloud A Cloud B Cloud C
Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Trusted Third-Party Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Trusted Third-Party Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud providers are reluctant to share this information! Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Secure Multiparty Computation (SMPC) Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent SMPC Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Secure Multiparty Computation (SMPC) Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent SMPC is hard to scale! SMPC [Xiao et al. CCSW 13] Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent Evaluating independence by the dataset similarity between clouds Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
App Provider INDaaS Auditor Agent ISP A Power A Power B ISP B Power A Power B Using Jaccard similarity to evaluate the ISP B Power C independence of each redundancy configuration. Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
App Provider INDaaS Agent S1 S2...... Sn J(S1, S2,..., Sn) = ISP A Power A Power B ISP B Power A Power B S1 S2...... Sn ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B =1 =4 ISP B Power C J = 1/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider 0 means fully independent INDaaS Agent Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
P-SOP [Vaidya et al. JCS05] We apply Private Set Operation Protocol (P-SOP): - Private set intersection cardinality. - Private set union cardinality. S1 S2...... Sn J(S1, S2,..., Sn) = S1 S2...... Sn
P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information.
P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information. 11 3 10 3 7 1 5 20 3
P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information. 11 3 10 3 7 P-SOP 1 5 20 3
P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other information. 11 3 10 =1, =7 3 7 1 =1, =7 =1, =7 P-SOP 5 20 3
P-SOP [Vaidya et al. JCS05] Allow k parties to compute both intersection and union cardinalities without learning other 11 3 10 information. But I do not know which elements are overlapping/union. Protocol P-SOP 3 7 But I do not know which elements are overlapping/union. But I do not know which elements are overlapping/union. 1 5 20 3
P-SOP [Vaidya et al. JCS05] 3 7 Kf 11 3 10 Kd Ke 1 5 20 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] 3 7 Kf 11 3 10 Kd Ke 1 5 20 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] 3 7 Kf 11 3 10 Kd Ke 1 5 20 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Ee(Ed(Ef(7))) 7 Kf Ef(Ee(Ed(3))) 11 Ef(Ee(Ed(10))) 3 Ef(Ee(Ed(11))) 10 Kd Ef(Ee(Ed(3))) Ke Ed(Ef(Ee(3))) 1 Ed(Ef(Ee(5))) 5 Ed(Ef(Ee(1))) 20 Ed(Ef(Ee(20))) 3 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] Ee(Ed(Ef(3))) 3 Kf Ef(Ee(Ed(3))) 11 Kd Ef(Ee(Ed(3))) Ef(Ee(Ed(10))) Ef(Ee(Ed(11))) Ee(Ed(Ef(7))) Ed(Ef(Ee(5))) Ed(Ef(Ee(1))) Ed(Ef(Ee(20))) Ke Ed(Ef(Ee(3))) 1 Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
P-SOP [Vaidya et al. JCS05] Kf Kd Ef(Ee(Ed(3))) Ef(Ee(Ed(10))) Ef(Ee(Ed(11))) Ee(Ed(Ef(7))) Ed(Ef(Ee(5))) Ed(Ef(Ee(1))) Ed(Ef(Ee(20))) 7 Ke Each party maintains a commutative encryption key Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))
Private Independence Evaluation
Service Provider Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B P-SOP ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B =2 ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =2 =4 ISP B Power A Power B ISP B Power C J = 2/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B P-SOP ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B =1 =4 ISP B Power C J = 1/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B ISP B Power A Power B =1 =4 ISP B Power C J = 1/4 Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B P-SOP ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&B 0.5 Cloud B&C 0.25 Cloud A&C 0 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent ISP A Power A Power B =0, =5, J = 0/5 ISP B Power C Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 Service Provider INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
Service Provider Deployment Sim Cloud A&C 0 Cloud B&C 0.25 Cloud A&B 0.5 INDaaS Agent Cloud A Cloud B Cloud C ISP A Power A Power B ISP B Power C
RoadMap Step2 Step1 INDaaS Agent Step4 Dependency Data Source1 Step3 Step5 Step4 Step2 Dependency Data Source2 #1: Dependency collections - Solution: Reusing existing tools #2: Dependency representation - Solution: Fault graphs #3: Efficient auditing - Solution: Failure sampling algorithm #4: Private independence audit - Solution: Private Jaccard similarity
Evaluation
Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol
Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol
Three Case Studies Common network dependency Common hardware dependency Common software dependency Please see our paper for more details
Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol
Tradeoff Evaluation We evaluate efficiency/accuracy tradeoff. We generate topology based on fat tree model. We also bu
Tradeoff Evaluation Topology A Topology B Topology C # of Core Routers 64 144 576 # of Agg Switches 128 288 1,152 # of ToR Switches 128 288 1,152 # of Servers 1,024 3,456 27,648 Total # of devices 1,344 4,176 30,528
Tradeoff Evaluation Topology A Topology B Topology C # of Core Routers 64 144 576 # of Agg Switches 128 288 1,152 # of ToR Switches 128 288 1,152 # of Servers 1,024 3,456 27,648 Total # of devices 1,344 4,176 30,528
Topology C: 30,528 Devices Minimal Fault Set Algorithm % critical fault sets detected 100 90 80 70 60 50 40 30 Failure Sampling Algorithm (10 4 ) Minimal Fault Set Algorithm Failure Sampling Algorithm 1 2 4 8 16 32 64 128 256 512 2048 Computational time (minutes) Failure Sampling Algorithm (10 6 )
Topology C: 30,528 Devices Minimal Fault Set Algorithm % critical fault sets detected 100 90 80 70 60 50 40 30 Failure Sampling Algorithm (10 4 ) Minimal Fault Set Algorithm Failure Sampling Algorithm 1 2 4 8 16 32 64 128 256 512 2048 Computational time (minutes) Failure Sampling Algorithm (10 6 )
Evaluation Three realistic case studies. Tradeoff between auditing algorithms Overhead of P-SOP protocol
What we evaluate? Kissner and Song (KS) protocol for comparison. Bandwidth overhead of P-SOP and KS. Computational overhead of P-SOP and KS.
Bandwidth Overhead Total traffic sent (MB) 200 150 100 50 0 KS (4) P-SOP (4) KS (3) P-SOP (3) KS (2) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset
Bandwidth Overhead Total traffic sent (MB) 200 150 100 50 0 KS (4) P-SOP (4) KS (3) P-SOP (3) KS (2) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset
Bandwidth Overhead Total traffic sent (MB) 200 150 100 50 0 KS (4) P-SOP (4) KS (3) P-SOP (3) KS (2) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset ~80MB
Computational Overhead Computational time (seconds) 1e+06 100000 10000 1000 100 10 1 KS (4) KS (3) KS (2) P-SOP (4) P-SOP (3) P-SOP (2) 1000 10000 100000 Number of elements in each provider s dataset ~10 3 sec
Conclusions INDaaS is a first step towards reliable clouds: - Dependency collections - Dependency representations - Efficient auditing - Private independence auditing We evaluated INDaas with three realistic case studies and large-scale simulations.
Thanks, questions? INDaaS: Heading off correlated failures in clouds Find out more at: - http://dedis.cs.yale.edu/cloud/ We will be at the poster session tonight.