Haryadi Gunawi and Andrew Chien

Size: px

Start display at page:

Download "Haryadi Gunawi and Andrew Chien"

Natalie Chandler
5 years ago
Views:

1 Haryadi Gunawi and Andrew Chien in collaboration with Gokul Soundararajan and Deepak Kenchammana (NetApp) Rob Ross and Dries Kimpe (Argonne National Labs)

2 2 q Complete fail-stop q Fail partial Rich literature q Corruption q Performance degradation ( limpware )?

3 3 Limping NIC! (1,000,000x) q 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps, q this one slow machine caused a chain reaction making a 100 node cluster was crawling at a snail's pace Facebook Engineers Cascading impact!

4 4 q q q q q q Disks 4 servers having high wait times on I/O for, up to 103 seconds. This was left uncorrected for 50 Argonne Causes: Weak disk head, bad packaging, missing screws, broken/old fans, too many disks/ box, firmware bugs, bad sector remapping, SSDs Samsung firmware bug (reduce bandwidth by 4x) Network cards and switches On Intrepid, a bad batch of optical transceivers with an extremely high error rate cause an effective throughput of 1-2 Argonne Causes: Broken adapter, error correcting, driver bugs, power fluctuation, Memory Runs only at 25% of normal speed HBase operators Processors 26% variation Aging transistors, overheat, self throttling, Many others: Yes we've seen that in production More anecdotes in our paper [SoCC 13]

5 5

6 6 q Introduction q Impact of limpware to scale-out cloud systems? [HotCloud 13, SoCC 13] q Progress Summary What bugs live in the cloud? [SoCC 14] Detecting performance bugs [HotCloud 15] The Tail at Store [In Submission] Other ongoing work

7 7 q Anecdotes The performance of a 100 node cluster was crawling at a snail's pace Facebook q But, why?

8 8 q Goals: Measure system-level impacts Find design flaws q Run distributed systems/protocols E.g., 3-node write in HDFS q Measure slowdowns under: No failure, crash, a limping NIC Execution slowdown 1000x slower 100x slower 0.1 Mbps NIC 1Mbps NIC workload 10x slower 10 Mbps NIC 1

9 9 HDFS Hadoop ZooKeeper Cassandra HBase

10 Fail-stop tolerant, but not limpware tolerant (no failover recovery) 10

11 q Run Hadoop with 6+ hours of Facebook workload 30-node cluster 30-node cluster (w/ 1 slow 0.1 Mbps) 1 job/hour Also happens in HDFS and ZooKeeper Cluster collapse after ~4 hours 11

12 12 q Single point of performance failure q Coarse-grained timeouts q Bounded thread/queue pool à resource exhaustion q Unbounded thread/queue pool à OOM q No throttling or back-pressure q Limp-oblivious background jobs q Unexploited parallelism of small transactional I/Os q Long lock/resource contention q

13 13 q Introduction q Impact of limpware [SoCC 13] q Progress Summary

14 14 q Study/Analysis Limplock/limpware [HotCloud 13, SoCC 13] What bugs live in the cloud? [SoCC 14]

15 15 q Study/Analysis Limplock/limpware [HotCloud 13, SoCC 13] What bugs live in the cloud? [SoCC 14] - Study of bugs in scale-out distributed systems - New: scalability bugs, single-point-of-failure bugs,

16 16 q Study/Analysis Limplock/limpware [HotCloud 13, SoCC 13] What bugs live in the cloud? [SoCC 14] The Tail at Store [In Submission] - Goal: Anecdotes to real statistics - Collaboration with Gokul Soundararajan and Deepak Kenchammana - Study of over 450,000 disks, 4000 SSDs, and 240 EBS drives - Ask: How many slow drives? How often? Transient? RAID RAID

17 17 q Study/Analysis Limplock/limpware [HotCloud 13, SoCC 13] What bugs live in the cloud? [SoCC 14] The Tail at Store [In Submission] - Limping disks and SSDs are real! - 2-digit slowdowns had occurred in 0.01% of disk and SSD hours - 4- and 3-digit slowdowns in 124 and 2461 disk hours, and 3-digit SSD slowdowns in 10 SSD hours

18 q Study/Analysis q Towards Limpware-Tolerant Systems Detecting limpware-intolerant designs in distributed systems [HotCloud 15] Tail-tolerant storage [In Progress] - In flash

18 18 q Study/Analysis q Towards Limpware-Tolerant Systems Detecting limpware-intolerant designs in distributed systems [HotCloud 15] Tail-tolerant storage [In Progress] - In flash controller, operating system, and distributed storage - + Coordination with MapReduce Speculative Execution - (A cross-cutting approach) TT Flash Ctrl TT OS/RAID MapReduce Spec. Ex. TT Distr. FS

19 19 XPS à Exploit Scale Limpware à Underexploit Scale ucare.cs.uchicago.edu ceres.uchicago.edu

Haryadi S. Gunawi 1, Riza O. Suminto 1, Russell Sears 2, Casey Golliher 2, Swaminathan Sundararaman 3, Xing Lin 4, Tim Emami 4, Weiguang Sheng 5,

Haryadi S. Gunawi 1, Riza O. Suminto 1, Russell Sears 2, Casey Golliher 2, Swaminathan Sundararaman 3, Xing Lin 4, Tim Emami 4, Weiguang Sheng 5, Nematollah Bidokhti 5, Caitie McCaffrey 6, Gary Grider