Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models

Size: px

Start display at page:

Download "Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models"

Adrian Fowler
5 years ago
Views:

1 Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models Suli Yang*, Jing Liu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau * work done while at UW-Madison

2 Scheduling: A Fundamental Primitive Modern storage systems are shared Correct and efficient request scheduling is indispensable snapchat S A A R/W N R/W R/W A S A N E R/W E Shared Storage 2

3 Broken Scheduling in Current Systems Popular storage systems have fundamental scheduling deficiencies [MongoDB - #21858]: A high throughput update workload could cause starvation on secondary reads [HBase - #8884]: when the read load is high on a specific RS is high, the write throughput also get impacted dramatically, and even write data loss... [Cassandra - #10989]: etc. inability to balance writes/reads/compaction/flushing 3

4 Why Is Scheduling Broken? The complexities in modern storage systems - Distributed: >1000 servers - Highly concurrent: ~1000 interacting threads in each server - Long execution path: requests traverses numerous threads across multiple machines We introduce Thread Architecture Model to describe scheduling complexities 4

5 Thread Architecture Model (TAM) Encodes scheduling related info: Request flows Thread interactions Resource consumption patterns RPC Read RegionServer/DataNode 1 RPC Handle f1 a 1 2 RPC Respond Easy to obtain automatically From complicated systems to an understandable and analyzable model HBase Cassandra MongoDB Riak r 1 r 2 Mem Flush Ack Process Data Xceive w 6 w 4 LOG Append a 3 a 2 Packet Ack Packet Ack Data Stream w 7 w 7 LOG Sync w 2 w 5 w 5 w 1 Data Xceive r 1 r 2 w 4 w 3 w 3 w 5 w 3 w 2 RegionServer/DataNode 5

6 TAM Exposes Scheduling Problems We discovered five categories of problems that happen in real systems - Lack of scheduling points - Unknown resource usage - Hidden contention between threads - Uncontrolled thread blocking - Ordering constraints upon requests 6

7 Fix Problems Leads to Effective Scheduling TAM-based simulation finds problem-free thread architectures Provides schedulability: various desired scheduling policies can be realized HBase Tamed-HBase Implementation transforms system to be schedulable Muzzled-HBase: approximated implementation Effective scheduling under YCSB and other workloads 7

8 Thread Architecture Model enables principled schedulability analysis on general distributed storage systems 8

9 Outline Overview Thread Architecture Model Scheduling Problems Achieve Schedulability: A Case Study Conclusion 9

10 r 1 RPC Read RPC Read r 2 Thread Architecture Model Mem Flush Ack Process onserver/datanode RegionServer/DataNode 1 f1 RPC Handle RPC Handle a 3 a 1 LOG Append a 2 Packet Ack Data Stream w 7 w 7 LOG Sync w 2 w 5 2 RPC Respond w 1 Data Xceive r 1 r 2 Name stage (threads performing similar tasks) C I L CPU I/O network Lock request flow N Name resource usage request queue (scheduling point) blocking w 6 w 5 w 4 w 3 Data Xceive Data Xceive w 3 w 2 w 4 Packet Ack RegionServer/DataNode w 5 w 3 w 3 w 2 11

11 Thread Architecture Model TAM encodes scheduling related info: Request flows Thread interactions Resource consumption patterns From complex systems to analyzable models TADalyzer: from live system to TAM automatically Only lines of user annotation code required 12

12 Outline Overview Thread Architecture Model Scheduling Problems Achieve Schedulability: A Case Study Conclusion 13

13 TAM Exposes Scheduling Problems No scheduling Unknown resource usage Hidden contention Blocking Ordering constraint Req Handle Req Handle Common in distributed storage systems HBase, Cassandra, MongoDB, Riak Directly identifiable from TAM No low-level implementation details required 14

14 TAM Exposes Scheduling Problems No scheduling Unknown resource usage Hidden contention Blocking Ordering constraint Common in distributed storage systems HBase, Cassandra, MongoDB, Riak Directly identifiable from TAM No low-level implementation details required 15

Scheduling Problem: Unknown Resource Usage

Cassandra Node l1 3 3 3 6 5 Read Mutation

.. Respond 2 l2 4 4 4 C-Respond Msg Out 5

15 Scheduling Problem: Unknown Resource Usage Cassandra Node 1 8 C-ReqHandle Msg In 7 Cassandra Node l Read Mutation V-Mutation... Respond 2 l C-Respond Msg Out 5 Read 5 Msg In 7 l Mutation V-Mutation... Respond l Msg Out 16

16 Scheduling Problem: Unknown Resource Usage Workload: C1: issues cold requests C2: issues cold and cached requests Expectation: C2 has much higher throughput (due to cached request) CPU underutilized 17

17 Unknown Resource Usage: Solution Workload: C1: issues cold requests C2: issues cold and cached requests Expectation: C2 has much higher throughput (due to cached request) 18

18 Scheduling Problem: Unknown Resource Usage Resource usage patterns unknown to schedulers until after the processing begins Forces schedulers to make decisions before information is available Identified as red square brackets around resource symbols in TAM Req Handle 19

19 Scheduling Problem: Blocking Secondary Node Primary Node Feedback 8 1 Worker Fetcher 2 NetInterface 8 Worker Batcher 4 Oplog Writer Writer 7 6 MongoDB 20

20 Scheduling Problem: Blocking Workload: C1: reads from primary (does not go to secondary) C2: writes to primary (replicate to secondary node) time 10: the secondary node slows down Expectation: C1 reads throughput remains stable Time (s) MongoDB 21

21 Blocking: Solution Workload: C1: reads C2: writes (replicate to secondary node) time 10: the secondary node slows down Expectation: C1 reads throughput remains stable MongoDB

22 Scheduling Problem: Blocking Stages with fixed number of threads block on other stages Unable to schedule requests that could have been completed because all threads block Identified as dashed arrow point to stages with queues in TAM Req Handle I/O 23

23 Outline Overview Thread Architecture Model Scheduling Problems Achieve Schedulability: A Case Study Conclusion 24

24 Fixing Problems Leads to Schedulability TAM-based simulation framework: explore thread architectures Simulates how systems perform under workloads Easily study architecture designs and scheduling policies Implementation: realize schedulable systems Also validates that simulation matches the real world 25

Stream w 7 w 7 LOG Sync w 2 w 5 w 5 w 4 w 1 Data Xceive w 3 r 2 r 1 RPC Read [ ] RPC Handle Ack Process LOG Append RPC

25 Simulation: HBase to Tamed-HBase RPC Read RegionServer/DataNode 1 f1 RPC Handle a 1 2 RPC Respond Network RegionServer/DataNode CPU a 1 Data Xceive Mem Flush r 1 r 2 Mem Flush Ack Process w 6 a 3 LOG Append a 2 Packet Ack Data Stream w 7 w 7 LOG Sync w 2 w 5 w 5 w 4 w 1 Data Xceive w 3 r 2 r 1 RPC Read [ ] RPC Handle Ack Process LOG Append RPC Respond a 2 LOG Sync Packet Ack Data Stream IO Data Xceive w 3 w 2 w 4 Packet Ack RegionServer/DataNode w 5 w 3 Network IO RegionServer/DataNode Packet Ack 26

26 Implementation : Tamed-HBase to Muzzled-HBase Some approximations to make implementation easier Supports multiple scheduling policies Proper scheduling under various workloads 27

27 Muzzled-HBase: Weighted Fairness Workloads: Five clients, each with different weight, run YCSB (reads mostly) Expectation: Client receives throughput proportional to weight 28

28 Muzzled-HBase: Weighted Fairness Workloads: Five clients, each with different weight, run YCSB (reads mostly) Expectation: Client receives throughput proportional to weight 29

29 Muzzled-HBase: Tail Latency Guarantee Workloads: Foreground client: runs YCSB (update-heavy) Background client: random Gets or Puts Expectation: Foreground latency remains stable 30

30 Muzzled-HBase: Tail Latency Guarantee Workloads: Foreground client: runs YCSB Background client: random Gets or Puts Expectation: Foreground latency remains stable 31

31 Muzzled-HBase: Tail Latency Guarantee Workloads: Foreground client: runs YCSB Background client: random Gets or Puts Expectation: Foreground latency remains stable 32

32 Outline Overview Thread Architecture Model Scheduling Problems Achieve Schedulability: A Case Study Conclusion 33

33 Conclusion We introduce thread architecture models Reduce complex distributed scheduling to an understandable representation Enable schedulability analysis We discover five scheduling problems Point to problematic architecture that exist in real systems Fixing them enables effective scheduling Complex systems need to be built with the help of TAM Analyze existing system and enable schedulability Design systems that are problem-free and natively schedulable 34

34 Thank you! Questions? (poster number: 28) OceanBase: We are Hiring Geo-scale relational database behind Alipay 42,000,000 SQLs per second US and China based Contact 35 OceanBase

Principled Schedulability Analysis for Distributed Storage Systems using Thread Architecture Models

Principled Schedulability Analysis for Distributed Storage Systems using Thread Architecture Models Suli Yang, Ant Financial Services Group; Jing Liu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau,