Enlightening the I/O Path: A Holistic Approach for Application Performance

Size: px

Start display at page:

Download "Enlightening the I/O Path: A Holistic Approach for Application Performance"

Vincent Short
6 years ago
Views:

1 Enlightening the I/O Path: A Holistic Approach for Application Performance Sangwook Kim 13, Hwanju Kim 2, Joonwon Lee 3, and Jinkyu Jeong 3 Apposha 1 Dell EMC 2 Sungkyunkwan University 3

2 Data-Intensive Applications Relational Document Key-value Column Search 2

3 Data-Intensive Applications Common structure Request Application performance Client I/O T1 Response T2 T3 I/O I/O I/O Operating System Application T4 * Example: MongoDB - Client (foreground) - Checkpointer - Log writer - Eviction worker - Storage Device 3

4 Data-Intensive Applications Common structure Client Request Response Application Background tasks are * Example: problematic MongoDB Application T1 T2 T3 T4 - Server (client) performance for application performance I/O I/O I/O I/O Operating System - Checkpointer - Log writer - Evict worker - Storage Device 4

5 Application Impact Illustrative experiment YCSB update-heavy workload against MongoDB 5

6 Operation throughput (ops/sec) Application Impact Illustrative experiment YCSB update-heavy workload against MongoDB CFQ 30 seconds latency at th percentile Regular checkpoint task Elapsed time (sec) 6

7 Operation throughput (ops/sec) Application Impact Illustrative experiment YCSB update-heavy workload against MongoDB CFQ CFQ-IDLE I/O priority does not help Elapsed time (sec) 7

8 Operation throughput (ops/sec) Application Impact Illustrative experiment YCSB update-heavy workload against MongoDB CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO State-of-the-art schedulers do not help much Elapsed time (sec) 8

9 What s the Problem? Independent policies in multiple layers Each layer processes I/Os w/ limited information I/O priority inversion Background I/Os can arbitrarily delay foreground tasks 9

10 What s the Problem? Independent policies in multiple layers Each layer processes I/Os w/ limited information I/O priority inversion Background I/Os can arbitrarily delay foreground tasks 10

11 Abstraction Multiple Independent Layers Independent I/O processing Application Caching Layer File System Layer Block Layer Storage Device 11

12 Abstraction Multiple Independent Layers Independent I/O processing Application Caching Layer read() write() Buffer Cache admission control File System Layer Block Layer Storage Device 12

13 Abstraction Multiple Independent Layers Independent I/O processing Application read() write() Caching Layer Buffer Cache admission control File System Layer Block-level Q Block Layer Storage Device 13

14 Abstraction Multiple Independent Layers Independent I/O processing Application Caching Layer read() write() Buffer Cache reorder File System Layer FG BG FG BG Block Layer Storage Device 14

15 Abstraction Multiple Independent Layers Independent I/O processing Application Caching Layer read() write() Buffer Cache File System Layer Block Layer Storage Device FG BG FG admission control Device-internal Q 15

16 Abstraction Multiple Independent Layers Independent I/O processing Application Caching Layer read() write() Buffer Cache File System Layer FG BG FG Block Layer Storage Device BG BG reorder FG BG 16

17 What s the Problem? Independent policies in multiple layers Each layer processes I/Os w/ limited information I/O priority inversion Background I/Os can arbitrarily delay foreground tasks 17

18 I/O Priority Inversion Task dependency Application Caching Layer File System Layer Block Layer Locks Condition variables Storage Device 18

19 I/O Priority Inversion Task dependency Application Caching Layer File System Layer Block Layer FG lock BG wait Condition variables I/O Storage Device 19

20 I/O Priority Inversion Task dependency Application Caching Layer File System Layer Block Layer FG lock BG wait FG wait var wake I/O BG wait Storage Device 20

21 I/O Priority Inversion Task dependency Application FG wait user wake var wait waitbg FG I/O Caching Layer File System Layer Block Layer Storage Device 21

22 I/O Priority Inversion I/O dependency Application Caching Layer File System Layer Outstanding I/Os Block Layer Storage Device 22

23 I/O Priority Inversion I/O dependency Application Caching Layer File System Layer Block Layer FG wait I/O For ensuring consistency and/or durability Storage Device 23

24 Operation throughput (ops/sec) Our Approach Request-centric I/O prioritization (RCP) Critical I/O: I/O in the critical path of request handling Policy: holistically prioritizes critical I/Os along the I/O path CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP Elapsed time (sec) 100 ms latency at th percentile 24

25 Challenges How to accurately identify I/O criticality How to effectively enforce I/O criticality 25

26 Critical I/O Detection Enlightenment API Interface for tagging foreground tasks I/O priority inheritance Handling task dependency Handling I/O dependency 26

27 I/O Priority Inheritance Handling task dependency Locks inherit submit FG lock BG FG BG I/O FG unlock BG complete inherit wake register BG FG wait submit CV CV BG I/O FG CV Condition variables BG complete 27

28 I/O Priority Inheritance Handling I/O dependency 28

29 I/O Priority Inheritance Handling I/O dependency I/O Block Layer 29

30 I/O Priority Inheritance Handling I/O dependency I/O Q admission stage Sched queueing stage Block Layer 30

31 I/O Priority Inheritance Handling I/O dependency I/O Non-critical I/O tracking PER-DEV ROOT Q admission stage NCIO NCIO NCIO NCIO Descriptor Location Resolver Sector # Sched queueing stage Block Layer 31

32 I/O Priority Inheritance Handling I/O dependency I/O Non-critical I/O tracking I/O PER-DEV ROOT allocate NCIO NCIO Q admission stage Sched queueing stage Block Layer Descriptor Location Resolver Sector # NCIO NCIO 32

33 I/O Priority Inheritance Handling I/O dependency I/O Non-critical I/O tracking I/O PER-DEV ROOT I/O NCIO NCIO Q admission stage Sched queueing stage Block Layer update Descriptor Location Resolver Sector # NCIO NCIO 33

34 I/O Priority Inheritance Handling I/O dependency I/O Non-critical I/O tracking I/O PER-DEV ROOT NCIO NCIO Q admission stage I/O Sched queueing stage Block Layer update Descriptor Location Resolver Sector # NCIO NCIO 34

35 I/O Priority Inheritance Handling I/O dependency I/O Non-critical I/O tracking I/O PER-DEV ROOT NCIO NCIO Q admission stage I/O Sched queueing stage Block Layer update Descriptor Location Resolver Sector # NCIO NCIO 35

36 I/O Priority Inheritance Handling I/O dependency I/O Non-critical I/O tracking I/O PER-DEV ROOT NCIO NCIO Q admission stage Sched queueing stage Block Layer I/O update Descriptor Location Resolver Sector # NCIO NCIO 36

37 I/O Priority Inheritance Handling I/O dependency I/O Non-critical I/O tracking I/O PER-DEV ROOT NCIO NCIO Q admission stage Sched queueing stage Block Layer I/O delete on completion Descriptor Location Resolver Sector # NCIO NCIO 37

38 Handling Transitive Dependency Possible states of dependent task FG inherit BG wait BG FG inherit BG wait I/O FG inherit BG wait Blocked on task Blocked on I/O Blocked at admission stage 38

39 Handling Transitive Dependency Recording blocking status FG inherit BG wait BG FG inherit BG wait I/O FG inherit BG wait Blocked on task Blocked on I/O Blocked at admission stage 39

40 Handling Transitive Dependency Recording blocking status FG inherit BG inherit BG FG inherit BG wait I/O FG inherit BG wait Task is recorded Blocked on I/O Blocked at admission stage 40

41 Handling Transitive Dependency Recording blocking status inherit inherit inherit reprio inherit wait FG BG BG FG BG I/O FG BG Task is recorded I/O is recorded Blocked at admission stage 41

42 Handling Transitive Dependency Recording blocking status inherit inherit inherit reprio inherit retry FG BG BG FG BG I/O FG BG Task is recorded I/O is recorded 42

43 Challenges How to accurately identify I/O criticality Enlightenment API I/O priority inheritance Recording blocking status How to effectively enforce I/O criticality 43

44 Criticality-Aware I/O Prioritization Caching layer Apply low dirty ratio for non-critical writes (1% by default) Block layer Isolate allocation of block queue slots Maintain 2 FIFO queues Schedule critical I/O first Limit # of outstanding non-critical I/Os (1 by default) Support queue promotion to resolve I/O dependency 44

45 Evaluation Implementation on Linux 3.13 w/ ext4 Application studies PostgreSQL relational database Backend processes as foreground tasks I/O priority inheritance on LWLocks (semop) MongoDB document store Client threads as foreground tasks I/O priority inheritance on Pthread mutex and condition vars (futex) Redis key-value store Master process as foreground task 45

46 Evaluation Experimental setup 2 Dell PowerEdge R530 (server & client) 1TB Micron MX200 SSD I/O prioritization schemes CFQ (default), CFQ-IDLE SPLIT-A (priority), SPLIT-D (deadline) [SOSP 15] QASIO [FAST 15] RCP 46

47 Transaction throughput (trx/sec) Application Throughput PostgreSQL w/ TPC-C workload CFQ CFQ-IDLE 10GB dataset 60GB dataset 200GB dataset 47

48 Transaction throughput (trx/sec) Application Throughput PostgreSQL w/ TPC-C workload CFQ CFQ-IDLE SPLIT-A GB dataset 60GB dataset 200GB dataset 48

49 Transaction throughput (trx/sec) Application Throughput PostgreSQL w/ TPC-C workload CFQ CFQ-IDLE SPLIT-A SPLIT-D GB dataset 60GB dataset 200GB dataset 49

50 Transaction throughput (trx/sec) Application Throughput PostgreSQL w/ TPC-C workload CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO GB dataset 60GB dataset 200GB dataset 50

51 Transaction throughput (trx/sec) Application Throughput PostgreSQL w/ TPC-C workload CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP % 31% 28% 10GB dataset 60GB dataset 200GB dataset 51

52 Transaction log size (GB) Application Throughput Impact on background task CFQ CFQ-IDLE QASIO Elapsed time (sec) 52

53 Transaction log size (GB) Application Throughput Impact on background task CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO Elapsed time (sec) 53

54 Transaction log size (GB) Application Throughput Impact on background task CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP Our scheme improves application throughput w/o penalizing background tasks Elapsed time (sec) 54

55 CCDF P[X>=x] Application Latency PostgreSQL w/ TPC-C workload 1.E+00 0th 10 0 CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO 1.E-01 90th E-02 99th th 1.E th 1.E-04 Over 2 sec at 99.9th th 1.E Transaction latency (msec) 55

56 CCDF P[X>=x] Application Latency PostgreSQL w/ TPC-C workload CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP 1.E+00 0th E-01 90th E-02 99th th 1.E th 1.E-04 Our scheme is effective for improving tail latency Over 2 sec at 99.9th th 1.E msec at th Transaction latency (msec) 56

57 Summary of Other Results Performance results MongoDB: 12%-201% throughput, 5x-20x latency at 99.9 th Redis: 7%-49% throughput, 2x-20x latency at 99.9 th Analysis results System latency analysis using LatencyTOP System throughput vs. Application latency Need for holistic approach 57

58 Conclusions Key observation All the layers in the I/O path should be considered as a whole with I/O priority inversion in mind for effective I/O prioritization Request-centric I/O prioritization Enlightens the I/O path solely for application performance Improves throughput and latency of real applications Ongoing work Practicalizing implementation Applying RCP to database cluster with multiple replicas 58

59 Thank You! Questions and comments Contact 59

Analyzing and Optimizing Linux Kernel for PostgreSQL. Sangwook Kim PGConf.Asia 2017

Analyzing and Optimizing Linux Kernel for PostgreSQL Sangwook Kim PGConf.Asia 2017 Sangwook Kim Co-founder and CEO @ Apposha Ph. D. in Computer Science Cloud/Virtualization SMP scheduling [ASPLOS 13, VEE