Performance Analysis of Virtual Environments

Size: px

Start display at page:

Download "Performance Analysis of Virtual Environments"

Lesley Cox
5 years ago
Views:

1 Performance Analysis of Virtual Environments Nikhil V Mishrikoti (nikmys@cs.utah.edu) Advisor: Dr. Rob Ricci Co-Advisor: Anton Burtsev 1

2 Introduction Motivation Virtual Machines (VMs) becoming pervasive in data centers and academic institutions. Honouring SLAs. Promised vs Actual. Quantify impact of Virtualization. Making sense of performance data. 2

3 Introduction Project Goals Empower user to investigate performance problems with as little inertia as possible. 3

4 Introduction Project Framework to build tools for performing fine grained analysis of, resource utilization overheads and performance bottlenecks, in virtual setups. 4

5 Agenda Introduction Xen Overview Data Collection Analysis Framework Analysis Algorithms Composibility 5

6 Agenda Introduction Xen Overview Data Collection Analysis Framework Analysis Algorithms Composibility 6

7 Xen Overview Dom 0 Dom 1... Dom U Admin VM Xen Hardware Open source. Widely used. Ex Amazon EC2 7

8 Agenda Introduction Xen Overview Data Collection Analysis Framework Analysis Algorithms Composibility 8

9 Data Collection Xentrace - Overview Xentrace is a lightweight tracing utility that collects hypervisor and domain level events. Ships with Xen. Dom 0 Dom 1 Dom n Xentrace Domain Events... Domain Events Hypervisor Events Xen Hardware Xentrace logs HDD 9

10 Data Collection Xentrace - Advantages Widely available since it ships with Xen. Easily extensible. 10

11 Data Collection Xentrace - Data Not originally intended for performance. Xentrace collects enormous amounts of raw information. E.x: data collected during a disk intensive load for 1 minute exceeds 700 MB Hence, chose as the source of performance data. 11

12 Data Collection Xentrace - Details Event masks to selectively capture event data. Log data is in binary format. Additional events not provided by Xentrace, can be manually added and collected by inserting trace macros in Xen or domain source. 12

13 Data Collection Xentrace - Event Format uint64 uint32 uint32 uint32 CPU tsc ns event_id d0 d1 d2 d3 d4 d5 Physical CPU id Optional trace data Timestamps - CPU clock cycles & nanoseconds Trace Event type 13

14 Data Collection Xentrace - Limitations xentrace_format : binary to text. E.x: 700 MB log has 20+ million lines of text Very difficult to manually peruse and, identify performance problems. gain high level overview of performance. 14

15 Agenda Introduction Xen Overview Data Collection Analysis Framework Analysis Algorithms Composibility 15

16 Analysis Framework Architecture Dom 0 Dom 1... Dom n Xentrace Domain Events Domain Events Hypervisor Events Xen Hardware Xentrace logs HDD Reader Analyses 16

17 Analysis Framework Overview Implementation split in two components. Reader: Parses binary log data offline and passes C - style structs to Analyses component. Analyses: Algorithms consisting a group of handlers for different event types. Generate high level performance metrics like CPU utilization, disk i/o performance etc. 17

18 Introduction Xen Overview Data Collection Analysis Framework Reader Analyses Analysis Algorithms Composibility Agenda 18

19 Analysis Framework Reader - Caveats Two caveats make parsing non-trivial. Events collected in logs not completely ordered. 19

20 Analysis Framework Reader - Unordered logs CPU 0 20

21 Analysis Framework Reader - Unordered logs CPU 0 ev1 ev2 ev3 ev4 ev5 ev6 Time 21

22 Analysis Framework Reader - Unordered logs CPU 0 ev1 ev2 ev3 ev4 ev5 ev6 CPU 1 ev1 ev2 ev3 ev4 ev5 ev6 CPU 2 ev1 ev2 ev3 ev4 ev5 ev6 Time 22

23 Analysis Framework Reader - Unordered logs CPU 0 ev1 ev2 ev3 ev4 ev5 ev6 ev1 ev1 ev2 Ordered ev1 CPU 1 ev1 ev2 ev3 ev4 ev5 ev6 ev2 ev2 ev3 CPU 2 ev1 ev2 ev3 ev4 ev5 ev6 ev4 ev3 ev3 ev4 ev5 Time ev5 ev6 23 ev4 ev5

24 Analysis Framework Reader - Unordered logs CPU 0 CPU 1 CPU 2 ev1 ev2 ev3 ev4 ev5 ev6 ev1 ev2 ev3 ev4 ev5 ev6 ev1 ev2 ev3 ev4 ev5 ev6 Time ev1 ev1 ev2 ev1 ev2 ev2 ev3 ev4 ev3 ev3 ev4 ev5 ev5 Ordered Log ev1 ev2 ev3 ev4 ev5 ev6 ev1 ev2 ev3 ev4 ev5 ev6 ev1 ev2 ev3 ev6 ev4 24 ev4 ev5 ev5 ev6

25 Analysis Framework Reader - Unordered logs Unordered logs ev1 ev2 ev3 ev4 ev5 ev6 ev1 ev2 Sort? ev1 ev1 ev2 ev1 ev2 ev2 ev3 Ordered logs ev3 ev4 ev4 ev5 ev3 ev6 ev3 ev1 ev4 ev2 ev5 ev3 ev4 ev5 ev6 25 ev5 ev6 ev4 ev5

26 Analysis Framework Reader - Unordered logs Unordered large logs ev1 ev2 ev3 ev4 ev5 ev6 External Sort? ev1 ev2 ev3 ev4 ev5 ev6 d-way Merge Sort ev1 ev1 ev2 ev1 ev2 ev2 ev3 ev4 ev3 ev3 Ordered large logs ev1 ev2 ev4 ev5 ev3 ev4 ev5 ev6 26 ev5 ev6 ev4 ev5

27 Analysis Framework Reader - Unordered logs Unordered large logs ev1 ev2 ev3 ev4 ev5 ev6 ev1 ev2 d-way Merge ev1 ev1 ev2 ev1 ev2 ev2 ev3 Ordered large logs ev3 ev4 ev4 ev5 ev3 ev6 ev3 ev1 ev4 ev2 ev5 ev3 ev4 ev5 ev6 Min-Heap 27 ev5 ev6 ev4 ev5

28 Analysis Framework Reader - Caveats Two caveats make parsing non-trivial. Events collected in logs not completely ordered. Loss of events during log collection. 28

29 Analysis Framework Reader - Lost Records Events come in faster than they can be flushed to disk. Xentrace inserts a lost_record event in the logs. Interferes with analysis esp. time sensitive. ev1 ev1 ev2 ev1 ev2 ev3 ev4 ev5 ev5 ev6 29 ev4 ev5

30 Analysis Framework Reader - Lost Records Tried different approaches. Increase buffer size. Use Event masks when possible. Fix Xentrace bug. Treat lost_records as just another event. It s handler notifies other event handlers in execution of its occurrence. They deal with it appropriately (discarding analysis, ignoring it completely etc.) 30

31 Introduction Xen Overview Data Collection Analysis Framework Reader Analyses Analysis Algorithms Composibility Agenda 31

32 Analyses Each tool s analysis logic is made up of Event Handlers. Event Handlers registered with Reader. Handler needs 3 methods written by the user. Initialize Handle Finalize Ex: Count of Events. Initialize count to 0, increment count, print count at the end. 32

33 Analyses Architecture Xentrace logs Reader Handlers Parse Xentrace records ev_id_1 ev_handler_1(..) Call Event Handlers ev_id_2 ev_id_3 ev_handler_2(..) Finalizers ev_handler_3(..) Free Handlers 33

34 Introduction Xen Overview Data Collection Analysis Framework Reader Analyses Analysis Algorithms Composibility Agenda 34

35 Analysis Algorithms Reasoning about performance in virtual environments is not always straightforward. The user has to follow his instinct or data from another analysis. 35

36 Analysis Algorithms Motivation Quantify phenomena observed on virtual setups. Utilization Saturation Errors (USE) methodology [1] 36

37 Analysis Algorithms CPU Utilization CPU Scheduling Latency Time in Hypervisor Disk I/O Device Driver Queue Status Device Driver Queue Request and Response Latency 37

38 Analysis Algorithms CPU Utilization - Why? Fine grained CPU utilization information can, Check if hypervisor adheres to Service Level Agreements (SLA) between hosting providers and clients. Detecting unbalanced mapping between physical CPUs and VMs. 38

39 Analysis Algorithms CPU Utilization dom 0 dom1 dom U vcpu0 vcpu1 v0 v1 v2 vcpu0 CPU 0 CPU 1 CPU n 39

40 Analysis Algorithms CPU Utilization CPU Scheduling Latency Time in Hypervisor Disk I/O Device Driver Queue Status Device Driver Queue Request and Response Latency 40

41 Analysis Algorithms CPU Scheduling Latency domain_wake(dom1, vcpu 0) domain_wake(dom1, vcpu 1) enter_sched(v0) enter_sched(v1) 41

42 Analysis Algorithms CPU Scheduling Latency Measures time a VM had to wait to get scheduled since the context switch request was sent out. Delay in context switch not only affects CPU bound tasks but also I/O jobs. Since domain needs CPU to process I/O requests/responses. 42

43 Analysis Algorithms CPU Scheduling Latency Wait times can increase for a number of reasons, VCPUs are over-scheduled. Physical CPU is always busy. Imbalance in VCPU => CPU affinity. 43

44 Analysis Algorithms CPU Scheduling Latency domid: 0 : CPU Wait Time: (ms) domid: 1 : CPU Wait Time: (ms) Total CPU Wait time for all domains: (ms) (ns) : (ns) : (ns) : (ns) : (ns) : (ns) : (ns) : (ns) : (ns) : (ns) : 0 > 7000 (ns) : 0 44

45 Analysis Algorithms CPU Utilization CPU Scheduling Latency Time in Hypervisor Disk I/O Device Driver Queue Status Device Driver Queue Request and Response Latency 45

46 Analysis Algorithms Time in Hypervisor - Why? CPU utilization data can sometimes be unreliable to infer performance problems from. Possible case is when most execution takes place in hypervisor. E.x: When significant amount of time is spent in the hypervisor, executing instructions or performing I/O on behalf of a domain, CPU util data will not show this behavior. E.x: Honoring SLAs. Does SLA include Xen runtime? 46

47 Analysis Algorithms Time in Hypervisor Total of 0 lost_record events encountered Total time spent in Domain 0 : CPU 1 = (ms) Total time spent in Domain IDLE : CPU 1 = (ms) Total time spent in Domain 0 : CPU 0 = (ms) Total time spent in Domain IDLE : CPU 0 = (ms) Total time spent in Domain 1 : CPU 2 = (ms) Total time spent in Domain IDLE : CPU 2 = (ms) Total time spent in Xen: (ms) 47

48 Analysis Algorithms CPU Utilization CPU Scheduling Latency Time in Hypervisor Disk I/O Device Driver Queue Status Device Driver Queue Request and Response Latency 48

49 Analysis Algorithms Disk I/O - Split Device Driver Model Backend Frontend Xen 49

50 Analysis Algorithms Disk I/O - Why? Performance impact of split device drivers on Disk I/O. 50

51 Analysis Algorithms 1 Disk I/O - Shared Ring Buffer 2 [2] 3 DomU writes Req 1 DomU writes Req 2 Dom0 writes Resp DomU reads Resp 1 Dom0 writes Resp 2 DomU reads Resp 2 51

52 Analysis Algorithms Disk I/O - Simplified Disk Response Shared Ring buffer Disk Request Dom Dom... Dom 0 1 U Event Mechanism Xen Hardware HDD 52

53 Analysis Algorithms Disk I/O - Device Driver Queues Front Shared-Ring Request Queue Dom 0 Dom 1 Back Shared-Ring Request Queue Front Request Queue Back Shared-Ring Response Queue BDD FDD Front Shared-Ring Response Queue Xen Hardware HDD 53

54 Analysis Algorithms Disk I/O - Device Driver Queue States Blocked : Cannot process requests to/from queue. Unable to add new requests to queue or Queue is empty. Unblocked : Can enqueue new incoming requests. 54

55 Analysis Algorithms Disk I/O - Device Driver Queue States Intuition was that a queue blocked for a long time would block the entire pipeline. 55

56 Analysis Algorithms Disk I/O - Device Driver Queue States Q_blocked Blocked Q_unblocked Q_blocked Q_unblocked Unblocked 56

57 Analysis Algorithms Disk I/O - Device Driver Queue States Q_blocked Blocked Q_unblocked Q_blocked Q_unblocked lost_records Q_blocked lost_records lost_records Unblocked Unknown Q_unblocked 57

58 Analysis Algorithms Disk I/O - Observations Blocked Unbuffered : 99 % - All queues Buffered : 50 % - Frontend request queue, 99 % rest. Buffer cache enables faster request processing at frontend. Disk I/O so slow, virtualization overheads negligible. 58

59 Analysis Algorithms CPU Utilization CPU Scheduling Latency Time in Hypervisor Disk I/O Device Driver Queue Status Device Driver Queue Request and Response Latency 59

60 Analysis Algorithms Queue Latency - Simplified DomU writes Req & notifies Dom0 Dom0 reads Req 3 4 Dom0 writes Resp & notifies DomU 60 DomU reads Resp

61 Analysis Algorithms Queue Latency - Simplified t2 - t1 SR Request Latency DomU writes Req & notifies Dom0 Dom0 reads Req 3 t4 - t3 4 SR Response Latency Dom0 writes Resp & notifies DomU 61 DomU reads Resp

62 Analysis Algorithms Disk I/O - Queue Latency SR Request Latency t1 t0 Dom 0 Dom 1 Disk Request Disk Response BDD FDD t3 t0 Xen Hardware HDD t3 t4 SR Response Latency 62

63 1 Analysis Algorithms Queue Latency - Results SR Response Latency >> SR Request Latency approx. 2 order of magnitudes greater for buffered i/o 63

64 Analysis Algorithms Queue Latency - Results QUEUE TIMES ================================================================== Queue BLOCKED: Unable to add new requests to queue or queue empty. Queue UNBLOCKED: Can enqueue new incoming requests. 1 Front Request Queue Unblocked : (ms) ; Blocked : (ms) Back Request Queue Unblocked : (ms) ; Blocked : (ms) Front Shared Ring Resp Queue Unblocked : (ms) ; Blocked : (ms) QUEUE WAIT TIMES ================================================================== Back Request Queue Wait Time : Back Response Queue Wait Time : (ms) (ms) 64

65 Introduction Xen Overview Data Collection Analysis Framework Reader Analyses Analysis Algorithms Composibility Agenda 65

66 Composibility Problem Analysis : Average queue blocked times on domain 0 66

67 Composibility Problem Analysis : Average queue blocked times on domain 0 Disk I/O analysis Averaging function Logic from CPU utilization 67

68 Composibility Problem Analysis : Average queue blocked times on domain 0 Disk I/O analysis Averaging function New Tool Logic from CPU utilization Lots of duplication of effort 68

69 Composibility Overview Framework, so far, gives us stand alone tools for focussed analysis. Composibility gives agility to this framework. 69

70 Composibility Overview Ability to reuse analysis algorithms logic for different event types. Easy to combine analysis tool outputs without having to rewrite large part of logic. Compose new analysis using reusable parts. 70

71 Composibility Overview Ability to reuse analysis algorithms logic for different event types. Stages Easy to combine analysis tool outputs without having to rewrite large parts of logic. Operators 71

72 Composibility Pipeline Analysis : Average queue blocked times on domain 0 event dom 0? Disk I/O tool Average output 72

73 Composibility Pipeline Analysis : Average queue blocked times on domain 0 event dom 0? Disk I/O tool Average output Stages Pipeline Operators 73

74 Composibility Pipeline - Operators Pipe ( ) : Connects a single stage to another. Split ( + ) : Connects a single stage to multiple stages. Executes all stages. Or ( or ) : Connects a single stage to multiple stages. Executes stages until valid return. Join : Connects multiple stages to a single stage. Either wait for a single connected stage to pass a valid event (JOIN_OR) or wait for all the connected stages to pass a successful event (JOIN_SPLIT). Syntax ideas inspired from A Universal Calculus for Streaming Processing Languages [3] 74

75 Composibility Pipeline - Operators Simplified split AND join_split or OR join_or 75

76 Composibility Pipeline - Stages Reusable and independent analysis components. Input: One or more events. Output : Same event, new event with results from execution or invalid event. If invalid event returned, break from Pipeline. 76

77 Composibility Pipeline Analysis : Average queue blocked times on domain 0 event dom 0? Disk I/O tool Average output 77

78 Composibility Pipeline Analysis : Average queue blocked times on domain 0 event vm(dom0) Disk I/O tool Average output 78

79 Composibility Pipeline Analysis : Average queue blocked times on domain 0 event vm(dom0) wait_time (Q_B, Q_U) Average output 79

80 Composibility Pipeline Analysis : Average queue blocked times on domain 0 event_id(q_b) event vm(dom0) + join_split wait_time() Average output + join_split event_id(q_u) 80

81 Composibility Pipeline Analysis : Average queue blocked times on domain 0 event_id(q_b) event vm(dom0) + join_split wait_time() average() output + join_split event_id(q_u) 81

82 Composibility Pipeline - Syntax event_id(q_b) event vm(dom0) + join_split wait_time() average() output + join_split event_id(q_u) vm(dom0) event_id(q_b) + event_id(q_u) wait_time() average() 82

83 Composibility Pipeline - Runtime vm(dom0) event_id(q_b) + event_id(q_u) wait_time() average() Parser [4] s1 = create_stage(vm, dom0); s2 = create_stage(event_id, Q_b); s3 = create_stage(event_id, Q_u); s4 = create_stage(wait_time, NULL); s5 = create_stage(average, NULL); split(s1, s2); split(s1, s3); join(s2, s4, JOIN_SPLIT); join(s3, s4, JOIN_SPLIT); pipe(s4, s5); while(!feof(fp)) { parse_next_event(&ev); execute_pipe(s1, ev); }

84 Composibility Summary Reusable and independent analysis components - Stages Connect stages using Operators. Compose Pipeline using Stages and Operators

85 Demo If time permits?? 85

86 Conclusion Goals met. Easier to build tools for fine grained performance analysis of Xen - Reader & Analyses Build complex analysis tools in a short time - Composibility 86

87 Thank You Q & A 87

88 References [1] USE method ( [2] Xen reference guide [3] R. Soule, M. Hirzel, R. Grimm. A Universal Calculus of Streaming Languages. ESOP 10. [4] Vembyr. ( 88

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office: