Exploring History with Hawk An Introduction to Cluster Forensics Kristoffer Grönlund High Availability Software Developer kgronlund@suse.com
This tutorial High Availability in 5 minutes Introduction to HAWK 2 What's new in HAWK 2 History Explorer Cluster Forensics Example Usage Summary
About me Kristoffer Grönlund 3 Developer crmsh hawk resource-agents Maintainer fence-agents haproxy
High Availability
High Availability 5
What is a cluster? Cluster 1-32* Nodes Node Single machine in cluster Hardware or virtualized Remote nodes Site Physical location Local Metro Geographical * Scale beyond 32 nodes with remote nodes 6
Resources Agent Classes Open Cluster Framework (OCF) Agents 7 resource-agents systemd services Fencing agents Init scripts Examples: Web Server, File Server Databases Filesystems, IP Addresses VMs, resources in VMs...
Constraints Order Location 8 Resource A prefers node Colocation Start resource A before resource B Resource A with resource B Score Mandatory vs. Preference Numeric value or +/- infinity Resource stickiness
RESOURCES Overview Resource Resource Resource Resource Resource Agents PACEMAKER Local Resource Manager Cluster Resource Manager Designated Coordinator (DC) Local Resource Manager Policy Engine Cluster Information Base (CIB) CIB Replica Cluster Resource Manager COROSYNC Resource Allocation 9 Corosync Corosync Messaging / Infrastructure
Fencing Dealing with Schrödinger's cat Goal: Preventing corruption Storage based: SBD Recommended if possible No special hardware required Hardware based: IPMI, ilo, 10 Many supported devices
11
Tools crmsh HAWK 12 Command line interface Web interface
Learn more www.suse.com/documentation/sle-ha-12/ Two node cluster in two commands node1 # ha-cluster-init node2 # ha-cluster-join -c node1 13
Introducing HAWK
HAWK - Overview 15 High Availability Web Konsole Monitoring Configuration / Administration Dashboard
HAWK - Technical details 16 Installed by ha-cluster-bootstrap Runs on the cluster nodes Ruby on Rails https://<node>:7630/
HAWK - Security Default user is hacluster 17 Remember to change the password HTTPS for secure access Replace SSL certificate with your own /etc/hawk/hawk.key /etc/hawk/hawk.pem
HAWK 0.7
Status 19
Dashboard 20
HAWK 2
A New Look 22 Complete visual overhaul More intuitive Similar to other SUSE tools Improved features History Explorer More powerful wizards Integrated help Supports new cluster features
Upgrading to HAWK 2 zypper install hawk2 23
Login 24
Status 25
Dashboard 26
Graph 27
Simulator 28
Simulator, node event 29
Simulator, results 30
Creating resources 31
Command log 32
Wizards
Wizards 34 Apply a complete cluster configuration Helps configuring constraints and groups Install and configure required software
Wizards 35
Wizard, configuration 36
Wizard, verify changes 37
Wizard, advanced options 38
Wizard, optional steps 39
Wizard, verify changes (1) 40
Wizard, verify changes (2) 41
Command line wizards crm script list show virtual-ip verify virtual-ip id=admin-ip ip=10.13.37.42 run virtual-ip id=... 42
History Explorer
Cluster Forensics 44 Something went wrong How can we figure it out? Pitfalls Understanding the cluster logs Use the history explorer Get a cluster report
Root Cause Analysis 45 Start at the evidence Trace backwards Know the application Assume you know nothing
Jumping To Conclusions Always stay on the evidence When the evidence runs out, we are guessing Guessing is OK! 46 But know when you are guessing
The Evidence 47 Failed Cluster Action Software bugs, crashes Configuration error Failed Node Hardware failure Communication error
Collecting data crm report -f '2015-10-10 12:00' -t '2015-10-10 14:00' strange_event 48
Understanding the logs 2015-10-11T19:40:11.717167+02:00 2015-10-11T19:40:19.777412+02:00 2015-10-11T19:40:24.524292+02:00 2015-10-11T19:40:24.528651+02:00 2015-10-11T19:40:24.528851+02:00 2015-10-11T19:40:24.530055+02:00 2015-10-11T19:40:24.530701+02:00 2015-10-11T19:40:24.740118+02:00 2015-10-11T19:40:24.801183+02:00 2015-10-11T19:40:24.836022+02:00 49 sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a crmd[1590]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=i_te_success cause=c_fsa_internal origin=notify_crmd ] apache(srv2)[20777]: INFO: Successfully retrieved http header at http://localhost:8000 crmd[1590]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=i_pe_calc cause=c_fsa_internal origin=abort_transition_graph ] pengine[1589]: notice: Restart admin_addr#011(started sle12sp1b) pengine[1589]: notice: Calculated Transition 156: /var/lib/pacemaker/pengine/pe-input-55.bz2 crmd[1590]: notice: Processing graph 156 (ref=pe_calc-dc-1444585224-290) derived from /var/lib/pacemaker/pengine/pe-input-55.bz2 crmd[1590]: notice: Initiating action 16: stop admin_addr_stop_0 on sle12sp1b crmd[1590]: notice: Initiating action 6: start admin_addr_start_0 on sle12sp1b crmd[1590]: notice: Initiating action 1: monitor admin_addr_monitor_10000 on sle12sp1b crmd[1590]: notice: Transition 156 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-55.bz2): Complete
Internal components 50 Cluster Information Base (CIB) Cluster Resource Management daemon (crmd) Local Resource Management daemon (lrmd) Policy Engine (pengine) Fencing daemon (stonithd)
Policy Engine 51 Designated Controller (DC) Elected automatically Calculates ideal cluster state Decides on actions to achieve state
Transition Sequence of actions to reach new state Records state before and after transition Saved to /var/lib/pacemaker/pengine/ Numbered with sequence number 52 Number sequence may reset to 0 if DC is re-elected
Cluster Actions 53 <resource>_<action>_<nn> Actions start stop promote demote monitor migrate_to migrate_from
Cluster Actions Error Codes 0: Success 1: Generic Error 2: Argument Error 3: Unimplemented Action 4: Insufficient Permissions 5: Required Component Is Missing 6: Configuration Error 7: Resource Was Not Running 8: Running As Primary 9: Failed As Primary 54
Cluster Action Failure 55 Unexpected result when performing action Triggers transition May also trigger fencing (stop failure)
Node Failure 56 Quorum = Majority vote Improves availability Avoids fence loops Downside: Need more nodes Smaller partitions are fenced
Node Failure Crash / reboot Network issues Leads to chaos without fencing Uncommunicative nodes are fenced 57 Cluster no longer knows if node is running resources Enforces a known state
History Explorer Command line: 58 crm history Collect logs from cluster nodes Analyse transitions Present summary of events View configuration Transition graph Transition diff Extract logs during a particular transition
History Explorer 59
History Explorer 60
History Explorer 61
History Explorer 62
History Explorer 63
Example configuration g-proxy 200 50 ping srv1 proxy proxy-vip demo-node1 200 srv2 demo-node2 64
Example Description Two web servers 65 Port 8000 HAProxy Port 80 Load balancer (round robin) Failed action: kill -9 proxy detected by monitor
Failed Action 66
History Explorer 67
History Explorer 68
History Explorer 69
History Explorer 70
History Explorer 71
History Explorer 72
History Explorer 73
History Explorer 74
History Explorer 75
History Explorer 76
Pitfalls 77
Too many logs History explorer can get slow Find the relevant transitions Narrow the scope Command line: 78 Run HAWK in offline mode to avoid burdening cluster timeframe <from> <to>
End of the tracks 79 Analysing action failure Example: monitor fails for unknown reasons Probes Before starting a resource, Pacemaker checks if it is running Success Is Failure Know your application Start at action failure, read application logs backwards At this point, the cluster can't help you
General Confusion Which node wrote this log? Get back to the evidence 80 Was it even running the resource in question? If in doubt, start over Cancelled Transitions Sometimes, the history explorer gets confused Fencing can cancel a transition By default, Pacemaker fences offline nodes at startup
Possible Problems Network Latency Disk is full Misconfiguration 81 Does your network fulfill the requirements? Use csync2 or configuration management tool Fencing device failure Is fencing enabled? Does the fencing device work? Use SBD
Resource tracing crm resource trace <resource> /var/lib/heartbeat/trace_ra/<agent>/ Note: Trace is written on node where resource runs Complete trace of every action 82 Can be a lot of data: remember to untrace!
Summary 83 Try The New Hawk Use The History Explorer Follow The Evidence Action Failure Leads To Actions Node Failure Leads To Fencing Without Fencing, Anything Can Happen
Open Source https://github.com/clusterlabs/hawk https://github.com/clusterlabs/crmsh 84
Questions? www.suse.com Thank you. 85
86
Unpublished Work of SUSE LLC. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.