Exploring History with Hawk

Similar documents
Linux High Availability on IBM z Systems

SUSE OpenStack Cloud. Enabling your SoftwareDefined Data Center. SUSE Expert Days. Nyers Gábor Trainer &

Best practices with SUSE Linux Enterprise Server Starter System and extentions Ihno Krumreich

SUSE Linux Enterprise High Availability Extension

How To Make Databases on SUSE Linux Enterprise Server Highly Available Mike Friesenegger

Docker Networking In OpenStack What you need to know now. Fawad Khaliq

SaltStack and SUSE Systems and Configuration Management that Scales and is Easy to Extend

SUSE Manager Roadmap OS Lifecycle Management from the Datacenter to the Cloud

Using Linux Containers as a Virtualization Option

Managing Linux Servers Comparing SUSE Manager and ZENworks Configuration Management

SUSE Manager in Large Scale 17220

Essentials. Johannes Meixner. about Disaster Recovery (abbreviated DR) with Relax-and-Recover (abbreviated ReaR)

Cloud in a box. Fully automated installation of SUSE Openstack Cloud 5 on Dell VRTX. Lars Everbrand. Software Developer

Provisioning with SUSE Enterprise Storage. Nyers Gábor Trainer &

SUSE Manager and Salt

Build with SUSE Studio, Deploy with SUSE Linux Enterprise Point of Service and Manage with SUSE Manager Case Study

Linux and z Systems in the Datacenter Berthold Gunreben

SUSE Linux Enterprise Kernel Back to the Future

A Carrier-Grade Cloud Phone System

Open Enterprise & Open Community

Protect your server with SELinux on SUSE Linux Enterprise Server 11 SP Sander van Vugt

BOV89296 SUSE Best Practices Sharing Expertise, Experience and Knowledge. Christoph Wickert Technical Writer SUSE /

Introduction to Software Defined Infrastructure SUSE Linux Enterprise 15

Secure Authentication

Building a Secure and Compliant Cloud Infrastructure. Ben Goodman Principal Strategist, Identity, Compliance and Security Novell, Inc.

Expert Days SUSE Enterprise Storage

Using Crowbar to Deploy Your OpenStack Cloud. Adam Spiers Vincent Untz John H Terpstra

Novell SLES 10/Xen. Roadmap Presentation. Clyde R. Griffin Manager, Xen Virtualization Novell, Inc. cgriffin at novell.com.

Exploring the High Availability Storage Infrastructure. Tutorial 323 Brainshare Jo De Baer Technology Specialist Novell -

VSP16. Venafi Security Professional 16 Course 04 April 2016

Define Your Future with SUSE

Novell Infiniband and XEN

Welcome to SUSE Expert Days 2017 Service Delivery with DevOps

Too Many Metas A high level look at building a metadata desktop. Joe Shaw

Saving Real Storage with xip2fs and DCSS. Ihno Krumreich Project Manager for SLES on System z

RedHat Cluster (Pacemaker/Corosync)

SUSE An introduction...

Saving Your Bacon Recovering From Common Linux Startup Failures

Samba HA Cluster on SLES 9

From GIT to a custom OS image in a few click OS image made easy

Red Hat Enterprise Linux 7

openqa features capabilities bugs Ondrej Holecek /aaannz/

SUSE Linux Enterprise High Availability Extension

SUSE Linux Enterprise High Availability Extension

Linux Cluster next generation Vladislav Bogdanov

SICOOB. The Second Largest Linux on IBM System z Implementation in the World. Thiago Sobral. Claudio Kitayama

Scaling a Highly Available Global SUSE Manager Deployment at Rackspace to Manage Multiple Linux Platforms

Gaps and Overlaps in Identity Management Solutions OASIS Pre-conference Workshop, EIC 2009

Before We Start... 1

DevOps with SUSE: How SUSE Manager, SUSE Studio and SUSE Cloud APIs Facilitate Continuous Software Delivery. Wolfgang Engel.

Troubleshooting Your SUSE TUT6113. Cloud. Paul Thompson SUSE Technical Consultant. Dirk Müller SUSE OpenStack Engineer

MySQL High Availability and Geographical Disaster Recovery with Percona Replication Manager. Yves Trudeau November 2013

Novell Access Manager

openqa making QA interesting since 2013 Ondrej Holecek /aaannz/

Administration Guide. SUSE Linux Enterprise High Availability Extension 15

Administration Guide. SUSE Linux Enterprise High Availability Extension 12 SP2

Update Management ZENworks Mobile Management 3.2.x September 2015

SDS Heterogeneous OS Access. Technical Strategist

High Availability for Highly Reliable Systems

Unleash the Power of Ceph Across the Data Center

Novell Access Manager

INSTALLATION GUIDE Spring 2017

Understanding High Availability options for PostgreSQL

Telset Administration

Linux Clusters Made Easy with the SUSE Linux Enterprise High Availability Extension

The opensuse project. Motivation, Goals, and Opportunities. Sonja Krause-Harder Michael Löffler. March 6, 2006

Online Documentation: To access the online documentation for this and other Novell products, and to get updates, see

Chime for Lync High Availability Setup

Collecting data from IoT devices using Sigfox network

Novell Data Synchronizer Mobility Pack Overview. Novell. Readme. January 28, 2013

One Identity Manager Administration Guide for Connecting Oracle E-Business Suite

Avaya Software Keycode Installation Guide

Novell Access Manager

IP Office Release 7.0 IP Office Essential Edition - Quick Version Embedded Voic User Guide

Using Manage Alarm Tool

Online Documentation: To access the online documentation for this and other Novell products, and to get updates, see

This product may require export authorization from the U.S. Department of Commerce prior to exporting from the U.S. or Canada.

Installation and Setup Quick Start

openqa Avoiding Disasters of Biblical Proportions Marita Werner QA Project Manager

One Identity Manager 8.0. Administration Guide for Connecting Unix-Based Target Systems

DOWNLOAD PDF SQL SERVER 2012 STEP BY STEP

Milestone Systems. Milestone Mobile client 2017 R1. User Guide

Novell. NetWare 6. NETWARE WEBACCESS OVERVIEW AND INSTALLATION

Learning Secomea Remote Access (Using SiteManager Embedded for Windows)

This Readme describes the NetIQ Access Manager 3.1 SP5 release.

McAfee epolicy Orchestrator Update 2

Version is the follow-on release after version 8.1, featuring:

TIBCO Spotfire Hybrid Cloud Architecture Deep Dive

McAfee epolicy Orchestrator Release Notes

ForeScout Extended Module for IBM BigFix

Sentinel EMS 4.1. Release Notes

Forescout. eyeextend for IBM BigFix. Configuration Guide. Version 1.2

openqa Avoiding Disasters of Biblical Proportions Marita Werner QA Project Manager

Oracle Java SE Advanced for ISVs

The Privileged Appliance and Modules (TPAM) 1.0. Diagnostics and Troubleshooting Guide

Samba and Ceph. Release the Kraken! David Disseldorp

Virtualization at Scale in SUSE Linux Enterprise Server

User Guide for Avaya Equinox Add-in for IBM Lotus Notes

Zdeněk Kubala Senior QA

Novell Identity Manager

Transcription:

Exploring History with Hawk An Introduction to Cluster Forensics Kristoffer Grönlund High Availability Software Developer kgronlund@suse.com

This tutorial High Availability in 5 minutes Introduction to HAWK 2 What's new in HAWK 2 History Explorer Cluster Forensics Example Usage Summary

About me Kristoffer Grönlund 3 Developer crmsh hawk resource-agents Maintainer fence-agents haproxy

High Availability

High Availability 5

What is a cluster? Cluster 1-32* Nodes Node Single machine in cluster Hardware or virtualized Remote nodes Site Physical location Local Metro Geographical * Scale beyond 32 nodes with remote nodes 6

Resources Agent Classes Open Cluster Framework (OCF) Agents 7 resource-agents systemd services Fencing agents Init scripts Examples: Web Server, File Server Databases Filesystems, IP Addresses VMs, resources in VMs...

Constraints Order Location 8 Resource A prefers node Colocation Start resource A before resource B Resource A with resource B Score Mandatory vs. Preference Numeric value or +/- infinity Resource stickiness

RESOURCES Overview Resource Resource Resource Resource Resource Agents PACEMAKER Local Resource Manager Cluster Resource Manager Designated Coordinator (DC) Local Resource Manager Policy Engine Cluster Information Base (CIB) CIB Replica Cluster Resource Manager COROSYNC Resource Allocation 9 Corosync Corosync Messaging / Infrastructure

Fencing Dealing with Schrödinger's cat Goal: Preventing corruption Storage based: SBD Recommended if possible No special hardware required Hardware based: IPMI, ilo, 10 Many supported devices

11

Tools crmsh HAWK 12 Command line interface Web interface

Learn more www.suse.com/documentation/sle-ha-12/ Two node cluster in two commands node1 # ha-cluster-init node2 # ha-cluster-join -c node1 13

Introducing HAWK

HAWK - Overview 15 High Availability Web Konsole Monitoring Configuration / Administration Dashboard

HAWK - Technical details 16 Installed by ha-cluster-bootstrap Runs on the cluster nodes Ruby on Rails https://<node>:7630/

HAWK - Security Default user is hacluster 17 Remember to change the password HTTPS for secure access Replace SSL certificate with your own /etc/hawk/hawk.key /etc/hawk/hawk.pem

HAWK 0.7

Status 19

Dashboard 20

HAWK 2

A New Look 22 Complete visual overhaul More intuitive Similar to other SUSE tools Improved features History Explorer More powerful wizards Integrated help Supports new cluster features

Upgrading to HAWK 2 zypper install hawk2 23

Login 24

Status 25

Dashboard 26

Graph 27

Simulator 28

Simulator, node event 29

Simulator, results 30

Creating resources 31

Command log 32

Wizards

Wizards 34 Apply a complete cluster configuration Helps configuring constraints and groups Install and configure required software

Wizards 35

Wizard, configuration 36

Wizard, verify changes 37

Wizard, advanced options 38

Wizard, optional steps 39

Wizard, verify changes (1) 40

Wizard, verify changes (2) 41

Command line wizards crm script list show virtual-ip verify virtual-ip id=admin-ip ip=10.13.37.42 run virtual-ip id=... 42

History Explorer

Cluster Forensics 44 Something went wrong How can we figure it out? Pitfalls Understanding the cluster logs Use the history explorer Get a cluster report

Root Cause Analysis 45 Start at the evidence Trace backwards Know the application Assume you know nothing

Jumping To Conclusions Always stay on the evidence When the evidence runs out, we are guessing Guessing is OK! 46 But know when you are guessing

The Evidence 47 Failed Cluster Action Software bugs, crashes Configuration error Failed Node Hardware failure Communication error

Collecting data crm report -f '2015-10-10 12:00' -t '2015-10-10 14:00' strange_event 48

Understanding the logs 2015-10-11T19:40:11.717167+02:00 2015-10-11T19:40:19.777412+02:00 2015-10-11T19:40:24.524292+02:00 2015-10-11T19:40:24.528651+02:00 2015-10-11T19:40:24.528851+02:00 2015-10-11T19:40:24.530055+02:00 2015-10-11T19:40:24.530701+02:00 2015-10-11T19:40:24.740118+02:00 2015-10-11T19:40:24.801183+02:00 2015-10-11T19:40:24.836022+02:00 49 sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a sle12sp1a crmd[1590]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=i_te_success cause=c_fsa_internal origin=notify_crmd ] apache(srv2)[20777]: INFO: Successfully retrieved http header at http://localhost:8000 crmd[1590]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=i_pe_calc cause=c_fsa_internal origin=abort_transition_graph ] pengine[1589]: notice: Restart admin_addr#011(started sle12sp1b) pengine[1589]: notice: Calculated Transition 156: /var/lib/pacemaker/pengine/pe-input-55.bz2 crmd[1590]: notice: Processing graph 156 (ref=pe_calc-dc-1444585224-290) derived from /var/lib/pacemaker/pengine/pe-input-55.bz2 crmd[1590]: notice: Initiating action 16: stop admin_addr_stop_0 on sle12sp1b crmd[1590]: notice: Initiating action 6: start admin_addr_start_0 on sle12sp1b crmd[1590]: notice: Initiating action 1: monitor admin_addr_monitor_10000 on sle12sp1b crmd[1590]: notice: Transition 156 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-55.bz2): Complete

Internal components 50 Cluster Information Base (CIB) Cluster Resource Management daemon (crmd) Local Resource Management daemon (lrmd) Policy Engine (pengine) Fencing daemon (stonithd)

Policy Engine 51 Designated Controller (DC) Elected automatically Calculates ideal cluster state Decides on actions to achieve state

Transition Sequence of actions to reach new state Records state before and after transition Saved to /var/lib/pacemaker/pengine/ Numbered with sequence number 52 Number sequence may reset to 0 if DC is re-elected

Cluster Actions 53 <resource>_<action>_<nn> Actions start stop promote demote monitor migrate_to migrate_from

Cluster Actions Error Codes 0: Success 1: Generic Error 2: Argument Error 3: Unimplemented Action 4: Insufficient Permissions 5: Required Component Is Missing 6: Configuration Error 7: Resource Was Not Running 8: Running As Primary 9: Failed As Primary 54

Cluster Action Failure 55 Unexpected result when performing action Triggers transition May also trigger fencing (stop failure)

Node Failure 56 Quorum = Majority vote Improves availability Avoids fence loops Downside: Need more nodes Smaller partitions are fenced

Node Failure Crash / reboot Network issues Leads to chaos without fencing Uncommunicative nodes are fenced 57 Cluster no longer knows if node is running resources Enforces a known state

History Explorer Command line: 58 crm history Collect logs from cluster nodes Analyse transitions Present summary of events View configuration Transition graph Transition diff Extract logs during a particular transition

History Explorer 59

History Explorer 60

History Explorer 61

History Explorer 62

History Explorer 63

Example configuration g-proxy 200 50 ping srv1 proxy proxy-vip demo-node1 200 srv2 demo-node2 64

Example Description Two web servers 65 Port 8000 HAProxy Port 80 Load balancer (round robin) Failed action: kill -9 proxy detected by monitor

Failed Action 66

History Explorer 67

History Explorer 68

History Explorer 69

History Explorer 70

History Explorer 71

History Explorer 72

History Explorer 73

History Explorer 74

History Explorer 75

History Explorer 76

Pitfalls 77

Too many logs History explorer can get slow Find the relevant transitions Narrow the scope Command line: 78 Run HAWK in offline mode to avoid burdening cluster timeframe <from> <to>

End of the tracks 79 Analysing action failure Example: monitor fails for unknown reasons Probes Before starting a resource, Pacemaker checks if it is running Success Is Failure Know your application Start at action failure, read application logs backwards At this point, the cluster can't help you

General Confusion Which node wrote this log? Get back to the evidence 80 Was it even running the resource in question? If in doubt, start over Cancelled Transitions Sometimes, the history explorer gets confused Fencing can cancel a transition By default, Pacemaker fences offline nodes at startup

Possible Problems Network Latency Disk is full Misconfiguration 81 Does your network fulfill the requirements? Use csync2 or configuration management tool Fencing device failure Is fencing enabled? Does the fencing device work? Use SBD

Resource tracing crm resource trace <resource> /var/lib/heartbeat/trace_ra/<agent>/ Note: Trace is written on node where resource runs Complete trace of every action 82 Can be a lot of data: remember to untrace!

Summary 83 Try The New Hawk Use The History Explorer Follow The Evidence Action Failure Leads To Actions Node Failure Leads To Fencing Without Fencing, Anything Can Happen

Open Source https://github.com/clusterlabs/hawk https://github.com/clusterlabs/crmsh 84

Questions? www.suse.com Thank you. 85

86

Unpublished Work of SUSE LLC. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.