Troubleshooting of SIB Messaging Engine Failover Problems in a Clustered Environment

IBM Software Group Troubleshooting of SIB Messaging Engine Failover Problems in a Clustered Environment Jhansi Kolla (jkolla@us.ibm.com) Ty Shrake (tyshrake@us.ibm.com) 8 th April 2015 WebSphere Support Technical Exchange Click to add text

Agenda What are the different possible SIB configurations in clustering environment? What are the pros and cons of each configuration? The importance of core group policies in SIB ME clusters. Important steps in SIB configuration when using clusters. Reasons for ME failovers. Why and how does an ME failover to a different server? Troubleshooting of ME failover issues. Summary WebSphere Support Technical Exchange 2

Different possible SIB configurations in clustering environment. You can have more than one ME, depending on the policy you choose. High Availability (HA) One ME in the entire cluster with Failover. (failback is optional) Scalability / Work load management - Multiple messaging engines/ one ME dedicated to each cluster member with no failover. High Availability + Scalability - Multiple messaging engines/ one ME dedicated to each cluster member with failover. (failback is optional) Custom - (Use this policy when the other options of High availability, Scalability, or Scalability with high availability do not provide the messaging engine behavior you require, and you are familiar with creating messaging engines and configuring messaging engine policy settings.) WebSphere Support Technical Exchange 3

High Availability configuration pros and cons: High Availability If you configure your environment with a High Availability policy it creates only one ME in that cluster. This single ME is able to start on any JVM 's in this cluster depending on how you configure the core groups like preferred servers. By default the ME starts on the first server in the cluster to start. The ME is able to failover to any JVM in that cluster. You can also setup failback to failback the ME if the preferred server is back up and running. Pros: The messaging engine is highly available. If the messaging engine or the server on which it runs fails, then the messaging engine starts up on another available server in the cluster. You can set the preferred servers where you want to give preference to some servers to run the ME. You can set the ME failback criteria if preferred server is back. All messaging functionality is in one place for low traffic scenarios. Application is able to connect to any SIB bootstrap server to send or consume messages. Cons: If you set the Preferred servers only option in the core groups panel for this ME policy then, if no preferred server is available, the ME won't start. If you set only one preferred server and setup for the Preferred servers only and the ME is running on this preferred server and if catastrophic error condition occurs on this preferred server the ME is not able to failover to another server in that ME cluster, because you set it up for ME run on preferred server only and there are no more preferred servers available. No WLM option. Only one ME to handle all messaging functions in that cluster. Work load can't be shared. WebSphere Support Technical Exchange 4

High Availability configuration WebSphere Support Technical Exchange 5

High Availability configuration When a server that is hosting a messaging engine fails, the messaging engine is activated and run on another server. The messaging engine is configured to fail over to any of the application servers in the cluster. All the application servers in the cluster are added to the preferred servers list, and this list determines the order in which the servers are used for failover. The earlier the server in the preferred servers list, the stronger the preference for that server to run the ME. WebSphere Support Technical Exchange 6

Scalability configuration pros and cons: One messaging engine is created for each application server in the cluster. The messaging engines cannot fail over. Pros: Cons: One messaging engine dedicated to one server in the cluster. Destinations are partitioned across the messaging engines. The messaging engines can share all traffic passing through the destination, reducing the impact of one messaging engine failing. Greater throughput of messages by the application.(spreading the messaging load across multiple servers, and optionally across multiple hosts.) Performance improvement. Better accessibility for the application. (local ME access to the application). Message order is no longer be guaranteed due to the workload balancing of messages across queue points. Default scalability option don't provide the failover of the messaging engine. WebSphere Support Technical Exchange 7

Scalability configuration WebSphere Support Technical Exchange 8

Scalability configuration The scalability policy ensures that there is a messaging engine for each server in a cluster. Messaging engines runs on its dedicated servers. No failover. If server is down that specific ME is also down. ME's run on only preferred servers. WebSphere Support Technical Exchange 9

High Availability + Scalability configuration pros and cons This policy provides one ME per JVM with failover option. Pros: Cons: Improved performance. Failover option Better access to the application due to local ME to the application. Able to configure the failback to the preferred servers. Work load management. Message order is no longer be guaranteed due to the workload balancing of messages across queue partitions. A little more complex than other options WebSphere Support Technical Exchange 10

High Availability + Scalability configuration WebSphere Support Technical Exchange 11

High Availability + Scalability configuration In this example there are three servers, each on a separate node, and three messaging engines. Each messaging engine has a preferred server and one other server it can use for failover. Each server is the preferred host for one messaging engine, and the failover host for one other messaging engine. WebSphere Support Technical Exchange 12

High Availability + Scalability messaging view WebSphere Support Technical Exchange 13

Supported Core Group policies in ME clustering: Core group policies we support: a) Static. b) One of N - with no preferred servers ( This is the Default SIBus Policy) c) One of N - with preferred servers d) One of N - with preferred servers and the Fail back setting e) One of N - with preferred servers and the Preferred servers only setting f) No operation You cannot use the "All active" or "M of N" policy types, because they are not supported for the service integration bus component. (any ME can run on only one JVM at any specific time) WebSphere Support Technical Exchange 14

Policy setting for ME clusters WebSphere Support Technical Exchange 15

How to find out where ME is running in the cluster: Servers > Core groups > Core group settings > Name of the Core group > RunTime tab > Click the Show groups > High Availability group name > WebSphere Support Technical Exchange 16

ME lost its DataStore connection and is unable to the get a new lock on the Datastore after failover ME is running on one of the cluster members and suddenly it loses the database connection due to either network issues or the database is hung/shutdown or any other reason. ConnectionEve A J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for resource jdbc/<data source name>. CWSIS1594I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=< INC_UUID>, has lost the lock on the data store. CWSIS1519E: Messaging engine <ME_NAME> cannot obtain the lock on its data store, which ensures it has exclusive access to the data. CWSID0046E: Messaging engine <ME_NAME>_bus detected an error and cannot continue to run in this server. HAGroupImpl I HMGR0130I: The local member of group IBM_hc=<cluster_name>, WSAF_SIB_BUS=<bus name>, WSAF_SIB_MESSAGING_ENGINE=<ME_NAME>,type=WSAF_SIB has indicated that is it not alive. The JVM will be terminated. (only in WAS7.0 and 8.0) SystemOut O Panic:component requested panic from isalive (only in WAS7.0 and 8.0) SystemOut O java.lang.runtimeexception: emergencyshutdown called: (only in WAS7.0 and 8.0) (JVM termination is default behavior in WAS 7.0 / 8.0. In WAS 8.5 and above levels the JVM is not terminated and the ME is disabled in this JVM. Other applications continue to run. Also note that the new Restrict long running locks feature in WAS 8.5 gets around this locking problem.) JDBC WebSphere Support Technical Exchange 17

ME lost the DataStore connection and is unable to the get the lock after failover sib.msgstore.jdbcfailoverondbconnectionloss In order to maintain the data integrity in WAS 7.0 and above we introduced this custom parameter: sib.msgstore.jdbcfailoverondbconnectionloss with the default value as TRUE. By setting the sib.msgstore.jdbcfailoverondbconnectionloss custom property on a messaging engine, you can control the behavior of the messaging engine and its hosting server in the event that the connection to the data store is lost. If this parameter is set to TRUE (the default value) then HA manager stops the messaging engine and its hosting application server when the next core group service Is Alive check takes place (the default value is 120 seconds). You will also see a Panic message in the SystemOut.log and the hosting JVM is terminated (which also stops the ME). If this parameter is set to FALSE then the messaging engine continues to run and accept work, and periodically attempts to regain the connection to the data store. If work continues to be submitted to the messaging engine while the data store is unavailable, the results can be unpredictable, and the messaging engine can be in an inconsistent state when the data store connection is restored. JMS messages could also be lost. WebSphere Support Technical Exchange 18

ME lost the DataStore connection and is unable to the get the lock after failover At this point the HMGR will failover the ME to the other cluster member if its available based on the core group policies configuration. From the ME side it lost the lock on the data store, but on the data store (database) side the lock is still alive. WebSphere Support Technical Exchange 19

ME lost the DataStore connection and is unable to the get the lock after failover HMGR starts the ME in the some other available server based on the core group policies configuration. This new instance of the ME tries to get a new lock on the data store. If the attempt fails it will continue attempting for 15 minutes. If the messaging engine is unable to acquire the lock after 15 minutes it will become disabled. (In WAS 8.5 the ME will be automatically re-enabled) In the case of an abnormal shutdown the lock on the data store and OS socket are not removed in the DB server. Keep in mind that the database, not WAS, owns the database lock. CWSIS1538I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=<INC_UUID>,, is attempting to obtain an exclusive lock on the data store. CWSIS1593I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=<INC_UUID>, has failed to gain an initial lock on the data store. Above pattern continues for 15 minutes if the ME is unable to get lock on the data store. After 15 minutes HMGR disables the ME in this server and moves the ME to next available server in that cluster. The root cause of this locking issue is an orphaned lock that still exists on the SIBOWNER table and OS socket is still open. WebSphere Support Technical Exchange 20

ME lost the datastore connection and unable to the get the lock after failover The reason for the failure is that when the disconnection originally occurred the messaging engine became aware of this fact but the database did not. So while the messaging engine is attempting to reacquire its data store lock on the SIBOWNER table the database is still holding both the lock and the socket from the previous connection before the unexpected disconnect occurred. As long as this socket and data store lock remain in place the new instance of messaging engine will not be able to acquire a new lock and resume messaging functions. Old (orphaned) data store lock will not be released by the database until the socket connection to the database is first cleaned up. Once the socket is cleaned up the lock will be released by the database and the messaging engine will be able to acquire a fresh lock, start the ME and resume its messaging functions. There are 3 solutions for this problem. 1) DB Admin can manually remove the lock from the SIBOWNER table. 2) Restart the DB server. (Most of the time this is not a good option because many applications are being used the same DB server. So this is not feasible for most people) 3) Configure the database server to clean the stale/orphaned sockets. (no manual intervention) WebSphere Support Technical Exchange 21

Solution for this problem is TCP KeepAlive If you are running WAS 7 or 8 then TCP KeepAlive is a good solution to this problem. KeepAlive is a feature of TCP that has 2 main benefits: It can detect that a connection to a peer is no longer valid and it can prevent a network connection from being terminated due to inactivity. KeepAlive periodically checks for broken connections (stale/orphaned sockets). The default OS level KeepAlive setting is 2 hours. SoKeepAlive will check the socket connection after 2 hours of inactivity. If the connection is stale KeepAlive will clean up the socket. In order to cleanup this stale socket connection in a timely manner (2 hours is usually much too long) set this value to approximately ~ 5 minutes. (When you set this KeepAlive setting please consider other timeout settings like transaction time out which is 3 minutes default). If you set this to approximately 5 minutes then after 5 minutes of inactivity this socket will be tested to see if the connection is valid. If it fails the test (because the connection is stale/broken) then it will be cleaned up. The old lock on the SIBOWNER table will then be released by the database, and the messaging engine will be able to acquire a fresh lock and resume messaging. This KeepAlive setting is done differently on each OS. Check with your OS admin to set this. NOTE! In WAS version 8.5 a new locking mechanism is implemented that solves this problem without the need for setting the KeepAlive timeout. Just enable the Restrict long running locks feature in the messaging engines Message Store configuration and restart the messaging engine. WebSphere Support Technical Exchange 22

KeepAlive settings for each OS AIX: TCP_KEEPIDLE The default value is 14,400 half-seconds, which is 2 hours. This value can be changed using the following command: no -o tcp_keepidle=600 In the example above the value is set to 600 half-seconds (5 minutes). Linux: tcp_keepalive_time The default value is 7200 seconds (2 hours). This value can be changed using the following command (procfs interface): # echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time The example above sets the value to 300 seconds (5 minutes). Solaris: TCP_KEEPALIVE_INTERVAL The default value is 7200000 milliseconds (2 hours). This value can be changed using the following command: ndd -set /dev/tcp tcp_keepalive_interval 300000 The example above sets the value to 300,000 milliseconds (5 minutes). Windows: KeepAliveTime The default value is 7200000 milliseconds (2 hours). You can change this value in the Windows Registry here: \HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Tcpip\Parameters WebSphere Support Technical Exchange 23

JVM hang / terminated When the JVM the ME is running on is hung or abnormally terminated the ME is also terminated along with JVM. In this case HA manager will fail over the ME to some other available cluster member (depending on your core group policies). Also remember that the lock on the data store side still exists if you are running WAS 7 or 8. In the failover server ME tries to get a new lock on the data store. If the previous orphaned lock is not released this ME instance won't be able to get a new lock. As previously mentioned lowering the KeepAlive to 5 minutes or so removes the the OS level socket and SIBOWNER table lock, allowing the new ME instance to get the lock, complete startup and resume messaging functions. WebSphere Support Technical Exchange 24

Locking issues with NFS3 file system If you are using a File Store instead of a Data Store you should be aware of the following items regarding failovers: During the failover customers who are using NFS3 file system to store the file store files may see locking issues after the failover and the new instance of ME may be unable to get the lock on the file store. You may see this exception during ME startup: CWSIS1579E: The file stores log file is locked by another process. NFS3 file system has some locking issues and we suggest that our customers use NFS4. IBM File System Locking Protocol Test for WebSphere Application Server: http://www.ibm.com/support/docview.wss?uid=swg24010222 Network File System (NFS) Recommendations for WebSphere Application Server: http://www.ibm.com/support/docview.wss?uid=swg21645066 WebSphere Support Technical Exchange 25

HA Manager is disabled HMGR0010I: The High Availability Manager has been disabled. Services that rely on the High Availability Manager might not work as expected. Do not disable the High Availability Manager! It is required for failovers to work correctly! By default every cluster member has an equal opportunity to host and run the ME. If you disable HA Manager in some ME cluster members HA Manager is unable to failover the ME to that server because HA Manager is disabled. WebSphere Support Technical Exchange 26

Cluster members are in different core groups In very rare scenarios we have seen ME cluster members in different core groups and no core group bridge is defined between the core groups. In this case the HA manager is unable to failover to the other ME server which is in the other core group because there is no communication bridge between these 2 core groups. It's always better to keep all ME cluster members in the same core group. It ensures a smooth ME failover will occur, it's a better system management practice and it also improves failover performance. WebSphere Support Technical Exchange 27

Run ME in preferred servers only and no preferred servers are running If you configure the preferred servers only core group policy option make sure that you have at least one preferred server available as a standby instance all the times. When the ME is unable to continue on the server where it was running and needs to failover to the other server, if no preferred server is available (you choose to run ME in preferred servers only) then ME failover won't happen. WebSphere Support Technical Exchange 28

Issues with the core group policies In most cases the default, predefined messaging engine policies support frequently-used cluster configurations. High Availability is enabled by default There could be some scenarios, such as modifications you make to comply with business requirements, where you create a coregoup policy that prevents ME failovers. Static - The messaging engine can run only on the server to which it is restricted, and cannot fail over to any other server in the cluster One of N - with preferred servers and the Preferred servers only setting: If no preferred servers are available, it cannot fail over to any other server in the cluster. No operation : Messaging engine is managed by an external high availability framework, so it is outside of WebSphere control. WebSphere Support Technical Exchange 29

DCSV* / HMGR* messages DCS views are groups of servers that 'see' each other and are communicating with each other. Sometimes servers may get unexpectedly removed from a view and this can cause problems. This can result in differences in views between the server(s) and JVM where the HA coordinator is running. Due to this view instability the HA manager may be unable to find a suitable JVM for an ME failover. Reasons for this are: Check for any HMGR0130I / HMGR0131I / HMGR0132I messages Frequent changes in DCS views. Check for DCSV messages like DCSV8050I / DCSV8051I / DCSV8052I Firewall issues Networking issues JVM hangs Sometimes CPU starvation issues occur where the HA coordinator is running. Look for HMGR0152W messages. Check for any hung thread messages in the failover server (WSVR0605W) WebSphere Support Technical Exchange 30

DCSV* / HMGR* messages When there are problems with DCS views the HA Manager may not be certain where the ME is running. As a result it may try to start a second copy of the same ME. This second copy will fail to start because it will attempt to get a lock on SIB tables that are already locked by the copy that was already running. If you see a second instance of a messaging engine that is trying (and failing) to start check for DCSV type warning messages before the failover started, such as DCSV8104W messages, indicating that one or more servers have been removed from the view. If one of the messaging servers was removed from the view before the ME startup attempt then that is the cause of the failover. This kind of situation is usually driven by networking problems so check for socket exceptions and other TCP related errors in the logs. WebSphere Support Technical Exchange 31

Mixed WebSphere versions (from info center) Messaging engine failover is not supported for mixed version clusters. For example you have WAS 8.0 and WAS 7.0 servers as ME cluster members. (different versions of nodes) A messaging engine that is hosted on WebSphere Application Server Version 8.0 cannot fail over to a server that is hosted on a different WebSphere Application Server version. If you have a cluster bus member that consists of a mixture of Version 8.0 and other version servers, you must ensure that the high availability policy is configured to prevent this type of failover. (same with 7.0 and 8.5 versions) To prevent failover of a Version 8.0 messaging engine to a server on different version, configure the high availability policy for the messaging engine so that the cluster is divided into one set of servers for Version 8.0 and another set of servers for other versions, and the Version 8.0 messaging engine is restricted to the servers at Version 8.0 only. (same with 7.0 and 8.5 versions) WebSphere Support Technical Exchange 32

Summary WAS supports a variety of HA and Failover options We usually recommend an HA configuration with multiple servers in your messaging cluster Core Group Policies allow you to control where messaging engines run and what servers they failover to There are many possible reasons for a failover but checking the WAS SystemOut.log files will usually provide a clear indication why the failover occurred. Failovers are usually caused by disconnections between a messaging engine and its message store or DCS view problems WebSphere Support Technical Exchange 33

Connect with us! 1. Get notified on upcoming webcasts Send an e-mail to wsehelp@us.ibm.com with subject line wste subscribe to get a list of mailing lists and to subscribe 2. Tell us what you want to learn Send us suggestions for future topics or improvements about our webcasts to wsehelp@us.ibm.com WebSphere Support Technical Exchange 34

Open Lines for Questions WebSphere Support Technical Exchange 35

Additional WebSphere Product Resources Learn about upcoming WebSphere Support Technical Exchange webcasts, and access previously recorded presentations at: http://www.ibm.com/software/websphere/support/supp_tech.html Discover the latest trends in WebSphere Technology and implementation, participate in technicallyfocused briefings, webcasts and podcasts at: http://www.ibm.com/developerworks/websphere/community/tp://www.ibm.com/software/websphere/ev ents_1.html IBM Software Events: http://www.ibm.com/software/websphere/events_1.html Join the Global WebSphere Community: http://www.websphereusergroup.org Access key product show-me demos and tutorials by visiting IBM Education Assistant: http://www.ibm.com/software/info/education/assistant View a webcast replay with step-by-step instructions for using the Service Request (SR) tool for submitting problems electronically: http://www.ibm.com/software/websphere/support/d2w.html Sign up to receive weekly technical My Notifications emails: http://www.ibm.com/software/support/einfo.html WebSphere Support Technical Exchange 36