TROUBLESHOOTING PROCEDURE FOR SYSTEM STALL ISSUES SA, MAG, IC, VA PLATFORMS Published Date July 2015
SUMMARY: This document describes the proper procedure for monitoring and obtaining required logs for troubleshooting system stalls. Procedure requires setting up a serial console monitor during first occurrence of issue and preferably leaving the console monitor for extended period necessary (could be only up to the next freezing incident when good kernel trace is taken) to obtain critical information needed to root cause the issue. The stall described here is an SA, VA, MAG, or IC device becoming inaccessible via web and/or serial console with total loss of access to the device by admin and end users. Device may still be responsive to pings. PRE-REQUISITES: Troubleshooting system stall requires the following: Connection of serial console monitor (common applications are hyperterminal and putty), and a null modem cable for direct connection or via console servers using IP/Port combination to the stalled SA/ IC; then setup kernel logging level and leave for further continuous capture of serial console outputs (if extended monitoring is allowed) Serial console snapshot if console is still responsive, and if possible to upload to a local SCP server if available so it can be obtained prior to rebooting. Otherwise, snapshot taken can be downloaded later via Web Admin UI after reboot of the system Additional information such as time of freeze, whether the box interfaces are ping-able or not, whether serial console is responsive or not, and very importantly, note the front panel LED indicators such as power, disk activity, hard disk, FIPS (if FIPS unit), and network interfaces REQUIRED LOGS: Following are logs necessary to submit to Pulse Secure when opening a case for system stall or similar issues: Serial Console output of everything being displayed saved as a text file Serial or admin system snapshot taken from serial console or Web Admin UI, respectively Other process snapshots seen in the Web Admin UI snapshot page (if any), these are usually prefixed with the process name, e.g. Pulse Secure-state-watchdog-x-xxxxxxxx-xxxxxx, dscsd_xxxxxxxx_xxxxxx SA/IC logs such as Events, Admin and User access logs taken as one zipped file, e.g. Pulse Securelogs[1].tar.gz Cockpit graph screenshots to include all nodes in the cluster Note: If in cluster, always include SA/IC logs and a system snapshot from the other working node/s as these oftentimes are helpful in correlating events between the nodes of cluster as part of the analysis 2
LOG COLLECTION PROCEDURE: A. OBTAINING NECESSARY SERIAL CONSOLE LOGS: IMPORTANT: It is preferred to use a serial console application that can display timestamps to the screen output as well as the saved log. One of the terminal server applications with this capability is (enhanced) Putty available from: http://www.extraputty.com/download.php Procedure if using Putty Go to Putty main screen Session then Logging and select appropriate file to save output to, and other logging options to ensure file gets recorded and preserved as needed. Ensure to select the timestamp options On terminal and All session output so all outputs will include timestamps both in terminal view and in saved log. This is so that Pulse Secure engineers are able to correlate outputs with the device logs. Make sure the IVE device (SA/MAG/VA) time is in sync with the computer being used for monitoring serial console. Select the location and log to be used for logging console outputs. Go back to Session and enter appropriate serial settings as displayed. 3
Go to Session, then select Serial and then Open to connect. 4
1. Preparing kernel logging : To obtain a good kernel trace, kernel logging level 9 must be enabled prior to having the issue. However, first time occurrence of stall will not have this kind of logging set. So in that first stall condition, enable this log level, and wait for at least 10 minutes up to 15 minutes and watch for some errors or messages that may come out, then finally obtaining the kernel stack trace. It is recommended, however, to monitor and have this kernel logging enabled after the first reboot as this will give better information when next issue occurs. Setting Kernel Logging to Level 9 The CTL-Break>9 sequence will set proper kernel logging level for troubleshooting halt or stall issues. Hold the CTL key, then hit the Break key momentarily, release the CTL key and hit the 9 key. It should respond with Loglevel set to 9. (If no response obtained from this command, try again for a few times, and if still not responsive, proceed to Step #3 below Getting the kernel trace during lockup ). Leave the serial console at this state, and while capturing to a local text file. You may notice that occasionally, you may get some messages by kernel which are the important messages needed for root causing the issue. 2. Getting the Kernel trace during lockup The CTL-Break>T sequence will output to console a kernel trace/dump While console is in any state AND web UI is not accessible, hold the CTL key, hit the Break key momentarily, release the CTL key and then hit the T key (lower or upper case). Do not execute this command in a fully operational system as this could disconnect user sessions. The console should output a long list with entries similar to the following screenshot: 5
B. OBTAINING NECESSARY SYSTEM SNAPSHOT WHEN CONSOLE MENU IS STILL UP: 1. Getting system snapshot from serial console if console menu is accessible: Take a system snapshot from serial console when menu is still accessible while Web UI is down. Select console menu option 7. System Maintenance then menu option 1. System snapshot It should respond with Taking system snapshot You can either SCP or leave the snapshot in the system for downloading from Admin UI later by answering y or n in the prompt: 6
C. OBTAINING LOGS AFTER THE REBOOT: 1. After reboot, login back to Admin UI and obtain the SA/IC logs and all the snapshots: Download the SA/IC logs. Go to Log/Monitoring>Events>Log>Click Save All Logs. This saves the events, users, and admin logs and given as a single zipped file. Download all snapshots. Go to Maintenance>Troubleshooting>System Snapshot, then download the snapshot that was taken from the serial console and also any other snapshot (process or watchdog) automatically generated by system (if any) 7
D. COMPLETE LOG SUBMISSION: 1. Gather the following from the above steps: Serial console output log in text format Serial snapshot, system snapshot, and any other process snapshots including any watchdog snapshots SA/IC logs Cockpit dashboard graphs screenshots Date and time of occurrence and status of pings LED front panel indicators for all hardware most importantly the hard drive status (activity and error status, also please note that drive activity indicator next to the power indicator) 8
SERIAL CONSOLE COMMANDS FOR TROUBLESHOOTING STALL ISSUES: CTL-Break>9 puts the console kernel logging to Level 9 details (should output Loglevel set to 9 ). During a perceived hung state of the device, useful kernel and system driver/s events may still be collected via serial console with this level of logging. Enable this log level and wait for 10 15 minutes to observe output of the console, then proceed with taking the Kernel trace by way of CTL-Break>T command. This waiting period is very important. To monitor and obtain logs for next failure event, enter into this level 9 logging with console timestamps enabled and output being recorded to a local file. On next event, proceed with collecting kernel trace via Ctl- Break>T command. Kernel logging level 9 when enabled has very small effect on the performance of the device so it can be left enabled over some period of monitoring. In this level of logging and during a halt condition, the various system drivers could output important and useful information. If the output tends to be too much every minute and while running in still working state, it may cause some performance impact, so please observe and advise Pulse Secure support. IMPORTANT NOTES: CTL-Break>9 in some kernel condition may not work and appear that serial console is unresponsive, please try several times and then do CTL-Break>T. CTL-Break>0 or rebooting the unit resets the kernel logging to its default setting. CTL-Break>T outputs the kernel stack and memory dump to the serial console for kernel level analysis by Pulse Secure. ONLY execute this command in halted state as this will affect users. d (quotes excluded) outputs the recent kernel messages and some information on the state of the file system, that can be used to track issues related to file system mounted read only and file system running out of space, and may only work in certain conditions ie; when prompted with Do you want to reboot? (y/n). Menu Option 7 then 1: System Snapshot (if serial console is responsive and menu is available) takes a system snapshot of system while in that present condition and can be locally transferred to an SCP server or left in the SA/IC for downloading later after reboot. NOTE: For any questions about this document, please contact Pulse Secure support. 9