Error recovery through programnling

Size: px

Start display at page:

Download "Error recovery through programnling"

Georgina Barnett
6 years ago
Views:

1 Error recovery through programnling by ALAN N. HIGGINS International Business Machines Corporation ' Kingston, New York INTRODUCTION The requirement for error recovery procedures has existed as long as computers themselves. Since the earliest computers, one of the goals of design has been to increase the reliability and availability of the computer to the user. While great strides have been made in this direction, the need of error recovery is still as present today as ever and at this time, the need is actually amplified and more pressing than ever before. With the many advanced techniques in programming such as multiprogramming and multiprocessing, the cost of an error has increased dramatically so that no longer are the consequences of an error limited "merely" to the loss of a job and the imposition of the need for a subsequent rerun. Error today can: Cause the termination of concurrently executing tasks. Cause an environmental control system to go down. Cause the loss of teleprocessing messages. Cause the generation of a report to be delayed No longer can rerunning the job be accepted as a prime means of "error recovery. The situation existing when running under an Operating System, and executing a number of jobs in the computer at the same time, makes improved error recovery procedures mandatory.. It is recognized that the Engineering Community is diligently striving to improve the hardware itself and thus for a complete solution it is necessary to look at the other half of the question of error recovery-what can be done to improve reliability~ to improve availability, to improve error recovery through programming? In order to do this, we have to first consider in a general way error recovery procedures or Recovery Management Support. The next step is to look specifically at some of the work which has been done in Operating System/60 with the Recovery Management Support for the Model 65. System incidents An examination of system incidents reveals that such incidents are due to a number of sources. Among these are Hard Core errors (including errors in the CPU, memory and channels), errors from Input/Output devices and control units, procedural and operational errors. Each of these is made up of a number of different errors but from a gross point of view, it seems reasonable to state that there are three general types of system interruptions: Hardware malfunctions. Design errors (both hardware and software). Operator or user injected errors. Systems planning must therefore be influenced by the facts that machines will malfunction, neither hardware nor software is perfect and that operators are still likely to make as many mistakes a.s they have in the past. Recovery management The primary objective in any error recovery procedure or Recovery :\ianagement Support should be to alleviate the burden of system interruptions to the user. In order to accomplish this we must: 1. Reduce the number of interruptions to which the user is exposed and, 2. :\iinimize the impact of these interruptions when they do occur. Recovery :\-ianagement therefore Hhould provide the user with a higher degree of system. availability (more time for more jobs) by minimizing the impact of system malfunctions upon his operations. With this objective as the target, error recovery takes on a broader meaning and scope than has been applied to the concept in the past. In an environment of multiprogramming, the system becomes all important and it is most necessary that no matter what happens, the sys-. tem must continue to function. It often becomes a situ- 39

2 40 Fall Joint Computer Conference, 1968 ation of sacrificing a part so that the "whole" may survive. In order to accomplish this, Recovery Management facilities may follow a pattern similar to one where the support attempts to reduce the number of system interruptions by retrying the operation which was interrupted by the malfunction or it may terminate the task affected and continue system operation. If this is not possible, then the second step toward accomplishjng the primary objective of error recovery becomes of paramount importance-to minimize the impact of the interruption. This is done by preparing the system for a simple restart or it may indicate that repair by maintenance personnel is required. Instruction retry. This pattern, which has just been outlined, suggests a nulllber of functions which can be performed to achieve the objectives of Recovery Management. The first of these functions is instruction retry. The concept of instruction retry is not really new. It is something which IBM has been doing for years, particularly in the I/O area. Instruction retry has been standard procedure whenever an error was en'countered in reading or writing a tape. But it is possible to extend this retry capability and to employ it when a CPU or memory malfunction occurs. A relatively large number of malfunctions are intermittent in nature rather than solid failures and therefore, there is a high probability of success of execut.ion and recovery if an instruction retry can be attempted. The first thing which must be determined then is whether instruction retry is feasible and then if feasible, to execute the retry. The determination of instruction retry feasibility is usually quite dependent upon the characteristics of the particular machine. Ordinarily for feasibility to exist, the "environment" of the computer must be valid or free from error. Dependent upon the specific machine, this may include the data contained in general purpose registers, floating point registers, machine log-out areas, permanent storage areas, etc. Arbitrarily, the criteria of validity can be keyed on parity. If the parity of the data is good, the environment is assullled valid and therefore retry is feasible. If parity is bad, then no further retry action can be taken. Having ascertained that instruction retry is feasible, it is necessary to continue the analysis and determine if a specific instruction is retryable. To do this, it is first necessary to locate the failing instruction. The procedure involved here is again dependent upon the particular machine and what type of fetch or pre-fetch logic is employed and whether or not the instruction counter is accurate. In one case, a comparison of the internal registers in the machine log-out can provide the clue as to whether the instruction counter is accurate; in another it may be a function of when the machine check occurred and what updating cycles the instruction counter was executing at the time. It is obvious, therefore, that it is not always easy or possible to locate the failing instruction but if the instruction counter is accurate and it is possible to locate the failing instruction, an analysis can be performed to ascertain whether the retry threshold of the interrupted instruction has been exceeded. (The retry threshold is that point in the instruction cycle after which retry cannot be attempted and is usua]]y indicated by a bit set by the hardware.) The retry threshold has been exceeded when during the normal instruction cycle one or more of the original operands has been changed. If the threshold has not been exceeded, it is possible to cause another attempt at executing the failing instructions. If, however, the threshold is exceeded, it may be possible to extend the threshold by examining the instruction type to determine whether a copy of the original operand might still be intact in some internal register and if it is, by restoring it. This is accomplished by re-,building (in a special execution area) the instruction from the contents of the log-out or the internal registers or main storage. Therefore, from an analysis, it is possible to determine that an instruction is either: I-Retry able, that is the retry threshold has not been exceeded or if it has been exceeded, the damaged operand can be restored and therefore instruction retry can be attempted or 2-Non-retryable, that is instruction retry is not possible because either the threshold has been exceeded or the damaged operand cannot be restored, an invalid environment exists because of incorrect parity or the value of the instru~tion counter is indeterminate. If the second condition is the case, then it is necessary to look for another way to handle the error recovery. Refresh main [)torage The occurrence of a parity error in main storage obviates instruction retry therefore, one function which could be of value would be the ability to "Refresh" main storage. By this is meant to repair the damage which either caused or was caused by a malfunction by loading a new copy of the affected module into main storage. (A module is a program unit that is di'screte and identifiable with respect to compiling, combining with other units and loading.) The use of refreshable code requires a good deal of foresight in coding since in order to be refreshable, a module must not modify itself or be Inodified by another module; for example, it must not set switches,

3 Error Recovery Through Programming 41 contain dynamic storage areas, or store registers or address pointers within the body of its code. The foresight is well rewarded, however, when it is possible to load this refreshable code and then continue execution without changing either the sequence or the results of the processing. The attribute "refreshable" is similar to "reentrant". Most reentrant modules meet the requirements specified above and in addition, a reentrant module is one that may be utilized by more than one task at a time (some modules classified as reentrant deviate from these requirements by operating in a psuedo disabled manner, thus actually allowing modifications during a short period of time). The difference between the two is that "reentrant" is based on the operational characteristics of the module within the system while "refreshable" is based only on the fact that the code is not modified in any manner. Selective termination The functions of instruction retry and refreshable code are most desirable since they render the error recovery procedure transparent to the user and require no intervention on his part. Unfortunately, it is not always possible to attain this level of recovery. When this is the case, it is necessary to accept some degradation in order to keep the system operational. One way to accomplish this is to implement a function of Selective Termination. Such a function would enable the system to examine the failing environment, determine what problem prograln was executing and then proceed to terminate this program while continuing all other jobs which were executing at the time of the malfunction. This is really a type of job-abort which frees the resources of the system allocated to the job and makes theln ava,ilable for future use. If a problem program was utilizing system code when the malfunction occurred, selective termination could be effective if the system code was transient rather than resident in nature. This process results in the loss of a specific job but it does enable the system to continue without interruption. Another function which would aid in the error recovery process when a memory malfunction occurs is the ability to logically carve out or remove that portion of the memory in which the malfunction occurred. Since this type of error recovery would result in job termination and might not return resources (Storage, I/O devices, etc.) to the system, such a procedure would obviously introduce undesirable side effects, such as loss of availability of I/O devices, loss of part of core and, loss of the terminated job, but it would preserve the system and operation would continu~ until an orderly correction could be made.. I/O Recovery The functions which have been discussed so far have been directed mainly to errors which occur in the CPU or memory. From an examination of system inciden~s, it is evident that a significant portion of errors occur In the I/O area. Is there anything which can be done to improve error recovery procedures for I/??. In the first place, there is I/O retry whlch IS available through the ERPs (Error Recovery Proce~ure~) for the different I/O devices. As indicated earlier, It has been standard procedure to retry I/O instructions when errors occur. A number of errors (unit check, unit exception wrong-length indication, protection check and som~ chaining checks) can be corrected by this means. An I/O Supervisor performs an analysis and selects, according to device, the proper ERP to attempt recovery. After retry is attempted, the ERP regains control to determine whether or not the retry has been successful. If it was successful, the I/O retry is transparent to the user. There is another group of I/O errors-channel checks (channel control check, channel data check and interface control check)-which need not be disastrous but which after analysis of the conditions causing the error, it may be possible to recover.. Such an analysis would determine the type of operation t~at failed the type of device affected, the sequences whlch occur~ed across the I/O interface following the error and whether a retry can be attempted. The I/O device or medium can malfunction and if a retry is not successful,. there may be other ways to continue the execution of the job. One such way would be to have the ability to switch data sets (devices), that is to change a tape or disk pack from one drive to another and then to retry the operation with the new drive. Another possibility (if the malfunction was really related to the Channel or Control Unit) would be to try another route to the same device. In this circumstance it would be an attempt to use the device by accessing it through a different route, that is by addressing it through a different channel or control unit. Other system incidents Another group of system incidents is due to procedural and operator errors. Several things can be done to decrease this and as such, it certainly deserves concentrated attention. The first is, of course, better trained personnel but from a programming ~oint of view,.several possibilities exist. It is most desirable to requlre a minimum of user intervention and interaction in order to accomplish execution. Control information should be minimal. When interaction is required, messages should be clear and concise - to the point of outlining

42 Fall Joint Computer Conference, 1968 possible choices. A conversation mode could be optional which would permit correction or confirmation of operator action.

4 42 Fall Joint Computer Conference, 1968 possible choices. A conversation mode could be optional which would permit correction or confirmation of operator action. All these points are generally grouped under a concept of Operator Awareness and have a very definite place in the planning of any error recovery support. All of these functions are aimed at continuing the operation of the system but unfortunately this is not always possible to accomplish. Therefore, the next best thing is to minimize the effect of the malfunction. This can be done by attempting to preserve information concerning the malfunction and to make it available to assist knowledgeable personnel to determine what caused the error and what can be done to correct it. This will have the most desirable effect of shortening the Duration of the Unexpected Interrupt and get the system back in operation as quickly as possible RMS/65 The Recovery Management for the System/360 Model 65 (RMS/65) has provjded a number of these functions in the operating system. These functions are contained in two programs which make up RMS/65. These are the Machine Check Handler (MCR) which is directed at CPU and memory malfunction and Channel Check Handler (CCR) which is oriented to I/O problems. The RMS/65 has provided a hierarchy of recovery which involves four levels: I. Functional Recovery II. System Recovery III. System-Supported Restart IV. System Repair Functional Recov:ery is the successful retry of an interrupted instruction. MCR handles the operation for the CPU and main storage through its Machine Analysis and Instruction Retry (MAIR) facilities. The MAIR facilities perform an analysis of the machine environment at the time of the machine check interruption to determine the feasibility of retrying the interrupted instruction. MAIR then retries the interrupted instruction when retry is feasible. The CCH performs the analysis function for the channel checks discussed earlier. This is accomplished by intercepting I/O interruptions before the I/O Supervisor receives them and performing an analysis of the existing conditions. If feasible, the status bits are manipulated to make the channel check look like a failure for which ERP exists and then control is transferred to the appropriate ERP for action. Functional recovery is of course the desired goal because in this case the malfunction is transparent to the user. System Recovery is the second level of recovery and is required when functional recovery is either not feasible or fails. The objective is to preserve the system and to continue processing all unaffected jobs. This is done by means of a Program Damage Assessment and Repair feature which attempts to analyze the malfunction environment, to isolate and repair the program damage if possible and to report permanent failures to the program and operator. This feature also incorporates the mechanism to provide the capability of selective termination of a task. The function of System -Supported Restart is called on when both Functional and System Recovery have fail~d but a stop for repair is not required. The operator is informed that such a condition exists and that it is necessary to restart the system. The fourth level of recovery support provided by RMS/65 is System Repair. In a way, this is perhaps one of its most important functions since the detailed error analysis information which is provided can be of great assistance in the determination of the cause of failure and in suggesting the proper correction for the problem. Once the repair is completed, initialization is required to restart the system. Figure 1 shows the relationship of these levels of recovery to one another and to the main objective ofrecovery l\1anagement Support which is to keep the system in operation. Each level of recovery performs the important func- FIGURE 1

5 Error Recovery Through Programming 43 tion of recording information concerning what happened, the status of the computer at the time of the incident, what action was taken and the results of such an action. This information which is recorded on a special data set S YSI.LOGREC, is then available through execution of the Environment Record Editing and Printing utility (EREP) which runs under the control of the Operating System/360. This program edits and prints the records generated by MCH and CCH (as well as by several other recording functions) and provides the information for interpretation by the experienced Customer Engineer. A Standard Operating Procedure in a Computer Center using MCH and/or CCH should be to execute EREP on a regular basis and then the information should be available to the CE as an aid or indicator to anticipate serious trouble. For example, if a particular pattern appears indicating possible degrada- tion, preventative maintenance can be performed before the occurrence of a serious incident. CONCLUSION RMS/65 is a step in the direction which error recovery must take if the requirements of computer technology are to be met in this area. l\/fore and more the question of error recovery canr:tot be relegated to hardware or programming alone but rather these two must form an effective partnership and attack the problem together in order to provide. a satisfactory solution. Every sign indicates that this is being accomplished and it appears that some meaningful steps such as Rl\/fS/65 are being taken toward the goal of reducing the number of interruptions to which a user is exposed and to minimizing the impact of these interruptions when they do occur.

System/370 integrated emulation under OS and DOS

System/370 integrated emulation under OS and DOS by GARY R. ALLRED International Business Machines Corporation Kingston, N ew York INTRODUCTION The purpose of this paper is to discuss the design and development