The XCS network is based on the use of shared resources by the connected processors. If a processor fails, the resources utilized by the crashed processor must be made available again so that the XCS network can continue to operate unhindered.
The failure of a partner is determined by the individual sharers using the two mechanisms "disk monitoring" and "connection monitoring". In addition, the processors connected in the XCS network agree whether the partner is actually considered no longer active by all remaining sharers in the network. Based on the status of the partner with respect to these three factors, the partner can be classified into one of the error classes described below.
Partner crashed
A processor assumes that a partner has crashed if the following conditions apply simultaneously (i.e. within the interval defined for the monitoring algorithms and derived from FAIL-DETECTION-LIMIT):The disk monitoring mechanism detects the absence of vital-sign messages from the partner on all shared pubsets.
The connection monitoring mechanism detects the absence of vital-sign messages from the partner to the monitoring telegram of the processor via the MSCF connection.
The partner is no longer considered active by all other processors connected to the XCS network.
Partner status unknown
The partner is no longer considered active by the processor and all other processors connected to the XCS network. However, due to the time difference between the absence of vital-sign messages and the connection failure, or due to another error in relation to the monitoring path, a crash cannot be assumed.Loss of connection
The partner is no longer considered active by the processor, but is considered active by another sharer in the XCS network. No automatism is initiated in this case, rather a decision is requested from systems support (see section “Loss of connection in an XCS network”).
The behavior in the event of a "partner crash" can be controlled by the MSCF configuration parameter RECOVERY-START (see "Global control parameters"): either the system handles the situation automatically or systems support is requested to decide on the relevant measures. In the event of "partner status unknown", a decision is always requested from systems support.
Automatic error recovery
In the event that a “partner failed” the fail reconfiguration needed to release the global resources occupied by the processor that has crashed is started automatically.
Systems support decision
In the event of “partner status unknown” or if a RECOVERY-START setting does not permit automatic error handling, the message MCS1100 which requires an answer is issued to the consoles of all partners remaining in the network and a systems support decision is requested. Systems support can do the following:
Start the fail reconfiguration with an appropriate response to the message from any of the participants. The following input choices are available:
MXCM-<order code of the console message>.CRASH
(CRASH:MXCM-<order code of the console message>.MTERM
(MTERM:
The processors still connected to the XCS network perform a recovery. The failed processor is removed from the XCS network.
A fail reconfiguration must only be started if the processor has actually crashed or if it can be guaranteed that the processor can no longer access the shared resources (communication, shared pubsets, and shared GS).Reestablish the failed MSCF connection to the partner by issuing the START-MSCF-CONNECTION command, provided only the connection to the partner has failed and not the partner itself.
Disconnect the remote processor from the XCS network by issuing the command STOP-SUBSYSTEM MSCF,SUBSYSTEM-PARAMETER='FORCE=YES', if the connection cannot be reestablished and then answer the message MCS1101 with MTERM.
The fault in the XCS configuration remains until a fail reconfiguration is implemented and the crashed processor has been removed from the XCS network or, if the processor has not crashed, the connections to the partner are reestablished.