To constantly monitor the shared pubset configuration, it is essential that the disk protocol be written and/or read on a continuous basis (see "Shared pubset network"). Input/output errors when accessing the disk therefore represent a restriction to the monitoring and reconfiguration capabilities of the shared pubset network. The use of several shared pubsets increases the effectiveness of disk monitoring with respect to errors on a disk.
The monitoring mechanism tolerates the interruption of a monitoring path. However, a simultaneous error on the other monitoring path (here, loss of connection), may result in an erroneous crash detection of the processor by the partner. Emergency measures (either immediate export or termination of system by MSCF) are unavoidable in this case.
Input/output errors can be classified into temporary and permanent (i.e. irrecoverable) errors. A temporary input/output error can be a delayed input/output operation or an input/output error rectified automatically by the system. The reading and writing of the watchdog file is not halted, rather it is continued periodically. When the input/output error is rectified, the disruption of the disk path is also eliminated.
A permanent input/output error, i.e. an irrecoverable error, can only be eliminated by the intervention of systems support (and possibly with hardware repairs). The pubset is switched to the WRTERR or READERR state (the SHOW-SHARED-PUBSET command provides information on the disk states). In the event of a permanent input/output error, the pubset should be exported as soon as possible and should not be imported again until the error has been rectified.
Errors when writing vital-sign messages
A write error which occurs when a sharer accesses the watchdog file of the shared pubset decreases the reconfiguration capability of the network. A master change cannot take place and, with an XCS pubset, leave or fail reconfigurations are blocked. Apart from the reconfiguration problem, a write error has no further consequences per se. However, a simultaneous communication failure with the partner may result in erroneous crash detection and may necessitate the implementation of emergency measures. This depends on the BS2000 version running on the partner processor.
Sharers are subject to partner-related failure monitoring. If an error occurs when writing to the watchdog file of a shared pubset, monitoring is only affected if the error occurs with all pubsets shared with a partner.
In this case, if the MSCF connection has already failed or fails shortly afterwards, an erroneous crash detection is imminent by the partner due to the temporal relationship and the RECOVERY-START settings. The system may be terminated by MSCF to avoid the erroneous crash detection and message MCS1300 will be output.
Write errors are indicated by message DMS03B9. Rectification is announced by message DMS03BB.
The partner is notified of the occurrence and rectification of a write error by message MCA0110.
Errors when reading vital-sign messages
If a read error prevents a sharer from reading the vital-sign messages entered in the watchdog file by the partners, the sharer is no longer able to detect the failure of these partners. If a partner is considered no longer active by configuration management due to a read error and the loss of an MSCF connection, the partner may have crashed. Systems support is then requested to make a decision (message MCS1100), regardless of the RECOVERY-START setting.
The read error is indicated by message DMS03B2, and the rectification of the error is indicated by message DMS03B8. If a partner which was suspected of having crashed is identified as being active again, the pending failure inquiry is cancelled.