If a processor connected to a shared pubset network crashes, the network can be continued by the remaining sharers (following a reconfiguration). If a slave processor crashes, the master processor must therefore release the locks of the slave processor (slave crash processing); if the master processor crashes, a slave processor (backup master) must take on the role of the master processor (master change import).
The mechanisms for crash detection and the actions taken in the event of a crash depend on the BS2000 versions of the connected processors. The differences are explained in the individual reconfiguration steps.
Crash detection
Monitoring and crash detection for partners are partner-specific for all shared pubsets. For error recovery, it is possible to define whether these tasks are to be carried out automatically or whether they are to be initiated by systems support.
A partner processor has crashed when this is detected by the Live Monitor.
A partner processor may have crashed when the partner’s disk protocol was no longer detected by the processor on all pubsets shared with the partner processor, and the MSCF connection monitoring function has found a connection error or no MSCF connection exists.
If both monitoring paths are free of errors from the point of view of the processor, and if the vital-sign messages of the partner failed to appear on both monitoring paths simultaneously (i.e. within the intervals of the monitoring mechanisms) the partner is automatically declared as crashed if this is permitted by the rules described under “Inhibiting the automatic start of fail reconfiguration” (Global control parameters ).
In all other cases, systems support must make a decision (message MCS1100); the partner is not declared as crashed until the message is answered with “CRASH” or “MTERM” (or more precisely “MXCM-<order code>.CRASH” or “MXCM-<order code>.MTERM”).
Failure of a sharer
If the partner processor has crashed, error recovery is started in parallel in the respective local tasks for the individual pubsets concerned. With a master-slave relationship, the failure of the partner is indicated by message DMS03B0; with a slave-slave relationship, message MCA0110 is output on the console.
If a slave processor crashes, slave crash processing is initiated on the master processor. No recovery measures are necessary on the other slave processors. If the master processor crashes, one of the slave processors becomes the backup master. The backup master starts a master change import.Determining the backup master
If the master processor crashes, each processor determines the backup master on the basis of its local watchdog sharer list, the sharers currently entered in the SVL, and the master change control parameters specified with the SET-PUBSET-ATTRIBUTE command.
If the processor itself is the backup master, it checks whether a master change can be implemented. If a master change is not possible (e.g. there is no connection to a slave processor), it is rejected with message MCA0104. Otherwise, a master change import is started.
If the processor is not the backup master, it waits for the start of the master change import on the backup master. However, if no connection exists to the backup master, the master change cannot take place (message MCA0104 is output).Master change import
With a master change import, the individual DMS components (e.g. CMS, Allocator) must establish their master entity on the new master processor. In this case, the management data which has been lost must be recovered on the other processors in the shared pubset network in cooperation with the slave entities. During the master change import, all new requirements on the master processor are deferred until the master change is completed or has been aborted.
Slave crash processing
The master processor releases all the resources held by the slave processor. All locks (file and catalog entry locks) are reset.