This section provides information about the Hot Backup system including: advantages, disadvantages, and different types of recovery.
The hot backup configuration involves two systems: One master system (the system in operation) and a slave system in stand-by mode. Both machines are connected by a fast TCP/IP connection. Users are typically connected to the main system. The backup system is also booted and has a copy of the database on the main system. The two machines do not need to be absolutely identical. The backup machine just needs the necessary resources (disk, memory, connectivity, and so on) to support the applications.
During normal operations, all updates to the database on the main system are applied to the backup system, over the network.
In case of a failure of the main system, the users are switched to the backup machine, and the application is restarted. The downtime is limited to the switchover time (may be just the time for the terminal concentrators to establish an Ethernet connection to the other machine), and the data loss limited to the updates not yet transmitted to the backup machine. This loss is usually limited to a few seconds worth of work.
The system administration and recovery procedures require some manual intervention. The system relies heavily on UNIX networking, which must be understood by the system administrator.
This section outlines the operation required to recover from a failure of the main system. After the failure occurred, the users have been switched to the backup system and the application restarted. The main machine is repaired and must now be set back to the same level as the backup machine.
While the main machine is down, the database on the backup machine is naturally evolving. To record all changes on the backup machine, all updates are recorded, using the transaction logger mechanism. If the repair time of the main machine is expected to be short (a few hours), the transaction journal can be left on the disk. If the repair time is expected to be longer, it is probably better and safer to write the transactions on tape.
Assuming the main machine’s database has been completely destroyed following a multiple disk crash, resynchronizing the main machine involves running a full save on the backup machine, restoring it on the main machine, and switching the users back to the main system. The problem is that the file save and restore operation can be very long, taking potentially days. It would be unacceptable to stop the operations during this process. Therefore, while the save and restore proceeds, updates to the database must be logged and can be stored to tape.
After the restore has been completed on the main machine, the transactions that have been accumulated during the save/restore operation are applied to the main database. During this transaction log load, it is likely that more updates will be done on the backup machine, resulting in more transaction tapes. Depending on the volume of data, there might be a few iterations of this process:
The users are then disconnected from the backup machine, the very last transactions written on the last tape, and this tape is loaded on the main system. All operations must stop for this short time. Both systems are now in sync. Users can be reconnected to the main machine, and the transaction log across the network can be restarted from the main machine to the backup, and the system is now operational again.
If there is enough disk space on the back-up machine, and if the downtime of the main system (including the file save/restore) is expected to be small, it is possible to leave all the updates on disk. Resynchronizing the two machines is then easier.
If the backup system fails, a procedure similar to the one described for the main system recovery must be applied. The only difference is that the users are never stopped. Essentially, a full save is taken out of the main machine, restored on the backup machines, then all the updates applied to the backup machines. The only impact on normal operations are a higher system load due to the file save, and, obviously, a higher risk, since there is no backup.
This configuration is usually applied to very large databases. Making sure everything works and that no data loss occurs is of utmost importance. Network reliability is obviously critical. The various processes (servers) involved in the communication constantly check on each other, assign numbers to the messages on the network and also make sure the transaction logging mechanism itself is operating normally by periodically writing some test data and making sure the updates are sent. The System Administrator can control the databases by periodically running some application report, verifying that the results are identical. All network incidents, as well as unusual circumstances, are reported to a predetermined list of users, so that an incident does not stay unnoticed for a long time.