Introduction to High-Availability Clusters On Linux And Other Systems

Cluster Stability Vs. Fast Failure Recovery

In standard server configurations, we want to do a lot of retries in case of errors, to overcome very short connectivity problems.
When using an HA cluster, we want to detect problems fast, in order to switch the service to another node (increasing the 'service uptime').
Most operating systems come configured for best stability, so they perform a lot of retries (e.g. in case of disk conectivity error on linux, I/O operations may hang for several minutes).
One always have to weight the long timeouts against the long fail-over periods.
Assume a failure too quickly - and both nodes might think they have a problem - the service is unavailable.
Assume a failure too slowly - and when there is a non-temporary problem, the cluster will not do its job.

Originally written by