Cluster Stability Vs. Fast Failure Recovery
- In standard server configurations, we want to do a lot of retries in case
of errors, to overcome very short connectivity problems.
- When using an HA cluster, we want to detect problems fast, in order to
switch the service to another node (increasing the 'service uptime').
- Most operating systems come configured for best stability, so they perform
a lot of retries (e.g. in case of disk conectivity error on linux, I/O
operations may hang for several minutes).
- One always have to weight the long timeouts against the long fail-over
periods.
- Assume a failure too quickly - and both nodes might think they have a
problem - the service is unavailable.
- Assume a failure too slowly - and when there is a non-temporary problem,
the cluster will not do its job.
Originally written by
guy keren