Fail-Over During Active-Node Failure
- The passive node detects that the active node is down.
- The passive node makes sure the active node is down - losing the service
is better than having data corruption.
- The passive node starts taking over the resources of the service, from
bottom to top - first the disks, then launches the service process,
and finally making the virtual IP address of the service available.
- The IP address take-over must be last, to avoid clients getting an error.
This is relevant for UDP-based services - the client will simply retry
sending commands (it will temporarily 'hang', during the transition), and
then start getting responses from the new node.
- In case of a TCP-based service, the client must be able to transparently
re-connect to the server after getting a 'RST' packet.
Originally written by
guy keren