Introduction to High-Availability Clusters On Linux And Other Systems

Increasing The "Uptime" Of Services

Most of us are proud of the 'uptime' of our machines.
Fewer of us are proud of the 'uptime' of the services our machines supply.
Users of our services don't care for the machine's uptime - only for that of the services it runs.
If we will use several machines to run the same service, we could achieve a better "service uptime".
High-Availability (HA) cluster software is used to help us manage a set of services on a set of machines, and make sure that the service out-lives the non-availability of any one of these machines.
Note: sometimes we could also increase the performance, but we will concentrate on High-Availability today.

High-Availability Vocabulary (part I)

High-Availability Cluster - A group of computers, whose purpose is to increase the up-time of services, by making sure that a failed machine is automatically and quickly replaced by a different machine, with little service disruption.
Cluster service - A computer service that is managed by the cluster software. The cluster service includes all resources required to deliver the service - the data (e.g. file systems), virtual IP addresses, processes...
Cluster Group - a group of servers that run the cluster software, and together handle a set of services.
Cluster node - A server that is a member of a cluster group.
High-Availability Cluster Software - an application that manages a high-availability cluster.

High-Availability Vocabulary (Part II)

Active-Passive Cluster - A cluster where only one node runs a given service at a time, and the other nodes are in stand-by to take over, should the need arise.
Active-Active Cluster - A cluster where a given service is running on more then one machine at the same time.
Fail-over - the operation of moving the service from one cluster node to another.
I/O fencing - a mechanism used in active-passive clusters, to ensure that no matter what - at most one node will send I/O requests to a given disk.

HA Cluster Software - Roles

Allow us to configure services, and spread that configuration info on several cluster nodes.
Make it easy to launch those services on any of the cluster nodes.
Make it easy to move a running service from one node to another (e.g. to allow upgrading the operating system or service software on the first node).
Monitor the availability of the service
Automatically switch a failed service to a different node.

Note: the switch should be as transparent to users as possible. This requires some support from the client and server software.

Cluster Hardware

Clusters require a certain set of hardware in order to operate.
Some of this is found on normal servers
Some is specific to cluster setups.
This hardware will include
- Servers
- Network connectivity
- External disks
- Power-supply control mechanisms

Cluster Server Machines

Cluster servers are usually standard rack-mounted server machines.
They need to be reliable (or else the entire cluster will be un-reliable).

Disk Storage

In order for the data to be accessible to several cluster nodes, it is placed on an external SAN or NAS storage device.
The disks are normally managed using RAID 5 (using a parity disk), or RAID 1 (mirroring).
In this setup, a single disk failure will not cause the cluster to fail.
The disks are stored in a RAID device, which allows replacing faulty disks without powering down the other disks. This is called "hot-swap".
On the other hand, since the storage device is connected via a network, the server might lose access to it.
To overcome this, more then one network path may be connected between each cluster node and the storage device.

Networking Equipment

Cluster nodes are usually connected using more then one network interface
(e.g. a standard Ethernet and a cross-cable Ethernet or serial link)
- this helps reduce the chance for a "split brain" problem (we'll explain this later on).
Sometimes, each cluster node is connected to a different network switch.
- this is done in order to avoid a single switch failure causing the entire cluster to be in-accessible.

Power-Control Hardware

Sometimes a cluster node needs to reboot another cluster node, in order to assure that the service is only running on a single cluster node.
One way to achieve this is using network-controlled power supplies,
Another way is using remote console hardware that is bundled inside modern server machines
- which allows you to turn off the server machine even if the server's operating system is hung.

Active-Passive Cluster Operation

Choosing the active node
Administrator-Initiated Fail-Over
Cluster Monitoring
Fail-Over During Active-Node Failure
The Split-Brain Problem

Choosing The Active Node For A Service

In most A/P cluster software, the administrator defines a "primary" node for each service.
When the cluster is in normal mode, each service will run on its primary node.
In case the primary node fails, a secondary node will take over the service.
In some cluster configurations, when the active node is working again, it will immediately take the service back from the secondary node. This is known as "automatic fail-back".
When a cluster group runs more then one service, it will be wise to choose a different primary node for different services, to achieve better overall performance.

Administrator-Initiated Fail-Over

All A/P cluster software allow the administrator to initiate a fail-over operation (or a fail-back operation).
This is useful to test the cluster configuration before the primary node actually fails.
This is also useful when we want to upgrade the hardware (or operating system, or service software) of the active node.

Cluster Monitoring

There are two types of monitoring in a cluster - service monitoring and node monitoring.
The active node constantly monitors the availability of the service and its resources.
If the service becomes non-responsive, the active node will initiate a fail-over of the service to the passive node.
If one of the resources required by the service (e.g. the disks) become in-accessible, the active node will initiate such a fail-over, too.
The passive node monitors the status of the active node.
If the passive node sees that the active node is down, it will initiate a fail-over of the service (i.e. try to take over the service).

Fail-Over During Active-Node Failure

The passive node detects that the active node is down.
The passive node makes sure the active node is down - losing the service is better than having data corruption.
The passive node starts taking over the resources of the service, from bottom to top - first the disks, then launches the service process, and finally making the virtual IP address of the service available.
The IP address take-over must be last, to avoid clients getting an error. This is relevant for UDP-based services - the client will simply retry sending commands (it will temporarily 'hang', during the transition), and then start getting responses from the new node.
In case of a TCP-based service, the client must be able to transparently re-connect to the server after getting a 'RST' packet.

The Split-Brain Problem

What happens if the network connectivity between the active-node and passive node is down?
The active-node still works, the passive node will try to take over the resources, and we'll have two active nodes, corrupting the data on the disks.
To solve this, clusters use I/O fencing.
One common method of fencing, is using SCSI reservation. Another common method is STONITH.

The SCSI Reservation Mechanism

Most SCSI disks support a 'SCSI reservation' command.
If a machine sent such a command to the disk, it works as a "lock" against I/O coming from other machines.
If a machine sends an I/O (read/write) request to a SCSI Disk reserved by another machine, it will get an error, with a code of "reservation conflict".
If the reserving machine crashes, another machine may send a "break reservation" (or a "reset target") command, to brutally break the lock.
Of-course, now the second machine needs to send a SCSI reservation, before sending any I/O to the device.
This might cause file-system consistency problems (e.g. if the two machines play too much with 'break reservation').

The STONITH Mechanism

STONITH - Shoot The Other Node In The Head - a mechanism to make sure the other machine cannot send any I/O, before the passive node becomes active.
This is usually done using some network-enabled power-control hardware.
When the other machine will come up, It will try to establish communications with the node that killed it, and will see that it already took over the service.

Active-Active Cluster Operation

Active-active clusters require great awareness of the underlying applications - if it wasn't planned for active-active use, it will not work.
Thus, one mostly sees this with very specific applications - e.g. database servers that were designed to support multiple database servers accessing the same database disk storage.

Shared File-Systems

In order for an active-active cluster to work, it must use a shared file-system (or an equivalent application-support for working with raw devices).
In a shared file-system, the same file-system is mounted on several nodes at the same time, and applications from all nodes may access files at the same time (if they use proper locking).
A shared file-system must provide proper distributed locking support, to avoid data corruption.
To reduce lock contention, a shared file-system such as GFS avoids sharing meta-data about files (e.g. i-nodes) in the same file-system block. Instead, they store the file data (for very small files) in the same block.
Further, a shared file-system should have "on-line recovery" (e.g. in case one node crashes in the middle of a transaction).
In GFS, this is done by having a separate journal per node - and having another node re-play the journal in case the first node crashes.

Load Balancing

Active-Active clusters have an added feature on top of active-passive clusters - load balancing.
Load balancing may be done at the DNS level. This is useful for TCP-based applications, where connections are (normally) long-lived.
It may be done at the router level, in case of a completely stateless protocol.
It may be done at the application level, in case the computations take much longer than the network communications.
Of-course, in case one of the nodes crashes, other nodes need to take over the clients it serviced.

Cluster Stability Vs. Fast Failure Recovery

In standard server configurations, we want to do a lot of retries in case of errors, to overcome very short connectivity problems.
When using an HA cluster, we want to detect problems fast, in order to switch the service to another node (increasing the 'service uptime').
Most operating systems come configured for best stability, so they perform a lot of retries (e.g. in case of disk connectivity error on Linux, I/O operations may hang for several minutes).
One always have to weight the long timeouts against the long fail-over periods.
Assume a failure too quickly - and both nodes might think they have a problem - the service is unavailable.
Assume a failure too slowly - and when there is a non-temporary problem, the cluster will not do its job.

Cluster Software On Linux

HeartBeat
RedHat Cluster Suite
Oracle RAC
(there's a bunch of other commercial cluster software...)

The HeartBeat (V2) Cluster

An active-passive cluster software (open source), that is commonly used on Linux systems.
Supports I/O fencing using the STONITH mechanism.
Simple to install - not tied to specific OS/kernel/whatever. Does not use SCSI reservation.
Has built-in monitoring for various resources (starting with version 2), but a glance at the built-in monitor scripts shows them to be.... naive and pointless.
Originally configured using text files - now there's also a GUI.
Supports multiple nodes (>=16, formally) clusters.
Works with any kind of block device.

RedHat Cluster Suite

A part of the RedHat enterprise server Linux distributions (can be used independently too, it seems).
Supports active-passive and probably also active-active clusters (e.g. when used on top of RedHat's GFS shared file-system).
Very simple - does no proper resource monitoring.
Has a simple GUI to configure it.
Supports I/O fencing using the STONITH mechanism.
Works with any kind of block device.

Oracle RAC

The Oracle 10 database server (also known as RAC - 'Real Applications Cluster') - supports both active-passive and active-active configurations.
Can run on top of Oracle's OCFS shared file-system (which is now part of the Linux kernel).
If you run an Oracle database - you usually don't use the machine for other things - so you don't need an extra HA cluster software.

Clusters On Other Operating Systems

The Microsoft Cluster
Veritas Cluster (Various operating systems)
Sun Cluster (Solaris)
HACMP Cluster (AIX)

The Microsoft Cluster

The Microsoft cluster service supports 2-node and N-node configurations.
In a two-nodes configuration, SCSI reserve commands are used to handle I/O fencing.
If there is a "split brains", the formerly-active node will send a SCSI reserve every 3 seconds, while the formerly-passive node will send it every 5 seconds.
The first to manage to get SCSI reservation, becomes active.
in an N-node configuration, usually a Quorum disk is used to overcome 'split-brains' situations.
A quorum disk is used to handle a 'vote' for the active node. In order to win, a node must get (N/2)+1 votes. Thus, N is normally an odd number (e.g. 3).
Very commonly found in the field.
(Probably) works only with SCSI disk devices.

Veritas Cluster Server

Comes from Veritas software (which is now owned by Symantec), and supports several operating systems (including Solaris and Linux).
An active-passive cluster. Has a very nice GUI configuration tool (but cannot handle spaces in various resource names).
Very common in the field on Solaris systems, until Solaris 10 - when Sun tried to "boot it off" in favor of their own cluster software (i think they failed - but they got some awareness and market-share).
Often used with Veritas volume manager - which is an LVM software, and can properly identify disk statuses.
Normally uses SCSI reservation for I/O fencing, but can be used also without this.
Works only with SCSI disk devices.

Sun Cluster

Comes as an added product for Solaris operating systems.
Until Solaris 9 (inclusive) was scarcely used - everyone used Veritas cluster.
Has an active-passive mode and a phony active-active mode. In the 'active-active' mode, all I/O from the 2nd node flows via the LAN to the first node, so only one node actually accesses disks.
Has a nice GUI to configure it - which is somewhat complicated to use.
(Probably) works only with SCSI disk devices.

HACMP Cluster (AIX)

The HACMP Cluster software supports both active-passive and active-active configurations.
In AIX, the SCSI reservation command is sent when opening the device, and the default mode is to send a SCSI reservation. It needs to be disabled for active-active configurations.
Since on AIX everything is done using LVM, there is tight integration between logical volumes configuration and cluster configuration.

References

Wikipedia - High-Availability Clusters - http://en.wikipedia.org/wiki/High-availability_cluster
The Heartbeat Cluster Software - http://www.linux-ha.org

Originally written by

guy keren