Increasing The "Uptime" Of Services
- Most of us are proud of the 'uptime' of our machines.
- Fewer of us are proud of the 'uptime' of the services our machines supply.
- Users of our services don't care for the machine's uptime - only for that
of the services it runs.
- If we will use several machines to run the same service, we could achieve
a better "service uptime".
- High-Availability (HA) cluster software is used to help us manage a set of
services on a set of machines, and make sure that the service out-lives
the non-availability of any one of these machines.
- Note: sometimes we could also increase the performance, but we will
concentrate on High-Availability today.
High-Availability Vocabulary (part I)
- High-Availability Cluster - A group of computers, whose purpose is to
increase the up-time of services, by making sure that a failed machine
is automatically and quickly replaced by a different machine, with little
service disruption.
- Cluster service - A computer service that is managed by the cluster
software. The cluster service includes all resources required to deliver
the service - the data (e.g. file systems), virtual IP addresses,
processes...
- Cluster Group - a group of servers that run the cluster software, and
together handle a set of services.
- Cluster node - A server that is a member of a cluster group.
- High-Availability Cluster Software - an application that manages a
high-availability cluster.
High-Availability Vocabulary (Part II)
- Active-Passive Cluster - A cluster where only one node runs a given
service at a time, and the other nodes are in stand-by to take over,
should the need arise.
- Active-Active Cluster - A cluster where a given service is running on more
then one machine at the same time.
- Fail-over - the operation of moving the service from one cluster node
to another.
- I/O fencing - a mechanism used in active-passive clusters, to ensure that
no matter what - at most one node will send I/O requests to a given disk.
HA Cluster Software - Roles
- Allow us to configure services, and spread that configuration info on
several cluster nodes.
- Make it easy to launch those services on any of the cluster nodes.
- Make it easy to move
a running service from one node to another (e.g. to allow upgrading the
operating system or service software on the first node).
- Monitor the availability of the service
- Automatically switch a failed service to a different node.
Note: the switch should be as transparent to users as possible. This
requires some support from the client and server software.
Cluster Hardware
- Clusters require a certain set of hardware in order to operate.
- Some of this is found on normal servers
- Some is specific to cluster setups.
- This hardware will include
- Servers
- Network connectivity
- External disks
- Power-supply control mechanisms
Cluster Server Machines
- Cluster servers are usually standard rack-mounted server machines.
- They need to be reliable (or else the entire cluster will be un-reliable).
Disk Storage
- In order for the data to be accessible to several cluster nodes, it is
placed on an external SAN or NAS storage device.
- The disks are normally managed using RAID 5 (using a parity disk), or
RAID 1 (mirroring).
- In this setup, a single disk failure will not cause the cluster to fail.
- The disks are stored in a RAID device, which allows replacing faulty disks
without powering down the other disks. This is called "hot-swap".
- On the other hand, since the storage device is connected via a network,
the server might lose access to it.
- To overcome this, more then one network path may be connected between
each cluster node and the storage device.
Networking Equipment
- Cluster nodes are usually connected using more then one network interface
- (e.g. a standard Ethernet and a cross-cable Ethernet or serial link)
- - this helps reduce the chance for a "split brain" problem (we'll explain
this later on).
- Sometimes, each cluster node is connected to a different network switch.
- - this is done in order to avoid a single switch failure causing the entire
cluster to be in-accessible.
Power-Control Hardware
- Sometimes a cluster node needs to reboot another cluster node, in order to
assure that the service is only running on a single cluster node.
- One way to achieve this is using network-controlled power supplies,
- Another way is using remote console hardware that is bundled inside
modern server machines
- - which allows you to turn off the server machine even if the server's
operating system is hung.
Active-Passive Cluster Operation
- Choosing the active node
- Administrator-Initiated Fail-Over
- Cluster Monitoring
- Fail-Over During Active-Node Failure
- The Split-Brain Problem
Choosing The Active Node For A Service
- In most A/P cluster software, the administrator defines a "primary" node for
each service.
- When the cluster is in normal mode, each service will run on its primary
node.
- In case the primary node fails, a secondary node will take over the
service.
- In some cluster configurations, when the active node is working again, it
will immediately take the service back from the secondary node. This
is known as "automatic fail-back".
- When a cluster group runs more then one service, it will be wise to choose
a different primary node for different services, to achieve better
overall performance.
Administrator-Initiated Fail-Over
- All A/P cluster software allow the administrator to initiate a fail-over
operation (or a fail-back operation).
- This is useful to test the cluster configuration before the primary node
actually fails.
- This is also useful when we want to upgrade the hardware (or operating
system, or service software) of the active node.
Cluster Monitoring
- There are two types of monitoring in a cluster - service monitoring
and node monitoring.
- The active node constantly monitors the availability of the service
and its resources.
- If the service becomes non-responsive, the active node will initiate
a fail-over of the service to the passive node.
- If one of the resources required by the service (e.g. the disks) become
in-accessible, the active node will initiate such a fail-over, too.
- The passive node monitors the status of the active node.
- If the passive node sees that the active node is down, it will initiate
a fail-over of the service (i.e. try to take over the service).
Fail-Over During Active-Node Failure
- The passive node detects that the active node is down.
- The passive node makes sure the active node is down - losing the service
is better than having data corruption.
- The passive node starts taking over the resources of the service, from
bottom to top - first the disks, then launches the service process,
and finally making the virtual IP address of the service available.
- The IP address take-over must be last, to avoid clients getting an error.
This is relevant for UDP-based services - the client will simply retry
sending commands (it will temporarily 'hang', during the transition), and
then start getting responses from the new node.
- In case of a TCP-based service, the client must be able to transparently
re-connect to the server after getting a 'RST' packet.
The Split-Brain Problem
- What happens if the network connectivity between the active-node and
passive node is down?
- The active-node still works, the passive node will try to take over the
resources, and we'll have two active nodes, corrupting the data on the
disks.
- To solve this, clusters use I/O fencing.
- One common method of fencing, is using SCSI reservation. Another common
method is STONITH.
The SCSI Reservation Mechanism
- Most SCSI disks support a 'SCSI reservation' command.
- If a machine sent such a command to the disk, it works as a "lock"
against I/O coming from other machines.
- If a machine sends an I/O (read/write) request to a SCSI Disk reserved
by another machine, it will get an error, with a code of
"reservation conflict".
- If the reserving machine crashes, another machine may send a
"break reservation" (or a "reset target") command, to brutally break the
lock.
- Of-course, now the second machine needs to send a SCSI reservation, before
sending any I/O to the device.
- This might cause file-system consistency problems (e.g. if the two
machines play too much with 'break reservation').
The STONITH Mechanism
- STONITH - Shoot The Other Node In The Head - a mechanism to make sure
the other machine cannot send any I/O, before the passive node becomes
active.
- This is usually done using some network-enabled power-control hardware.
- When the other machine will come up, It will try to establish
communications with the node that killed it, and will see that it already
took over the service.
Active-Active Cluster Operation
- Active-active clusters require great awareness of the underlying
applications - if it wasn't planned for active-active use, it will not
work.
- Thus, one mostly sees this with very specific applications - e.g.
database servers that were designed to support multiple database servers
accessing the same database disk storage.
Shared File-Systems
- In order for an active-active cluster to work, it must use a shared
file-system (or an equivalent application-support for working with
raw devices).
- In a shared file-system, the same file-system is mounted on several
nodes at the same time, and applications from all nodes may access
files at the same time (if they use proper locking).
- A shared file-system must provide proper distributed locking support,
to avoid data corruption.
- To reduce lock contention, a shared file-system such as GFS avoids sharing
meta-data about files (e.g. i-nodes) in the same file-system block.
Instead, they store the file data (for very small files) in the same block.
- Further, a shared file-system should have "on-line recovery" (e.g. in
case one node crashes in the middle of a transaction).
- In GFS, this is done by having a separate journal per node - and having
another node re-play the journal in case the first node crashes.
Load Balancing
- Active-Active clusters have an added feature on top of active-passive
clusters - load balancing.
- Load balancing may be done at the DNS level. This is useful for TCP-based
applications, where connections are (normally) long-lived.
- It may be done at the router level, in case of a completely stateless
protocol.
- It may be done at the application level, in case the computations take
much longer than the network communications.
- Of-course, in case one of the nodes crashes, other nodes need to take
over the clients it serviced.
Cluster Stability Vs. Fast Failure Recovery
- In standard server configurations, we want to do a lot of retries in case
of errors, to overcome very short connectivity problems.
- When using an HA cluster, we want to detect problems fast, in order to
switch the service to another node (increasing the 'service uptime').
- Most operating systems come configured for best stability, so they perform
a lot of retries (e.g. in case of disk connectivity error on Linux, I/O
operations may hang for several minutes).
- One always have to weight the long timeouts against the long fail-over
periods.
- Assume a failure too quickly - and both nodes might think they have a
problem - the service is unavailable.
- Assume a failure too slowly - and when there is a non-temporary problem,
the cluster will not do its job.
Cluster Software On Linux
- HeartBeat
- RedHat Cluster Suite
- Oracle RAC
- (there's a bunch of other commercial cluster software...)
The HeartBeat (V2) Cluster
- An active-passive cluster software (open source), that is commonly used
on Linux systems.
- Supports I/O fencing using the STONITH mechanism.
- Simple to install - not tied to specific OS/kernel/whatever. Does not use
SCSI reservation.
- Has built-in monitoring for various resources (starting with version 2),
but a glance at the built-in monitor scripts shows them to be.... naive
and pointless.
- Originally configured using text files - now there's also a GUI.
- Supports multiple nodes (>=16, formally) clusters.
- Works with any kind of block device.
RedHat Cluster Suite
- A part of the RedHat enterprise server Linux distributions (can be used
independently too, it seems).
- Supports active-passive and probably also active-active clusters (e.g.
when used on top of RedHat's GFS shared file-system).
- Very simple - does no proper resource monitoring.
- Has a simple GUI to configure it.
- Supports I/O fencing using the STONITH mechanism.
- Works with any kind of block device.
Oracle RAC
- The Oracle 10 database server (also known as RAC - 'Real Applications
Cluster') - supports both active-passive and active-active configurations.
- Can run on top of Oracle's OCFS shared file-system (which is now part
of the Linux kernel).
- If you run an Oracle database - you usually don't use the machine for
other things - so you don't need an extra HA cluster software.
Clusters On Other Operating Systems
- The Microsoft Cluster
- Veritas Cluster (Various operating systems)
- Sun Cluster (Solaris)
- HACMP Cluster (AIX)
The Microsoft Cluster
- The Microsoft cluster service supports 2-node and N-node
configurations.
- In a two-nodes configuration, SCSI reserve commands are used to handle
I/O fencing.
- If there is a "split brains", the formerly-active node will send a SCSI
reserve every 3 seconds, while the formerly-passive node will send it
every 5 seconds.
- The first to manage to get SCSI reservation, becomes active.
- in an N-node configuration, usually a Quorum disk is used to overcome
'split-brains' situations.
- A quorum disk is used to handle a 'vote' for the active node. In order to
win, a node must get (N/2)+1 votes. Thus, N is normally an odd number
(e.g. 3).
- Very commonly found in the field.
- (Probably) works only with SCSI disk devices.
Veritas Cluster Server
- Comes from Veritas software (which is now owned by Symantec), and supports
several operating systems (including Solaris and Linux).
- An active-passive cluster. Has a very nice GUI configuration tool
(but cannot handle spaces in various resource names).
- Very common in the field on Solaris systems, until Solaris 10 - when Sun
tried to "boot it off" in favor of their own cluster software (i think
they failed - but they got some awareness and market-share).
- Often used with Veritas volume manager - which is an LVM software, and can
properly identify disk statuses.
- Normally uses SCSI reservation for I/O fencing, but can be used also
without this.
- Works only with SCSI disk devices.
Sun Cluster
- Comes as an added product for Solaris operating systems.
- Until Solaris 9 (inclusive) was scarcely used - everyone used Veritas
cluster.
- Has an active-passive mode and a phony active-active mode. In the
'active-active' mode, all I/O from the 2nd node flows via the LAN
to the first node, so only one node actually accesses disks.
- Has a nice GUI to configure it - which is somewhat complicated to use.
- (Probably) works only with SCSI disk devices.
HACMP Cluster (AIX)
- The HACMP Cluster software supports both active-passive and active-active
configurations.
- In AIX, the SCSI reservation command is sent when opening the device,
and the default mode is to send a SCSI reservation. It needs to be
disabled for active-active configurations.
- Since on AIX everything is done using LVM, there is tight integration
between logical volumes configuration and cluster configuration.
References
- Wikipedia - High-Availability Clusters -
http://en.wikipedia.org/wiki/High-availability_cluster
- The Heartbeat Cluster Software -
http://www.linux-ha.org
Originally written by
guy keren