Oracle Grid Infrastructure - Reboot less Node Fencing

mikerajendran's picture
articles: 

Introduction

Oracle Grid Infrastructure 11.2.0.2 has many features including Cluster Node Membership, Cluster Resource Management and Cluster Resources monitoring. One of the key area where DBA need to have expert knowledge on how the cluster node membership works and how the cluster decides to take out node should there be a heart beat network, voting disk or node specific issues. I have written about this before and this article specifically focuses on the 11g R2 features and I will also try to explain the reboot less node fencing.

What are the circumstances which can be causing the node membership issues?

  • Heartbeat failures – any type of heart beat communication failures which will stop the node communicating with the other nodes
  • Voting disk ping failures – any type of voting disk access failure
  • Node specific CPU issue or we can call it as a CPU starvation issue
  • The last, of course, but not the least any one of the cluster processes

Why a cluster does needs to force a node to be removed from the cluster?

If any one of the nodes cannot communicate to other nodes, there is a potential that node can be corrupting the data without coordinating the writes with the other nodes. Should that situation arise, that node needs to be taken out from the cluster to protect the integrity of the cluster and its data. This is called as “split brain” in the cluster which means two different sets of clusters can be functioning against the same set of data writing independently causing data integrity and corruption. Any clustering solution needs to address this issue so does Oracle Grid Infrastructure Clusterware.

Heartbeat or Private Interconnect Failures

One of the core design for any mission critical clusters is to design failover for the interconnect. Some folks may call it as heart beat network. Heartbeat network basically gives the connectivity between the nodes. These heartbeat networks usually go through redundant switches / redundant NICs and redundant cables so that any type of failure whether it is cable failure, NIC failure or switch failure those will be tolerated in the cluster. Typically two redundant NICs will be bonded together using native bonding. In Solaris you can use IPMP to bond them. The switches need to be trunked in order for the communication to work between the two links. Fortunately if you are using 11g R2, you don’t need to bond the interfaces and you can let Oracle manage them. The advantage of using redundant interfaces for private interconnects is that now Oracle can load balance the heart beat TCP / database UDP traffic but also can give failure capability should one of the network goes bad.

Oracle constantly checks the network heartbeat as well as disk heartbeat between the nodes all the time. The following parameter controls how many seconds of missing heartbeat can be tolerated on the cluster. This is called CSS misscount. This has default value in each platform and in Linux platform the default value is 30 seconds starting from 11g R1.

unc01sys$ crsctl get css misscount

30
The CSS misscount parameter represents the maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster reconfiguration to evict the node. The following are the default values for the misscount parameter and their respective versions when using Oracle Clusterware* in seconds:

OS
10g (R1 &R2)
11g

Linux
60
30

Unix
30
30

VMS
30
30

Windows
30
30

*CSS misscount default value when using vendor (non-Oracle) clusterware is 600 seconds. This is to allow the vendor clusterware ample time to resolve any possible split brain scenarios.

If a heartbeat failed between the nodes, voting disk write now should complete within SDTO. This is called shortdiskTimeout. There are two disk out parameters used in Oracle Clusterware which are: longdisktimeout (LDTO) and shortdisktimeout (SDTO).

LDTO is 200 seconds by default and SDTO is called based on two other parameters which are misscount and reboottime. When default CSS misscount value is used in Linux for an example: the SDTO = misscount – reboottime

=> 30 seconds misscount – 3 seconds reboottime = 27 seconds

The voting disk write should complete within SDTO when there is a “network heart beat failure”. When there is network heart beat failure, Oracle Clusterware no longer considers the LDTO. After missing network heartbeat for 100% css misscount seconds minus the reboot time (which is 27 seconds) but have disk heart beat within SDTO, now cluster voting disks will have two sets of clusters formation: One is called as big cluster and other one is called as small cluster. Now clusterware will decide to take out / evict the small cluster from the cluster.

If n/w HB fails for more than 50% of the time, it will start putting the messages in the cluster logs which are located under $CRS_HOME/log/hostname directory.

[cssd(32512)]CRS-1612:Network communication with node unc1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.829 seconds

[cssd(32512)]CRS-1612:Network communication with node unc1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 8.124 seconds

[cssd(32512)]CRS-1610:Network communication with node unc1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 1.802 seconds

[cssd(32512)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity

Then CRS verifies the disk heartbeat using the SDTO and if the entire disk heart beat comes back within the time of 27 seconds but CRS cannot get the disk heartbeat 100% of the time within those 27 seconds, now it has to shoot the node which is not communicating within the remaining 3 seconds. How does the cluster find out which node to be shot down to preserve the cluster integrity? There could be a racing condition occurring at that time to take down the nodes. So there has to be some other mechanism to find out which node is losing the connectivity. And the voting files will be used for the disk heartbeats.

There are several ways a node can be fenced. One is called “Shoot the Other Node in the Head (STONITH)” like. Some of the hardware’s are equipped to be managed remotely using IPMI (Intelligent Platform Management Interface) so Oracle can enable the remote management of the node to do such rebooting using the hardware interface.

[bannergarden id="1"]

When network heartbeats are not available and the disk heart beats are available, then the cluster can package the “poison pill” or so called “kill block” and put that information on the voting disk. The cluster with lowest node ID(lsnodes –n) or the nodes which are part of the smaller cluster(during a split brain situation two types of sub clusters formed – one is called big cluster and the other one is small cluster) will get the “kill block” which will force ocssd.bin to commit suicide on the nodes.

Voting Disk Failures

The next cause for the split brain clusters can be voting disk access. From 11g R2 onwards, voting files can be placed on the ASM disk groups. ASM instance do not need to be up in order for the cluster to access the voting files. Typically it is recommended to keep odd number of votings such as three or five voting files. That way when there is a failure, there will be a majority always. You can’t find the exact time by having two watches. You can have hung jury verdict when the participants are evenly split on the decision so odd number of voting files should be always used.

Similar to the network failures, voting disk failures are tracked using two parameters. LDTO and SDTO. Long disk timeout which is 200 seconds is used for normal cluster operation when network heart beats are good. Short disk timeout we explained earlier(27 seconds) is used when cluster formation or cluster node leaving. Typically the cluster will evict the node by rebooting it when a node cannot communicate with the voting disk.

The node is unable to join the cluster if it cannot access majority of the voting files and the node must leave if it cannot access majority of the voting files. Although the cluster will evict the node by rebooting the node or committing suicide for voting disk failures prior to 11g R2, starting from release 11g R2 the cluster will protect by doing reboot less fencing of the node. We will explain what is rebooting less fencing.

Reboot less Node Fencing

Prior to 11g R2, during voting disk failures the node will be rebooted to protect the integrity of the cluster. But rebooting cannot be necessarily just the communication issue. The node can be hanging or the IO operation can be hanging so potentially the reboot decision can be the incorrect one. So Oracle Clusterware will fence the node without rebooting. This is a big (and big) achievement and changes in the way the cluster is designed.

The reason why we will have to avoid the reboot is that during reboots resources need to re-mastered and the nodes remaining on the cluster should be re-formed. In a big cluster with many numbers of nodes, this can be potentially a very expensive operation so Oracle fences the node by killing the offending process so the cluster will shutdown but the node will not be shutdown. Once the IO path is available or the network heartbeat is available, the cluster will be started again. Be assured the data will be protected but it will be done without any pain rebooting the nodes. But in the cases where the reboot is needed to protect the integrity, the cluster will decide to reboot the node.

CPU Scheduling & Starvation

Ocssd.bin is responsible to ensure the disk heartbeat as well as the network heartbeat. But there are situations where certain processes can chew up the CPU and make the systems to starve thus blocking network and disk heartbeat pings. Two processes cssdmonitor and cssdagent which are responsible to check the CPU scheduling, ocssd.bin process hanging situations, hardware and IO path hanging. In prior versions, oprocd was responsible to check the CPU scheduling and Oracle recommendation was to change the diagwait to 13 seconds so that it gives enough time to check the hanging situation as well enough time to dump the memory information so that it is useful to debug why the node went down. But this parameter diagwait no longer need to be explicitly set as oracle controls cssdmonitor and cssdagent with undocumented values. As you might have guessed it right, if the two processes cssdmonitor and cssdagent not running on the cluster (or killed them) and they are down for more than 27 seconds (SDTO – reboottime) then the node will be evicted.

Summary

As we have seen there are three major areas which need to be looked upon for node eviction such as heartbeat failure, voting disk access failure, CPU/Scheduling starvation or a hanging situation. Oracle 11g R2 offers much improved cluster resiliency and protects the cluster without rebooting during a split brain situation. The reboot less node fencing will help to stabilize the cluster far better than 11g R1 clusters.

Author Michael Rajendran is a Oracle Certified Master and working with Oracle technologies for more than 13 years and he can be reached at mikerajendran@gmail.com.

Comments

Hi, Mike. I've been trying to find more information about the implementation of i/o fencing without a reboot that you describe. So far, all I can find is techniques for doing it if you have IPMI hardware installed and configured. Are you saying that this is implemented without IPMI? Can you point me to any docs?
Thank you for bringing this topic up.
John.

mikerajendran's picture

Hello John - My apology for the late response. Oracle Reboot less node fencing in 11g R2 Grid Infrastructure is for a member kill from the cluster without a node termination. The cluster reforms taking out the victim node which has lost the heart beat(or disk timeout accessing voting disks). The kill block code is executed by the offending node's to take out the node from the cluster without rebooting it. In all these reboot less fencing, the member kill is co-ordinated by Clusterware(local or remote node) and Operating System on the victim node. But in certain cases when member kill escalation to node-termination may need to be executed without waiting for (or in the absense of) Clusterware and Operating System. In such cases the node needs to be terminated by IPMI which is capable of power recycling the server with remote commands.

It begs a seperate article to explain the IPMI(Intelligent Platform Management Interface). IPMI is used to manage the system remotely in the absense of OS so Clusterware can use that to reboot a node for I/O fencing. The power should be on and the host should be on the network(IPMI needs a seperate IP and the best network would be the management network preferably using DHCP) and the server should be having a Baseboard Management Controller(BMC) with the firmware compatible to IPMI 1.5 in order to configure IPMI in 11g R2. It also needs a username and password which will be used duirng a node eviction operation. The larger cluster CSSD(evicting node) needs to communicate to the sub cluster Baseboard Management Controller(to be evicted node or victim node) over LAN using the username / password to reboot the node.

Michael Rajendran
http://www.unbreakablecloud.com