Sunday, January 23, 2011

HACMP verification and Synchronization


Few points about HACMP verification and Synchronization which I think few have got some doubts.

 Verifying and synchronizing your HACMP cluster assures you that all resources used by HACMP areconfigured appropriately and that rules regarding resource ownership and resource takeover are in agreement across all nodes. You should verify and synchronize your cluster configuration aftermaking any change within a cluster. For example, any change to the hardware operating system, node configuration, or cluster configuration.

Whenever you configure, reconfigure, or update a cluster, run the cluster verification procedure to ensure that all nodes agree on the cluster topology, network configuration, and the ownership and takeover of HACMP resources. If the verification succeeds, the configuration can be synchronized.Synchronization takes effect immediately on an active cluster. A dynamic reconfiguration event isrun and the changes are committed to the active cluster.


Note :
 If you are using the SMIT Initialization and Standard Configuration path, synchronization automatically  follows a successful verification. If you are using the Extended Configuration path, you have more options for types of verification. If you are using the Problem Determination Tools path, you can choose whether to synchronize or not.

Typically, the log is reported to  /var/hacmp/clverify/clverify.log



Running Cluster Verification
After making a change to the cluster, you can perform cluster verification in several ways.

These methods include:

Automatic verification:
 You can automatically verify your cluster:
       Each time you start cluster services on a node
       Each time a node rejoins the cluster-
       Every 24 hours.

       By default, automatic verification is enabled to run at midnight.


Manual verification:
 Using the SMIT interface,
       you can either verify the complete configuration,
       or only the changes made since the last time the utility was run.

       Typically, you should run verification whenever you add or change anything in your
       cluster configuration. For detailed instructions, see Verifying the HACMP configuration
       using SMIT.

Automatic Verification :
 You can Disable this Automatic verification during Cluster Startup under
 Extended Configuration >> Extended Cluster Service Settings   >>>>>>>>  BUT DONT DO IT IF NOT ADVICED.


Understading Verification Process

The phases of the verification and synchronization process are as follows:

Verification
Snapshot (optional)
Synchronization.


Phase one: Verification
During the verification process the default system configuration directory (DCD) is compared
with the active configuration. On an inactive cluster node, the verification process compares
the local DCD across all nodes. On an active cluster node, verification propagates a copy of
the active configuration to the joining nodes.

If a node that was once previously synchronized has a DCD that does not match the ACD of an already
active cluster node, the ACD of an active node is propagated to the joining node. This new information
does not replace the DCD of the joining nodes; it is stored in a temporary directory for the purpose
of running verification against it.

HACMP displays progress indicators as the verification is performed.

Note: When you attempt to start a node that has an invalid cluster configuration, HACMP transfers a
valid configuration database data structure to it, which may consume 1-2 MB of disk space. If the
verification phase fails, cluster services will not start.

Phase two: (Optional) Snapshot
A snapshot is only taken if a node request to start requires an updated configuration. During the
snapshot phase of verification, HACMP records the current cluster configuration to a snapshot file
- for backup purposes. HACMP names this snapshot file according to the date of the snapshot and the
name of the cluster. Only one snapshot is created per day. If a snapshot file exists and its filename
contains the current date, it will not be overwritten.

This snapshot is written to the /usr/es/sbin/cluster/snapshots/ directory.

The snapshot filename uses the syntax MM-DD-YYYY-ClusterName -autosnap.odm. For example, a snapshot
taken on April 2, 2006 on a cluster hacluster01 would be named usr/es/sbin/cluster/snapshots/04-02
-06hacluster01-autosnap.odm.

Phase three: Synchronization
During the synchronization phase of verification, HACMP propagates information to all cluster nodes
. For an inactive cluster node, the DCD is propagated to the DCD of the other nodes. For an active
cluster node, the ACD is propagated to the DCD.

If the process succeeds, all nodes are synchronized and cluster services start. If synchronization
fails, cluster services do not start and HACMP issues an error.


Conditions that can trigger Corrective Action :

https://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.hacmp.admngd/ha_admin
_trigger_corrective.htm

This topic discusses conditions that can trigger a corrective action.

HACMP shared volume group time stamps are not up-to-date on a node
If the shared volume group time stamp file does not exist on a node, or the time stamp files do not match on all nodes, the corrective action ensures that all nodes have the latest up-to-date VGDA time stamp for the volume group and imports the volume group on all cluster nodes where the shared volume group was out of sync with the latest volume group changes. The corrective action ensures that volume groups whose definitions have changed will be properly imported on a node that does not have the latest definition.

The /etc/hosts file on a node does not contain all HACMP-managed IP addresses
If an IP label is missing, the corrective action modifies the file to add the entry and saves a copy of the old version to /etc/hosts.date. If a backup file already exists for that day, no additional backups are made for that day.

Verification does the following:

If the /etc/hosts entry exists but is commented out, verification adds a new entry; comment lines are ignored.
If the label specified in the HACMP Configuration does not exist in /etc/hosts , but the IP address is defined in /etc/hosts, the label is added to the existing /etc/hosts entry. If the label is different between /etc/hosts and the HACMP configuration, then verification reports a different error message; no corrective action is taken.
If the entry does not exist, meaning both the IP address and the label are missing from /etc/hosts, then the entry is added. This corrective action takes place on a node-by-node basis. If different nodes report different IP labels for the same IP address, verification catches these cases and reports an error. However, this error is unrelated to this corrective action. Inconsistent definitions of an IP label defined to HACMP are not corrected.
SSA concurrent volume groups need unique SSA node numbers
If verification finds that the SSA node numbers are not unique, the corrective action changes the number of one of the nodes where the number is not unique. See the Installation Guide for more information on SSA configuration.

A file system is not created on a node, although disks are available
If a file system has not been created on one of the cluster nodes, but the volume group is available, the corrective action creates the mount point and file system. The file system must be part of a resource group for this action to take place. In addition, the following conditions must be met:

This is a shared volume group.
The volume group must already exist on at least one node.
One or more node(s) that participate in the resource group where the file system is defined must already have the file system created.
The file system must already exist within the logical volume on the volume group in such a way that simply re-importing that volume group would acquire the necessary file system information.
The mount point directory must already exist on the node where the file system does not exist.
The corrective action handles only those mount points that are on a shared volume group, such that exporting and re-importing of the volume group will acquire the missing file systems available on that volume group. The volume group is varied off on the remote node(s), or the cluster is down and the volume group is then varied off if it is currently varied on, prior to executing this corrective action.

If Mount All File Systems is specified in the resource group, the node with the latest time stamp is used to compare the list of file systems that exists on that node with other nodes in the cluster. If any node is missing a file system, then HACMP imports the file system.

Disks are available, but the volume group has not been imported to a node
If the disks are available but the volume group has not been imported to a node that participates in a resource group where the volume group is defined, then the corrective action imports the volume group.

The corrective action gets the information regarding the disks and the volume group major number from a node that already has the volume group available. If the major number is unavailable on a node, the next available number is used.

The corrective action is only performed under the following conditions:

The cluster is down.
The volume group is varied off if it is currently varied on.
The volume group is defined as a resource in a resource group.
The major number and associated PVIDS for the disks can be acquired from a cluster node that participates in the resource group where the volume group is defined.
Note: This functionality will not turn off the auto varyon flag if the volume group has the attribute set. A separate corrective action handles auto varyon.

Shared volume groups configured as part of an HACMP resource group have their automatic varyon attribute set to Yes.
If verification finds that a shared volume group inadvertently has the auto varyon attribute set to Yes on any node, the corrective action automatically sets the attribute to No on that node.

Required /etc/services entries are missing on a node.
If a required entry is commented out, missing, or invalid in /etc/services on a node, the corrective action adds it. Required entries are:

Name Port Protocol
topsvcs  6178 udp
grpsvcs  6179 udp
clinfo_deadman  6176 udp
clcomd 6191 tcp

Required HACMP snmpd entries are missing on a node
If a required entry is commented out, missing, or invalid on a node, the corrective action adds it.

Note: The default version of the snmpd.conf file for AIX® is snmpdv3.conf.
In /etc/snmpdv3.conf or /etc/snmpd.conf, the required HACMP snmpd entry is:

smux   1.3.6.1.4.1.2.3.1.2.1.5   clsmuxpd_password # HACMP/ES for AIX clsmuxpd
In /etc snmpd.peers, the required HACMP snmpd entry is:

clsmuxpd   1.3.6.1.4.1.2.3.1.2.1.5 "clsmuxpd_password" # HACMP/ES for AIX clsmuxpd
If changes are required to the /etc/snmpd.peers or snmpd[v3].conf file, HACMP creates a backup of the original file. A copy of the pre-existing version is saved prior to making modifications in the file /etc/snmpd.{peers | conf}.date. If a backup has already been made of the original file, then no additional backups are made.

HACMP makes one backup per day for each snmpd configuration file. As a result, running verification a number of times in one day only produces one backup file for each file modified. If no configuration files are changed, HACMP does not make a backup.

Required RSCT network options settings
HACMP requires that the nonlocsrcroute, ipsrcroutesend, ipsrcrouterecv, and ipsrcrouteforward network options be set to 1; these are set by RSCT's topsvcs startup script. The corrective action run on inactive cluster nodes ensures these options are not disabled and are set correctly.

Required HACMP network options setting
The corrective action ensures that the value of each of the following network options is consistent across all nodes in a running cluster (out-of-sync setting on any node is corrected):

tcp_pmtu_discover
udp_pmtu_discover
ipignoreredirects
Required routerevalidate network option setting
Changing hardware and IP addresses within HACMP changes and deletes routes. Because AIX caches routes, setting the routerevalidate network option is required as follows:

no -o routerevalidate=1
This setting ensures the maintenance of communication between cluster nodes. Verification run with corrective action automatically adjusts this setting for nodes in a running cluster.

Note: No corrective actions take place during a dynamic reconfiguration event.
Corrective actions when using IPv6
If you configure an IPv6 address, the verification process can perform 2 more corrective actions:

Neighbor discovery (ND). Network interfaces must support this protocol which is specific to IPv6. The underlying network interface card is checked for compatibility with ND and the ND related daemons will be started.
Configuration of Link Local addresses (LL). A special link local (LL) address is required for every network interface that will be used with IPv6 addresses. If a LL address is not present the autoconf6 program will be run to configure one.


No comments:

Post a Comment