This is a follow-up tutorial on the DRBD replication posts about resynchronization of sites after the failed DR site is recovered.
More than a year ago I described in a long tutorial how to use DRBD in a real production environment and the challenges associated to it. See: DRBD based disk replication of a production cluster to a remote cluster on RHEL6
Sometimes DR site where the slave drbd service receives replication data becomes unavailable. This can occur due to a fatal failure of the DR site or due to a communication failure between sites.
In both cases the DR site is considered as failed and the drbd master node will consider after a time-out that the slave node is unavailable.
One split-brain is detected on the PR node you will have something like this:
cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown A r-----
ns:265914648 nr:0 dw:538224448 dr:1510207293 al:79061 bm:2397 lo:0 pe:11 ua:0 ap:0 ep:1 wo:d oos:35371204
Where:
– cs:StandAlone = PR node detected the split brain and is standing alone
– ro:Primary/Unknown = PR knows is the primary
– ds:UpToDate/DUnknown = PR is up to date and has no information about the DR
On the DR site we have something like this:
cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown A r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
Where:
– cs:WFConnection = DR node was not able to detect the split-brain before the connection to the PR was severed
– ro:Secondary/Unknown = DR knows is the secondary
– ds:UpToDate/DUnknown = DR is up to date (in its opinion) and has no information about the PR
To resynchronize the sites when the DR site is recovered the following steps must be
done.
Make sure that the cluster manager is stopped on the DR site by executing:
# service rgmanager stop
Add the DR database cluster IP as a resource on first cluster node on DR
# ifconfig bond0:1 172.20.101.19/28
Activate the volume group vg_data on the first cluster node on DR:
# vgscan
# vgchange -ay vg_data
Activate the logical volume lv_data on the first cluster node on DR:
# lvscan
# lvchange -a y vg_data/lv_data
Note: In case we cannot see or acquire the vg_data/lv_data (error : “Not activating vg_data/lv_data since it does not pass activation filter.”) we can force the rediscovery of the vg resource.
Edit the lvm.conf and add under the volume_list directive add the entry “vg_data” and re-execute the above commands.
# vi /etc/lvm/lvm.conf
# vgscan
# vgchange -ay vg_data
Start the drbd service on the first cluster node on DR. The drbd will start as a slave node, the same way it was running before the incident.
# service drbd start
Wait for the two sites to synchronize. The synchronization process can be followed by monitoring the drbd proc device:
# watch -n 1 cat /proc/drbd
When the sites are synchronized stop the drbd service:
#service drbd stop
Deactivate the logical volume lv_data on the first cluster node on DR:
# lvscan
# lvchange -a n vg_data/lv_data
Deactivate the volume group vg_data on the first cluster node on DR:
# vgscan
# vgchange -an vg_data
Remove the DR database cluster IP as a resource on first cluster node on DR
# ifconfig bond0:1 del 172.20.101.19/28
Start the cluster manager on the both cluster nodes from DR
#service rgmanger start-domain
Check if the DRBD_Slave_Service is started on the DR cluster.
#clustat
If the DRBD_Slave_Service is not started on the DR cluster force start it.
#clusvcadm -e DRBD_Slave_Service -F
Note:
Note that in this case we changed the lvm.conf file and this is no longer matching the one from the initrd.
In case of a reboot of the node we are going to hit the “HA LVM: Improper setup detected” issue when we want to start any cluster service on this node.
Apr 22 13:07:30 PRODB rgmanager[7864]: [lvm] HA LVM: Improper setup detected
Apr 22 13:07:30 PRODB rgmanager[7884]: [lvm] * initrd image needs to be newer than lvm.conf
This isssue is described by RedHat Solution 21622
Basically the solution is to regenerate the initrd image for the current kernel on this node.
Make a backup of the image:
$ cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.$(date +%m-%d-%H%M%S).bak
Now rebuild the initramfs for the current kernel version:
$ dracut -f -v
Update:
There is a case when during the split-brain recovery the connection between sites dies again.
In that situation the PR site will look like:
cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
0: cs:Timeout ro:Primary/Unknown ds:UpToDate/DUnknown A r-----
ns:265914648 nr:0 dw:538224448 dr:1510207293 al:79061 bm:2397 lo:0 pe:11 ua:0 ap:0 ep:1 wo:d oos:35371204
Where:
– cs:Timeout = PR node tried to resynchronize the DR site but the operation times out
– ro:Primary/Unknown = PR knows is the primary
– ds:UpToDate/DUnknown = PR is up to date and has no information about the DR
In this case we will have to make sure first that the connection between sites is OK and after that just restart the drbd service on the PR site.
After the restart the PR site will automatically try to resynchronize
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
m:res cs ro ds p mounted fstype
0:repdata SyncSource Primary/Secondary UpToDate/Inconsistent A /data ext4
... sync'ed: 4.7% (32944/34556)M
On the DR site we can also see the progress:
cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate A r-----
ns:0 nr:1862072 dw:1862036 dr:1838940 al:0 bm:238 lo:1 pe:0 ua:1 ap:0 ep:1 wo:d oos:33548452
[>...................] sync'ed: 5.3% (32760/34556)M
finish: 2:33:31 speed: 3,632 (3,260) want: 1,120 K/sec
[paypal_donation_button]