# HAST and ZFS with CARP failover



## gkontos (Feb 8, 2012)

HAST (Highly Available Storage) is a new concept for FreeBSD and it is under constant development. HAST allows to transparently store data on two physically separated machines connected over the TCP/IP network. HAST operates on block level making it transparent for file systems, providing disk-like devices in /dev/hast directory.

In this article we will create two identical HAST nodes, hast1 and hast2. Both devices will use one NIC connected to a vlan for data synchronization and another NIC will be configured via CARP in order to share the same IP address across the network. The first node will be called storage1.hast.test, the second storage2.hast.test and they will both listen to a common IP address which we will bind to storage.hast.test

HAST binds its resource names according to the machine's hostname. Therefore, we will use hast1.freebsd.loc and hast2.freebsd.loc  as the machines hostnames so that HAST can operate without complaining.

For starters, lets set up two identical nodes. For this example I have installed FreeBSD 9.0-RELEASE on two deferent instances using a Linux KVM. Both nodes have 512MB of RAM, one SATA drive containing the OS and three SATA drives which will be used to create our shared Raidz1 pool. 

In order for carp to work we don't have to compile a new kernel. We can just load it as a module by adding to /boot/loader.conf


```
if_carp_load="YES"
```

Our both nodes are set up, it is time to make some adjustments. First a descent /etc/rc.conf for the first node:


```
zfs_enable="YES"

###Primary Interface##
ifconfig_re0="inet 10.10.10.181  netmask 255.255.255.0"

###Secondary Interface for HAST###
ifconfig_re1="inet 192.168.100.100  netmask 255.255.255.0"

defaultrouter="10.10.10.1"
sshd_enable="YES"
hostname="hast1.freebsd.loc"

##CARP INTERFACE SETUP##
cloned_interfaces="carp0"
ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 0"

hastd_enable=YES
```

The second node we will also much the first except for the IP addressing:


```
zfs_enable="YES"

###Primary Interface##
ifconfig_re0="inet 10.10.10.182  netmask 255.255.255.0"

###Secondary Interface for HAST###
ifconfig_re1="inet 192.168.100.101  netmask 255.255.255.0"

defaultrouter="10.10.10.1"
sshd_enable="YES"
hostname="hast2.freebsd.loc"

##CARP INTERFACE SETUP##
cloned_interfaces="carp0"
ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 0"

hastd_enable=YES
```

At this point we have assigned re1 with two IPs for HAST synchronization. We have also assigned two IPs to re0 which in turn we share with a third common IP assigned to carp0.
As a result, re1 is being used for HAST synchronization in a vlan while carp0 which is cloned by re0  used under the same vlan with the rest of our clients.  

In order for HAST to function correctly we have to resolve the correct IPs on every node. We don't want to rely on DNS for this because DNS can fail. Instead we will use /etc/hosts same on every node.


```
::1			localhost localhost.freebsd.loc
127.0.0.1		localhost localhost.freebsd.loc
192.168.100.100		hast1.freebsd.loc hast1
192.168.100.101		hast2.freebsd.loc hast2

10.10.10.181          	storage1.hast.test storage1
10.10.10.182          	storage2.hast.test storage2
10.10.10.180	      	storage.hast.test  storage
```

Next, we have to create the /etc/hast.conf file. Here we will declare the resources that we want to create. All resources will eventually create devices located under /dev/hast on the primary node. Every resource indicates a physical device specifying a local and remote IP device. The /etc/hast.conf must be exactly the same on every node.


```
resource disk1 {
        on hast1 {
                local /dev/ad1
                remote hast2
        }
        on  hast2 {
                local /dev/ad1
                remote hast1
        }
}

resource disk2 {
        on  hast1 {
                local /dev/ad2
                remote hast2
        }
        on  hast2 {
                local /dev/ad2
                remote hast1
        }
}

resource disk3 {
        on  hast1 {
                local /dev/ad3
                remote hast2
        }
        on  hast2 {
                local /dev/ad3
                remote hast1
        }
}
```

In this example we are sharing three resources, disk1, disk2 and disk3. Each resource indicates a device the local and the remote IP address. With this configuration in place, we are ready to begin setting up out HAST devices.

Lets start hastd on both nodes first:


```
hast1#/etc/rc.d/hastd start
```


```
hast2#/etc/rc.d/hastd start
```

Now on the primary node we will initialize our resources, create them and finally assign a primary role:


```
hast1#hastctl role init disk1
hast1#hastctl role init disk2
hast1#hastctl role init disk3
hast1#hastctl create disk1
hast1#hastctl create disk2
hast1#hastctl create disk3
hast1#hastctl role primary disk1
hast1#hastctl role primary disk2
hast1#hastctl role primary disk3
```

Next, on the secondary node we will initialize our resources, create them and finally assign a secondary role:


```
hast2#hastctl role init disk1
hast2#hastctl role init disk2
hast2#hastctl role init disk3
hast2#hastctl create disk1
hast2#hastctl create disk2
hast2#hastctl create disk3
hast2#hastctl role secondary disk1
hast2#hastctl role secondary disk2
hast2#hastctl role secondary disk3
```

There are other ways for creating and assigning roles to each resource. Having repeat this procedure a few times, I saw that this usually always works.   

Now check the status on both nodes:


```
hast1# hastctl status
disk1:
  role: primary
  provname: disk1
  localpath: /dev/ada1
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk2:
  role: primary
  provname: disk2
  localpath: /dev/ada2
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk3:
  role: primary
  provname: disk3
  localpath: /dev/ada3
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
```

The first node looks good. Status is complete.


```
hast2# hastctl status
disk1:
  role: secondary
  provname: disk1
  localpath: /dev/ada1
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk2:
  role: secondary
  provname: disk2
  localpath: /dev/ada2
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk3:
  role: secondary
  provname: disk3
  localpath: /dev/ada3
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
```

So does the second. Like I mentioned earlier there are different ways for doing this the first time.  You have to look for the word 
status: complete. If you get a degraded status you can always repeat the procedure.

Now it is time to create our ZFS pool. The primary node should have a /dev/hast directory containing our resources. This directory appears only at the active node.


```
hast1# zpool create zhast raidz1 /dev/hast/disk1 /dev/hast/disk2 /dev/hast/disk3
hast1# zpool status zhast
 pool: zhast
 state: ONLINE
 scan: none requested
 config:

	NAME            STATE     READ WRITE CKSUM
	zhast           ONLINE       0     0     0
	  raidz1-0      ONLINE       0     0     0
	    hast/disk1  ONLINE       0     0     0
	    hast/disk2  ONLINE       0     0     0
	    hast/disk3  ONLINE       0     0     0
```

We can now use hastctl status on each node to see if everything looks ok. The magic word we are looking for here is: 
replication: fullsync

At this point both of our nodes should be available for failover. We have storage1 running as primary and sharing a pool called zhast. Our storage2 is currently in a standby mode. If we have set DNS properly we can ssh to storage.hast.test or by using its carp IP to 10.10.10.180.


----------



## gkontos (Feb 8, 2012)

*HAST and ZFS with CARP failover (Part2)*

In order to perform a failover we have to first export our pool from the first node, change the role of each resource to secondary. Then change the role of each resource to primary on the standby node and import the pool. This procedure will be done manually to test if failover really works. But for a real HA solution we will eventually create a script that will take care of this.

First lets export our pool and change our resources role:


```
hast1# zpool export zhast
hast1# hastctl role secondary disk1
hast1# hastctl role secondary disk2
hast1# hastctl role secondary disk3
```

Now, lets reverse the procedure on the standby node:


```
hast2# hastctl role primary disk1
hast2# hastctl role primary disk2
hast2# hastctl role primary disk3
hast2# zpool import zhast
```

The roles have successfully changed, lets see our pool status:


```
hast2# zpool status zhast
 pool: zhast
 state: ONLINE
 scan: none requested
 config:

	NAME            STATE     READ WRITE CKSUM
	zhast           ONLINE       0     0     0
	  raidz1-0      ONLINE       0     0     0
	    hast/disk1  ONLINE       0     0     0
	    hast/disk2  ONLINE       0     0     0
	    hast/disk3  ONLINE       0     0     0

errors: No known data errors
```

Again, by using hastctl status on each node we can verify that the roles have indeed changed and that the status is complete. This is a sample output from the second node now in charge:


```
hast2# hastctl status
disk1:
  role: primary
  provname: disk1
  localpath: /dev/ad1
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
disk2:
  role: primary
  provname: disk2
  localpath: /dev/ad2
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
disk3:
  role: primary
  provname: disk3
  localpath: /dev/ad3
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
```

It is now time to automate this procedure. When do we want our servers to automatically failover?
One reason would be if the primary node is not responding to the external network thus not being able to serve its clients. Using a devd event we can catch a carp interface going up or down and a state change. 

Add the following lines to /etc/devd.conf on both nodes:


```
notify 30 {
	match "system" "IFNET";
	match "subsystem" "carp0";
	match "type" "LINK_UP";
	action "/usr/local/bin/failover master";
};

notify 30 {
	match "system" "IFNET";
	match "subsystem" "carp0";
	match "type" "LINK_DOWN";
	action "/usr/local/bin/failover slave";
};
```

And now lets create the failover script which will be responsible for doing automatically what we did before manually:


```
#!/bin/sh

# Original script by Freddie Cash <fjwcash@gmail.com>
# Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org>
# and Viktor Petersson <vpetersson@wireload.net>
# Modified by George Kontostanos <gkontos.mail@gmail.com>

# The names of the HAST resources, as listed in /etc/hast.conf
resources="disk1 disk2 disk3"

# delay in mounting HAST resource after becoming master
# make your best guess
delay=3

# logging
log="local0.debug"
name="failover"
pool="zhast"

# end of user configurable stuff

case "$1" in
	master)
		logger -p $log -t $name "Switching to primary provider for ${resources}."
		sleep ${delay}

		# Wait for any "hastd secondary" processes to stop
		for disk in ${resources}; do
			while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do
				sleep 1
			done

			# Switch role for each disk
			hastctl role primary ${disk}
			if [ $? -ne 0 ]; then
				logger -p $log -t $name "Unable to change role to primary for resource ${disk}."
				exit 1
			fi
		done

		# Wait for the /dev/hast/* devices to appear
		for disk in ${resources}; do
			for I in $( jot 60 ); do
				[ -c "/dev/hast/${disk}" ] && break
				sleep 0.5
			done

			if [ ! -c "/dev/hast/${disk}" ]; then
				logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear."
				exit 1
			fi
		done

		logger -p $log -t $name "Role for HAST resources ${resources} switched to primary."


		logger -p $log -t $name "Importing Pool"
		# Import ZFS pool. Do it forcibly as it remembers hostid of
                # the other cluster node.
                out=`zpool import -f "${pool}" 2>&1`
                if [ $? -ne 0 ]; then
                    logger -p local0.error -t hast "ZFS pool import for resource ${resource} failed: ${out}."
                    exit 1
                fi
                logger -p local0.debug -t hast "ZFS pool for resource ${resource} imported."

	;;

	slave)
		logger -p $log -t $name "Switching to secondary provider for ${resources}."

		# Switch roles for the HAST resources
		zpool list | egrep -q "^${pool} "
        	if [ $? -eq 0 ]; then
                	# Forcibly export file pool.
                	out=`zpool export -f "${pool}" 2>&1`
               		 if [ $? -ne 0 ]; then
                        	logger -p local0.error -t hast "Unable to export pool for resource ${resource}: ${out}."
                        	exit 1
                	 fi
                	logger -p local0.debug -t hast "ZFS pool for resource ${resource} exported."
        	fi
		for disk in ${resources}; do
			sleep $delay
			hastctl role secondary ${disk} 2>&1
			if [ $? -ne 0 ]; then
				logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}."
				exit 1
			fi
			logger -p $log -t $name "Role switched to secondary for resource ${disk}."
		done
	;;
esac
```

Let's try it and see if it works. Log into both the currently active and standby node. Make sure that you are on the active by issuing a hastctl status command. Then force a failover by bringing the interface which is associated with carp0 down.


```
hast1# ifconfig er0 down
```

Watch at the generated messages:


```
hast1# tail -f /var/log/debug.log

Feb  6 15:01:41 hast1 failover: Switching to secondary provider for disk1 disk2 disk3.
Feb  6 15:01:49 hast1 hast: ZFS pool for resource  exported.
Feb  6 15:01:52 hast1 failover: Role switched to secondary for resource disk1.
Feb  6 15:01:55 hast1 failover: Role switched to secondary for resource disk2.
Feb  6 15:01:58 hast1 failover: Role switched to secondary for resource disk3.
```


```
hast2# tail -f /var/log/debug.log

Feb  6 15:02:15 hast2 failover: Switching to primary provider for disk1 disk2 disk3.
Feb  6 15:02:19 hast2 failover: Role for HAST resources disk1 disk2 disk3 switched to primary.
Feb  6 15:02:19 hast2 failover: Importing Pool
Feb  6 15:02:52 hast2 hast: ZFS pool for resource  imported.
```

Voila! The failover worked like a charm and now hast2 has assumed the primary role.


*Further considerations:​*
What we did today is a basic setup of two nodes sharing a raidz1 pool with automatic role failover in case of a failure that would result in a loss of a carp interface.

Obviously, a similar devd event would be generated in case we loose a HAST replication interface. This is something that needs to be addressed similarly since losing that interface will leave us with no synchronization at all.

Going further, we would have to add scripts that will bring up and down services during a failover.

Original article: http://www.aisecure.net/2012/02/07/hast-freebsd-zfs-with-carp-failover/
Resources: MICHAEL W LUCAS, The Freebsd Handbook


----------



## Sylhouette (Mar 2, 2012)

Did you also test a sudden reboot of the master? If *I* do this, then _I_ get in all kinds of trouble.

Mainly because the CARP interface starts in master mode after a reboot and hence will execute the master script, even if it is not master. Then the trouble starts and you get a split brain scenario.

Regards,
Johan


----------



## gkontos (Mar 2, 2012)

Sylhouette said:
			
		

> Did you also test a sudden reboot of the master.
> If i do this, then i get in all kind of troubles.
> 
> Mainly because the carp interface starts in master mode after a reboot and hence will execute the master script, even if it is not master.
> ...



I did some tests running net/samba36 with both machines sharing /zhast. File sharing service was enabled in both machines. 

The connection was established via the CARP IP. During the reboot of the master there was an obvious delay until the pool becomes available to the secondary machine but that was solved by a client reset. 

After the node came up, CARP did not assign the master role therefore I always had to perform a manual fail back.

Which FreeBSD version are you using?
Do you by any chance have net.inet.carp.preempt=1 in your /etc/sysctl.conf?


----------



## phoenix (Mar 2, 2012)

Sylhouette said:
			
		

> Did you also test a sudden reboot of the master? If *I* do this, then _I_ get in all kinds of trouble.
> 
> Mainly because the CARP interface starts in master mode after a reboot and hence will execute the master script, even if it is not master. Then the trouble starts and you get a split brain scenario.



This is a known bug with CARP and is being worked on.  The interim fix is to *not* enable the preempt sysctl for CARP.


----------



## gkontos (Mar 4, 2012)

A quick update. There is a commit in 9-STABLE which allows the user to set the state of the carp cluster.  

Link: http://svnweb.freebsd.org/base?view=revision&revision=232486


----------



## johnd (Mar 22, 2012)

Great work!
There is a little typo, but it doesnÂ´t affect the script. Look for ${resource} which should be ${resources}.


----------



## balboah (Mar 28, 2012)

*hast or not hast*

I've been using a version of this guide to set up my own replication testing in two Xen guests. Having disk1 and disk2 set up in HAST I've created a pool with mirror devices. This works most of the time, and all of the time if everything is shut down cleanly.

But for testing I've also tried resetting the HAST master in the middle of writing to a new file, which can get me into troubles. Once the ZFS metadata got corrupted which meant it rolled back a couple of minutes after forcing import with *zfs import -F*.

Another time it completely locks up on *zfs import* with state tx->tx. Rendering all zfs tools unusable since they all lock up and wait for this import. The same thing happens on both machines even after reboots etc.

So I'm currently wondering if this method really is reliable enough or if I should go the snapshot sync route without HAST.


----------



## gkontos (Mar 28, 2012)

If you forcibly export a pool during heavy I/O operations then you will eventually end up with corrupted metadata. 

This means that you should never initiate a manual failover during I/O operations. 

What happens though if the primary node crashes?

The secondary node will try to import the pool and most probably it will unless a heave corruption has occurred. In that case you can use different import techniques and heal the pool.


----------



## balboah (Mar 30, 2012)

In my recent tests I've simply done *ifconfig down* or hard reset while a client is copying files to it via NFS.

More than once I've gotten metadata corruption and errors which when trying *zfs import -F* tells me to restore the pool from a backup and refuses to import it.

Seems a bit sketchy to me as the whole point of doing this in my case is to have a reliable backup machine in case the primary burns up. Also to stall NFS clients until the secondary comes up, which works as long as ZFS doesn't get corrupted.

But I have only tried this in a virtual environment using this setup:

two virtual machines running with 1G of ram as Xen guests.
two ZFS mirrored virtual drives via hast devices.
CARP setup which is monitored by devd and that executes my script similar to the one in the article, with additions to start/stop nfsd.
Also I'm having these errors which might break things:

```
Mar 29 10:01:01 storage1 hastd[6690]: [disk2] (primary) Remote request failed (Operation not supported by device): FLUSH.
Mar 29 10:01:02 storage1 hastd[6690]: [disk2] (primary) Unable to flush disk cache on activemap update: Operation not supported by device.
```

My idea was to apply this on real machines with raidz of 3 drives, which later would be expanded by an additional three drives. I'm wondering if anyone has used this setup on real machines and in production?


----------



## mgiammarco (Mar 31, 2012)

Hello,
I am interested too in this setup. I have tried it before with linux/pacemaker and I ask these questions:

1) When I import/export a zfs from master to slave also nfs/cifs setup is imported/exported?

2) Latency of slave server kills write performance (at least with linux/drbd). I plan to put on slave server a battery backupped ram hard disk. Can I tell zfs to use it as zil/log and always write on it, then later copy to zfs volume (slave hdds may for example on standby)

3) Is hast stable?

Thanks,
Mario


----------



## mgiammarco (Mar 31, 2012)

mgiammarco said:
			
		

> 2) Latency of slave server kills write performance (at least with linux/drbd). I plan to put on slave server a battery backupped ram hard disk. Can I tell zfs to use it as zil/log and always write on it, then later copy to zfs volume (slave hdds may for example on standby)


Sorry I made a mistake. ZFS does not replicate data to slave. I need that HAST can quickly copy data to a log and then to HDDs.


----------



## gkontos (Apr 1, 2012)

@mgiammarco,

1) During a failover the resources change roles. This means that your storage becomes unavailable in machine#1 and available in machine#2. Please note that the resources can only be available to one machine only, the primary. This means that some services that depend on that data might complain. So, you might need to start /stop those servers as well.

2) I don't understand.

3) This is very difficult to answer. Why? Because until a technology is used enough, then there is not much of user feedback and error reporting.


----------



## balboah (Apr 4, 2012)

balboah said:
			
		

> In my recent tests I've simply done *ifconfig down* or hard reset while a client is copying files to it via NFS.
> 
> More than once I've gotten metadata corruption and errors which when trying *zfs import -F* tells me to restore the pool from a backup and refuses to import it.
> 
> ...




Narrowing down my issue:

In my virtual Xen environment I get some kind of deadlock with state "tx->tx" and 99.8% idle if I do these steps, also all zfs commands stop working and I'm unable to import the pool again even after a reset of the guest machine:

`dd if=/dev/urandom of=./foo bs=100M count=10 &`
`zpool export -f storage`

This only occurs when using HAST in between, not if I create the pool directly on the virtual drives. However it doesn't seem to occur on my real machine that I'm testing with now. Perhaps it's just a bug from using the virtual environment


----------



## gkontos (Apr 4, 2012)

balboah said:
			
		

> Narrowing down my issue:
> 
> In my virtual Xen environment I get some kind of deadlock with state "tx->tx" and 99.8% idle if I do these steps, also all zfs commands stop working and I'm unable to import the pool again even after a reset of the guest machine:
> 
> ...



This could be an expected behavior for ZFS running on top of a virtual machine given the COW nature. Did you allocate full space to the VMs before conducting the tests?


----------



## balboah (Apr 4, 2012)

gkontos said:
			
		

> This could be an expected behavior for ZFS running on top of a virtual machine given the COW nature. Did you allocate full space to the VMs before conducting the tests?



Actually I get the same behaviour with "tx->tx" lockup for a long while on the real servers as well but that at least recovers from it. I haven't gotten the same issue yet of where it's impossible to import the pool again.

However when split-brain occurs, hastctl says 1.8TB of "dirty" instead of 1-2G that is actually written in total. Is there a way around this?

*systat -io* says tps: 500+ and about 60MB/s on all three drives. While network activity is going at 500KB/s and the dirty counter in hastctl isn't shrinking that fast either. What's causing all the disk activity?


----------



## balboah (Apr 4, 2012)

When I re-create the secondary that is.


----------



## zennybsd (Apr 11, 2012)

@gkontos: Great stuff you shared.  

In Linux, DRBD failover is possible with a single NIC, but


How does it look with HAST with a single NIC with your configurations? What configuration changes are needed if it is possible in your configuration samples above?
Could the two HAST nodes be in two different remote locations, I meant not in the same localnet?

I know that a single NIC is not a failover option, but for the system which lacks expansion slots for NICs, one has to opt for a single NIC.


----------



## gkontos (Apr 11, 2012)

zennybsd said:
			
		

> 1) how does it look with HAST with a single NIC with your configurations? What configuration changes are needed if it is possible in your configuration samples above?



CARP uses only one NIC in this example. The second NIC is used for data replication. You could use only one NIC and just change the resource names. 



			
				zennybsd said:
			
		

> 2) could the two HAST nodes be in two different remote locations, I meant not in the same localnet?



Don't forget that both servers should bind to the same IP address in CARP. This means that you would have to perform some sort of complex ROUTING.


----------



## zennybsd (Apr 11, 2012)

@gkontos: Thanks!

From what you said about CARP, it seems that HAST+CARP is good for storage scalability rather than redundancy, right?

Generally, for enterprise grade operations are done in at least two datacenters keeping in mind if something happens (like fire, earthquake or flood etc.) to one datacenter, the IT operations will switchover to the other one in a different geographical location.

Is there any solution of the kind with HAST+CARP or is it only local solution? In GNU/Linux, DRBD with Heartbeat/Corosync is able to do what I stated.

Is that a possibility with HAST? Just curious!


----------



## gkontos (Apr 11, 2012)

zennybsd said:
			
		

> @gkontos: Thanks!
> 
> From what you said about CARP, it seems that HAST+CARP is good for storage scalability rather than redundancy, right?



Well, it is more for High Available Storage solutions. Meaning that I need my storage space always online. 



			
				zennybsd said:
			
		

> Generally, for enterprise grade operations are done in at least two datacenters keeping in mind if something happens (like fire, earthquake or flood etc.) to one datacenter, the IT operations will switchover to the other one in a different geographical location.



I don't think HAST would fit into a DR category solution yet. In that case, incremental snapshots would work better.



			
				zennybsd said:
			
		

> Is there any solution of the kind with HAST+CARP or is it only local solution? In GNU/Linux, DRBD with Heartbeat/Corosync is able to do what I stated.
> 
> Is that a possibility with HAST? Just curious!



It would if HAST were to support asynchronous replication. For the time being only fullsync is supported. I believe that DRDB uses asynchronous replication for long distance clusters. 

Also, CARP is not mandatory for HAST. If HAST could support async replication then that would work as a solution for DR replication.


----------



## zennybsd (Apr 11, 2012)

DRBD + Heartbeat/Pacemaker or Corosync in GNU/Linux supports synchronous replication too. Maybe such a setup requires a fencing device for more effective implementation. Proxmox is a Debian-based distro which uses such approach (upto 1.9 without any fencing device but only with DRBD+Heartbeat and from 2.0 Proxmox uses DRBD with Corosync. A pretty robust enterprise grade solution. Just for information.


----------



## tuaris (Apr 11, 2012)

My issues thus far with the ZFS + HAST + CARP + DEVD setup are during system startup and shutdown (related forum post is here: http://forums.freebsd.org/showthread.php?t=29996)

Hast1 and hast2 are up, running, and properly replicating.  Hast1's role is primary, hast2's role is secondary. 

*Issue #1*


I pull the (power) plug on hast1 to simulate some type of failure.  
Hast2 takes over and everything is perfect.
I put hast1 back into service (plug the power back in).
FreeBSD boots up
The CARP interface on hast1 switches to MASTER and hast2 switches to BACKUP 
hast2's role is now secondary
hast1's role is stuck at init
storage system is down ​
The cause of this issue is fully explained in the related forum post.  Basically it has to due with the fact that hastd isn't running yet.  I can easily work around this issue by modifying the fail-over script (start hastd if it's not running), but that generates errors/warnings during boot up and is not as elegant as I want it to be.

*Issue #2*


I attempt a clean reboot or shutdown of hast1
Hast1 hangs
Hast2 never takes over
storage system is down ​
Not sure exactly what causes this issue, but it only happens when the role is primary.  Some sources online point to a problem with ZFS and HAST.  I have been unable to find a work around/fix for this.  

Any assistance would be appreciated, and by the way: net.inet.carp.preempt=0 on both hosts.


----------



## gkontos (Apr 12, 2012)

@tuaris

*Issue #1*

When my primary server comes back online it does not automatically assume a MASTER role in CARP.
I have to manually issue on both nodes:

[CMD=""]#ifconfig carp0 down && ifconfig carp0 up[/CMD]

Only then do they switch roles. This way I avoid split brain issues.

*Issue #2*

Very strange!


----------



## tuaris (Apr 12, 2012)

gkontos said:
			
		

> @tuaris
> 
> When my primary server comes back online it does not automatically assume a MASTER role in CARP.



Interesting, when I reboot either server regardless of it's current role, it always assumes the MASTER role in CARP.

For example I have HostA and HostB...

HostA:

```
carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 1.2.3.4 netmask 0xffffff00
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        carp: MASTER vhid 1 advbase 1 advskew 0
```

HostB:

```
carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 1.2.3.4 netmask 0xffffff00
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        carp: BACKUP vhid 1 advbase 1 advskew 0
```

I reboot HostB.

HostA:


```
carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 1.2.3.4 netmask 0xffffff00
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        carp: BACKUP vhid 1 advbase 1 advskew 0
```

HostB:


```
carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 1.2.3.4 netmask 0xffffff00
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        carp: MASTER vhid 1 advbase 1 advskew 0
```


----------



## gkontos (Apr 12, 2012)

@tuaris,

What FreeBSD version are you running? 

My HAST tests were conducted with FreeBSD 9.0-RELEASE.

I recently deployed a HA solution for a client who is using FreeBSD 8.2-STABLE (mid Jan). The backup server is using FreeBSD  9.0-RELEASE. Both listen on a CARP IP address and they share a mysql server in replication mode, apache and DNS in jails. 

In that case I also got the same results during failover tests.

Regards,
George


----------



## phoenix (Apr 12, 2012)

zennybsd said:
			
		

> @gkontos: Thanks!
> 
> From what you said about CARP, it seems that HAST+CARP is good for storage scalability rather than redundancy, right?



No.  HAST does not provide you any extra storage space, nor does it provide any extra storage speed.  All HAST does is provide 2 copies of the data on two separate servers with the ability to switch which server is accessed for storage.  Meaning, if one server fails, all access switches automatically to the other server and clients carry on like nothing happened.



> Generally, for enterprise grade operations are done in at least two datacenters keeping in mind if something happens (like fire, earthquake or flood etc.) to one datacenter, the IT operations will switchover to the other one in a different geographical location.



Which is exactly what HAST + CARP (plus other things as needed to fix routing, load-balance services, etc) provide.  You get 2 servers (regardless of whether they are 2 inches or 2 km apart) "sharing" a virtual IP and replicating data between the two, such that if one fails, nobody notices as the other carries on.

HAST + CARP can be considered the FreeBSD way of doing similar things to DRBD + FreeVRRPd (or VServer) on Linux.  They provide storage replication, storage failover, and shared virtual IP.

Heartbeat is completely separate, and acts at a much higher layer in the applications/services stack.  And can be used on either FreeBSD or Linux.

HAST + CARP works at the storage layer to provide HA storage.  Heatbeat works at the application layer to provide HA services.  Two very different things.


----------



## phoenix (Apr 12, 2012)

tuaris said:
			
		

> Interesting, when I reboot either server regardless of it's current role, it always assumes the MASTER role in CARP.



Check your CARP sysctls to make sure preempt is disabled.


----------



## gkontos (Apr 15, 2012)

phoenix said:
			
		

> Which is exactly what HAST + CARP (plus other things as needed to fix routing, load-balance services, etc) provide.  You get 2 servers (regardless of whether they are 2 inches or 2 km apart) "sharing" a virtual IP and replicating data between the two, such that if one fails, nobody notices as the other carries on.



I would really appreciate if you could share your experience with HAST replication over long distance servers. 
Like a DR scenario with a 10Mbit line. (Any example you have, regardless of the speed, is very welcomed and much anticipated :e)

George


----------



## phoenix (Apr 16, 2012)

I've only done tests with gigabit links, mostly LAN but one test was across a WAN link, approximately 5 km. But, that was still a gigabit fibre link.


----------



## zennybsd (Apr 17, 2012)

phoenix said:
			
		

> I've only done tests with gigabit links, mostly LAN but one test was across a WAN link, approximately 5 km. But, that was still a gigabit fibre link.



Would like to know how you achieved this? Any pointers or tutorial would be appreciated!


----------



## phoenix (Apr 17, 2012)

We have fibre runs between admin sites and secondary schools within the city.  We setup 1 pair of fibres to extend out DMZ between the server rooms in two buildings.  And I did some HAST+CARP testing using VMs in the two buildings.  The setup was exactly the same as the testing using VMs in the same building, since the two buildings were (essentially) on the same LAN.


----------



## zennybsd (Apr 19, 2012)

phoenix said:
			
		

> We have fibre runs between admin sites and secondary schools within the city.  We setup 1 pair of fibres to extend out DMZ between the server rooms in two buildings.  And I did some HAST+CARP testing using VMs in the two buildings.  The setup was exactly the same as the testing using VMs in the same building, since the two buildings were (essentially) on the same LAN.



Meaning HAST is possible only under the same LAN, not between different LANs in different geographical locations. 

Your setup is already covered by gkontos' howto the first post of this thread. But ..

It is important to have failover storage or network services in different geographical locations in case of natural disasters. Yes, zfs send and receive can backup data, but not sure of failover part. 

Thanks!


----------



## gkontos (Apr 19, 2012)

zennybsd said:
			
		

> Meaning HAST is possible only under the same LAN, not between different LANs in different geographical locations.



That is not entirely correct. CAPR advertises a virtual IP. You don't need CARP for HAST to work. 

And even if you decide to use CARP you can use complex routing scenarios so that those 3 IP address are routable between different LANs 



			
				zennybsd said:
			
		

> Your setup is already covered by gkontos' howto the first post of this thread. But ..
> 
> It is important to have failover storage or network services in different geographical locations in case of natural disasters. Yes, zfs send and receive can backup data, but not sure of failover part.
> 
> Thanks!



My point was that as long as HAST uses full synch only at the moment, it is difficult to implement DR scenarios over long distances.  

DRBD mirroring implementations for DR are implemented in asynchronous mode.


----------



## phoenix (Apr 19, 2012)

zennybsd said:
			
		

> Meaning HAST is possible only under the same LAN, not between different LANs in different geographical locations.



HAST works between any two systems that are accessible via TCP/IP, whether that be on the local network, through a router, over the Internet, wherever.  So long as the two systems are accessible via TCP/IP, you can use HAST to replicate the data between them.

How you do the failover will, of course, depend on the setup.  To failover on a LAN, you can use CARP to share a single local IP.  To failover on a WAN, you'd use something at or above the routing layer.  What you use depends on the setup.  Maybe you tunnel things.  Maybe you update routing tables.  Maybe you failover across IPs.  Maybe you put a load-balance in front to make it transparent.  Maybe you use something else.

All HAST does is replicate data between two systems.  You stack other stuff on top to make it fit your needs.

You won't find a single piece of software that provides everything.  But you will find software that can be stacked together to provide the features you need.


----------



## srdjanrosic (Jun 11, 2012)

*... and NFS?*

Is anyone exporting the ZFS through NFS?

I'm getting 
	
	



```
Stale NFS File Handle
```
 when a failover happens (which means that NFS fsid's are different, I think).


----------



## glocke (Aug 21, 2012)

I'm also getting stale NFS handles.
As far as I understand this, in 8.3-RELEASE the sysctl vfs.typenumhash was introduced to address this issue: http://www.freebsd.org/releases/8.3R/relnotes-detailed.html#FS. 
I tried the setup on two 9.0-RELEASE machines with a third one acting as NFS client, no matter the value of  vfs.typenumhash, when doing a failover the clients quits the cp command with 
	
	



```
Stale NFS file handle
```
Has anybody more insight into this matter? Any help would be greatly appreciated.

Greetings glocke


----------



## gkontos (Aug 22, 2012)

Can you post your NFS configuration?

Also, do you run the NFS service in the standby node as well or do you start it when it becomes active?


----------



## glocke (Aug 24, 2012)

Hi gkontos, 

I copied the NFS switch-over from http://www.erik.eu/carp-hast-nfs/:

```
On both server add to rc.conf:
 nfs_server_enable="YES"
 nfs_server_flags="-u -t -n 4 -h 192.168.0.63"
 rpcbind_enable="YES"
 mountd_flags="-r"
 rpcbind_flags="-h 192.168.0.63"
```
From the failover-script, it justs starts nfsd in master-mode, but it does not stop it, when in slave mode, so *yes* most of the time the nfsd runs on master and slave. Spured from your comment, I modified the script to stop the nfsd and rpcbind when in slave mode and also disabled both in rc.conf. 
Now I don't get any Stale NFS file handle when copying from NFS to local disk on the client. Note that the NFS export is mounted without any parameters (nfs_client is enabled in rc.conf). 
I will have a deeper look into all of that, now it _just works_, (I guess the -h param for nfsd and rpcbind just hassled with eachother due to the same (shared) IP address. Thanks again for the hint 

glocke


----------



## gkontos (Aug 24, 2012)

glocke said:
			
		

> Spured from your comment, I modified the script to stop the nfsd and rpcbind when in slave mode and also disabled both in rc.conf.
> Now I don't get any Stale NFS file handle when copying from NFS to local disk on the client. Note that the NFS export is mounted without any parameters (nfs_client is enabled in rc.conf).



That is something that I have to probably modify in my guide. There are some services that should not run on the standby node.


----------



## dswartz (Sep 27, 2012)

*Corruption issue with HAST?*

I was trying an experiment.  Had two VMs under ESXi.  Each had two 32GB vmdks.  The idea was to set up a HAST for each VM using da2, and then mirror da1 with the HAST device (using ZFS).  I got both guys up and running, without all the scripts (e.g. just testing manually.)  My switchover technique was:

1. Export pool on primary.
2. Set primary to secondary (hastctl role secondary tank)
3. Set secondary to primary (hastctl role primary tank)
4. Import pool on newly-promoted primary.

This works just fine (e.g. no errors logged), however, if I then check the pool, it complains about a bunch of checksum errors on the local disk of the pool (e.g. the zfs local disk, NOT the local disk HAST is using.)  I scrub the pool and do a 'zpool clear tank', and all *seems* well.  Until I flip control back to the other host (using steps 1-4 again).  Again, corruption on the local disk of the newly-promoted host.  Nothing useful is being logged anywhere I can see.  Any ideas where to look?  Thanks!


----------



## vermaden (Sep 27, 2012)

I have tested similar setup under VirtualBox, also with ZFS and everything was ok, maybe its ESXi getting involved somewhere between?


----------



## dswartz (Sep 27, 2012)

*Hmmm*

Well, both disks (the local one and the other local one used by hast) are just vmdks - and this does not happen if not doing hast   I am trying a different approach.  4 virtual disks, each one part of a hast resource.  I will then setup the 4 hast resources in a 2x2 raid10.  Maybe it does not like having a local device explicitly mirrored with the hast device?  I will post my findings...


----------



## FMiralha (Dec 16, 2012)

zennybsd said:
			
		

> .. two datacenters keeping in mind if something happens (like fire, earthquake or flood etc.) to one datacenter, the IT operations will switchover to the other one in a different geographical location...



Hi! Let me ask... Is there a solution using FreeBSD?

Two FreeBSD machines, one on each datacenter, geografically distants?


----------



## vermaden (Dec 16, 2012)

FMiralha said:
			
		

> Hi! Let me ask... Is there a solution using FreeBSD?
> 
> Two FreeBSD machines, one on each datacenter, geografically distants?



From what I know, HAST does not need single VLAN/layer 2 network to be spread between the two different datacenters. CARP may need that, but there is also UCARP which should allow what, so IMHO it should be possible using one network on the first datacenter and some other network from the second one. These networks of course need to 'see' each other.


----------



## gkontos (Dec 17, 2012)

vermaden said:
			
		

> From what I know, HAST does not need single VLAN/layer 2 network to be spread between the two different datacenters. CARP may need that, but there is also UCARP which should allow what, so IMHO it should be possible using one network on the first datacenter and some other network from the second one. These networks of course need to 'see' each other.



Correct, this would not be an issue. However, HAST does not support async mode for synchronization. This means that a descend Internet connection must exist between the 2 datacenters. Of course, this also depends on the amount of data that change on the primary node.


----------



## interrupted (Dec 18, 2012)

balboah said:
			
		

> Mar 29 10:01:01 storage1 hastd[6690]: [disk2] (primary) Remote request failed (Operation not supported by device): FLUSH.
> Mar 29 10:01:02 storage1 hastd[6690]: [disk2] (primary) Unable to flush disk cache on activemap update: Operation not supported by device.[/CODE]



anyone else had this and found a solution for it?


----------



## pol (Feb 16, 2013)

*other script for devd action*

Hello. I wrote another script for service management, disk status, depending on the CARP interface. Maybe it solves the problems encountered in switching node and the order start services at boot time.


```
#!/bin/sh

# Copyright (c) 2013 Pavel I Volkov
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# 1. Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
# OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
#

# file location: /usr/local/libexec/carpcontrol.sh
# version 1.5

# file example: /usr/local/etc/devd/carp.conf
# notify 0 {
# 	match "system"          "IFNET";
# 	match "subsystem"       "carp*";
# 	match "type"            "LINK_UP";
# 	action "/usr/local/libexec/carpcontrol.sh $type $subsystem";
# };
#
# notify 0 {
# 	match "system"          "IFNET";
# 	match "subsystem"       "carp*";
# 	match "type"            "LINK_DOWN";
# 	action "/usr/local/libexec/carpcontrol.sh $type $subsystem";
# };

# file example: /etc/fstab
# ...
# /dev/hast/volume /hast/volume ufs rw,noauto 0 0
# ...

# file example: /etc/rc.conf.local
# ...
# hastd_enable="YES"
# carpcontrol_services="samba apache22"
# if [ -d "/dev/hast" ]; then # master mode
# samba_enable="YES"
# apache22_enable="YES"
# else # backup mode
# samba_enable="NO"
# apache22_enable="NO"
# fi
# ...

MY=`basename $0 .sh`
PID=$$
EVENT=$1 # Event type
IF="$2"	 # The network interface
PIDf="/var/run/${MY}.pid" # PID file for background process

. /etc/rc.subr
load_rc_config ${MY}

carp_type() { ifconfig ${IF} | sed -E '1,3d;s/^.*(INIT|MASTER|BACKUP).*$/\1/'; }

get_fstab() { awk -v p1=$1 '/\/dev\/hast\//{print $p1}' /etc/fstab; }

hast_role() { local _ret
	hastctl role "${1}" all; _ret=$?
	[ $_ret -ne 0 ] \
		&& logger -p daemon.err -t "${MY}[${PID}]" "hastd unable to switch role to ${1} (${_ret})" \
		|| logger -p daemon.notice -t "${MY}[${PID}]" "hastd switched to ${1} (${_ret})"
	return $_ret
}

serv_ctl() { local _i _st
	for _i in ${carpcontrol_services}; do
		logger -p daemon.notice -t "${MY}[${PID}]" "attempt to change the status of a service ${_i} to ${1}"
		service $_i onestatus > /dev/null 2>&1; _st=$?
		case ${1} in
			*start) [ $_st -ne 0 ] && service $_i ${1} ;;
			 *stop) [ $_st -eq 0 ] && service $_i ${1} ;;
		esac
	done
}

hast_init() { local _i
	serv_ctl stop # services stop
	# umount all hast volumes
	sync
	[ -d /dev/hast ] && for _i in `ls -1 /dev/hast/*`; do umount -f "${_i}"; done
	[ -e /var/run/hastctl ] && hast_role init # HAST(init)
}

wait_hast_bg() { local _i _j _k PID
	sleep 0.25; [ -e "${PIDf}" ] && PID=`cat ${PIDf}` || PID=$$
	logger -p daemon.notice -t "${MY}[${PID}]" "bg start from [$$] process"
	until hastctl status all > /dev/null 2>&1 # wait hast daemon
	do
	       sleep 0.25
	       logger -p daemon.notice -t "${MY}[${PID}]" "wait hastd"
	done
	for _i in `jot 12`; do # wait until 3 seconds
		case `carp_type` in
			BACKUP) break ;;
			*) sleep 0.25 ;;
		esac
	done
	for _i in `jot 12`; do # wait until 3 second
		case `carp_type` in
			BACKUP) # backup mode
				hast_init
				# HAST(secondary)
				hast_role secondary
				break
				;;
			MASTER) # master mode
				hast_init
				# HAST(primary)
				hast_role primary
				[ $? -ne 0 ] && break
				# mount all hast volumes from /etc/fstab
				for _j in `get_fstab 2`; do [ -d "$_j" ] || mkdir -p "$_j"; done
				for _j in `get_fstab 1`; do
					for _k in `jot 40`; do sleep 0.25; [ -e $_j ] && break; done # wait until 10 second
					fsck -p -y -t ufs $_j
					logger -p daemon.notice -t "${MY}[${PID}]" "trying mount ${_j}"
					mount $_j
					[ $? -ne 0 ] && break 2
					logger -p daemon.notice -t "${MY}[${PID}]" "${_j} mounted"
				done
				serv_ctl start # services start
				break
				;;
			*) # other mode
				sleep 0.25
				;;
		esac
	done
	rm -f "${PIDf}" # remove PID of the background process
	logger -p daemon.notice -t "${MY}[${PID}]" "bg stop from $$ process"
}

case "${EVENT}" in
	"LINK_UP"|"LINK_DOWN") # Carrier status changed to UP or DOWN
		logger -p daemon.notice -t "${MY}[${PID}]" "${IF} ${EVENT}"
		if [ -e "${PIDf}" ]; then # PID exist
			if [ pgrep -qF "${PIDf}" ]; then # process don't exist
				rm -f "${PIDf}"
			else # process exist
				logger -p daemon.err -t "${MY}[${PID}]" "background process already running"
				exit 1
			fi
		fi
		# because it is necessary to wait hast daemon at startup
		wait_hast_bg > /dev/null 2>&1 &
		echo $! > "${PIDf}" # create PID for background process
		;;
	*)	logger -p daemon.err -t "${MY}[${PID}]" "unknown event ${EVENT} for interface ${IF}" ;;
esac
exit 0
```


----------



## Paul-LKW (Sep 29, 2013)

Just tried ZFS with HASTD, it is absolutely not suitable for production as reboot will cause a system halt at;


```
Syncing disks, vnodes remaining ... 0 0 0 done
All buffers synced
```

and then the whole system halts, and you will need to force to push the power button by going to data center by hand (or remote hand) turn off and turn on! !!!


----------

