# iscsi Multipath target / Network multipath



## onob (Nov 21, 2012)

Hi,

We are in the process of setting up a new iscsi based on FreeBSD, ZFS and istgt(1).

Virtual hosts are running esxi an we have a redundant storage network and using MPIO from the hosts to our current storage.

Using istgt(1) we have support for MPIO but the FreeBSD network stack has no support for multipathing in the kernel other than using setfib(8). We have support for round-robin on incoming traffic but traffic out always leaves the default interface.

Has anyone got a working solution for iscsi MPIO / Multipath on a FreeBSD target?

//JO


----------



## VVB16 (Nov 30, 2012)

Hi, onob

I think you can use lagg(4) for multipath/failover network configuration.

--
Vitaly Belekhov


----------



## bjwela (Dec 5, 2012)

Hi onob, 

Where you able to get MPIO with istgt? I setting up a similar system with 4 NICs and I see the same thing with round-robin on writes, but single interface on reads. 

Where you able to get lagg(4) to work?


----------



## bjwela (Dec 5, 2012)

I solved it myself. 

Just had to make sure all istgt ports where on different subnets.


----------



## onob (Dec 6, 2012)

We are up n running with MPIO and lagg. It works very well. 

lagg setup is in failover mode with 2 ports in each lagg interface and using 2 lagg (lagg0 and lagg1) for MPIO. Each lagg has their own subnet. 

Our other SAN from "the big hardware provider" has the ability to put all ports on the same subnet and we were trying to do the same thin in FreeBSD using tools like policy routing and setfib. Turns out this is not doable when it comes to daemons like istgt. No problem with incoming traffic but everything leaving the FreeBSD box will take the route for the destination subnet. 

//JO


----------



## gnoma (Dec 6, 2012)

Hello,

If you are running on the same network and same switch, you can use lagg.

But my experience with vmware and iSCSI shows that if you have few networks, you can use multipathing(usually Round Robin) to get better performance and higher fault tolerance. 
Because if the entire switch goes down, not just a single port, then you are in trouble. 
But this apply only if you have more than 1 network and more than 1 switch. 

How I did this? Just make istgt to listen on all interfaces and allow initiators from all your networks. Then in the iSCSI adapter of your ESX/ESXi manually add all IPs of your SAN in all the networks and the multipathing will be setup. The initiator just checks the LUN devise name, serial number and all that stuff. If at all IPs listed there, the characteristics of the LUN is the same, it will be automatically recognized as a single devise with many paths to it.
I know that there is a dynamic discovery possibility, because when I used FreeNAS, trough the web configuration I set up the istgt and on the ESX, I just put a single IP of the SAN, then it discovered them all automatically. Then I switch to FreeBSD, configured manually the istgt, dynamic discovery didn't work and I just list manually all IP addresses of the SAN. Failover still works, aggregation still works, so I didn't worry about the dynamic discovery.
However if you are using single switch and single network, lagg may be is the better way to get higher performance, because it does it osi layer 3, and it is sort of hardware acceleration given by the switches.

Hope this was useful, if you need istgt config file, I can provide it for you. About ESX/ESXi configuration, it's all done by GUI, and you can't go wrong there.

Good luck.


----------



## Sylhouette (Dec 6, 2012)

I am interested in the config file!

thanks for your time and explanation!

gr
Johan


----------



## gnoma (Dec 6, 2012)

Hello,

This is what I got, please note that this is a little bit risky configuration, because it allows everything from everywhere. The restrictions I do with my firewall, not with the build-in restrictive rules of istgt. So consider putting authentication, or restriction for the initiators. There are comments in the sample config file how to do that.

```
san root >cat /usr/local/etc/istgt/istgt.conf | grep -v \#
[Global]
  Comment "Global section"
  NodeBase "san.mysystem.com"
  PidFile /var/run/istgt.pid
  LogFacility "local7"
  Timeout 30
  NopInInterval 20
  DiscoveryAuthMethod Auto
  MaxSessions 64
  MaxConnections 16
  MaxR2T 256
  MaxOutstandingR2T 16
  DefaultTime2Wait 2
  DefaultTime2Retain 60
  FirstBurstLength 262144
  MaxBurstLength 1048576
  MaxRecvDataSegmentLength 262144
  ImmediateData Yes
  DataPDUInOrder Yes
  DataSequenceInOrder Yes
  ErrorRecoveryLevel 0
[UnitControl]
  Comment "Internal Logical Unit Controller"
  AuthMethod Auto
  AuthGroup AuthGroup10000
  Portal UC1 127.0.0.1:3261
  Netmask 127.0.0.1
[PortalGroup1]
  Comment "ANY IP"
  Portal DA1 0.0.0.0:3260
[InitiatorGroup1]
  Comment "Initiator Users"
  InitiatorName "ALL"
[LogicalUnit1]
  Comment "My LUN"
  TargetName Name-of-LUN
  TargetAlias "alias-name-of-lun"
  Mapping PortalGroup1 InitiatorGroup1
  AuthMethod Auto
  AuthGroup AuthGroup1
  UseDigest Auto
  UnitType Disk
  QueueDepth 255
  LUN0 Storage /dev/zvol/datacore/istgt.block.device Auto

san root >cat /usr/local/etc/istgt/auth.conf | grep -v \#
  Comment "Auth Group1"
san root >
```

Hope this was useful.


----------



## Sylhouette (Dec 6, 2012)

Thank you.

I will try things out.

We now use NFS to share our virtual machines.
But we have diabled sync on the dataset. ZFS, NFS and ESXi is really slow when using NFS with sync enabled.
But maybe it is more wisely to use iscsi for the ESXi datastore.

regards
Johan


----------



## gnoma (Dec 6, 2012)

I like the iSCSI mostly because of the multipathing. You can get fault tolerance and aggregation in the same time. For example midnight when the backups are running, I have a cronjob that brings down the interface connected to the switch that rolls the backups. After the backups are finished, the script bring the interface up and in a few minutes  all HBAs are rescanned automatically and the path is restored. 
Best practices says that the LUN should be directly on a block devise, not on a file inside already mounted filesystem in your SAN.
And also if you are using ESXi5.1, it supports iSCSI over jumbo frames with MTU 9000. Istgt also got no trouble using MTU 9000, so this is a real network performance hit.


----------



## Sylhouette (Dec 7, 2012)

The one thing i did like about NFS was the fact that i can see the files on my filesystem.
So it was quit easy to insert files from FreeBSD itself into the datastore.

But the fact that the sync is disabled gives me more and more the shiffers.

gr
Johan


----------



## bjwela (Dec 13, 2012)

Has anyone experienced very poor performance on iscsi (istgt) on FreeBSD 9.0? 

Seems like I cannot get good performance even sync=disabled. This is especially true for smaller block-sizes < 16k.
I am using zvols as backing for iscsi. Is it better to use a file?


----------



## Sebulon (Dec 14, 2012)

Sylhouette said:
			
		

> Thank you.
> 
> I will try things out.
> 
> ...



Disabling sync is *never* the answer. Period. You are facing a real danger of corrupting your data, and I sincerly hope for your sake that you have good backups for when that time comes. I advise you to shell out a couple of bucks on two Vertex 4 256GB MLC that you configure as mirrored SLOG, so that I can sleep better at night, not worrying on your behalf

/Sebulon


----------



## bjwela (Dec 14, 2012)

Hi Sebulon, 
Thank you for your reply. 

I know that disabling the zil cache is a bad thing, this was done only to test the iscsi connection.

My setup is 24X Intel 520 480GB SSDs and therefore no need to have a separate SSD zil cache, a ramdisk may give me more performance but I am banking on my all SSD array  to have enough performance.

The setup with FreeBSD is just a test before I take it into production using Nexenta. My plan was to benchmark with FreeBSD for possible future installations. However, I see awful iscsi performance using istgt.  

I share iscsi to Windows 7 over 1GB ethernet, this will be MPIO in production using 10GB. 

At 4k random writes I get about 8MB/s on FreeBSD. Measured using IOMeter in Windows. 
With Nexenta with same test I get 76MB/s at 100% random 4k writes. 

The pool conf is striped in a raid0 (again only for testing, production setup will be mirrored).

Any ideas on what is causing the bad performance on FreeBSD? Have you seen successful setups like this one before on FreeBSD?


----------



## Sebulon (Dec 14, 2012)

bjwela said:
			
		

> Hi Sebulon,
> Thank you for your reply.
> 
> I know that disabling the zil cache is a bad thing, this was done only to test the iscsi connection.
> ...



Wow, a bit different budget than my tinker systems usually have had, haha! In that case you may want to look up STECÂ´s ZeusRAM SSD, itÂ´s a real killer

We are using istgt very sucessfully in our organisation over 1GbE so it is definitely possible. Some things to keep in mind though is that 1) the documentation is horrible, so youÂ´re basically just guessing your way through, and so did we, and 2) be very careful how you design your pool(partitioning/ashift) and performance benchmark locally first, then configure and performance test remotely to know if youÂ´re on the right path at all.

When creating the zvol it is very important to specify the same block size the remote file system will be using. NTFS defaults to 4k, which you can set with:
`# zfs create [b]-b 4k[/b] -o sync=always -o compress=on -s -V 1t pool/lun`
And if youÂ´re going to store larger files, you can set NTFS/zvol block size all the way up to 64k to get even better performance. istgt does not sync by default, which is the reason you get "better" throughput over iSCSI than youÂ´d normally get over NFS, but is dangerous in regards to data corruption. I usually specify "-s" to have the zvol thin provisioned. Also IÂ´d recommend you to configure the network interfaces for Jumbo frames to get a lower overhead of packets, that can give you better performance if your switches are stressed enough as it is.

/Sebulon


----------



## gnoma (Dec 14, 2012)

Aloha,


> Has anyone experienced very poor performance on iscsi (istgt) on FreeBSD 9.0?
> 
> Seems like I cannot get good performance even sync=disabled. This is especially true for smaller block-sizes < 16k.
> I am using zvols as backing for iscsi. Is it better to use a file?



I had this problem and it is discussed in this topic.
As a result, it seems that no mater the good hardware you got, geom raid will aways be faster than zfs. You can get very expensive hardware and get great performance with ZFS, but even then geom will be faster. Of course you will lose great features and from data integrity point of view zfs is aways the best choice.


----------



## bjwela (Dec 14, 2012)

Hi Sebulon, 

Thanks for your reply.

I have setup all disks with gpt and gnop:


```
gpart create -s GPT /dev/$drive
gpart add -t freebsd-zfs -l disk$i -b 2048 -a 4k /dev/$drive
gnop create -S 4096 /dev/gpt/disk$i
```

Then created zpool using the .nop devices.


```
zpool create tank1 disk1.nop disk2.nop ....
```

I will try to setup the zvols with 4k block and sync=always.

Are there any obvious settings in istgt that could cause really bad performance?

istgt.conf: 


```
[Global]
  Comment "Global section"
  NodeBase "iqn.2012-12.com.adq"
  PidFile /var/run/istgt.pid
  AuthFile /usr/local/etc/istgt/auth.conf
  MediaDirectory /var/istgt
  LogFacility "local7"
  Timeout 30
  NopInInterval 20
  DiscoveryAuthMethod Auto
  MaxSessions 16
  MaxConnections 4
  MaxR2T 32
  MaxOutstandingR2T 16
  DefaultTime2Wait 2
  DefaultTime2Retain 60
  FirstBurstLength 262144
  MaxBurstLength 1048576
  MaxRecvDataSegmentLength 262144

  # NOTE: not supported
  InitialR2T Yes
  ImmediateData Yes
  DataPDUInOrder Yes
  DataSequenceInOrder Yes
  ErrorRecoveryLevel 0

[UnitControl]
  Comment "Internal Logical Unit Controller"
  AuthMethod CHAP Mutual
  AuthGroup AuthGroup10000
  Portal UC1 127.0.0.1:3261
  Netmask 127.0.0.1

[PortalGroup1]
  Comment "igb0"
  Portal DA1 10.30.0.165:3260

[InitiatorGroup1]
  Comment "Initiator Group1"
  InitiatorName "ALL"
  Netmask 10.30.0.0/24
  
[LogicalUnit1]
  Comment "zvol1"
  TargetName zvol1
  TargetAlias "DiskTarget zvol1"
  Mapping PortalGroup1 InitiatorGroup1
  AuthMethod Auto
  AuthGroup AuthGroup1
  UseDigest Auto
  UnitType Disk
  QueueDepth 64
  LUN0 Storage /dev/zvol/tank/zvol1 Auto
```

Any ZFS tunables that needs to be set properly?


----------



## Sebulon (Dec 14, 2012)

@bjwela

That looks textbook, nicely done. Tuning is mostly evil. If you have any, remove them and see if that actually gives you better performance than you had with them set. Next step would be to install benchmarks/bonnie++ to verify your performance locally first:
`# bonnie++ -d /foo/bar -u 0 (if running as root)`

/Sebulon


----------



## onob (Dec 17, 2012)

Hi all,

Our setup works fine when it comes to throughput but worse when it comes to latency (as in to much):

Hardware:
Xeon E3-1270 V2 @ 3.50GHz
32GB RAM
Supermicro Motherboard and chassis
2xSSD for ZIL
1xSSD for cache
2xSATA mirrored vdev

Network using 2 lagg-interfaces in failover mode, should be able to do 2x1Gbit/s using MPIO and round-robin in ESXi5.

[root@storage2 ~]# zpool status
  pool: tank1
 state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Thu Nov 22 13:57:59 2012
config:

	NAME        STATE     READ WRITE CKSUM
	tank1       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	logs
	  mirror-2  ONLINE       0     0     0
	    ada1    ONLINE       0     0     0
	    ada2    ONLINE       0     0     0
	cache
	  ada0      ONLINE       0     0     0

errors: No known data errors

When we put some load on the target our monitoring software (Veeam One) starts to alarm on high latency. When monitoring the system the network interfaces burst up to ~45MB/s, latency then raises up to about 200ms and sometimes as high as 500ms. Nothing strange in top, gstat or netstat.

istgt.conf:
[Global]
  Comment "Global section"
  # node name (not include optional part)
  NodeBase "storage2.x.istgt"

  # files
  PidFile /var/run/istgt.pid
  AuthFile /usr/local/etc/istgt/auth.conf

  # syslog facility
  LogFacility "local7"

  # socket I/O timeout sec. (polling is infinity)
  Timeout 30
  # NOPIN sending interval sec.
  NopInInterval 20

  # authentication information for discovery session
  DiscoveryAuthMethod None
  #DiscoveryAuthGroup AuthGroup9999

  # reserved maximum connections and sessions
  # NOTE: iSCSI boot is 2 or more sessions required
  MaxSessions 32
  MaxConnections 8

  # iSCSI initial parameters negotiate with initiators
  # NOTE: incorrect values might crash
  FirstBurstLength 262144
  MaxBurstLength 262144
  MaxRecvDataSegmentLength 262144

[UnitControl]
  Comment "Internal Logical Unit Controller"
  #AuthMethod Auto
  AuthMethod CHAP Mutual
  AuthGroup AuthGroup10000
  # this portal is only used as controller (by istgtcontrol)
  # if it's not necessary, no portal is valid
  #Portal UC1 [::1]:3261
  Portal UC1 127.0.0.1:3261
  # accept IP netmask
  #Netmask [::1]
  Netmask 127.0.0.1

[PortalGroup1]
Comment "esx-grp1 lagg0"
Portal DA1 172.16.68.10:3260
Portal DA2 172.16.69.10:3260

[InitiatorGroup1]
  Comment "Initiator Group esx-grp1"
  InitiatorName "ALL"
  Netmask 172.16.68.0/24
  Netmask 172.16.69.0/24

[LogicalUnit1]
  Comment "esx-grp1-sata2"
  TargetName esx-grp1-sata2
  TargetAlias "esx-grp1-sata2"
  # use initiators in tag1 via portals in tag1
  Mapping PortalGroup1 InitiatorGroup1
  # accept both CHAP and None
  AuthMethod None
  AuthGroup AuthGroup1
  UnitType Disk
  # Queuing 0=disabled, 1-255=enabled with specified depth.
  QueueDepth 128
  #QueueDepth 16
  LUN0 Storage /tank1/sata2/sata2 3TB

[LogicalUnit2]
  Comment "esx-grp1-sata3"
  TargetName esx-grp1-sata3
  TargetAlias "esx-grp1-sata3"
  # use initiators in tag1 via portals in tag1
  Mapping PortalGroup1 InitiatorGroup1
  # accept both CHAP and None
  AuthMethod None
  AuthGroup AuthGroup1
  UnitType Disk
  # Queuing 0=disabled, 1-255=enabled with specified depth.
  QueueDepth 128
  #QueueDepth 16
  LUN0 Storage /tank1/sata3/sata3 3TB

Anyone with an idea on how to decrease latency under a "not that heavy" load? Peaks at 2x45MB/s should not be a problem causing this?

//JO


----------



## Sebulon (Dec 18, 2012)

@onob

As I stated earlier, it is very important how you design your pool with regards to partitioning and optimizing for 4k, where logs and cache are just as important. bjwela posted an accurate setup you can follow. Benchmark your performance locally first to know if you are on the right path, you cannot expect better values remotely than you have locally.


```
LUN0 Storage /tank1/sata2/sata2 3TB
```

Looks to me like a file called "sata2" within a zfs filesystem, also called "sata2". I suggest you instead try to export a zvol instead:
`# zfs create -b 4k -o sync=always -o compress=on -s -V 1t pool/lun`

```
LUN0 Storage /dev/zvol/pool/lun 1TB
```
And be sure to format the lun with the same block size on the client(initiator) as you specified for the lun, in this case 4096(4k), but with e.g. NTFS, you can raise that all the way up to 64k, depending on your application use-case.

/Sebulon


----------



## gnoma (Dec 19, 2012)

Hello,

I was wondering if I can also get better performance with switching istgt to work on udp.

```
iscsi-target    3260/tcp   # iSCSI port
iscsi-target    3260/udp   # iSCSI port
```
But I couldn't find anywhere in the configuration files an option for tcp/udp protocol. 
google didn't do much better.

Anybody knows if istgt can work on udp, or I should try another iscsi target solution?

Thank you.


----------



## _martin (Dec 20, 2012)

gnoma said:
			
		

> Anybody knows if istgt can work on udp, or I should try another iscsi target solution?



Please see the RFC 3270:


```
- Connection: A connection is a TCP connection.  Communication
     between the initiator and target occurs over one or more TCP
     connections.  The TCP connections carry control messages, SCSI
     commands, parameters, and data within iSCSI Protocol Data Units
     (iSCSI PDUs)
```
Anything working differently just breaks a standard.


----------

