# L2ARC and ZIL on SSD - 4K alignment?



## belon_cfy (Apr 21, 2012)

I have added a whole SSD as L2ARC without partitioning it, do we really need to optimize the align of L2ARC and ZIL on SSD by using gpart and gnop?

I have tried to partition a L2ARC (45G) and ZIL (10G) by using gpart for performance test, but when I did a *zdb | grep ashift*, the ZIL showed 9 instead of 12, does it mean the ZIL wasn't aligned?


----------



## Sebulon (Apr 21, 2012)

@belon_cfy

Having an improper alignment on a SSD can cut the performance in *half*, so yes, it is very important:


```
[CMD="#"]diskinfo -v daX[/CMD]
daX
	512         	# sectorsize
	240057409536	# mediasize in bytes (223G)
	468862128   	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	29185       	# Cylinders according to firmware.
	255         	# Heads according to firmware.
	63          	# Sectors according to firmware.
	ID-SDFG434156XVH	# Disk ident.

[CMD="#"]echo "240057409536 / 1024000 - 1" | bc[/CMD]
234430
[CMD="#"]dd if=/dev/zero of=tmpdsk0 bs=1024000 count=1 seek=234430[/CMD]
[CMD="#"]mdconfig -a -t vnode -f tmpdsk0[/CMD]
[CMD="#"]gnop -S 4096 md0[/CMD]
[CMD="#"]gpart create -s gpt[/CMD]
[CMD="#"]gpart add -t freebsd-zfs -l log1 -b 2048 -a 4k daX[/CMD]
[CMD="#"]zpool add pool log mirror md0.nop gpt/log1[/CMD]
[CMD="#"]zpool detach pool md0.nop[/CMD]
```
Gives you both a proper alignment and ashift: 12 on the log vdev. Ashift is *not* alignment.

/Sebulon


----------



## belon_cfy (Apr 21, 2012)

Are the parameters -b 2048 and -a 4k necessary when adding a partition by using gpart? I'm using 4X 2TB 4k optimized drives and the zfs volumes have been created with 4K optimized. My gpart show as below, does it consider aligned? 


```
=>        34  1953525101  da0  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)

=>        34  1953525101  da1  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)

=>        34  1953525101  da2  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)

=>        34  1953525101  da3  GPT  (931G)
          34          94    1  freebsd-boot  (47k)
         128    41943040    2  freebsd-zfs  (20G)
    41943168  1911581967    3  freebsd-zfs  (911G)
```

How about L2ARC? Do I still need to create a .nop with 4k optimized before add it as L2ARC?


----------



## einthusan (May 4, 2012)

belon_cfy said:
			
		

> How about L2ARC? Do I still need to create a .nop with 4k optimized before add it as L2ARC?



Bump! I would like to know the answer to that. Anyone care to answer that?


----------



## Sebulon (May 4, 2012)

@einthusan

Not quite sure to be honest. I mean, a SLOG gets its own vdev that receives a ashift-value, but if you run zdb, you won't see that for L2ARC. It's also different since it's a part of the ARC, while the SLOG is part of the pool.

I settled for aligned partitioning on my cache device (Vertex3) and have noticed it reading and writing over 200MB/s at times in gstat, so I'm not worried.

/Sebulon


----------



## einthusan (May 4, 2012)

Sebulon said:
			
		

> @einthusan
> 
> Not quite sure to be honest. I mean, a SLOG gets its own vdev that receives a ashift-value, but if you run zdb, you wonÂ´t see that for L2ARC... ItÂ´s also different since itÂ´s a part of the ARC, while the SLOG is part of the pool.
> 
> ...



Hi Sebulon,

I'm not sure what a SLOG is but I have a quick question regarding the L2ARC. I plan on putting 2x SSD as L2ARC. I use  your guide regarding "ZFS as root" to setup a few FreeBSD boxes but do you think you can explain what you think is the best way to setup 2x L2ARC devices? I thought L2ARC devices don't need disk alignment at all (4k or not) and I also thought that L2ARC devices "just work" by telling ZFS to just use it as L2ARC. I even heard of some people saying to mirror L2ARC devices, does that actually provide additional read performance?


----------



## t1066 (May 4, 2012)

The following two drives are 4k aligned.


```
$ gpart show -l ada0
=>       34  250069613  ada0  GPT  (119G)
         34        128     1  (null)  (64k)
        162       1886        - free -  (943k)
       2048   20971520     2  ssd1  (10G)
   20973568   10485760     3  log0  (5.0G)
   31459328  218103808     4  cache0  (104G)
  249563136     506511        - free -  (247M)

$ gpart show -l da6
=>       34  250069613  da6  GPT  (119G)
         34        128    1  (null)  (64k)
        162       1886       - free -  (943k)
       2048   20971520    2  swap1  (10G)
   20973568  228589568    3  cache1  (109G)
  249563136     506511       - free -  (247M)
```

And when accessing a file that is in the cache,


```
extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     2329.8   0.0 298218.9     0.0    4   2.7  67
da6      2433.7   1.0 311182.9   127.9    5   3.1  80
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     2833.2   6.0 362309.5   422.1   10   2.7  82
da6      2932.1   0.0 375080.8     0.0   10   3.2  97
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     2845.1   0.0 364178.7     0.0   10   2.8  82
da6      2990.0   0.0 382608.2     0.0   10   3.2  98
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     1583.4   2.0 202676.3   143.9    0   2.8  46
da6      1586.4   0.0 203059.9     0.0    0   3.2  52
```

You should also increase the sysctl variables vfs.zfs.l2arc_write_max and vfs.zfs.l2arc_write_boost. Their defaults 8M/s are a bit slow for current generation SSDs. Finally, you could change vfs.zfs.l2arc_noprefetch to 0 if you also want to cache streaming data.


----------



## Sebulon (May 4, 2012)

einthusan said:
			
		

> I'm not sure what a SLOG is


Separate LOG device.



			
				einthusan said:
			
		

> I plan on putting 2x SSD as L2ARC.


Why? Is it size? Remember that you still need RAM to be able to allocate all that.



			
				einthusan said:
			
		

> I thought L2ARC devices don't need disk alignment at all (4k or not) and I also thought that L2ARC devices "just work" by telling ZFS to just use it as L2ARC


It depends on the SSD. Some SSD's have about the same write performance un/aligned, while others, like the Vertex's unaligned writes go *half* as fast as aligned.
Just keep the partitioning aligned and you'll never have any problems, SSD's or otherwise. I'm assuming the cache devices to be SSD, since otherwise, what's the point?



			
				einthusan said:
			
		

> I even heard of some people saying to mirror L2ARC devices, does that actually provide additional read performance?


_Them people_, up to no good again. That is just incorrect. There is no way to mirror cache devices, because you don't have to, they're completely redundant. I'm betting you're just confusing SLOG ZIL with L2ARC. Log devices can be good to have mirrored. Not really necessary now a days, from ZPOOL V19 and up. But if you have a database machine in production doing heavy sync writing with only one log device, and it dies, you are going to feel the paaaiiin of going from around 60-70MB/s write down to, like, 5  
With two mirrored SLOGs and one dies, that's not an issue.

/Sebulon


----------



## einthusan (May 4, 2012)

Sebulon said:
			
		

> Separate LOG device.
> 
> 
> Why? Is it size? Remember that you still need RAM to be able to allocate all that.
> ...



The reason why I am using 2 x 32 GB SSD instead of 1 x 64 GB SSD is because *I* believe that by using two drives, I will have more read IOPS and thus better read performance compared to using a single SSD. I need 64 GB because of video streaming and so the more cache space the better, right? We have 8 GB of RAM, I think that should be enough.

Yes, I will align the SSD using the same procedure for regular hard drives. And yes they will be used as cache devices (L2ARC).

You're right. I got ZIL mixed up with L2ARC. They were mirroring the SLOG devices, not L2ARC.


----------



## Sebulon (May 7, 2012)

@einthusan

SSDs use internal striping across the cells. 1x64GB is as fast as 2x32GB, as long as it has twice the cells to stripe across, and that's really hard to find out. Manufacturers rarely print that out, even in the supposedly complete product sheet. But with 2x32GB, you have to trust- in case- ZFS to be as effective in striping across 2x disks, as giving only 1x disk to ZFS and trust the disk to stripe internally.

In the end, I don't think anyone would notice the difference One con against 2x disks is that they take up space in the chassis. One pro is that if one of them dies, you'd only loose half the cached information on them.

/Sebulon


----------



## gkontos (May 7, 2012)

@Sebulon,

I noticed that your are partitioning a L2ARC device with a ZFS type:



> [CMD=""]# gpart add -t freebsd-zfs -l log1 -b 2048 -a 4k daX[/CMD]



Is there a particular reason why this is done?

Thanks


----------



## Sebulon (May 7, 2012)

@gkontos

No, nothing in particular. Is there a "better" type, you think? Does it even matter?

/Sebulon


----------



## gkontos (May 7, 2012)

Sebulon said:
			
		

> @gkontos
> 
> No, nothing in particular. Is there a "better" type, you think? Does it even matter?
> 
> /Sebulon



I guess not. I asked mainly out of curiosity. Usually, I don't partition LOG or CACHE SSD  devices.

If you want to 4K align then this is the only way, but do you think that it really makes that difference?


----------



## einthusan (May 7, 2012)

Sebulon said:
			
		

> @einthusan
> 
> SSDs use internal striping across the cells. 1x64GB is as fast as 2x32GB, as long as it has twice the cells to stripe across, and that's really hard to find out. Manufacturers rarely print that out, even in the supposedly complete product sheet. But with 2x32GB, you have to trust- in case- ZFS to be as effective in striping across 2x disks, as giving only 1x disk to ZFS and trust the disk to stripe internally.
> 
> ...



But what about the sustained data read rates. For example, a single SSD may have a sustained read rate of 200 MB/s. If you have 2 SSD being read in parallel, wouldn't you have about 400 MB/s of sustained read speed? And if we add random reads to the picture, wouldn't parallel reads be better? Unless ZFS sucks at managing 2x L2ARC devices.


----------



## Sebulon (May 7, 2012)

gkontos said:
			
		

> ...but do you think that it really makes that difference?



I *know* it makes that difference on OCZ's drives because I have tested quite a few drives now, looking for the perfect SLOG device, and one of the things I test is aligned/unaligned writes. Please test the difference yourself:

```
[CMD="#"]gpart add -t freebsd-zfs -l log(cache)1 (a)daX[/CMD]
Compared to:
[CMD="#"]gpart add -t freebsd-zfs -l log(cache)1 -b 2048 -a 4k (a)daX[/CMD]

Prepare:
[CMD="#"]mdmfs -s 2048m mdX /mnt/ram[/CMD]
[CMD="#"]umount /mnt/ram[/CMD]
(Or use the [FILE]mdconfig[/FILE]-command to do the same thing)
[CMD="#"]dd if=/dev/random of=/dev/mdX bs=1024000 count=2048[/CMD]

Then test three times:
[CMD="#"]dd if=/dev/mdX of=/dev/gpt/log(cache)1 bs=1024000 count=2048[/CMD]
[CMD="#"]dd if=/dev/mdX of=/dev/gpt/log(cache)1 bs=1024000 count=2048[/CMD]
[CMD="#"]dd if=/dev/mdX of=/dev/gpt/log(cache)1 bs=1024000 count=2048[/CMD]
```

/Sebulon


----------



## Sebulon (May 7, 2012)

einthusan said:
			
		

> But what about the sustained data read rates. For example, a single SSD may have a sustained read rate of 200 MB/s. If you have 2 SSD being read in parallel, wouldn't you have about 400 MB/s of sustained read speed? And if we add random reads to the picture, wouldn't parallel reads be better? Unless ZFS sucks at managing 2x L2ARC devices.



The same striping priciple applies to reads as well.

Let's just say hippopotamously that 1xSSD with 16x4GB (=64GB) NAND cells can read sustained at 200MB/s, then a SSD with 32x4GB (=128GB) NAND cells can read sustained at 400MB/s. These numbers were pulled out my ass, but you get the picture

/Sebulon


----------



## Sebulon (May 8, 2012)

t1066 said:
			
		

> You should also increase the sysctl variables vfs.zfs.l2arc_write_max and vfs.zfs.l2arc_write_boost. Their defaults 8M/s are a bit slow for current generation SSDs. Finally, you could change vfs.zfs.l2arc_noprefetch to 0 if you also want to cache streaming data.



What have you chosen as values for write_max and write_boost? I cranked them up from 8 to 80MB/s, but what would be a good rule of thumb for it? My Vertex3 can write sustained at about 240MB/s, but going too high may impact reading if it gets 100% busy at writing. Should we say about half(120MB/s) of the write-performance maybe?

/Sebulon


----------



## t1066 (May 8, 2012)

@Sebulon

I think that L2ARC should mainly be used as a reading cache. You do not want to read from the pool instead of L2ARC when ARC is busy. So write speed should be set to at most half of the maximum write speed. I would prefer to set it to a quarter or a third of the maximum write speed. There are two sysctls that I had overlooked,  vfs.zfs.l2arc_feed_again and vfs.zfs.l2arc_feed_min_ms. They would increase the frequency of feeding the l2arc if writing on L2ARC is more than half of write_max. So actually, there are two ways to increase write speed. One way is to set a hard limit, write_max=_hard limit_ and set l2arc_feed_again=0. The other way is set for a lower write_max and let the feed again mechanism kick in. Fortunately, these settings can be tested without reboot the system.

But I think crank up write_boost to almost the maximum write speed should not cause any problems.
Finally L2ARC work in a round robin way instead of striping. However, for streaming data it should perform more or less the same as striping.


----------



## stuart (May 9, 2012)

If your array contained a mix of 512 and 4K drives, will it do any harm to align everything as 4K? I'm assuming not, as the 512 ones won't care either way?

In terms of aligning them, is it effectively sufficient to make sure that the gpt partitions are on 4k bound*a*ries? I'm not clear why using gnop helps so much, and why you only need to use it once?


----------



## kpa (May 9, 2012)

Aligning on 4k boundaries without using gnop(8) is not enough because ZFS would still use 512 byte sectors as the smallest I/O unit, creating a fake device with 4KB sectors will force ZFS to use 4KB sectors as the smallest I/O unit in the pool.


----------



## stuart (May 9, 2012)

Ok, so why is it that you don't need to create the gnop device every time? Most of the walkthroughs I've seen only seem to require it to be created once?


----------



## kpa (May 9, 2012)

It's only needed when you create the pool, once the ashift property is set it can not be changed.


----------



## Sebulon (May 9, 2012)

kpa said:
			
		

> It's only needed when you create the pool, once the ashift property is set it can not be changed.



Not quite correct. You need to have a gnop-device first for every new vdev in the pool. If your pool consists of 8xdrives raidz(2,3), you only need one gnop-device. If it is 2x4drives raidz(2,3), you need two gnops. 4x2 mirror vdevs require four gnops. And finally, a pool with 8x1 single-drive vdevs (no fault-tolerance), you need all eight to be gnops

Also, because the ashift-value is set per vdev, instead of pool-wide, it is possible to have, e.g. 2xdrives mirror with ashift=9 and the next time you buy two more hard drives (most probably AF ones), you can add another 2xdrives mirror into the pool but this new vdev can have ashift=12 instead.

/Sebulon


----------



## t1066 (May 10, 2012)

This is an update on how L2ARC works.

I had upgraded x11/nvidia-driver. But when I

`# kldunload nvidia`

this command just hanged. So I rebooted the system. 

I had set write_max to 25MB/s. First, when ARC was warming up, I got the following result.


```
$ iostat -xz -w 1 ada0 da6

                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  19.0     0.0  2317.5    0   0.5   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
da6        0.0  20.0     0.0  2557.2    0   1.6   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  26.0     0.0  3325.3    0   1.2   2
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
da6        0.0  35.0     0.0  4475.7    0   1.4   2
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  14.0     0.0  1678.3    0   0.7   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
da6        0.0  24.0     0.0  3069.0    0   1.2   1
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0  19.0     0.0  2429.5    0   0.9   1
```

We can see from above that L2ARC is indeed writing in a round robin way. Each drive got a write in alternate seconds. (ada0 and da6 combined to form the L2ARC.) 

After the ARC warmed up, I dd a 3G file to /dev/null and got the following result.


```
$ iostat -d -w 1 -c 60 ada0 da6
            ada0              da6
  KB/t tps  MB/s   KB/t tps  MB/s
 86.79  18  1.53  105.60  16  1.68
 128.00   1  0.12   0.00   0  0.00
 72.00   2  0.14   0.00   0  0.00
  0.00   0  0.00   0.00   0  0.00
 128.00   1  0.12  128.00   2  0.25
  0.00   0  0.00   0.00   0  0.00
 119.38  39  4.54  128.00  31  3.87
 126.58  45  5.56  127.04  67  8.30
 124.44  54  6.56  114.33  96 10.71
 128.00  43  5.37  105.65  48  4.95
 119.80  41  4.79  128.00  47  5.87
 128.00  88 10.99  123.33 120 14.44
[B] 119.68 148 17.28  128.00 114 14.24
 128.00 293 36.59  125.91 322 39.55
 126.77 729 90.29  128.00 737 92.16
 128.00 543 67.93  126.50 448 55.29[/B]
 128.00   3  0.37  128.00 150 18.73
 128.00   1  0.12  128.00   7  0.87
 128.00   6  0.75  128.00   1  0.12
```

The bold part shows the part when turbo warmup kicked in.


----------



## stuart (May 10, 2012)

Sebulon said:
			
		

> Not quite correct. You need to have a gnop-device first for every new vdev in the pool. If your pool consists of 8xdrives raidz(2,3), you only need one gnop-device. If it is 2x4drives raidz(2,3), you need two gnops. 4x2 mirror vdevs require four gnops. And finally, a pool with 8x1 single-drive vdevs (no fault-tolerance), you need all eight to be gnops
> 
> Also, because the ashift-value is set per vdev, instead of pool-wide, it is possible to have, e.g. 2xdrives mirror with ashift=9 and the next time you buy two more hard drives (most probably AF ones), you can add another 2xdrives mirror into the pool but this new vdev can have ashift=12 instead.
> 
> /Sebulon



So we use gnop when creating a vdev to make sure zfs writes in 4K blocks, but once this is done, that's it? So using -b 4096 with GPT and gnop is enough for alignment?

Sorry, just wanted to make sure I understand this fully


----------



## Sebulon (May 11, 2012)

@stuart

Using *-b 2048 -a 4k* with GPT and gnop is enough.

/Sebulon


----------



## wblock@ (May 16, 2012)

Careful, -a overrides -b until a very recent -stable.  See PR bin/167567.


----------



## Sebulon (May 16, 2012)

@wblock

To my understanding it was -a -s overriding -b, at least in my own experience.

/Sebulon


----------



## einthusan (May 16, 2012)

Sebulon said:
			
		

> @wblock
> 
> To my understanding it was -a -s overriding -b, at least in my own experience.
> 
> /Sebulon



Is it necessary that we create gnop devices and add them to the pool first, and only then take them offline to replace with real disk as your methods show? Why can't we add disk directly to the pool without creating gnop devices at all?


----------



## kpa (May 16, 2012)

The .nop devices are needed to force 4KB minimum size for I/O units on disks with 4KB sectors that lie about their physical sector size and report 512 byte sectors instead. I'm not sure what you mean by "replace with real disks"? You can create .nop devices directly on top of the real disks if that's what you're asking.


----------



## Sebulon (May 16, 2012)

einthusan said:
			
		

> Is it necessary that we create gnop devices and add them to the pool first, and only then take them offline to replace with real disk as your methods show? Why can't we add disk directly to the pool without creating gnop devices at all?



Because if you create the gnop-provider ontop of the gpt-provider, and use that to create (or add to) the pool, the label in zpool "fall off" after a reboot. DonÂ´t ask me why

IÂ´ve asked, no one knows. The "phenomenon" can be read about in, e.g. my thread:
Labels "disappear" after zpool import

And IÂ´ve seen at least a dozen more people describing the same problem, both here on the forum and around mailing-lists. So this procedure of not using the actual drives when you create (or add to) the pool is a workaround that I know works.

/Sebulon


----------



## einthusan (May 16, 2012)

Awesome thanks for the help guys!


----------



## gkontos (May 17, 2012)

Sebulon said:
			
		

> IÂ´ve asked, no one knows. The "phenomenon" can be read about in, e.g. my thread:
> Labels "disappear" after zpool import
> /Sebulon



FYI I also experienced the same phenomenon up to FreeBSD 9.0-STABLE #1 r235225 where I now get:


```
> zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 1h37m with 0 errors on Wed May 16 00:11:55 2012
config:

	NAME            STATE     READ WRITE CKSUM
	tank            ONLINE       0     0     0
	  raidz1-0      ONLINE       0     0     0
	    gpt/zdisk1  ONLINE       0     0     0
	    gpt/zdisk2  ONLINE       0     0     0
	    gpt/zdisk3  ONLINE       0     0     0
```

This just happened all of a sudden!


----------



## einthusan (May 18, 2012)

gkontos said:
			
		

> FYI I also experienced the same phenomenon up to FreeBSD 9.0-STABLE #1 r235225 where I now get:
> 
> 
> ```
> ...



Well, does it matter that these labels fall off? My system seems to be working fine and my labels look much worse then that,


```
NAME            STATE     READ WRITE CKSUM
	pool1             ONLINE       0     0     0
	  mirror-0      ONLINE       0     0     0
	    ada0p2      ONLINE       0     0     0
	    ada2p2      ONLINE       0     0     0
	  gpt/disk1      ONLINE       0     0     0
```


----------



## phoenix (May 22, 2012)

Export the pool.  Then import it using the -d option to point to /dev/gpt.  That will force zpool(8) to scan for devices via GPT labels.

`# zpool import -d /dev/gpt pool1`


----------

