# Realigning ZFS to 4k



## gpeskens (Aug 20, 2012)

Hey all,

I've got a biggish ZFS setup consisting of 6 drives (2 raidz vdevs, 3x 1TB and 3x1.5TB)

Recently 1 of the 1 TB drives died, an other is on the verge of dying. 

Since the new 1TB drive I got out of warranty had exacly the same sector count I just dd'ed the old MBR from one of the other drives (the 3 1TBs consist of a 50gig partition being gmirrored, an 4 for swap and rest for ZFS) and I noticed horrific performance.

Doing some googling I found out that my new drive is 4K aligned, all the other old ones are 512byte. 

Since one of the other 1 TB's is also failing I'll get another out of warranty with the same 4k specs, and I also bought a new 1Tb to replace the third before it fails. 

I redid the partitioning on the new drive, dropping the swap slice and aligning the 50gig slice to 4k boundaries, I did the same for the ZFS slice. 

Will ZFS automagicly pick up the correct physical aligmnment of the drive ? If not is there a "hack" to make it work ? (one that does not involve replacing all drives with 4k ones and exporting and recreating the pool)

Camcontrol does identify the drive as having a physical sector size of 4k with a logical size of 512


----------



## gpeskens (Aug 20, 2012)

Sorry, forgot to mention (and cannot edit)

I'm running 9.0-Stable


----------



## Sebulon (Aug 21, 2012)

@gpeskens

Partition alignment can be done like:
`# gpart add -t freebsd-zfs -l diskX -a 4k (a)daX`
or
`# gpart add -t freebsd-zfs -l diskX -b 2048 (a)daX`
for the partition to start at 1MiB.

Before replacing the drive with zpool replace, but then thereÂ´s the ashift value that determines the IO size that ZFS sends to a vdev in a pool, and unfortunately that cannot be modified afterwards. Since you didnÂ´t do anything about that(no one thought of this before), your ashift value is most likely 9. You can verify this by:
`# zdb poolname | grep ashift`
This means that ZFS sends 512b IOÂ´s to that vdev. That is suboptimal for a drive with 4k sector size. If you were to create to create a new pool and use the gnop-trick to present the drives as true 4k drives, the ashift value would be 12, which means that ZFS instead sends 4k IOÂ´s.

For a drive with 512b sectors, receiving a 4k large IO is quite alright, it just writes 512bx4, but for a 4k drive to receive only a pesky little 512b IO, what is it to do? Now, how big of a performance impact this has is impossible to say, you might not even notice it. So I would suggest you just do the aligning part and then benchmark with e.g. benchmarks/bonnie++ and then decide if you feel the need to back everything up, destroy and rebuild from there.

You might also want to keep an eye at how the individual drives are handling during the benchmark with:
`# gstat`
to see if the 4k drive is working itÂ´s ass of while the others are handling it OK, or vice-versa.


/Sebulon


----------



## gpeskens (Aug 21, 2012)

Sebulon said:
			
		

> @gpeskens
> 
> Partition alignment can be done like:
> `# gpart add -t freebsd-zfs -l diskX -a 4k (a)daX`
> ...


Yes, this I did, at least somehow (using MBR aligning to 4k sectors, since I don't know my system will properly boot with my current setup if I change to GPT). The gmirror performance is good after switching the drive to 4k alignment


> Before replacing the drive with zpool replace, but then thereÂ´s the ashift value that determines the IO size that ZFS sends to a vdev in a pool, and unfortunately that cannot be modified afterwards. Since you didnÂ´t do anything about that(no one thought of this before), your ashift value is most likely 9. You can verify this by:
> `# zdb poolname | grep ashift`


Yes, my 2 vdevs are unfortunately on a ashift of 9


> For a drive with 512b sectors, receiving a 4k large IO is quite alright, it just writes 512bx4, but for a 4k drive to receive only a pesky little 512b IO, what is it to do? Now, how big of a performance impact this has is impossible to say, you might not even notice it. So I would suggest you just do the aligning part and then benchmark with e.g. benchmarks/bonnie++ and then decide if you feel the need to back everything up, destroy and rebuild from there.
> 
> You might also want to keep an eye at how the individual drives are handling during the benchmark with:
> `# gstat`
> ...



So I guess the conclusion is that I'm quite screwed over until I recreate the vdev with the correct allignment. 
Since the pool consists of 2 vdevs, and data usage is slightly less then the size of the second vdev would it be possible to somehow move all data away from the vdev to be realigned ? This way I could destroy the single vdev and recreate it with the correct allignment. 

Currently I don't own enough disks to completely migrate data back and forth :/


----------



## kpa (Aug 21, 2012)

You can't remove vdevs from the pool, only add new ones. ZFS stripes data across the vdevs much like RAID 0 and that striping can't be undone.


----------



## gpeskens (Aug 21, 2012)

kpa said:
			
		

> You can't remove vdevs from the pool, only add new ones. ZFS stripes data across the vdevs much like RAID 0 and that striping can't be undone.



Yes, I was afraid of this :/ hmmmmm destroying some snapshots (after resilver is done, as I've reports of this causing resilver to restart), deleting lots of unused data and setting up another pc with old drives I have lying around is an option, though I guess rsyncing ~2,5TB from my system even on 1 gbit would take a lot of time.


----------



## Sebulon (Aug 21, 2012)

gpeskens said:
			
		

> So I guess the conclusion is that I'm quite screwed over...



Yes, quite Sorry. Because as kpa pointed out, you cannot change the property of a vdev or destroy it. Only backup and recreate.

But as I said, you might not even notice it, so benchmark before deciding anything.


The only looongshot I can think of is to OFFLINE one drive from each vdev, create a temporary striped pool from them(aligned and ashift=12), copy data over to the temporary pool, destroy the old pool, recreate it properly aligned and with ashift=12 using two fake file-backed drives to cover the parity which you immediatly offline after creation. Then copy the data back, destroy the temporary pool and replace in the last two drives from the destroyed temporary pool. But the temporary pool probably wouldnÂ´t be big enough(1TB+1,5TB=2,5TB) anyways so... Yes, quite.

DonÂ´t feel bad, you couldnÂ´t have known this, no one did. But as a best practice, always create new pools and vdevs with ashift=12 for better compatibility.

/Sebulon


----------



## gpeskens (Aug 21, 2012)

Sebulon said:
			
		

> Yes, quite Sorry. Because as kpa pointed out, you cannot change the property of a vdev or destroy it. Only backup and recreate.
> 
> But as I said, you might not even notice it, so benchmark before deciding anything.
> 
> ...



Not feeling bad at all  It's a homeserver hobby project, though I do run live websites for myself and mail for friends.

I was already planning to "redo" the server somewhere this year as things are getting a bit more messy every month. This just adds to the motivation 

I've already made many mistakes (lets call them poorly informed decisions  ) on this machine, for one I can tell you that running dedup on this dataset with "only" 8GB of ram can grind your system to a complete halt, took me a week of copying data around, deleting the dedupped copy, and moving it back to get performance ok again.

Since I still have another board lying around with 5 sata ports, and I still have 4 500gig drives and now 1 spare 1tb I could rsync data to it (configuring it without redundancy), I calculated and 2,5TB would take 5,7 hrs under perfect conditions, so I guess it would only take around 10hrs I could even set it up to temporarily host my websites and mail, so I could do a full reinstall of the main machine.

Currently main is setup the following way:

ada0: 1 slice 50GB
    : 1 slice 4GB
    : rest ZFS
ada1+2 : 1 slice 50GB (4K aligned MBR)
       : rest ZFS
ada3-5 : ZFS

The 3 50G slices are actually gmirrored to 1 gm0 wich holds a few bsdlabels (effectively this makes all the 50G slices have bsdlabels as well, adding redundant bootdisks this way)

the 4GB slice on ada0 was for swap (it was also on the old ada1 and 2, but I moved swap to zfs since a dying disk holding swap paniced the system)

The main 2 things I'm aiming for are root partition redundancy (the system should work and boot, even if 2 disks fail) and a large datatank (redundancy is still important, but less so than system operability)

If I redid the server what would you advice me ? Go for 1 big 4k aligned raidz2 vdev also holding root ?
Keeping the same or similar gmirror mbr setup ?
Creating a GPT setup with Gmirror ? (I have a legacy bios, so this might prove a bit more prone to trouble)


----------



## Sebulon (Aug 23, 2012)

gpeskens said:
			
		

> I've already made many mistakes (lets call them poorly informed decisions  ) on this machine, for one I can tell you that running dedup on this dataset with "only" 8GB of ram can grind your system to a complete halt, took me a week of copying data around, deleting the dedupped copy, and moving it back to get performance ok again.


Oh god how I outright hate dedup in ZFS. It has caused me nothing but pain. Just thinking about it makes my stomach hurt. Just when you least expect it, it jumps up and bites you in the ass, hard. If it wasnÂ´t for dedup, my experience with ZFS would have been wonderful, and instead itÂ´s somewhat blemished.



			
				gpeskens said:
			
		

> If I redid the server what would you advice me ? Go for 1 big 4k aligned raidz2 vdev also holding root ?
> Keeping the same or similar gmirror mbr setup ?
> Creating a GPT setup with Gmirror ? (I have a legacy bios, so this might prove a bit more prone to trouble)


Well, since IÂ´m paranoid about redundancy(occupational hazard), I opt for better fault-tolerance any day. But with your 3x1TB + 3x1,5TB, if you go with 6xraidz2(or even 3) you would be throwing 1,5TB down the drain, which is a waste.

Although, if you partition those 1,5Â´s with 1x1TB partition + 1x500GB partition, you could do something else.
In that scenario, I would go for a 6xraidz2 in one pool and 3xmirror in another pool. That would provide you with an equal amount of storage(though not in the same place) but with lots better fault-tolerance because you could afford to lose ANY two drives at once without compromising your pools.

I say go for broke and do boot and root from ZFS as well. That way you are also protected against boot problems if any of the drives fail.

/Sebulon


----------

