# Deploying Multiple Systems ==> Drives, Filesystems, Imaging, Etc.



## Dave-D (May 11, 2022)

Hi. I'm new to FreeBSD, but learning.

Want to make sure I'm heading in the right direction.

Currently doing test builds on servers and clients.

Here is what I'm considering, any feedback is greatly appreciated.

SERVERS:
For 3 small businesses.
Each server = 6 hard drives,
1 boot/operating system drive, (UFS or ZFS, not yet sure)
3 mirrored "Data" drives. (ZFS)
2 mirrored "Local Backup" drives. (ZFS)
(each server has a SCSI tape drive for long term storage.)

SERVER QUESTIONS:
1.) Does this seem like a reasonable drive arrangement?
2.) Re: the boot/operating system drive, if I go with UFS I can use Clonezilla to image, for quick cold-metal restores.
      I've yet to work with ZFS, not sure if can clone/image a whole system like with Clonezilla.
      Clonezilla does not support ZFS.
      Whats the best choice for this drive - ZFS or UFS, and why?

CLIENTS:
I'll be deploying the same basic client, with minor differences, to 20+ workstations, possibly more.
Would like to store the setups (images) in some way for rapid deployment.
1 (single) or 2 (mirrored) drives containing everything. (ZFS)

CLIENT QUESTIONS:
1.) Whats the better file system for clients, ZFS or UFS?
      UFS I can image with Clonezilla.
      I've yet to work with ZFS, not sure if can clone/image a whole system like with Clonezilla.
      Clonezilla does not support ZFS.
      Whats the best choice for client drives - ZFS or UFS, and why?

Any and all feedback greatly appreciated!
Sharing of similar use-cases greatly appreciated, if possible!!

Thanks,
Dave


----------



## Geezer (May 11, 2022)

ZFS vs UFS
They are both good. It is a personal choice. You should try both and make your own decision. Neither would be a mistake so you cannot go too wrong.

No need to clone with third party software, all the tools are there in FreeBSD. But better still, if you have to install a fresh, then install a fresh!


----------



## Dave-D (May 11, 2022)

Geezer said:


> ZFS vs UFS
> They are both good. It is a personal choice. You should try both and make your own decision. Neither would be a mistake so you cannot go too wrong.
> 
> No need to clone with third party software, all the tools are there in FreeBSD. But better still, if you have to install a fresh, then install a fresh!



Is there no problem with running UFS on operating system drive, then ZFS on the other 5 drives?

Dave


----------



## richardtoohey2 (May 11, 2022)

Geezer said:


> ZFS vs UFS
> They are both good. It is a personal choice. You should try both and make your own decision. Neither would be a mistake so you cannot go too wrong.


This; personally I've not _yet_ moved off UFS (with/without hardware RAID) but ZFS has a _lot_ of good stuff in it (including snapshots, boot environments etc.).

Everytime I start looking at ZFS, though, there seems to be someone popping up in the forums with a ZFS issue and I get a bit nervous.  But I think that is true of any technology - the majority of people have no issues so they don't have anything to say, so the bulk of any "noise" is the small portion of users who do encounter an issue.

As Geezer says, though, it's probably best to try both and see how you go.


----------



## Phishfry (May 11, 2022)

Dave-D said:


> Is there no problem with running UFS on operating system drive, then ZFS on the other 5 drives?


No problem at all. You could even run a UFS gmirror for the OS drive and ZFS for the tank.

For me it was an easier learning experience to separate the OS from the data.


----------



## gpw928 (May 11, 2022)

ZFS has a great unique feature called boot environments.  Look into that before making up your mind.

But... ZFS has problems if your disks get too full.  So keep that firmly in mind when sizing systems.  Don't let them go past about 80%.

Also, I would never build any "business" system without some sort of redundancy on *all* the disks.

With 40 years in IT, I have seen a *lot* of failed disks.  You don't want one dead disk to ruin your day.

It's easy enough to clone a running zfs system, provided you can hot swap in/out the target disk of the clone on a system running ZFS.

I'd be concerned about the SCSI tape drives.  I threw out all my tapes and tape drives a long time ago.  However, we'd need to examine  your data capacities before making any sensible suggestions regarding the backup strategy.


----------



## gpw928 (May 11, 2022)

Phishfry said:


> No problem at all. You could even run a UFS gmirror for the OS drive and ZFS for the tank.


That's exactly what I did at the outset, but I have converted to a ZFS root now to get the advantage of boot environments.


Phishfry said:


> For me it was an easier learning experience to separate the OS from the data.


That's best practice at any time.


----------



## Dave-D (May 11, 2022)

gpw928 said:


> ZFS has a great unique feature called boot environments.  Look into that before making up your mind.
> 
> But... ZFS has problems if your disks get too full.  So keep that firmly in mind when sizing systems.  Don't let them go past about 80%.
> 
> ...



RE:  ZFS Sizes:
Set a ZFS quota at around 75%?

These are all small businesses so if server goes down for a day its not good,
but not catastrophic. Still, best not to go there at all, if possible.

Redundancy on ALL disks:
Server motherboard only has 6 ports.

Could add a card?
Do you know of any non-scsi sata cards so I can add another drive?
(I've heard that ZFS does NOT like hardware SCSI. Even though could disable,
best not to have a card with SCSI, if possible)???

Otherwise:
What if I change my drive arrangement . . .
Go with 2 mirrored operating system drives, (ZFS)
and reduce "Local Backups" drive to one drive? (ZFS)
If Local Backups goes down, the data is still on the 3 mirrored "Data Drives"
Local Backups will be for misc. stuff, depending on how it gets used,
non-catastrophic if it goes away.
BUT - if operating system goes away, end of story, until restored.


BACKUPS:

Tape drives:
HP Ultrium 1760 LTO-4 SCSI
800GB/1.6TB
Mainly for incremental backups during the week (non-attended).

Current data size:  Approx. 300GB on each server.

May also use portable drive enclosures (for rotating off-site backups)
Possibly some cloud-based.

Not sure yet, I'm new to this, will have to figure it out as we go.

Dave


----------



## gpw928 (May 11, 2022)

I'd mirror the root, and take the risk on the backups disk.  Or move the backups disk to external USB disk (easier to rotate).

For the price of an LTO tape drive you could buy a bunch of 500 GB USB disks for off-site rotation.

Study the USB spec very carefully.  Understand 3.1, 3.2, Gen 1, Gen2, and make sure you match accordingly.

Is this a single customer here, or many unrelated?  What's the geographic topology?  Are there network connections between hosts?  Back to base?  How much bandwidth?  [I'm trying to figure out if backups might be run over a network connection to remote sites.]


----------



## Phishfry (May 11, 2022)

Dave-D said:


> Do you know of any non-scsi sata cards so I can add another drive?


NVMe makes for nice addition. Maybe you could use part of it for your ZFS cache too.


----------



## richardtoohey2 (May 11, 2022)

+1 to NVME, and you can get PCIe cards for M.2, including fancy (too fancy?) ones that will let you fit two M.2 NVMEs and do RAID 1 for you.  Or two M.2 NVMEs (and _maybe_ 4?) with no RAID.  Not sure if any performance penalty for that sort of set-up, though.


----------



## Dave-D (May 11, 2022)

NVME is not an option at this point.


----------



## Dave-D (May 11, 2022)

gpw928 said:


> I'd mirror the root, and take the risk on the backups disk. Or move the backups disk to external USB disk (easier to rotate).



I'm planning on buildling a backup routine using
External H.D's in quick-swap drive enclosures.
Internal SCSI Tape Drive
Cloud (encrypted)


----------



## ralphbsz (May 11, 2022)

You don't need dedicated OS disks for the root filesystem. The root filesystem is likely small enough, it can be done with small partitions on a diet that is otherwise used for data. How about this: two disks are partitioned, with a small partition for root (mirrored) plus a large partition for backup (also mirrored). The other four disks are for data (mirrored or better RAID-Z2).


----------



## Dave-D (May 11, 2022)

ralphbsz said:


> You don't need dedicated OS disks for the root filesystem. The root filesystem is likely small enough, it can be done with small partitions on a diet that is otherwise used for data. How about this: two disks are partitioned, with a small partition for root (mirrored) plus a large partition for backup (also mirrored). The other four disks are for data (mirrored or better RAID-Z2).



You might be onto something.
I'm not doing the dedicated OS disks for size reasons, but rather to keep the OS as separate from the data, I've heard this is a better/cleaner/easier way to go, if you can do it.
The "Local Data" drive will serve (really not sure yet, remains to be seen) possibly for database dumps, system images, ZFS snapshots/clones/etc,
Your idea of combining OS and Local Data drives is brilliant.
But, I would have to do the install inside of a partition, I haven't been able to figure out how to do that. But I really like you're idea.

Something to think about:
RAIDZ2 takes 4 drives and can lose up to 2.
Mirror with 4 drives can lose up to 3.
Also, I assume that the surviving mirrored drive has all the data structures intact. With RAIDZ2 the surviving 2 drives has my data scattered across 2 drives, not a direct "look at your original data and work with it directly" type of situation.

Or, what about 3 mirrored drives for "OS+Local Backups" + 3 mirrored drives for "Data"?

Which, in your opinion, would be better?

Thank you for your idea!
Dave


----------



## mer (May 11, 2022)

Just my opinions:
I like the idea of separate OS and Data drives.  Makes upgrading easier.  You can have smaller OS devices, say 250G is plenty and have bigger Data drives.  I acknowledge ralphbsz solution as a valid idea, but since we're talking opinions, I like separate.

ZFS, ZFS mirror for the boot devices.  Why?  Boot Environments.  Best way to do upgrades, simply because rollback is "reboot, stop in boot loader, choose a working BE, boot"  Mirrors because honestly, the OS device is mostly read only after boot (assumes configuration has already been done).  Log files are about the biggest thing that is read/write.
Why Mirror boot device?  Protection against the boot device physically failing.  You may have to be physically present to boot from the other one in the pair, but it should come up and let you replace the failed device.

Mirror with more than 2 vs RAID configurations.  Mirrors are only as big as the smallest device.  The RAIDs give you various multipliers.  Performance:  Mirrors read complete at the fastest device, writes complete at the slowest.  RAIDs are in between.
I have no input on the optimal RAID configuration for 4 or more drives.


----------



## Erichans (May 11, 2022)

Dave-D said:


> RAIDZ2 takes 4 drives and can lose up to 2.
> Mirror with 4 drives can lose up to 3.


Assuming the 4 drives are all the same size:

 RAID-Z2 with 4 drives leaves you with _two_ disks worth of available disk space for your data.
 A mirror with 4 drives of which you can afford to lose 3 is a 4-way mirror.
That leaves you with _one_ disk worth of available disk space for your data.
Comparing (short version) 4 disks and both leaving 2 disks worth for your data:

 A pool consisting of one vdev with RAID-Z2: any 2 disks can fail _and you still have all your data_*
 A pool consisting of two vdev-s, 2 mirrors of 2 disks: 
- any one disk can fail _and you still have all your data_*; 
- 2 disks can fail _if_ they are not part of the same vdev; _you still have all your data_*.
Losing one vdev in a pool means losing the whole pool**. More info: 4 drives coming - raid-z1, or what; also the thread containing my earlier message: pool layout & references

Please take under consideration that of the maximum available disk space for your data, the utilization should—preferably—max out at ca. 80%.

Edit:
 * italic texts added
** losing the pool = losing your data in that pool. There is no (UFS) fsck(8) equivalent for ZFS: you'll have to resort to backups.


----------



## Dave-D (May 11, 2022)

mer said:


> Just my opinions:
> I like the idea of separate OS and Data drives.  Makes upgrading easier.  You can have smaller OS devices, say 250G is plenty and have bigger Data drives.  I acknowledge ralphbsz solution as a valid idea, but since we're talking opinions, I like separate.
> 
> ZFS, ZFS mirror for the boot devices.  Why?  Boot Environments.  Best way to do upgrades, simply because rollback is "reboot, stop in boot loader, choose a working BE, boot"  Mirrors because honestly, the OS device is mostly read only after boot (assumes configuration has already been done).  Log files are about the biggest thing that is read/write.
> ...



How about these two ideas:

*OPTION 1:*

6 Drives total, as follows:

VDEV #1 = 2 drives @ 500GB ea. (mirrored) = 500GB total size, (optionally) partition to use outer 250GB (for speed) (x.8 = 200GB avail.)  ==> [OS]
(can lose 1 drive)
VDEV #2 = 4 drives @ 2TB ea. (RAIDZ2) = 4TB total size, 1 partition for 3GB (x.8 = 2.4TB avail.) [DATA], 1 partition for 1GB (x.8=800GB avail.) [LOCAL_BACKUP] ==> [DATA + LOCAL_BACKUP]
(can lose 2 drives)

(??) (not sure I like that setup, only end up with 2.4TB Data, 3GB Data partition is spread across 2 drives in the pool (since drive is only total size 2TB - not sure if that might be an issue.)

*OPTION 2:*

6 Drives total, as follows:

VDEV #1 = 2 drives @ 2TB ea. (Mirrored) = 2TB total size, 1 partition for 250GB (x.8=200GB avail.) [OS], 1 partition for 1750GB (x.8=1400GB avail.) [LOCAL_BACKUP] ==> [OS + LOCAL_BACKUP]
(can lose 1 drive)
VDEV #2 = 4 drives @ 2TB ea. (RAIDZ2) = 4TB total size, 1 partition for 4GB (x.8 = 3.2TB avail.) [DATA] ==> [DATA]
(can lose 2 drives)

**HOWEVER**
re: Erichans ... Does the 4TB in my OPTION 2, VDEV #2 come from one VDEV= 4TB available, or two mirrored vdev's = 4TB avail?
The capacity of 4TB is little confusing. I don't want to risk losing any 2 drives *AS LONG AS* they are not in the same VDEV, sounds scary to me.


*Thoughts....?
.
.
.*


----------



## gpw928 (May 12, 2022)

mer said:


> I like the idea of separate OS and Data drives.  Makes upgrading easier.  You can have smaller OS devices, say 250G is plenty and have bigger Data drives.  I acknowledge ralphbsz solution as a valid idea, but since we're talking opinions, I like separate.
> 
> ZFS, ZFS mirror for the boot devices.  Why?  Boot Environments.  Best way to do upgrades, simply because rollback is "reboot, stop in boot loader, choose a working BE, boot"  Mirrors because honestly, the OS device is mostly read only after boot (assumes configuration has already been done).  Log files are about the biggest thing that is read/write.
> Why Mirror boot device?  Protection against the boot device physically failing.  You may have to be physically present to boot from the other one in the pair, but it should come up and let you replace the failed device.


I agree 100% with everything said above.

In addition, with the tank physically separate from the root, you can export the tank (applications and data) and optionally send it anywhere you like.  You can do anything you want to the OS and the media it lives on knowing that you can not impact the tank, including replacement of the media or re-provisioning a new physical system (in another location, if you want), optionally bring the tank back, and import it into the system. These options *greatly* enhance your capacity to deal with adversity.  And, if thought out ahead of your system builds, so you test the processes when you initially deploy the systems, will set you up for a much easier life with maintenance.  So, keep the root separate from applications and data, and either document or automate provisioning of the root.
I'd also recommend the same approach for the applications in the tank.

My thoughts on the layout options are that you have to figure out your own level of risk aversion.  I run a tank of four striped 2-spindle mirrors, so I can afford lose one drive in each mirror.  That configuration gives me the best performance possible, with enough redundancy to make me feel comfortable.  But I'm physically present with the system, it's on a UPS, and I have several brand new 4 TB drives ready to deploy at a moment's notice.


----------



## Dave-D (May 12, 2022)

gpw928 said:


> I agree 100% with everything said above.
> 
> In addition, with the tank physically separate from the root, you can export the tank (applications and data) and optionally send it anywhere you like. You can do anything you want to the OS and the media it lives on knowing that you can not impact the tank, including replacement of the media or re-provisioning a new physical system (in another location, if you want), optionally bring the tank back, and import it into the system. These options *greatly* enhance your capacity to deal with adversity. And, if thought out ahead of your system builds, so you test the processes when you initially deploy the systems, will set you up for a much easier life with maintenance. So, keep the root separate from applications and data, and either document or automate provisioning of the root.
> I'd also recommend the same approach for the applications in the tank.
> ...



I agree with you completely.

It just makes sense.

A couple questions:

1.) I've heard about keeping the applications separate from the operating system, but it sounded like there was no clean or simple way to do that short of major surgery by a tech with greater knowledge and ability than what I currently possess.  Can you offer comments on this and possibly point to where I could get more information, if at all possible?

2.) What do you think about the OS and the LOCAL_BACKUP being in 2 partitions on the first 2 mirrored drives?  Then 4 drives for data, either mirrored or more likely RAIDZ2? Or would you make the OS drive pure-OS-only, and nothing else, always?

Thank you for your excellent post.


----------



## gpw928 (May 12, 2022)

Automated provisioning, and configuration management, are ubiquitous at the big end of town.  But there's a lot to master if you have never been there.  And the real benefits come when you have a large fleet of systems.  Puppet was probably the first of the configuration management tools.  But there's now many more.

If you only have a few near-identical systems to deploy, then keeping meticulous records of everything you do (and aiming to script it) may be a satisfactory mechanism.  You start with detailed records of how to install the root.

It's true that the installation of packages and applications will want to make changes in the root, e.g. to create accounts, or install configuration files.  But you can easily identify these:
	
	



```
touch stamp
# install something
sudo find / -newer stamp
```
In that way you can identify what your applications have added to the root and immediately document all the changes.

If you keep good documents, repairing broken systems, and automating new builds, gets a whole lot easier.

As Phishfry implies above, iff you get enterprise class low latency SSDs with end-to-end data protection for the root mirror (think Intel SSD D3 Series) you could place a ZFS intent log (ZIL), and a ZFS L2ARC on the SSDs.  The benefit of this would very much depend on the nature of the I/O load, but it could be substantial.  However it doesn't make sense to buy expensive SSDs for backups...

I'm guessing the backups (in this explicit context) are lower in value because they are recoverable in other ways, so that means that they can be relegated to less premium storage.

You would need to flesh out your ideas on how the backups are going to work.  i.e. how much data routinely goes offsite, when it goes, how long it stays offsite, and how easily it can be accessed for recovery before it's possible to comment in detail on the local backup question.

Consider that if you can contain your application data in one or more dedicated ZFS file systems, you can pause your application, snapshot the file system (it takes no time), re-start the application and send the snapshot to backup media as a file.

The obvious advantage of tape is that, occasionally, you can put one away "for ever".  But that's not quite true, as the media will eventually become unreadable unless it goes through a routine refreshment cycle.  You may be able to get a similar benefit by the long term retention of a compressed snapshot on an external USB disk.  e.g. a 14TB external USB disk costs a few hundred bucks, and I suspect you could get ten of these drives for the price of a single LTO tape drive (ignoring the controller and tapes).

At some stage you should look at what the LTO tape drives and controllers and tapes are going to cost, and the benefits you expect to derive.  Once that's on the table, and we know a little more about data volumes, and if there any network connections available, the arguments can be better prosecuted.


----------



## Erichans (May 12, 2022)

Dave-D said:


> *OPTION 2:*
> 
> 6 Drives total, as follows:
> VDEV #1 = 2 drives @ 2TB ea. (Mirrored) = 2TB total size, 1 partition for 250GB (x.8=200GB avail.) [OS], 1 partition for 1750GB (x.8=1400GB avail.) [LOCAL_BACKUP] ==> [OS + LOCAL_BACKUP]
> ...



I've extended my earlier message #17 a bit for clarification.

"[...] or two mirrored vdev's = 4TB avail?"
With 4 disks you have disk 1 and 2 in one mirror (first vdev); disk 3 and 4 in the other mirror (second vdev). The two vdevs are combined into one pool. With the RAID-Z2 layout you have one pool that consists of one RAID-Z2 vdev of 4 drives. Note the distinction between a pool and a vdev.

As to available space: a pool with RAID-Z2 layout of 4*2TB disks has the same 4TB available for your data as a pool of two mirrors. With mirrors: If one disk fails then, the vdev to which the failed disk belongs has only one instance of its data (that data isn't anywhere else in the pool) and that vdev has lost its redundancy; the pool as a whole has lost its redundancy. With RAID-Z2: if  one disk fails then, the (only) vdev still has one disk of redundancy left; the pool as a whole still has redundancy, even though it is a reduced redundancy. As you can see the robustness of RAID-Z2 is better than the mirrors alternative when a pool consists of 4 disks.

A ZFS pool gets its redundancy from the redundancy of each of its constituent vdevs (image below from here):




If vdev _x_ in a pool loses its redundancy it affects the whole pool. If vdev _x_ has already lost its redundancy completely and another disk in vdev _x_ fails then the whole pool is lost. 

Traditional RAID configurations can be compared to ZFS configurations to a certain extent but you have to think a little different because the concepts are not really the same. Perhaps have a look at the Dan Langille's _ZFS for newbies_ mentioned here; you might be familiar with a lot of its contents but the different point of view as how to look at redundancy in ZFS and not as in parity in traditional raid configurations is important. If you want easy access to valuable and complete information perhaps have a look at the two ZFS (e)books: FreeBSD Development: Books, Papers, Slides


----------



## Dave-D (May 12, 2022)

Erichans said:


> I've extended my earlier message #17 a bit for clarification.



THANK You!
I'm learning. 
What you wrote helped me a lot.


----------



## ralphbsz (May 14, 2022)

I'm just going to collect a few small replies to minor points that may have been ignored earlier.



Dave-D said:


> Could add a card?
> Do you know of any non-scsi sata cards so I can add another drive?
> (I've heard that ZFS does NOT like hardware SCSI. Even though could disable,
> best not to have a card with SCSI, if possible)???


On the contrary. ZFS is perfectly fine with SCSI cards. There are lots of "large" systems around, running FreeBSD, using LSI/Avago/Broadcom SAS cards, with dozens of disks. One can also use one of those cards to connect extra SATA disks (nearly all SCSI cards can handle SATA disks). And also: you can get multi-port SATA cards too.

On the other hand: With 6 ports on the motherboard, I really don't think that more disk drive ports will be needed. In particular with modern disk capacity.



> Current data size:  Approx. 300GB on each server.


That's very small. Modern disk drives typically come in sizes such as 16 or 20 TB. If you were to for example buy four of these drives, and then use them in a RAID-Z2 layout (which can handle two failures), you would have 40TB of usable disk space, and your file system would be less than 1% full. Even buying small inexpensive disks (I think the sweet spot for new, not used, disks may be 4TB drives) capacity is not a problem for the foreseeable future. So you don't need a huge number of drives. And you only need many because you want redundancy.




gpw928 said:


> For the price of an LTO tape drive you could buy a bunch of 500 GB USB disks for off-site rotation.


True.



Dave-D said:


> I'm planning on buildling a backup routine using
> External H.D's in quick-swap drive enclosures.
> Internal SCSI Tape Drive
> Cloud (encrypted)


External disks in a professional setting? Risky. Now you're assuming that there are regular visits to the site, you're relying on disks that are transported and thrown around, you are relying on cables that are plugged and unplugged. I know it can be done, but I would try to avoid it. The good news about this approach is that capacity is really cheap: Put a 20TB drive into an external enclosure, using a good interface (USB-3 or eSATA), and for less than $1000 you have an enormous amount of backup capacity.

Tapes? That's all the downsides of external disks, and then some. Again you need site visits. The reliability of tapes drives ... is nasty. On paper they look great. But they have the nasty habit of failing in the real world. If I had to rely on tapes, I would (a) use enterprise-grade drives (3480/3490/3590 style, perhaps LTO if you can tolerate risk, definitely not small cartridges), and (b) write redundant tapes. But look at the cost of drives and media: Last I looked a good LTO-8 or -9 drive plus a 20-pack of cartridges brings you to about $5K or $10K. For that, you can get lots of other stuff.

If you have reasonable bandwidth, backing up to the cloud seems like the best option.

Here's an idea for a hybrid: Set up most of your data disks to be 2-fault tolerant (for example 4 disks, and then make big data partitions on them, which you arrange as RAID-Z2). Also put a small backup partition on each drive, and then make a non-redundant backup partition out of them (you get more capacity out of those). Use the backup disk partition for a first level backup, then copy these backups over the network offsite to the cloud. That gives you relatively cheap unlimited capacity in the cloud, rapid access to a local backup (even when the network is slow or down), and disaster recovery: If something destroys the whole server, you still have a (slightly older) offsite backup.



mer said:


> Just my opinions:
> I like the idea of separate OS and Data drives.  Makes upgrading easier.


A fine opinion to have, and I won't disagree with you. Matter-of-fact, the moment you start partitioning drives and using the partitions in different ZFS pools, it means the operator needs to think. For example, if you have four disks, each with 3 partitions (one tiny for OS somewhat redundant, one big for data highly redundant, and one small for backups), if one physical disk fails, you have four slightly sick pools. Orchestrating disk replacement is perfectly possible, but it requires multiple commands, and not getting things wrong. Using extra disks just to simplify the system is not a bad idea. A lot of this is a tradeoff: How much training do your operators have? How much will this system be modified in the future? Are you power- or size-constrained?



> Why Mirror boot device?  Protection against the boot device physically failing.  You may have to be physically present to boot from the other one in the pair, but it should come up and let you replace the failed device.


For a professionally managed system, this is a great idea. For a home system, where your users (in my case spouse and child) can handle a multi-day outage, it's less important.

But one warning: If one of the boot drives fails, you will probably have to be physically present, to convince the BIOS to actually boot. Even worse, I've seen SATA drives that fail so thoroughly, they completely disable the motherboard. So in a failure case, you may have to be physically present to pull disks (one at a time), until the system starts breathing again. Not fun when it happens, but sadly it does.

In the following post, I'm going to skip all the capacity calculations, but those are clearly important.


Dave-D said:


> I don't want to risk losing any 2 drives *AS LONG AS* they are not in the same VDEV, sounds scary to me.


Modern disk drives are so large, the probability of an unpredicted and uncorrectable single-sector error is getting to be significant. And the fastest way to lose data is to get the following double fault: One drive dies completely (rubber side up, shiny side down). It happens. No problem, you have redundancy, meaning ZFS will now read a while disk's worth of capacity from the other drives to put the data onto the spare. Unfortunately, during that giant rebuild/reading operation, you get a single sector error. You only lose one sector (one file), but using the "a barrel of wine with one spoonful of sewage in it is still sewage" theorem, the customer is now (justifiably) pissed off.

To guard against that, for enterprise-grade professionally managed systems, one should really have a system that can tolerate two faults.



Dave-D said:


> 1.) I've heard about keeping the applications separate from the operating system, but it sounded like there was no clean or simple way to do that short of major surgery by a tech with greater knowledge and ability than what I currently possess.


If by applications you mean "packages and ports": Those go into /usr/local. You could theoretically create separate file systems (or even pools) for that. In practice, that's probably silly, since they are typically quite small (dozens of GB total). In the past, the tradition was to have many separate file systems (for root, /usr, /usr/local, /var, /var/log and so on); these days pretty much the only splitting of file systems that's still commonly done is OS, user data, and backups.



gpw928 said:


> Automated provisioning, and configuration management, are ubiquitous at the big end of town.  But there's a lot to master if you have never been there.  And the real benefits come when you have a large fleet of systems.
> ...
> If you only have a few near-identical systems to deploy, then keeping meticulous records of everything you do (and aiming to script it) may be a satisfactory mechanism.  You start with detailed records of how to install the root.


Completely agree. Automated install with updates and customization is possible, but really hard. For a half dozen system, it will probably not gain you anything, on the contrary, you will waste much time learning the system.

And I completely agree with the "keep a record" system. The way I do this: Whenever I do system administration, I have a separate window open, and I type into a file exactly what I did (the files are in /root/, and named YYYYMMDD.txt). If I type a command, I cut and paste it into there. If I need to explain something, I do that by adding comments. Like that, the resulting file is sort of usable as a script, which means re-doing it (for example on another machine) becomes super easy. It also means that if I lose my OS or want to re-install, I can just work through all these files, and repeat all the required steps.



> The obvious advantage of tape is that, occasionally, you can put one away "for ever".  But that's not quite true, as the media will eventually become unreadable unless it goes through a routine refreshment cycle.


Media is also REALLY expensive today. I just looked: An LTO-9 cartridge is over $140. Sure, it gives you 45TB, but for that, you can buy an extra disk drive, which is probably more practical.

If your data volume is only 300 GB, and the data doesn't change fast, then suspect the expected backup volume will be small, and easily handled by things more cost-efficient than tape.


----------



## mer (May 14, 2022)

ralphbsz That's what I like the most about this forum (speaking generally).  Sharing ideas about how to do something.  Everyone "knows" their way is the "best" but listening to others expands the knowledge base.  Maybe my next system I use your ideas because they are a better solution.


----------



## Phishfry (May 14, 2022)

Can I add one point here. I think when configuring and planning your ZFS Pool it is very important to think about the consumers.

How fast is your network? You might expect your ZFS pool to be able to come near the saturation point of your network in terms of throughput. So plan your zraids according to the output speed you desire.

Many people come here saying my ZFS pool is so slow. That was why I mentioned NVMe and ZIL.
It can make hard disk pools faster. Enterprise class drives are desirable.


----------



## Phishfry (May 14, 2022)

ralphbsz said:


> But one warning: If one of the boot drives fails, you will probably have to be physically present, to convince the BIOS to actually boot.


This is handled by using EFI. It presents the gmirror option to boot. So it can boot with single drive or mirror.
I have tested this personally.
When you install FreeBSD it seems to write an entry to UEFI bios regarding boot drive.(Maybe it uses UUID)
So instead of a drive label EFI looks like this: UEFI Disk
I assume the `efibootmgr` is at play.









						Install UEFI FreeBSD on gmirror
					

I wanted to post instruction for installing FreeBSD on a GEOM_MIRROR  -aka- gmirror(8). This is an advanced topic so I assume you are capable of determining that your two chosen disks are empty. UFS RAID1 on FreeBSD is enabled with the geom_mirror module. I am using ada0 and ada1 as examples...




					forums.freebsd.org
				




I do agree you will need to pull the dead drive to boot. The SATA Controller is not going to like a dead drive at boot.
That part is hard to test without a dead drive.


----------



## Dave-D (May 14, 2022)

mer said:


> ralphbsz That's what I like the most about this forum (speaking generally).  Sharing ideas about how to do something.  Everyone "knows" their way is the "best" but listening to others expands the knowledge base.  Maybe my next system I use your ideas because they are a better solution.



Exactly.

"Iron sharpens iron."
.
.


----------



## Dave-D (May 14, 2022)

ralphbsz said:


> On the other hand: With 6 ports on the motherboard, I really don't think that more disk drive ports will be needed. In particular with modern disk capacity.



USEFUL TOOL:  RAID CAPACITY CACULATOR:






						RAID Capacity Calculator - WintelGuy.com
					

RAID Capacity Calculator - evaluates capacity of different RAID types and configurations



					wintelguy.com


----------



## ralphbsz (May 15, 2022)

Let's use a simple starting point, then people can make holes in it.

Drives 1 and 2: Make them into a pool using mirroring. That pool has ~2TB capacity, and can handle failure of any 1 drive. Use them as the boot disks, and use that pool for two purposes: First, as the root pool, where you install the OS. You will have lots of free space in there, use that for (temporary) backups. To make sure that backups don't fill the root file system, write the backups as a non-root user, and use quotas to make sure not too much space is used for backups.

Drives 3 through 6: Make them into a data pool, using RAID-Z2. That pool has ~4TB capacity, and can handle two disk failures. Use that for the~300GB of user data.

Will performance be adequate? Probably. I have no idea what your performance needs are, but most people's needs are amazingly low.

One thing you really need to do: Monitor disk health. Run smartd, and look at the output. Ideally set up an e-mail system that warns you if any disk has unusual metrics in there. Set up regular ZFS scrubbing. Again, set up e-mails that warn you if zpools are not in perfect shape, or scrubbing finds problems.


----------



## mer (May 15, 2022)

Assuming ZFS throughout.
The mirror:  
Create a dataset for the local backups to further separate root/OS from the backed up data.  It also would let you zfs send/receive to offline storage.

Disk health.  Plenty of good knobs to turn in /etc/periodic.conf to run things automatically.  You can also have periodic output go to files instead of emails, but you must remember to look at them.
Scrubbing:  regular interval is good.  It can help catch issues before they become a big problem.  A "rule of thumb" is every 3 months or so, but a lot depends on the quality of the drives and the rest of the system.  Some say that consumer grade you want to run every month, high end enterprise maybe every 6 months.
Scrubs run over the amount of data, so if there's nothing on the disk, they complete quickly.  If they are almost full, they take a while.  That's important if the scrub is running during "real work", performance can drop.


----------



## Dave-D (May 16, 2022)

ralphbsz said:


> Let's use a simple starting point, then people can make holes in it.



Agreed. However, I propose that we start with solving the worst case scenario,
and then move on to find the best way to prevent that scenario.

*Worst Case Scenario*

Total drive failure.
System totally down.
Must get back up within hours.
No time to rebuild everything.

The main reason for all these schemes is to prevent this from happening in the first place.
But lets say it does happen.
For whatever reason ----> All my drives go away.

*Now what?

1. First - restore the system (Cold Metal Restore)*

I'm still wondering about cold metal restore with FreeBSD and still haven't heard any good and solid answers.

Formerly, I used Clonezilla to make an image of the (linux or windows) hard drive and stored that image on an external hard drive.
I could restore that image (either the whole drive or any partition) to a hard drive (only had to be same size or larger than the original hard drive.)
Could be up and running again within 15 minutes. Clonezilla does UFS, but it will not do ZFS.

What can be done with ZFS to completely restore the whole operation system and everything on that drive?
Lets say I need to be up and running again in one to three hours.
.
*2. Second - restore the data*

Needs little discussion at this point. Use whatever backups you have to restore data.
.


----------



## ralphbsz (May 16, 2022)

mer said:


> Assuming ZFS throughout.
> The mirror:
> Create a dataset for the local backups to further separate root/OS from the backed up data.  It also would let you zfs send/receive to offline storage.


Good idea.



Dave-D said:


> Agreed. However, I propose that we start with solving the worst case scenario,
> and then move on to find the best way to prevent that scenario.



To begin with: with RAID, that worst case scenario becomes astronomically unlikely. The likely failure scenario is that one disk fails (either completely, or has a single error, or develops many errors), and the other disks serve the data. You need to have a plan for noticing when this happens, identifying the failed disk, and replacing it. ZFS will then handle rebuilding data onto the new disk. The overall reliability of the system depends crucially on how quickly you execute the plan, since the risk of total failure comes from a second (or third) disk failing before you had time to repair the first failure.

The worst case scenario is complete loss of a pool, because its fault tolerance has been overwhelmed by too many disks failing before you had time to repair the first failure.

In the case of the system pool, you are proposing to use a cold metal restore from an image copy. In a nutshell, what you are doing there is having one extra (external) disk in the mirrored RAID group for the system pool, but that extra disk is only updated rarely and manually. This is theoretically possible, and I happen to do the same thing with my system right now (by root disk is non-redundant, using UFS). But this brings up difficult questions. In the worst case, the first thing you have to do is to obtain spare empty disks. How are you going to do this? Ideally, you should have a spare on site. Maybe the spare *IS* the external disk with the copy of the root pool: it is in an external enclosure, updated once in a while, and spends the rest of its time on the shelf?

One of the things you have to plan and test (before releasing the system) is the procedure for recovering from a total disk failure. With ZFS, I do not know what commands would have to be executed. You should run through this once, and take extensive notes on what exactly needs to be done. And then store the notes in such a fashion that you can get to them even when the system is down.

In the case of the data pool, a total failure is much less likely (since we designed it to be 2-fault tolerant). It is so unlikely that other failure modes are now more important, such as user error (rm * for example). The idea of restoring from backups (which is likely to take a long time) is reasonable here. Again, this has to be documented and tested.


----------



## Dave-D (May 16, 2022)

ralphbsz said:


> In the case of the system pool, you are proposing to use a cold metal restore from an image copy. In a nutshell, what you are doing there is having one extra (external) disk in the mirrored RAID group for the system pool, but that extra disk is only updated rarely and manually. This is theoretically possible, and I happen to do the same thing with my system right now (by root disk is non-redundant, using UFS). But this brings up difficult questions. In the worst case, the first thing you have to do is to obtain spare empty disks. How are you going to do this? Ideally, you should have a spare on site. Maybe the spare *IS* the external disk with the copy of the root pool: it is in an external enclosure, updated once in a while, and spends the rest of its time on the shelf?



This is what I used to do with linux:
1. Build a new system, keep a document for each step of the process.
2. Track everything in a spreadsheet, have the step numbers, which steps were completed on the initial build.
3. Insert steps for cloning, and note which completed steps that clone included.
4. Clone when system completed, again, keeping notes.

If that computer would fail, I could bring back the server in less than 20 minutes. Never worried about the state of log files, but it worked well, for years.
We're talking low-budget small-business linux servers.

I could also bring back a clone of the data drive, mainly for the folder structure, then bring the data in by tape or a backup stored on external hard drive.

I would also use my clone and restore scheme for building similar client machines.
Bring back a clone of a "template" client, create record in my spreadsheet, then add/remove a minimum of software on the client,
updating my spreadsheet as necessary, then again, clone at the end of that computers build.
.
.


----------



## Jose (May 16, 2022)

Dave-D said:


> Agreed. However, I propose that we start with solving the worst case scenario,
> and then move on to find the best way to prevent that scenario.
> 
> *Worst Case Scenario*
> ...


This is why I used a GEOM mirror for my boot drives. The two drives are completely identical down to the boot sector. I use BIOS booting, and I have them set up as primary and secondary boot devices (doesn't matter which is which, they're identical.) Should the current boot drive fail completely, the system will automatically boot from the second drive.

Handling the case where the current boot drive has errors is more complicated. You'll have to detect the errors (look ralphbsz 's excellent suggestions regarding SMART mornitoring), and manually swap to the backup drive probably by turning the system off and physically replacing the bad drive.

I didn't know much about ZFS when I came up with this setup. It's likely you can accomplish something similar with a RAID1 zpool and UEFI booting.


Dave-D said:


> For whatever reason ----> All my drives go away.


Sure; fire, earthquake, etc.


Dave-D said:


> *Now what?
> 
> 1. First - restore the system (Cold Metal Restore)*
> 
> ...


You're looking for ZFS snaphots if you're using a root zpool. Personally I believe setting up the system should be scriptable. There should not be any changes, besides configuration files, from a base Freebsd install + whatever custom packages I've built. No, I haven't written this script yet.


Dave-D said:


> *2. Second - restore the data*
> 
> Needs little discussion at this point. Use whatever backups you have to restore data.


You'll need off-site backups. I'm interested in Tarsnap.


----------



## Dave-D (May 16, 2022)

ralphbsz said:


> In the case of the data pool, a total failure is much less likely (since we designed it to be 2-fault tolerant). It is so unlikely that other failure modes are now more important, such as user error (rm * for example). The idea of restoring from backups (which is likely to take a long time) is reasonable here. Again, this has to be documented and tested.



Agree completely. Needs to be figured out, tested and documented.
I'm currently doing test builds, so now is a good time for me to do that, if possible.
I'm willing to share info. so all can benefit, and maybe take these ideas further.

.
.


----------



## Dave-D (May 16, 2022)

Jose said:


> You'll need off-site backups. I'm interested in Tarsnap.



I just bought a book about Tarsnap.
"Tarsnap Mastery: Online Backups for the Truly Paranoid" 
by Michael W. Lucas
Pretty sure that was from Ebay for around $20 new.


----------



## Jose (May 16, 2022)

Dave-D said:


> I just bought a book about Tarsnap.
> "Tarsnap Mastery: Online Backups for the Truly Paranoid"
> by Michael W. Lucas
> Pretty sure that was from Ebay for around $20 new.


Buy them direct from Lucas and disintermediate.

Edit: Coz it's hard to convey tone over text. I bought the only Lucas book I own through Amazon. I didn't know better. All I can say in my defense is that I followed an affiliate link from Freshports, 'cause I find that site so useful so often.


----------



## Erichans (May 16, 2022)

ralphbsz said:


> [...] You need to have a plan for noticing when this happens, identifying the failed disk, and replacing it. ZFS will then handle rebuilding data onto the new disk. The overall reliability of the system depends crucially on how quickly you execute the plan, since the risk of total failure comes from a second (or third) disk failing before you had time to repair the first failure.


When a disk is starting to fail but is not completely "unresponsive", it can be very useful to have both the failing disk and the replacement disk in the system during resilvering. In general resilvering is a time-consuming and stressful activity.* Keeping the to-be-replaced disk in the system together with the new disk can speed up the resilvering process and offers better overall IO performance of the pool during resilvering versus disconnecting and taking out the to-be-replaced disk and exchange it with the new disk before resilvering. See also: Replacing a failing drive in a ZFS zpool

To accommodate this replacement procedure the ideal situation would be that you have an extra physical disk location (tray/bay etc.) available that is also equipped with an appropriate interface.

___
* for the discs in the pool and perhaps also for the sysadmin


----------



## ralphbsz (May 16, 2022)

Jose said:


> This is why I used a GEOM mirror for my boot drives. ..
> I didn't know much about ZFS when I came up with this setup. It's likely you can accomplish something similar with a RAID1 zpool and UEFI booting.


You can definitely do it with ZFS. A friend of mine does. But: I don't know how to do it (since I'm still running non-redundant UFS for my boot drive at home).

Agree with your suggestion that a failing drive may be helpful during resilvering. But we have to underline MAY be, it is not guaranteed. Good example: A drive on which 99.9% of all the IOs succeed, and the remaining 0.1% fail cleanly with fast error returns: very helpful. Bad example: A drive on which 90% of the IOs succeed, but the reamaining 10% take a 5-minute timeout, and then cause a SATA error so severe it crashes the motherboard and requires a reboot. The second drive is not helpful.


----------



## Dave-D (May 17, 2022)

Erichans said:


> it can be very useful to have both the failing disk and the replacement disk in the system during resilvering. In general resilvering is a time-consuming and stressful activity.



I read that you can lose another disk during this stressful process.

So lets say we have RAID2z - we can lose 2 disks. But it takes a min. of 4 disks.
We just lost one, or its nearly dead. So we can afford to lose 1 more.
We start the resilver process, and now 1 more dies.
Now we're down to nothing.
Or, what if the admin makes a mistake, we're down to nothing.
You almost need a RAIDz3 for peace of mind. So now I'm talking 5 disks, minimum.

Unless you go with mirrors. They are much easier (less stressful on the system) to resilver.
With a 4-way mirror, you can lose 3 disks.

So far I've gathered the following:

RAIDz
(somehow) better for data intregrity (not sure how / or if true)
harder on the system to resilver
main benefit is space, while mirror is performance (at the expense of space)
lower performance that mirrors
higher system resource use than mirrors
cannot add to an existing RAIDz.

MIRRORS
easier on the system to resilver
main benefit is performace, while RAIDz is space (at the expense of performance.)
higher performance than RAIDz
lower system resource use than RAIDz
*can* add to an existing mirror.
*can* remove drives from an existing mirror.
*can* expand space in an existing mirror.
Overall more flexible when expanding a system.
And another one - if I lose everything but 1 drive, the whole system is on that 1 drive, which I assume I can access as 1 drive (?).
.
.
I think RAIDZx has a lot more of the "cool" factor.
But plain old mirrors may have a lot to offer for the small business type of server.
I can do a LOT of business inside 2TB  worth of space.
.
.
my 2cents worth...
.
.


----------



## Dave-D (May 17, 2022)

Jose said:


> This is why I used a GEOM mirror for my boot drives.



I'm assuming this is the same basic idea as mirrors on ZFS?

See my last post, mirrors are starting to look like a lot better option for my use case.

I'm thinking:

1 x 3-disk mirror for O/S + LOCAL_BACKUP = 2TB total space
1 x 3-disk mirror for DATA = 2TB total space

-OR-

1 x 2-disk mirror for O/S + LOCAL_BACKUP = 2TB total space
1 x 4-disk mirror for DATA = 2TB total space

THEN

In the future, if I need more space, I can add larger drives to any existing mirror, when I've replaced all drives in that mirror, the mirror will automatically expand to the size of the larger drives.
Flexible. Simple. Easy. Nice.

Not so with RAIDz. Other than adding another pool.
.
.


----------



## cy@ (May 17, 2022)

Dave-D said:


> Is there no problem with running UFS on operating system drive, then ZFS on the other 5 drives?
> 
> Dave


I do this on all my systems. Not that I planned it that way but I installed the first one about 25 years ago and the rest were dump | restore clones of the first one. The ZFS partitions simply evolved over time. Long story short, you will have no problems.

However the UFS buffer cache and the ZFS ARC will compete for RAM. But since the system slices are rarely referenced (in my case) the UFS buffer cache remains small. The only time I see it grow, causing the ZFS ARC to shrink, is during installworld/installkernel.

Typically this is not what people tend to do but keeping the O/S on UFS does allow a person to clone the system using dump piped to restore (dump | ssh | restore) to another server (booted off ISO or my rescue drive) or simply clone my rescue drive to a machine, change a few settings in rc.conf and fstab and then boot. One may be able to do this with zfs send/receive but I haven't needed to try that.

As to how I chose to clone using dump piped to restore, I used to do this with Solaris UFS and Tru64 UFS back in the day. I also booted Solaris off UFS, using ZFS for data and booted Tru64 from its UFS while using AdvFS for data. It's not a new concept, just something I've done all my career.


----------



## Dave-D (May 17, 2022)

cy@ said:


> I do this on all my systems. Not that I planned it that way but I installed the first one about 25 years ago and the rest were dump | restore clones of the first one. The ZFS partitions simply evolved over time. Long story short, you will have no problems.
> 
> However the UFS buffer cache and the ZFS ARC will compete for RAM. But since the system slices are rarely referenced (in my case) the UFS buffer cache remains small. The only time I see it grow, causing the ZFS ARC to shrink, is during installworld/installkernel.
> 
> ...



Are you aware of clonezilla? I've used it for years with linux & windoz. 
Might be a lot easier than what you're doing??
It supports the UFS file system, but not ZFS.
clonezilla.org


----------



## gpw928 (May 17, 2022)

Never underestimate the danger of replacing the wrong drive when one (or more) spindles fail in a RAID set.

I have seen it happen.  Happily I was just a spectator.

I'm paranoid in dealing with this situation.  It's why the IBM procedures for RAID maintenance walk the engineer through a process that eventually lights a bulb on the broken drive.  However your drives probably won't have lights, and if you have multiple sites, you may have to rely on hired help.

So you need well defined procedures to defend against bad outcomes.  GPT labels that encode disk location and serial number are usually part of the defense.  This is done at system build time. 

On the matter of using a UFS root, I too did that for ages because at the beginning ZFS had no boot option.  Back then I actually had space allocated on the root mirror for two completely separate bootable root file systems.  And I used them for upgrades because I had to have a fallback if something went wrong.

I have since switched to ZFS root.  Boot environments were the reason, and they are great. You just have a different set of procedures to test and document in order to build your systems and recover from problems.

You don't need clonezilla if you know what you are doing.  You do need to understand some first principles.  The process to put a bootable ZFS root on a naked disk is well understood (and I'm happy to send you a well tested script).


----------



## Dave-D (May 17, 2022)

gpw928 said:


> The process to put a bootable ZFS root on a naked disk is well understood (and I'm happy to send you a well tested script).


Yes, please.


gpw928 said:


> Never underestimate the danger of replacing the wrong drive when one (or more) spindles fail in a RAID set.


Exactly what I'm talking about.
I lose 1 drive.
Admin messes up by replacing the wrong drive.
Now we're sitting on pins and needles.
One more issue and its game over.


gpw928 said:


> GPT labels that encode disk location and serial number are usually part of the defense.


Exactly.
Any idea what the character limit is on those labels?
Do you label the GPT partitions or is there a master label for the disk?
Do you use gparted to label the disk or some other method?

THANK YOU for your help!


----------



## Dave-D (May 17, 2022)

gpw928 said:


> Never underestimate the danger of replacing the wrong drive when one (or more) spindles fail in a RAID set.


Would it be impossible to "get into trouble" by pulling the wrong drive while using mirrors?
If I did pull the wrong drive - any single drive would have a complete set of data on it and doesn't "need" any other drive to "reconstruct" the data like RAIDz would?
As long as I had 1 complete and working drive from the mirror, could I not reconstruct the whole mirror from that one drive?

Sorry about peppering you with Q's.

But then, you've got me thinking...

A really appreciate all your help.


----------



## gpw928 (May 17, 2022)

The plan to install a ZFS mirror'd root on a pair of naked disks is here.  It's been used several times, and is configured for "Stage 2".  You probably want to examine and set  ZROOTSRC, ZROOTDST, SWAP, DEV0, and DEV1 appropriately for your needs.  Pay attention to the comments around "zpool get bootfs".  You have to have a running FreeBSD system to execute it.


Dave-D said:


> Any idea what the character limit is on those labels?
> Do you label the GPT partitions or is there a master label for the disk?
> Do you use gparted to label the disk or some other method?


Partition labels are limited to 15 characters.
You have to label partitions, not whole disks.
You use gpart(8) to apply the labels (have a look at the script above).
Root disks need multiple partitions, configured in a variety of ways.
My root disks, created by the script above, are partitioned like this, which is based on the layout the FreeBSD 13.0 installer uses:
	
	



```
[sherman.143] $ gpart show ada0
=>       40  781422688  ada0  GPT  (373G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   33554432     2  freebsd-swap  (16G)
   33556480  180355072     3  freebsd-zfs  (86G)
  213911552   25165824     4  freebsd-zfs  (12G)
  239077376  134217728     5  freebsd-zfs  (64G)
  373295104  408127488     6  freebsd-ufs  (195G)
  781422592        136        - free -  (68K)
```
Boot is not mirror'd, but the boot partition on each disk needs to be identical.
Swap is a GEOM mirror on partition 2:
	
	



```
[sherman.149] $ gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  ada0p2 (ACTIVE)
                       ada1p2 (ACTIVE)

[sherman.150] $ grep swap /etc/fstab
#/dev/mirror/swap      none        swap    sw        0    0
# With ".eli" appeded to the swap device, swapon(8) will set up GELI encrypt.
/dev/mirror/swap.eli      none        swap    sw        0    0

[sherman.151] $ swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/mirror/swap.eli  16777212        0 16777212     0%
```
The root is a ZFS mirror on partition 3:
	
	



```
[sherman.152] $ zpool status  zroot
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:02:52 with 0 errors on Wed Apr 13 14:13:59 2022
config:

    NAME                      STATE     READ WRITE CKSUM
    zroot                     ONLINE       0     0     0
      mirror-0                ONLINE       0     0     0
        gpt/236009L240AGN:p3  ONLINE       0     0     0
        gpt/410008H400VGN:p3  ONLINE       0     0     0
```
Partition 4 is a SLOG (ZFS mirror)  for the tank -- only appropriate if you have "enterprise class" SSDs.
Partition 5 is an L2ARC (ZFS stripe) for the tank.
Partition 6 is unused (over-provisioning for the SSDs).
Data (tank) disks are generally created with one large partition, that partition is labeled, and then used to create the RAID set.
My tank is labeled like this with stack position and serial number encoded in the label:
	
	



```
[sherman.155] $ zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 05:06:26 with 0 errors on Thu Apr 14 08:08:14 2022
config:

    NAME                      STATE     READ WRITE CKSUM
    tank                      ONLINE       0     0     0
      mirror-0                ONLINE       0     0     0
        gpt/L1:ZC1564PG       ONLINE       0     0     0
        gpt/L6:WMC1T1408153   ONLINE       0     0     0
      mirror-1                ONLINE       0     0     0
        gpt/L0:ZC135AE5       ONLINE       0     0     0
        gpt/L5:WMC1T2195505   ONLINE       0     0     0
      mirror-2                ONLINE       0     0     0
        gpt/L4:ZC12LHRD       ONLINE       0     0     0
        gpt/L3:WCC4N5CVZ6V4   ONLINE       0     0     0
      mirror-3                ONLINE       0     0     0
        gpt/L2:ZC1AKXQM       ONLINE       0     0     0
        gpt/L7:WE23ZTX9       ONLINE       0     0     0
    logs    
      mirror-4                ONLINE       0     0     0
        gpt/236009L240AGN:p4  ONLINE       0     0     0
        gpt/410008H400VGN:p4  ONLINE       0     0     0
    cache
      gpt/236009L240AGN:p5    ONLINE       0     0     0
      gpt/410008H400VGN:p5    ONLINE       0     0     0
```
This is how I created the tank (it could be improved, but shows what's needed):
	
	



```
LABELS="L6:WMC1T1408153
L5:WMC1T2195505
L3:WCC4N5CVZ6V4
L1:ZC1564PG
L0:ZC135AE5
L2:ZC1AKXQM
L4:ZC12LHRD
L7:WE23ZTX9"

n=0
for label in $LABELS
do
    gpart destroy -F /dev/da$n
    gpart create -s gpt /dev/da$n
    gpart add -t freebsd-zfs -l "$label" /dev/da$n
    n=$(($n+1))
done

# I'm pairing these manually, old with new for mirror reliability (not speed :-)
M0="/dev/gpt/L1:ZC1564PG" 
M1="/dev/gpt/L6:WMC1T1408153"
M2="/dev/gpt/L0:ZC135AE5"
M3="/dev/gpt/L5:WMC1T2195505"
M4="/dev/gpt/L4:ZC12LHRD"
M5="/dev/gpt/L3:WCC4N5CVZ6V4"
M6="/dev/gpt/L2:ZC1AKXQM"
M7="/dev/gpt/L7:WE23ZTX9"

# Create the new tank as 4 x 2x3TB mirrors
eval zpool create tank \
    mirror $M0 $M1 \
    mirror $M2 $M3 \
    mirror $M4 $M5 \
    mirror $M6 $M7
zfs set compression=lz4 tank
zpool status
```


----------



## gpw928 (May 17, 2022)

Dave-D said:


> Would it be impossible to "get into trouble" by pulling the wrong drive while using mirrors?
> If I did pull the wrong drive - any single drive would have a complete set of data on it and doesn't "need" any other drive to "reconstruct" the data like RAIDz would?
> As long as I had 1 complete and working drive from the mirror, could I not reconstruct the whole mirror from that one drive?


That would depend on whether your drives were "hot swapable".
If you pull the wrong "how swap" drive from the system, it can get corrupted when you pull it.
I would always choose to shut down a system before pulling a drive, it I had the option (you tend not to have the option in large enterprises).
That way you can compare the serial number on the label to what you expect to see, and get some wiggle space to back track.

[Often, when drives fail, the disk (and its label) will disappear from the status displays. But if all the others can be seen, then the missing one may be deduced.]


----------



## cy@ (May 17, 2022)

Dave-D said:


> Are you aware of clonezilla? I've used it for years with linux & windoz.
> Might be a lot easier than what you're doing??
> It supports the UFS file system, but not ZFS.
> clonezilla.org


I've used clonezilla on Linux at $JOB. But my approach predates clonezilla by about 10-15 years. I conceived it when I switched from MVS (IBM mainframe) to UNIX (Solaris, HP/UX, DG-UX, OSF/1), when patching would keep a server down for 1-3 hours instead of mere minutes. What I did was pretty much what Sun did when they implemented UFS boot environments. I shared the approach with Sun in 1995.

On the mainframe we'd patch the inactive disk, then reboot the inactive disk during a change window. Patching would take many weeks of research, planning, and implementation while the reboot took less than 30 minutes. When I started work on UNIX in 1992 the first thing that came to mind was, how backwards the UNIX patching and system install process was.

Basically it's:

Boot from ISO, slice and newfs filesystems,

cd /a && ssh server dump 0f - / | restore xf -
cd /a/var && do the same as above for /var.
Change /a/etc/rc.conf & /a/etc/hosts with new IP addresses, and fixup /etc/fstab as necessary. Reboot.

As I said, this was developed on Solaris 2.3 at the time (1995), long before clonezilla was a thing. And, works on Tru64 and all BSD variants. If it ain't broke, don't fix it.


----------



## Jose (May 17, 2022)

Dave-D said:


> I'm assuming this is the same basic idea as mirrors on ZFS?


The concepts are the same, but I'm sure the implementations are vastly different.

Clonezilla is superfantastic for migrating Windows drives. I boot from a CD or USB drive, copy the data over to a new drive, and Windows is none the wiser 'cause it wasn't even booted up when I did the clone. I've never felt a need to use it on Unixy machines.


----------



## Dave-D (May 17, 2022)

Jose said:


> Clonezilla is superfantastic for migrating Windows drives. I boot from a CD or USB drive, copy the data over to a new drive, and Windows is none the wiser 'cause it wasn't even booted up when I did the clone. I've never felt a need to use it on Unixy machines.


I'm not tied to it in any way.
Looking forward to learning more about FreeBSD and finding new and better ways to do things.


----------



## mer (May 17, 2022)

clonezilla does bit for bit copy on the source and dest device?  If your source device was say 128MB but the dest was 500MB booting from the dest would look like 128MB?
Of course toss in copying all the "bad" bits too.


----------



## Jose (May 17, 2022)

It's smarter than that. I clone NTFS filesystems to larger drives all the time, and Clonezilla has always done the Right Thing(tm). Ditto with boot drives. It feels like magic.


----------



## mer (May 17, 2022)

Jose thanks.  I wasn't sure, some cloning tools are less than smart.


----------



## Dave-D (May 17, 2022)

cy@ said:


> I've used clonezilla on Linux at $JOB. But my approach predates clonezilla by about 10-15 years. I conceived it when I switched from MVS (IBM mainframe) to UNIX (Solaris, HP/UX, DG-UX, OSF/1), when patching would keep a server down for 1-3 hours instead of mere minutes. What I did was pretty much what Sun did when they implemented UFS boot environments. I shared the approach with Sun in 1995.
> 
> On the mainframe we'd patch the inactive disk, then reboot the inactive disk during a change window. Patching would take many weeks of research, planning, and implementation while the reboot took less than 30 minutes. When I started work on UNIX in 1992 the first thing that came to mind was, how backwards the UNIX patching and system install process was.



Stories like yours amaze me.

It's an incredible thing to have access to such knowledge and experience.
I am very thankful for the experts on this site who are willing to share  their expertise and patient enough with greenhorns like myself who are starting from nothing and trying to learn.


----------



## ralphbsz (May 18, 2022)

Dave-D said:


> So lets say we have RAID2z - we can lose 2 disks. But it takes a min. of 4 disks.
> ...
> Unless you go with mirrors. They are much easier (less stressful on the system) to resilver.
> With a 4-way mirror, you can lose 3 disks.


True. But 4 disks using RAID-Z2 have the capacity of 2 disks worth. A 4-way mirror has the capacity of 1 disk.

In the tradeoff game (more redundancy <-> more capacity), there is no free lunch. With modern disks, and excluding failure modes that introduce correlated failures (more on that below), the sweet spot today is being able to tolerate two faults. In a well-designed system (which ZFS is), in such a situation you will recover back to a single-faulted system pretty quickly after the spare drive is put into service.



gpw928 said:


> Never underestimate the danger of replacing the wrong drive when one (or more) spindles fail in a RAID set.
> ...
> I'm paranoid in dealing with this situation.  It's why the IBM procedures for RAID maintenance walk the engineer through a process that eventually lights a bulb on the broken drive.  However your drives probably won't have lights, and if you have multiple sites, you may have to rely on hired help.


When I worked at IBM, this was actually measured: the #1 source of data loss was ... field service engineer pulling the wrong disk out. Really, it beat disk failure hands down.

That's why well-designed disk enclosures have a combination of the following: indicator lights that say "you are allowed to remove this disk"; another indicator light that says "please remove this specific disk right now"; battery-backup for those indicator lights so even if the field engineer cuts power, they remain on for about an hour; solenoids that lock the disks in place so the good disks can not be removed without cutting power to the system; and finally a loud alarm beeper that sounds if you use the remove handle on a disk that is not supposed to be removed. This is how you build systems that get good reliability in the real world.



> On the matter of using a UFS root, I too did that for ages because at the beginning ZFS had no boot option.  Back then I actually had space allocated on the root mirror for two completely separate bootable root file systems.  And I used them for upgrades because I had to have a fallback if something went wrong.


Again, having two independent copies of the boot environment is vital for real-world reliability. If your computer costs many M$, and there is risk that an upgrade might break something, you keep the current configuration on one disk, and only upgrade the second copy.



Dave-D said:


> Would it be impossible to "get into trouble" by pulling the wrong drive while using mirrors?


Absolutely. 


> If I did pull the wrong drive - any single drive would have a complete set of data on it and doesn't "need" any other drive to "reconstruct" the data like RAIDz would?
> As long as I had 1 complete and working drive from the mirror, could I not reconstruct the whole mirror from that one drive?


Not if there are writes occurring while you are pulling drives. You can easily end up with a situation where every bit of data is at least on one drive, but no single drive has a completely copy of all data.

Another comment: This whole discussion ignores the "dead moose on the table". We're all talking about data loss that's caused by disk failures (or cable or connector failures), and how to address that using RAID. It ignores that most data loss is caused by users. The joke example is "rm -Rf /", but much more common is "I overwrote that file". Really good backups are in reality more important than RAID.


----------



## Dave-D (May 18, 2022)

ralphbsz said:


> Not if there are writes occurring while you are pulling drives. You can easily end up with a situation where every bit of data is at least on one drive, but no single drive has a completely copy of all data.



So here's where I stand:

I've decided to get a LSI card so I can run 8 drives, the limit of my hot-swap cage.
That gives me a little more to work with.
Not sure what the lights indicate on the hot-swap cages, may just be on/off for disk activity, not yet sure.
This is a small-budget operation. Not a million$+ situation.

If I have to pull a drive, this is a small business, so I would *always* shut the server down before pulling any drive.

So, my big question now, is:  Mirrors, or RAIDz, or a combination of both?
I'm thinking mirrors sound a lot easier for a greenhorn and safer to work with, also more flexible so I can adapt to future needs, as I learn.
I'm thinking RAIDz scares me on some level. Mostly because it has to "reassemble" the data to get a complete set. Pulling wrong drive,
being locked into your RAIDz setup & not able to change it without starting over, etc.

Questions... Questions...
.
.


----------



## ralphbsz (May 18, 2022)

Dave-D said:


> I've decided to get a LSI card so I can run 8 drives, the limit of my hot-swap cage.


I like the 8-drive hot swap case. That will make working with the system easier. For example, if you have a disk that gets "sick" (not dead, but unwell and needing to be replaced soon), then you can get a spare disk, add it to the ZFS pool, mark the sick disk as needing to be drained of data (re-replicated onto the spare disk), and when the drain is finished, remove the sick one and throw it away. Having more physical slots than disk allow you to add spare disks temporarily, without doing perverse things with extension cables and disks mounted hap-hazardly.

On the other side: While I love the LSI cards, I wonder whether getting that is a little bit overkill. You already have 6 perfectly good SATA ports on the motherboard. You are probably not planning to run SAS (SCSI) disks anyway, because they are harder to find and sometimes more expensive. It might be cheaper to just get a 2-port SATA card instead, that gives you 8 ports for your hot swap cage, and may be cheaper and easier.



> Not sure what the lights indicate on the hot-swap cages, may just be on/off for disk activity, not yet sure.


Typically disk enclosures (including hot-swap cages) have one indicator light, often green, that shows power to the disk being on, and/or disk activity. In some cases, those are separate (one power light, which indicates that a disk is present and is getting power, and one activity light). That is the "blinking" light. Those one or two lights are controlled purely by the enclosure and the disk drive itself.

All other indicator lights (and crazy features like disk locking solenoids, beepers) are controlled by the computer, usually using an interface called SES (SCSI Enclosure Services). This is not a science, but somewhere between voodoo and magic. There are two problems you need to solve here: (a) If a disk is called /dev/ada5 or has serial number Hitachi 12345 or has WWN 5000cca228xxx, what physical slot of the enclosure is it in? (b) How do I turn the red/yellow/blue/... light for that slot on and off? Doing this reliably is not easy.



> If I have to pull a drive, this is a small business, so I would *always* shut the server down before pulling any drive.


Good. That automatically removes many possible failure modes. Here is one piece of advice: Use good human-readable labels on your disk partitions (using the gpart command). For example, my little server at home has two internal spinning disks, and if you do "gpart show -l /dev/adax", the partition name is is: "hd14_home". That means it is the disk named hd14 (meaning it is a Hitachi and I bought it in 2014), and on that physical disk it is the home partition. If you open the server, you will see two physical disks, and one has a big paper label attached, which says HD14 (the other one is HD16). That means that if I need to remove or replace HD14, it's pretty obvious which disk this is.



> So, my big question now, is:  Mirrors, or RAIDz, or a combination of both?
> I'm thinking mirrors sound a lot easier for a greenhorn and safer to work with, also more flexible so I can adapt to future needs, as I learn.
> I'm thinking RAIDz scares me on some level. Mostly because it has to "reassemble" the data to get a complete set. Pulling wrong drive,
> being locked into your RAIDz setup & not able to change it without starting over, etc.


From the sys admin point of view, RAID-Zx and mirrors work nearly the same. The "reassemble" problem you refer to is all internal to the ZFS implementation. If you pull the wrong drive, you're screwed in either case. Really the only two advantages of mirroring are: You get more redundancy and therefore reliability (a 4-way mirror can lose 3 drives, while a 4-disk RAID-Z2 can only lose 2), and better read/write performance (which you may not care about). The cost of 4-way mirroring is a loss of capacity, in this case by a factor of two (which you also not care about).



> Questions... Questions...


The question of how to use RAID efficiently is a super complex question.


----------



## Dave-D (May 18, 2022)

ralphbsz said:


> On the other side: While I love the LSI cards, I wonder whether getting that is a little bit overkill. You already have 6 perfectly good SATA ports on the motherboard. You are probably not planning to run SAS (SCSI) disks anyway, because they are harder to find and sometimes more expensive. It might be cheaper to just get a 2-port SATA card instead, that gives you 8 ports for your hot swap cage, and may be cheaper and easier.



I can get an LSI 9210-8i *new* on ebay for between $49.88 and $59.99, with the (2) SAS cables which look like 4 SATA connectors at the other end of each cable = 8 SATA connectors.
Can also set the hardware RAID to "passthrough" so it won't interfere with whatever RAIDz setup I'm running.
Might be hard to pass up for $50.00.
Specs said you can get an "expander" for that card so it supports up to 256 drives (?)
If I get really crazy my case will take 1 more hot-swap cage for a total of 12 hot-swap drives. Worst-case I can just add another $50 LSI card to handle the additional 4 drives, or run them off my motherboard.

Example:








						LSI 9210-8i 6Gbps SAS HBA FW:P20 9211-8i IT Mode ZFS FreeNAS unRAID 2* SFF SATA  | eBay
					

Find many great new & used options and get the best deals for LSI 9210-8i 6Gbps SAS HBA FW:P20 9211-8i IT Mode ZFS FreeNAS unRAID 2* SFF SATA at the best online prices at eBay! Free shipping for many products!



					www.ebay.com
				




I ordered a couple 4-port SATA cards to play around with, *new* on ebay for about $20.00 ea.

All my current 2TB drives are new Western Digital Enterprise class.
Do you know anything about the "WL" brand of hard drives?
They're said to be made as white label drives then rebranded and sold by many OEM's, etc.
Supposed to be really good enterprise-class drives. Haven't done any research yet.

Example:








						WL 2TB 64MB Cache 7200RPM Enterprise SATA 6Gb/s 3.5" Hard Drive - FREE SHIPPING  | eBay
					

Find many great new & used options and get the best deals for WL 2TB 64MB Cache 7200RPM Enterprise SATA 6Gb/s 3.5" Hard Drive - FREE SHIPPING at the best online prices at eBay! Free shipping for many products!



					www.ebay.com
				




Notice the seller specializes in selling drives, has sold 6,129 of this one drive model, overall feeback is still 100%.
Assuming all the numbers are legit, if the drives weren't any good then someone would be squawking about it.
in fact someone would be really ticked off and likely throwing a real hissy-fit.

I've also heard that its a good idea to mix in different hard drive brands.
And someone said (in this thread) they mix old and new for reliability.
Of course with my small servers, I'm thinking maybe all new, with two different brands.

Thoughts?


----------



## Dave-D (May 18, 2022)

ralphbsz said:


> For example, if you have a disk that gets "sick" (not dead, but unwell and needing to be replaced soon), then you can get a spare disk, add it to the ZFS pool, mark the sick disk as needing to be drained of data (re-replicated onto the spare disk), and when the drain is finished, remove the sick one and throw it away. Having more physical slots than disk allow you to add spare disks temporarily, without doing perverse things with extension cables and disks mounted hap-hazardly.



Could I do this using an external hard drive in an external drive enclosure attached to a USB port?

This one sounds really interesting.  Do you have the steps listed somewhere that you're willing to share, so I could experiment?
.


----------



## gpw928 (May 18, 2022)

Dave-D said:


> I can get an LSI 9210-8i *new* on ebay for between $49.88 and $59.99, with the (2) SAS cables which look like 4 SATA connectors at the other end of each cable = 8 SATA connectors.


Verify the source.  Is it coming from China?  There's a lot of counterfeits (some of which actually work OK, but YMMV).


----------



## gpw928 (May 18, 2022)

ralphbsz said:


> From the sys admin point of view, RAID-Zx and mirrors work nearly the same.


Agreed, but...  RAID-Z is always slower.  And striped mirrors are always much faster...

Your point about spare hot-swap slots is a _*really*_ good one. It really does help with risk mitigation.  My internal "cold swap" stack is augmented by a small (3-spindle) (normally empty) hot swap cage that facilitates rotation of 12TB off-site backup disks, and RAID re-silvering *prior* to any shutdown to remove "problem" drives (so the RAID set can then survive removal of the wrong drive).  [The ideas behind this approach were seeded by discussions on this list.]


----------



## Phishfry (May 18, 2022)

Dave-D said:


> Supposed to be really good enterprise-class drives.


No. If they were they would have a 5 year warranty. They are white label.
Maybe last a year, maybe longer.
No one in an enterprise would risk their job on cheap drives. This is just marketing terminology.


Dave-D said:


> I'm thinking maybe all new, with two different brands.


Not needed. Just buy quality drives.
Look at backblaze stats.


----------



## Dave-D (May 19, 2022)

Phishfry said:


> Not needed. Just buy quality drives.
> Look at backblaze stats.



THANK YOU. Huge help! 
"You can't argue with success."
.


----------



## Phishfry (May 19, 2022)

Success ain't cheap. I understand the temptation of 40 dollar drives.


----------



## Dave-D (May 19, 2022)

gpw928 said:


> Your point about spare hot-swap slots is a _*really*_ good one. It really does help with risk mitigation. My internal "cold swap" stack is augmented by a small (3-spindle) (normally empty) hot swap cage that facilitates rotation of 12TB off-site backup disks, and RAID re-silvering *prior* to any shutdown to remove "problem" drives (so the RAID set can then survive removal of the wrong drive). [The ideas behind this approach were seeded by discussions on this list.]



How many Hot-swap slots would I need (of the 8 available) to pull this off, *if* its even possible with my small rig?
Could I use an external hard drive or two, in an external drive enclosure, attached w/usb cable?

Would the "EMPTY* slots below be enough to pull it off?

I'm starting to think along these lines for my final setup:
Rack server case has maximum of 3 hot swap cages @ 4 drive bays/cage = 12 hot swap drive bays:

*CURRENT SETUP*
Slot 1 - WD 4TB - mirror #1 - OS + LOCAL_BACKUP
Slot 2 - WD 4TB - mirror #1 - OS + LOCAL_BACKUP
Slot 3 - WD 4TB - mirror #1 - OS + LOCAL_BACKUP
Slot 4 - *EMPTY*

Slot 5 - WD 4TB - mirror #2 - DATA
Slot 6 - WD 4TB - mirror #2 - DATA
Slot 7 - WD 4TB - mirror #2 - DATA
Slot 8 - *EMPTY*

*FUTURE EXPANSION:*
Slot 9 - WD 4TB - mirror #3 - DATA --> [Pool with mirror #2]
Slot 10 - WD 4TB - mirror #3 - DATA --> [Pool with mirror #2]
Slot 11- WD 4TB - mirror #3 - DATA --> [Pool with mirror #2]
Slot 12 - *EMPTY*


----------



## Dave-D (May 19, 2022)

Phishfry said:


> Success ain't cheap. I understand the temptation of 40 dollar drives.


Especially with the current bombardment of price increases.


----------



## ralphbsz (May 20, 2022)

Dave-D said:


> Specs said you can get an "expander" for that card so it supports up to 256 drives (?)


SAS expanders are not typically something that one usually buys separately, they are typically built into larger disk cages or enclosures. In some cases, large disk enclosures will have multiple levels of expanders (if you want to pack 100 disks into an enclosure, a single expander chip won't do).



> Do you know anything about the "WL" brand of hard drives?
> They're said to be made as white label drives then rebranded and sold by many OEM's, etc.


There are only 2.5 manufacturers of disk drives: Seagate (which includes Samsung), Western Digital (sometimes known as WD, and some disks still known as Hitachi), and Toshiba (which I count as 0.5 because they are small). So where do off-brand drives come from? Typically rejected drives from one of the big vendors. They could take two possible paths: either the manufacturer sold them to a big user (90% of all enterprise disks are sold to a dozen big companies, like Amazon/Apple/Baidu/Facebook/Google/Microsoft/Tencent or Dell/HP/Amazon/Oracle), failed QA testing there or showed errors early on, and were sold to unscrupulous resellers that hide their true history. Or they are drives that failed QA testing at the manufacturer, but I don't think the big manufacturers would be willing to sell those.

In all this, one has to remember that the disk manufacturers have very strict QA systems. And they grade the quality of disk drives: the best ones go to preferred customers, who also get access to internal QA information on a per-drive basis and use the drives in-house, and who pay a premium (typically the cloud superscalers); the decent ones go to price-conscious customers (such as Dell) who resell the drives as part of systems, and the not-so-good ones go into the retail channel. The joke used to be that the worst drives are sold at Fry's (a chain of electronics supermarkets on the west coast of the US, in particular in Silicon Valley, infamous for selling junk and having impossible to navigate return systems), but that's no longer true since Fry's has gone out of business.

If I had nothing useful to do, I would waste $50 on one of those drives, find out exactly what kind it really is (the firmware will give it away), and then perhaps bring it to some friend who works for one of the drive makers and we take it apart together. But I have too many useful things to do.

Summary: STAY AWAY.



> I've also heard that its a good idea to mix in different hard drive brands.


Opinions on that differ. At the consumer level, where you have no information about the reliability of individual disks, it might be a good idea, just to guard against the unfortunate coincidence that you buy all drives from one manufacturer/model made roughly at the same time and place, and that kind happens to be unreliable. But modern enterprise-grade drives are so good, with a decent RAID layer on top, they will be reliable enough. At the large-scale user level (the customers who buy a million disks at a time), there are large groups of people who study, measure and forecast drive reliability, and who consciously adjust data placement to maximize reliability and minimize cost. This is one of the reasons that small users simply can't compete with cloud providers: You can't afford the group of 5 PhDs and 10 software engineers that perform such optimizations, but you can rent disk space in the cloud from companies that do.

(about draining a disk that is about to remove)


Dave-D said:


> This one sounds really interesting.  Do you have the steps listed somewhere that you're willing to share, so I could experiment?


Look at the "zpool remove"command. 



Phishfry said:


> Not needed. Just buy quality drives.
> Look at backblaze stats.


THIS. For the consumer who buys small quantities of drives from the retail environment, looking at Backblaze is the best idea, because that's where their disks also come from. The problem with this approach is: by the time Backblaze has good high-statistics data (like having used 10,000 disks for 4 years), the disk model is probably obsolete, and can no longer be found in the retail channel, except for used or rejected drives. So the idea here is to look for patterns, like all the disks from manufacturer "Elephant" with model names that start with "Dumbo" are very good, and then follow that pattern. I'll give you a hint: My spinning drives are (WD) Hitachi HGST enterprise-grade drives, with model numbers starting with H.



Dave-D said:


> How many Hot-swap slots would I need (of the 8 available) to pull this off, *if* its even possible with my small rig?
> Could I use an external hard drive or two, in an external drive enclosure, attached w/usb cable?


One spare (empty) slot is enough. Sure, you could do it with external enclosures, but that's a hassle: You have to put the new/spare/old disk in the external enclosure, USB is probably slower, then remove it again and put it into its real location. Easier to have a spare slot.



> I'm starting to think along these lines for my final setup:
> Rack server case has maximum of 3 hot swap cages @ 4 drive bays/cage = 12 hot swap drive bays:


Given that you are planning to use 6 drives (and your assignment looks reasonable), I think two bays, meaning two spare slots, seems adequate. And cheaper.


----------



## Dave-D (May 21, 2022)

ralphbsz said:


> There are only 2.5 manufacturers of disk drives: Seagate (which includes Samsung), Western Digital (sometimes known as WD, and some disks still known as Hitachi), and Toshiba (which I count as 0.5 because they are small). So where do off-brand drives come from? Typically rejected drives from one of the big vendors. They could take two possible paths: either the manufacturer sold them to a big user (90% of all enterprise disks are sold to a dozen big companies, like Amazon/Apple/Baidu/Facebook/Google/Microsoft/Tencent or Dell/HP/Amazon/Oracle), failed QA testing there or showed errors early on, and were sold to unscrupulous resellers that hide their true history. Or they are drives that failed QA testing at the manufacturer, but I don't think the big manufacturers would be willing to sell those.
> 
> In all this, one has to remember that the disk manufacturers have very strict QA systems. And they grade the quality of disk drives: the best ones go to preferred customers, who also get access to internal QA information on a per-drive basis and use the drives in-house, and who pay a premium (typically the cloud superscalers); the decent ones go to price-conscious customers (such as Dell) who resell the drives as part of systems, and the not-so-good ones go into the retail channel. The joke used to be that the worst drives are sold at Fry's (a chain of electronics supermarkets on the west coast of the US, in particular in Silicon Valley, infamous for selling junk and having impossible to navigate return systems), but that's no longer true since Fry's has gone out of business.
> 
> ...



Fascinating. Very good advice.


----------



## Norm (May 28, 2022)

Dave-D said:


> NVME is not an option at this point.


Do you mean nvme devices are not supported on freebsd? I use NVME cards in my TrueNAS devices and that's freebsd.
Maybe I'm not understanding something because I'm trying to enable a PCI adapter with an NVME SSD installed to use as a log file but cannot find anything on how to enable this device.


----------



## PrometheousJames (May 29, 2022)

Is virtualization out of the question? this makes backups/snapshots and such much easier to automate. Also to script out.


----------



## Dave-D (May 31, 2022)

Norm said:


> Do you mean nvme devices are not supported on freebsd?


Sorry, meant this only as a person preference, for various reasons.
No reference to nvme's relationship to FreeBSD intended.
Sorry for the confusion.


----------



## Dave-D (May 31, 2022)

PrometheousJames said:


> Is virtualization out of the question? this makes backups/snapshots and such much easier to automate. Also to script out.


Maybe in the future.
But for now, I don't know much about it, and have enough on my plate.
Would have to set up an alternate test environment and play around with it.
Maybe some day...


----------



## Dave-D (Jun 1, 2022)

PROBLEM:

GPT partition tables and gmirror both write metadata at the end of a hard drive, which can cause problems with corrupted data.
The only recommended solution is to use MBR partitions, at least for now.

Does ZFS somehow get around this issue?

Does RaidZ have this same problem?
.
.


----------



## Jose (Jun 1, 2022)

Dave-D said:


> PROBLEM:
> 
> GPT partition tables and gmirror both write metadata at the end of a hard drive, which can cause problems with corrupted data.
> The only recommended solution is to use MBR partitions, at least for now.
> ...


ZFS and gmirror are completely unrelated. You might get spurious messages about a corrupt GPT if you created a ZFS vdev using whole devices (e.g., `zpool create zfspool raidz2 da0 da1 da2 da3`) that had existing GPT partitions on them. The cure for that is to run `gpart destroy -F` on the device before adding it to the pool. The messages are harmless in any case.


Dave-D said:


> Does RaidZ have this same problem?


No.


----------



## Dave-D (Jun 1, 2022)

Jose said:


> ZFS and gmirror are completely unrelated. You might get spurious messages about a corrupt GPT if you created a ZFS vdev using whole devices (e.g., `zpool create zfspool raidz2 da0 da1 da2 da3`) that had existing GPT partitions on them. The cure for that is to run `gpart destroy -F` on the device before adding it to the pool. The messages are harmless in any case.



*So, to clarify, you are saying that:*
*Using GPT partitions with ZFS file system does *NOT* have the same corruption problem as "GPT + gmirror," which is caused by (depending on install order) "one overwriting metadata of the other, at the end of the disk," 
so in other words, 
GPT + ZFS works fine, doesn't matter if I use GPT + ZFS mirror, or GPT + ZFS RAIDz?*
.
.
Am I correct in saying that "gmirror" is used *ONLY* with UFS file system, and not at all, ever, with the ZFS file system?
So that really, the problem is being caused by a GPT + UFS file system issue?
.
.


----------



## Jose (Jun 1, 2022)

Dave-D said:


> *GPT + ZFS works fine, doesn't matter if I use GPT + ZFS mirror, or GPT + ZFS RAIDz?*


No need to shout. Yes, GPT + ZFS will work fine. You'll get harmless warnings about corruption if you use a raw device that had a GPT partition on it in a ZFS vdev. This is because you don't have to use partitions at all with ZFS if you don't want. It will work fine on raw devices.


Dave-D said:


> Am I correct in saying that "gmirror" is used *ONLY* with UFS file system, and not at all, ever,with the ZFS file system?


Gmirror works at the block level, and does not care about what filesystem is used. It should work with msdosfs(5) filesystems as well. Heck, it's probably possible to add a gmirror device to a ZFS vdev. I've never tried this, though.


Dave-D said:


> So that really, the problem is being caused by a GPT + UFS file system issue?


UFS has no problems with GPT.


----------



## Dave-D (Jun 1, 2022)

Jose said:


> No need to shout. Yes, GPT + ZFS will work fine. You'll get harmless warnings about corruption if you use a raw device that had a GPT partition on it in a ZFS vdev. This is because you don't have to use partitions at all with ZFS if you don't want. It will work fine on raw devices.



Sorry, I wanted to make sure that was the case because I just bought 8 of 4TB WD Enterprise drives, which I would have to return for 2TB drives if forced to run with MBR (cap. limit = 2TB). 
Wasn't trying to shout!



Jose said:


> Gmirror works at the block level, and does not care about what filesystem is used. It should work with msdosfs(5) filesystems as well. Heck, it's probably possible to add a gmirror device to a ZFS vdev. I've never tried this, though.



So its a gmirror issue. Good to know.


----------



## Dave-D (Jun 1, 2022)

Jose said:


> This is because you don't have to use partitions at all with ZFS if you don't want. It will work fine on raw devices.


Currently trying to figure out how to create two partitions on first set of 3 mirrored drives, 1 partition for OS, 1 partition for LOCAL_BACKUP.
Second set of 3 mirrored drives will be DATA only, so that can be raw device.

Can't be done with the normal install routine without breaking out to shell.  Then it looks like manually installing everything so things get complicated fast.

I've heard its a good idea to always install ZFS into a partition slightly smaller than the actual drive size in case of total size mis-match when moving everything to a new drive.


----------



## Erichans (Jun 1, 2022)

Dave-D said:


> [...] Second set of 3 mirrored drives will be DATA only, so that can be raw device.
> [...] I've heard its a good idea to always install ZFS into a partition slightly smaller than the actual drive size in case of total size mis-match when moving everything to a new drive.


Yes, you _can_ use raw disks, the question is: do you want to.

 raw disks are not particularly faster than partitioned disks;
 raw disks cannot have nice human-friendly formatted names as labels because those labels are given to partitions, not raw disks;
 raw disks cannot have boot partitions, so you'll never be able to boot from them;
 the number of sectors of raw disks is fixed: and thereby disk brand, model and type specific.
#1 although raw disks form the shortest way to "communicate directly" to the disk sectors; it removes as much software layers as possible. I cannot imagine that that will bring you any measurable speed advantage however.

#2 might not seem particularly significant at first but, all that can change when you have to decide exactly which disk to pick to replace (or move around) when for example you have a failed disk and have to take that one 1) out of the pool and 2) physically remove that disk. Now you do not have to deal with an array of 24 or even bigger, I know. However, in times of stress picking the right drive is important. Picking the wrong drive may even be disastrous.

#3 matters only, of course, when you want or need to have the option of booting from that pool but, leaving room for that (=reserving and or tailoring the necessary partitions of the various drives for that option) requires little extra disk space; this is a small administrative overhead when you partion the drives.

#4 you have no influence over the size (=# of sectors) that ZFS will use from each disk. When you create a pool of raw disks ZFS will see to it that within a vdev of a certain pool, that vdev will be made of disk spaces of exactly the same size for each individual drive; even when the disks themselves are not equal in size. With disks of the same brand, model and type you have identical disks: no problems and waste of space at creation. However, when you need to replace a raw disk in such a vdev, the new disk must have the same or more sectors. Smaller disks will not be accepted; one new replacement 4TB drive may very well be a bit smaller that the other 4TB drives in the pool. This may result a prolonged search for a suited replacement disk, at a time when your pool is in a degraded state.

You said as much in your last sentence ("into a partition slightly smaller"): what holds for a pool that is bootable, holds just as well for data-only disks. When partitioning the drives you can shave off a reasonable number of Mbytes, a replacement disk of the same size (as advertised by “the label on the box”) will not get you into trouble if it happens to be a little smaller.

Think about this and decide if you really want raw disks.


----------



## Dave-D (Jun 2, 2022)

Erichans said:


> Think about this and decide if you really want raw disks.



I don't want raw disks. Thank you your excellent input. You just made up my mind.

So I have my desired drives and partitions, the next step is figuring out how to make it happen. The FreeBSD installer won't do this without (at the very least) escaping to command line, and I doubt it will do it from there.  Looks like I may have to escape to command line then complete the whole install using only the command line.

Has anybody done this and developed a sequence of commands? Something like that should be easy to modify for anyone having the same basic need.

I'll get to work and see what I can figure out.

Maybe we should share this when we get it figured out.


----------



## Dave-D (Jun 2, 2022)

Erichans said:


> When partitioning the drives you can shave off a reasonable number of Mbytes, a replacement disk of the same size (as advertised by “the label on the box”) will not get you into trouble if it happens to be a little smaller.



Any idea what percentage of total space to allow for "different size disks"?
I've heard 5%, but that might be a bit much? (4TB x 5% = 200GB)


----------



## Jose (Jun 2, 2022)

Dave-D said:


> Currently trying to figure out how to create two partitions on first set of 3 mirrored drives, 1 partition for OS, 1 partition for LOCAL_BACKUP.
> Second set of 3 mirrored drives will be DATA only, so that can be raw device.


I honestly don't understand this part of your setup. What's the point of the OS and backup partitions? If your goal is to reserve capacity for backups or impose limits on how much space some filesystem can take up, you can accomplish that with ZFS reservations and quotas.

I would create two mirror vdevs with three drives each*, and add them both to a single zpool. I would then create OS, backup, and data filesystems in that single pool.


Dave-D said:


> Can't be done with the normal install routine without breaking out to shell.  Then it looks like manually installing everything so things get complicated fast.


Yeah, the installer can really only handle simple setups in my experience.


Dave-D said:


> I've heard its a good idea to always install ZFS into a partition slightly smaller than the actual drive size in case of total size mis-match when moving everything to a new drive.


Sounds like it might be a good idea in theory. I've never had occasion to use this in practice. The replacement disk(s) have always been far bigger than the replaced disk.
* This is more robust than the setups I usually do for home systems. I'm following what you said in post #67


----------



## Dave-D (Jun 2, 2022)

Jose said:


> I would create two mirror vdevs with three drives each*


Two drives is the minimum for a mirror. One more drive = cheap insurance. One drive fails, I have a safety net while replacing. Otherwise, I'm immediately at risk while replacing.


Jose said:


> * This is more robust than the setups I usually do for home systems.


This is for a small business.


Jose said:


> I honestly don't understand this part of your setup. What's the point of the OS and backup partitions? If your goal is to reserve capacity for backups or impose limits on how much space some filesystem can take up, you can accomplish that with ZFS reservations and quotas.


Correct me if wrong.
With my setup, if I lose the wrong three drives, then I lose either DATA drive, or OS / LOCAL_BACKUP drive. But not both.
If I make the pair of 3-way mirrors a pool, then I lose the wrong three drives, and I lose everything. Also, wouldn't all the files be scattered across two sets of drives, rather than stored on
one set of drives? With one set of drives, each drive is complete in itself? With two sets of drives, no drives are complete in itself?

Again, correct me if wrong.
The point of the OS and LOCAL_BACKUP partitions:
I want a LOCAL_BACKUP drive, on a separate drive, for things like ZFS clones snapshots/clones/etc., DB Server dumps, incremental backups, whatever other need arises.
General consensus was that its always best to have OS on a separate drive.
I was going to have a pair of mirrored drives for OS, but decided (see other posts) that combining OS and LOCAL_BACKUP on one drive would eliminate two (of my 8) hot swap slots
for whatever drive replacement routines I needed to implement.


----------



## Jose (Jun 2, 2022)

Dave-D said:


> Two drives is the minimum for a mirror. One more drive = cheap insurance. One drive fails, I have a safety net while replacing. Otherwise, I'm immediately at risk while replacing.


Fair enough, but I think this is usually done using online spare drives.*


Dave-D said:


> Correct me if wrong.
> With my setup, if I lose the wrong three drives, then I lose either DATA drive, or OS / LOCAL_BACKUP drive. But not both.
> If I make the pair of 3-way mirrors a pool, then I lose the wrong three drives, and I lose everything.


Correct. However in your setup if you will still lose a zpool if you lose the wrong three drives. I wouldn't expect the data pool to be more lucky than the other pool. With my setup and some luck you could lose four drives and still be smiling.


Dave-D said:


> Also, wouldn't all the files be scattered across two sets of drives, rather than stored on one set of drives?


Why would you care?


Dave-D said:


> With one set of drives, each drive is complete in itself? With two sets of drives, no drives are complete in itself?


I don't understand this. Edit: OK, I think I understand now. You're thinking you can take a zpool out and put it somewhere else? I'm not sure what would be the point. Also, it seems to me transferring the OS and/or backup filesystem using zfs send/receive would be a whole lot easier. 


Dave-D said:


> Again, correct me if wrong.
> The point of the OS and LOCAL_BACKUP partitions:
> I want a LOCAL_BACKUP drive, on a separate drive, for things like ZFS clones snapshots/clones/etc., DB Server dumps, incremental backups, whatever other need arises.


ZFS snapshots live in the same pool as the filesystem they're a snapshot of, and therefore have the same reliability guarantees. Dumps, backups, etc., are just bytes on a filesystem at the end of the day. Their reliability depends only on the zpool that contains the filesystem.


Dave-D said:


> General consensus was that its always best to have OS on a separate drive.


I dunno about that. Definitely on its own filesystem (ZFS) or partition (everything else).


Dave-D said:


> I was going to have a pair of mirrored drives for OS, but decided (see other posts) that combining OS and LOCAL_BACKUP on one drive would eliminate two (of my 8) hot swap slots for whatever drive replacement routines I needed to implement.


I'm not sure I follow this either.

In any case, I don't see where you're going to put your UEFI partition. It is possible to BIOS boot from a GPT partition. Is that your plan? Also, do you plan on having swap on this array?
* Edit 2: I'm not a fan of this approach. You're hoping the spare drives that have been sitting there spinning but unused possibly  for years, are good and ready for an intense write load when resilvering happens.

I prefer to bake the reliability guarantees into the RAID level I'm using (i.e., RAIDZn). In my experience, drives usually fail either when they're very new or very old. Do some burn-in when you create the array to make sure no drives fall into the first category, and you'll probably have years of trouble-free operation. I do tend to replace my drives preemptively after 2-4 years. I learned to do this the hard way.

Please keep in mind I've only ever set up smallish systems for my own home use. I'm a weak-minded Java programmer by day, and an amateur Freebsd sysop by night.


----------



## Dave-D (Jun 2, 2022)

Jose said:


> * Edit 2: I'm not a fan of this approach. You're hoping the spare drives that have been sitting there spinning but unused possibly for years, are good and ready for an intense write load when resilvering happens.


I'm talking empty hot-swap slots. Spare drives sitting on the shelf.



Jose said:


> Correct. However in your setup if you will still lose a zpool if you lose the wrong three drives. I wouldn't expect the data pool to be more lucky than the other pool. With my setup and some luck you could lose four drives and still be smiling.


Not following you here. If you have 2 sets of 3-way mirrors, you can lose 2 drives from each mirror, but if you lose 3 drives from one mirror, you lose everything.
Data is backed up in a mixture (yet to be determined) of cloud, tape, external hard drives, etc. The LOCAL_BACKUP is not the main and final backup, but only a local means of doing misc. backups. Also, LOCAL_BACKUP can be used for backups of system, then backup to ext. hard drive, tape or cloud, so that when you do a "compare" it doesn't say that things don't match because the LIVE data set you otherwise would have used, has changed. I don't know all the answers, but see this setup as a good way to experiment with various solutions.


Jose said:


> Dumps, backups, etc., are just bytes on a filesystem at the end of the day. Their reliability depends only on the zpool that contains the filesystem.


Added reliability comes from being on a second set of 3-way mirrored drives. So technically, if something exists on both sets of 3-way mirrored drives, All original data exists (obviously) on DATA (3-way mirror). If backup of same original data exists on LOCAL_BACKUP (a different 3-way mirror), I would have to lose all 6 drives to lose that data.

LOCAL_BACKUP is not for final backups (that would be foolish).


Jose said:


> I'm not sure I follow this either.


If interested, please read the previous 4 pages of posts. I am interested in your feedback.


Jose said:


> I prefer to bake the reliability guarantees into the RAID level I'm using (i.e., RAIDZn


I've thought long and hard about this. Problem with RAID is you never have one complete set of data on one drive. Everything is scattered across drives.
Its slower than mirror.
Its harder to resilver a drive than with a mirror.
Takes more computing power (cpu and hard drive/head movement/etc.)
Overall "sexier" than plain-old-boring mirror (simple duplication) but I'm not sure it best fits my use case, being that its a small business server, and not a multi-terrabyte behemouth running some massive business organization.  If I get that big, then things could change, probably to a mixture of mirrors and RAIDZn.


Jose said:


> I'm a weak-minded Java programmer by day, and an amateur Freebsd sysop by night.


We can all learn something from each other.
"iron sharpens iron"
I appreciate your thoughts and feedback.


----------



## Jose (Jun 2, 2022)

Dave-D said:


> I'm talking empty hot-swap slots. Spare drives sitting on the shelf.


Seems worse to me. You're assuming these brand-new drives will be good when you need them to be, and that's likely to be at a stressful time. Again, I prefer to burn in my drives.



Dave-D said:


> Data is backed up in a mixture (yet to be determined) of cloud, tape, external hard drives, etc. The LOCAL_BACKUP is not the main and final backup, but only a local means of doing misc. backups. Also, LOCAL_BACKUP can be used for backups of system, then backup to ext. hard drive, tape or cloud, so that when you do a "compare" it doesn't say that things don't match because the LIVE data set you otherwise would have used, has changed. I don't know all the answers, but see this setup as a good way to experiment with various solutions.


Still sounds like a ZFS filesystem to me.



Dave-D said:


> Added reliability comes from being on a second set of 3-way mirrored drives. So technically, if something exists on both sets of 3-way mirrored drives, I would have to lose all 6 drives to lose everything. Original data on DATA (3-way mirror). Backup (some form of) on LOCAL_BACKUP (a different 3-way mirror).


Losing the OS should not be a big deal if you have good runbooks. Your backup staging area should be semi-disposable too since you should have at least one copy of what's on there somewhere else. Losing your data is what you should seek to avoid at all costs, and you still have a three drive loss maximum on that pool.



Dave-D said:


> I've thought long and hard about this. Problem with RAID is you never have one complete set of data on one drive. Everything is scattered across drives.


You're hoping to pull out one good drive from the wreckage? Fair enough, but supposing there's only one good drive left after some disaster you only have a one-in-two chance that that will be drive out of your data pool.



Dave-D said:


> Its slower than mirror.


Only for reads. I wonder about the write overhead of a three-way mirror like you propose. Most mirrors setups I've seen only had two members.



Dave-D said:


> Its harder to resilver a drive than with a mirror.


Not sure about this one. Resilvering a mirror means all data must be copied to the new drive.



Dave-D said:


> Takes more computing power (cpu and hard drive/head movement/etc.)


Not sure about this one either. I believe the ZFS implementation does clever things in this area. Do you have a reference?


----------



## Erichans (Jun 2, 2022)

Short comments:

I've thought long and hard about this. >> _good, it's "your" system._
Problem with RAID is you never have one complete set of data on one drive. Everything is scattered across drives.
>> _that isn't an actual problem_
Its slower than mirror. >> _probably, (most of the time for reads); for your use case: does that really matter?_
It’s harder to resilver a drive than with a mirror. >> _I don’t think so._
Takes more computing power (cpu [...] >> _Absolutely! And at the same time that hardly ever matters*!_
[...] and hard drive/head movement/etc.) ] >> _I really don't think so._

Your case, as most others, means juggling with all the possibilities of allocation. So, my suggestions may not be 100% satisfactory and could be considered as variations on previous suggestions but, decide for yourself. I'm also presuming that the OS part contains only the base install and root and other management accounts; all user data is independant and stored in the data pool.

You _can_ use three complete drives to form a separate pool with a 3-way mirror for the OS (and other things; these not being "running data"). That is a very safely configured way to do things, IMO, based on what you have written so far about your SMB target environment: overly safe. I think that the data pool is more important than the OS pool; just as Jose mentions. Therefore more redundancy would be required for the data pool. Based on this, I'd say that a pool with a 2-way mirror would suffice for the OS pool. When losing one disk, rebuilding takes about the same time for a 2-way mirror pool as for a 3-way mirror pool. You do not need the speed of a 3-way mirror for the OS pool, speed wise that's overkill, IMO.

Taking this reasoning a step further: for the OS pool you do not need very much space: a 2-way mirror of 2 * 512GB (SATA) SSDs would suffice. Here I'm transitioning from spindles to silicon. If you take a competent set of SSDs then they will be more reliable than spinning platters of rust. The added extra read/write speed is a pleasant side effect; you'll benefit from that when resilvering to a new replacement SSD in case of failure; keeping your vulnerability time window with a degraded pool narrower than with spinning platters. Also, in case of absolute disaster when the whole pool must be rebuilt from backups this will go fast. Added bonus: two/three extra usable physical 3.5 " slots in your drive cages; presuming the SSDs can be "tucked away" elsewhere in the system.

This brings me to the last item of the OS-pool: your local backups. I don't see a clear basis for that. Perhaps you're thinking of making a quick and efficient backup of the data (perhaps db data) from the data pool but, you're sort of squandering prime online disk space for local backup purposes. With ZFS's snapshots you'll be able to take a snapshot of (part) of a pool. After that you can use that snapshot as the source of your backup; no difficulties with a backup window or locking files: ZFS has done that already for you.

Next: the data pool. You’ll have to make an (business) assessment of how and when a possible expansion might be needed. First with the number of spindles at your disposal: a 5 or 6 drive RAIDZ2. When using 6 or more drives: RAIDZ3 for ultimate redundancy: that is the same level of redundancy as the 3-way mirror option. You'd be using your disk space a lot more efficiently; unless you have a clear use case that speed is that important.

When expansion becomes necessary in the future, you'll have basically two options. First replace each individual drive by a bigger one (say 8TB instead of 4TB). You'll be doing a replacement and resilver action on a disk-by-disk basis. That will probably take several days. The second is adding an extra vdev of new drives to the data pool. When using the same redundancy type that would mean doubling the pool that already exists.

While technically your data is indeed distributed, scattered if you prefer, I'm unsure as to why that matters to you in particular and otherwise in any practical way. With a data pool with a 3-way mirror, you only have a space utilisation of 33%. What do you think you could be doing with each separate disk and when would you likely need that? When you expand the pool by adding another 3-way mirrored vdev to the pool, this will not be the case any more. Your data is distributed over the two vdevs and while technically the files on the original vdev with the 3-way mirror are only on that vdev: they won't be of any use as a separate unit anymore once the second 3-way mirror vdev has been added.

While you can reason that there is an advantage for two separate pools for data and OS, in such a relatively small setting that would only be the separation of concerns, which in and of itself might be valid as a matter of personal preference. When combing the two pools you basically have the option of one big RAIDZn pool of 5 or 6 disks or, as Jose mentions, one pool consisting of (a slice over) two 3-way mirrors. I've found an overview of the various options (drawing) with their available spaces, space utilisation and redundancy useful.

Finally, to your extra HD disk on standby on the shelf. That is money paid for and not used and, I think, a false sense of security to a certain extent. Because that disk added to the (data) pool adds extra redundancy, space or speed, depending how it is deployed. You only have a slight advantage in case of a failing disk in the sense that the sysadmin (that would be you alone I presume, in the SMB use case) when phisycally at the system, could let the system start with resilvering sooner. When that disk is deployed as extra redundancy (RAIDZ3 instead of RAIDZ2) and a disk fails you're falling back to the same level of redundancy that you would have had when that disk is on the shelf and RAIDZ2 deployed. The system only has to endure the extra bit of power while operating.

Weigh your options and make a decision. With your level of preparation and management, you should also be considering the two ZFS (e)books (FreeBSD Development: Books, Papers, Slides); at least the first one. When you're concerned about speed issues: Six Metrics for Measuring ZFS Pool Performance: Part 1 - Part 2 - pdf (2018-2020); by iX Systems

___
* if you’d have a very low specced CPU without any hardware instruction support for the calculations needed, that might be an issue.


----------



## Dave-D (Jun 2, 2022)

Erichans said:


> 2-way mirror of 2 * 512GB (SATA) SSDs


Realistically, what minimum size should I consider?


Erichans said:


> a competent set of SSDs


What would be the criteria for considering an SSD as "competent"?

I see what you're saying.
By using two 3-way mirrored sets, I'm basically duplicating my need for redundancy, with no real benefit.
What you are saying is move OS to SSDs & free up 2 or 3 hot swap slots.
Then go for more drives with RAIDz2 or RAIDz3 to gain my redundancy for everything but the OS.
I have 8 hot-swap slots now, can expand to 12 in future.
I'm thinking I can go with either four drives initially, then add 4 more (= 8 max for now), then add 4 more, = 12 max. (if add one more bay).
OR,
go with six drives initially, then 6 more in future (= 12 total, max).
Then place my DATA and LOCAL_BACKUPS (or whatever I end up doing) on that one set of drives, keeping them separate by using pools or datasets.
List and weight my options, keeping ease of future expansion in mind.

I hadn't considered that once I add a second 3-way mirrored set of drives, then assemble all 6 drives into a pool, that data would start to spread out across drive sets anyway.  IF I wanted all data to always be available on one drive, I would be limited to that drives overall size as the max. available size for my whole system. Thank you, I wasn't thinking clearly about this. Data across multiple drives is a given for any sizable system.

Thank you Erichans and Jose, you've got me thinking along new lines.


----------



## Jose (Jun 3, 2022)

Erichans said:


> Taking this reasoning a step further: for the OS pool you do not need very much space: a 2-way mirror of 2 * 512GB (SATA) SSDs would suffice. Here I'm transitioning from spindles to silicon. If you take a competent set of SSDs then they will be more reliable than spinning platters of rust. The added extra read/write speed is a pleasant side effect; you'll benefit from that when resilvering to a new replacement SSD in case of failure; keeping your vulnerability time window with a degraded pool narrower than with spinning platters. Also, in case of absolute disaster when the whole pool must be rebuilt from backups this will go fast.


This is exactly what I did for my home system. I have two Samsung EVO 250GB SSDs in a gmirror that hosts the OS. I would've used a ZFS mirror if I were to do this again.

Each drive has three partitions, the OS partition which is about 220GB, a swap partition, and a partition that I'm using for the ZFS Intent Log.



Dave-D said:


> Realistically, what minimum size should I consider?


I'm only using 6.3 out of the 220GBs in my OS volume. It's a headless server with just these packages installed:

```
databases/postgresql12-server
devel/git
dns/bind916
dns/mDNSResponder_nss
mail/dovecot
mail/postfix
net/netatalk3
net/rsync
ports-mgmt/pkg
security/doas
sysutils/smartmontools
sysutils/tmux
www/dokuwiki
www/nginx
```


----------



## Erichans (Jun 3, 2022)

The 512 GB value is the result of new drives coming out more and more only starting at 0.5 TB and the 250 GB costing more than 1/2 the price of a 0.5 TB drive. The shift is moving towards 1 TB as the std size.
As Jose mentions, you can see with half of that you can get by, easily.

As to competent: my suggestion would be some (semi-)pro SSD, probably MLC/TLC. Don't concentrate on speed; not overly important for an OS-pool using SATA. I'm no expert in that area and haven't compared reviews either, you'll have to do your own research.


----------



## Dave-D (Jun 9, 2022)

Okay, so its on to the next battle....

I'll have to install from the command line in order to get the custom partitions I need.

I've watched various youtube videos which get very complicated. Much to hard for my current pay grade.
Will I need a 4-year degree before attempting to install from the command line?

Does anyone know of a good custom install script that would serve as a starting point,
which could be adapted to my purpose, rather than re-inventing the wheel?

Or, perhaps I should bail on the custom partition idea for now and do a default ZFS auto-partition install (for my O/S drive),
do my build from there, and worry about the custom partitions after gaining in knowledge / experience?

I do need to get some servers running soon, rather than months or years down the road.


----------



## J65nko (Jun 9, 2022)

About nine years ago I wrote a Makefile to automate a manual ZFS install described by Vermaden. It supports using memory/RAM disks for testing. Because ZFS as well as the FreeBSD install procedure will have changed a lot in these years, it would require quite a lot of work to adapt. But it probably could serve as inspiration for your own script.  See https://forums.freebsd.org/threads/...s-root-install-adapted-for.41274/#post-229381


----------



## Jose (Jun 9, 2022)

I would do a basic install first to get your feet wet, hopefully on a system with at least two drives. Ignore one of the drives during setup so you can use it to experiment with partitioning, ZFS, etc. after you have a basic system installed.

Freebsd system administration is done mostly at the command line. It's one of the things I like about it.


----------



## Dave-D (Jun 9, 2022)

Here is the solution to custom / command line install with partitioning of your choice:

1. Do a new install, selecting options closest to your desired final install. (in my case, one is "ZFS auto" partitioning, can select most of items I need, only the partition sizes (default use of full drive size) is wrong.)

2. Inspect contents of the FreeBSD install log at:  /var/log/bsdinstall_log.  Find the commands where the drive partitions are created and mirroring (if any) was set up.

3. Find examples of install scripts.
- See "Chapter 11: Complex Installation" in book "FreeBSD Mastery: Storage Essentials" by Michael W. Lucas.
- Misc. youtube videos, search "FreeBSD command line install," or "FreeBSD manual install." They can get complicated fast, I think probably overly and un-necessarily complicated.  We'll see.
- Search the forums. One example:








						How to create a scripted installer?
					

I've been browsing the forum and the documentation, and it seems like the right kind of information must be there somewhere, but basically what I would like to create is a following:  A custom bootable USB-stick from which FreeBSD installer would launch. I say "custom", because I would like the...




					forums.freebsd.org
				



-Etc.

4. Combine #1, #2 and #3 above, modify as needed to create & test a recipe of your own.


----------



## Dave-D (Jun 9, 2022)

Jose said:


> I would do a basic install first to get your feet wet, hopefully on a system with at least two drives. Ignore one of the drives during setup so you can use it to experiment with partitioning, ZFS, etc. after you have a basic system installed.



I think this is the best idea to get started.
Install with the provided installer, use ZFS Auto partitioning, install to (2 or 3) (250GB or 500GB) mirrored SSD cards, O/S only, using the default full size of SSD.
Mess around with partitioning / pools / etc to set up "DATA" and "LOCAL_BACKUP" on second set of drives.
Get a basic test server running ASAP, with file sharing and backup only. Install on-site for testing in real-world environment. Especially curious abous ZFS / file read/write performance, backups, using with Samba.
I can continue on a second FreeBSD testing server at my location, over the summer. Nobody cares what I do with that one.
Build out the 3 production servers (with more advanced features) after the summer.


----------



## Jose (Jun 9, 2022)

These are my notes from when I installed my server a couple of years ago:

ZFS​
Remove any existing GPT partitions with gpart destroy -F before adding to the pool. Annoying boot messages about a corrupt GPT will happen otherwise
Pool and dataset creation ​zpool create zfspool raidz2 da0 da1 da2 da3 da4 da5
zpool add zfspool log /dev/mirror/gm0s1d
zfs create zfspool/home
cp -rp /home/* /zfspool/home
rm -rf /home /usr/home
ln -s /zfspool/home /home
ln -s /zfspool/home /usr/home
zfs create zfspool/temp zfspool/video zfspool/postgres zfspool/tmachine

Misc​zpool status


These are for the gmirror creation, and maybe not so applicable for your use case:

GEOM mirror setup​

Boots from a GEOM mirror (two Samsung EVO 850 250GB SSDs). Most of the config is from here:
https://www.freebsd.org/doc/handbook/geom-mirror.html


But remember to enable TRIM on the root filesystem newfs -t -U /dev/mirror/gm0s1a
One gotcha is that the installation suddenly stopped seeing the mirror volumes. I think I needed to have the GEOM system "taste" the mirror again 
	
	



```
true > /dev/ada0
true > /dev/ada1
```

That last tip is from here:
https://www.ateamsystems.com/tech-blog/installing-freebsd-9-gmirror-gpt-partitions-raid-1/


----------

