# Drive failures



## amilojko (May 12, 2011)

Hi all,
Suddenly I am getting a whole bunch of disk errors like this:


```
May  7 15:56:00 poseidon kernel: ad2: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=191
May  7 15:56:00 poseidon kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=191
May  7 15:56:00 poseidon kernel: ad2: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=63
May  7 15:56:00 poseidon kernel: ad2: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=63
May  7 15:56:00 poseidon kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=63
May  7 15:56:00 poseidon kernel: ad2: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=575
May  7 15:56:00 poseidon kernel: ad2: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=575
May  7 15:56:00 poseidon kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=575
```

All FreeBSD 8, 8.1 and 8.2 machines. All disks over 160 GB. Now it can't be that suddenly hard drives are failing on three boxes. I had one disk 250GB replaced and it's ok for now. Wondering if anyone is encountering any new "features" like this in 8-Release?

Regards
A


----------



## da1 (May 12, 2011)

I have 8.2's everywhere but never saw this "feature" (as you call it) without a serious problem. I would suggest checking your cables and checking your hdd's with smartmontools.



> I had one disk 250GB replaced and it's ok for now.


Just to be sure I fully understand, you replaced one of the failing hdd's and everything was fine ?
Because if that's the case, it sounds to me like a hdd failure.


----------



## amilojko (May 12, 2011)

Thanks for your reply.

Yes, one outside hosted box with 8.1 (only 3 month old server) had the disk replaced and I don't see the errors anymore. Good. For now. But two internal boxes are now showing these errors too so I became a little paranoid. I will check the disks with fsck and smartmontools.

I thought maybe a driver problem or BIOS. Read somewhere that under heavy loads these errors show sometime and are not serious. But the server in question should not be under heavy load. So I'll have to go with disk failure.

Thanks again.
A


----------



## gkontos (May 12, 2011)

@amilojko,

those errors sound a lot like your drive is about to fail. You can use smartmon tools but I would suggest that you back up immediately if you haven't done already.


----------



## Imanol (May 12, 2011)

You should do what gkontos says, I suggest you to plug the drive on the computer and use ddrescue to dump the contents to the 250 GB

If the three HDD are the same age, it's feasible that they are simultaneously going down, especially since I guess they share the same environmental conditions (in the same room, or building maybe).

If you just want to be sure it's HDD failure and not some other thing, try checking the disk in some other OS (ddrescue will already notify you about bad sectors when backing up).

I'll tell you something, if you manage to make the backup, and there's disk damage, there's a neat way to fix it, which saved my ass once:

HDD firmwares are programmed to block access to bad sectors, making the computer "skip" them, if they are detected when writing, if the computer reads a bad sector, it'll retry 3 times and spit out an I/O error, but if you write on a bad sector, it'll get fixed, so, if you found that problem, just do

[CMD=""]dd if=/dev/zero of=/dev/YOURDRIVE[/CMD]

That'll zero up the whole drive (DELETING EVERYTHING, ALL PARTITIONS), but correcting any errors, then, make a new partition table, and restore the partition backup.


----------



## SirDice (May 12, 2011)

Imanol said:
			
		

> HDD firmwares are programmed to block access to bad sectors, making the computer "skip" them, if they are detected when writing, if the computer reads a bad sector, it'll retry 3 times and spit out an I/O error, but if you write on a bad sector, it'll get fixed, so, if you found that problem, just do
> 
> [CMD=""]dd if=/dev/zero of=/dev/YOURDRIVE[/CMD]
> 
> That'll zero up the whole drive (DELETING EVERYTHING, ALL PARTITIONS), but correcting any errors, then, make a new partition table, and restore the partition backup.



You're correct it happens during a write but those bad blocks aren't skipped. ATA drives have a "spare" area, you can't access it but the drive's firmware can. When a bad block is detected the firmware will remap that block to the spare area. If the spare area is filled up bad blocks will start showing up and the disk will need to be replaced as soon as possible.

The dd command will find and 'fix' any bad blocks but if there are still some bad blocks after that it means the spare area is full.


----------



## Imanol (May 12, 2011)

SirDice said:
			
		

> You're correct it happens during a write but those bad blocks aren't skipped. ATA drives have a "spare" area, you can't access it but the drive's firmware can. When a bad block is detected the firmware will remap that block to the spare area. If the spare area is filled up bad blocks will start showing up and the disk will need to be replaced as soon as possible.
> 
> The dd command will find and 'fix' any bad blocks but if there are still some bad blocks after that it means the spare area is full.



Totally right, my mistake.

It's worth to give it a shot though, I've got a ten year old computer that had a HDD failure about a year ago, and that let me keep using it whithout replacing the drive, but, it's very likely it'll fail soon, so, it's a good solution in the short run.


----------

