# Multiple errors; server dies for 10 minutes; revives itself



## ryanm (Feb 4, 2012)

I've run into some error messages on my server that are beyond my skill level of interpreting, so I'm hoping some of you can help me out.

I posted a while ago about a SIGBUS problem I was having here. Well, the problem came back but again, but a lot more errors accompanied it this time.

Before we jump into the gooey stuff, let me give a little background on the hardware. We have an Intel modular blade server. The chassis has 2x 3-disk RAID arrays. Volume 1 is what the OS (FreeBSD 7.2) is installed on and Volume 2 is mounted at /usr. These two volumes are da0 and da1.

I got email notifications saying the web host I run in a jail hosted on this server was down. I try to SSH into it, but it fails. I ping it and I get a 50% return rate. So I log in to the management blade and start a virtual KVM sessions to get into the blade. Once I'm into the basehost blade, I cat dmesg.today and get a slew of errors. Here we go..


```
(da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
(da3:mpt0:0:6:1): Retrying Command (per Sense Data)
(da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0
(da3:mpt0:0:6:1): CAM Status: SCSI Status Error
(da3:mpt0:0:6:1): SCSI Status: Check Condition
(da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b
(da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
(da3:mpt0:0:6:1): Retrying Command (per Sense Data)
(da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0
(da3:mpt0:0:6:1): CAM Status: SCSI Status Error
(da3:mpt0:0:6:1): SCSI Status: Check Condition
(da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b
(da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
(da3:mpt0:0:6:1): Retries Exhausted
```

As mentioned before, our two volumes are da0 and da1. /dev lists da2 and da3 as well, but I have no idea what they are. 

```
GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1.
GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf.
GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a.
GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807.
Trying to mount root from ufs:/dev/da0s1a
GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed.
GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1.
GEOM_LABEL: Label ufsid/4aeb0387d999941a removed.
GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed.
GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a.
GEOM_LABEL: Label for provider da1s1 is ufsid/4bd2077f23a6cc93.
GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed.
GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807.
GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed.
GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf.
GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed.
GEOM_LABEL: Label ufsid/4aeb0387d999941a removed.
GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed.
GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed.
GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed.
```

Was root unmounted? Whats going on here? Obviously there's some issue with da0, which is mounted at /.


```
pid 93248 (httpd), uid 80: exited on signal 10
pid 95624 (httpd), uid 80: exited on signal 10
pid 97956 (httpd), uid 80: exited on signal 10
pid 97935 (httpd), uid 80: exited on signal 10
pid 96603 (httpd), uid 80: exited on signal 10
pid 93210 (httpd), uid 80: exited on signal 10
pid 98246 (httpd), uid 80: exited on signal 10
```

The problem I reported in my other post. Apache just up and dies on me. Everything I've read says it's an issue with Apache trying to access RAM that it shouldn't or that doesn't exist.. Is there something else with the above da0 or da3 errors that would cause a SIGBUS on httpd?

Then after that it goes back and repeats that first block of da3 errors a bunch more times. The server was down for about 10 minutes and then it just fixed itself. It's weird because it seems the apache child processes all get killed off by the sigbus but the parent process doesn't.. so once the problem works itself out, it continues operations as normal without me having to restart the daemon or anything.

The management blade in the server chassis is reporting that all the hardware is fine. We have a second blade that boots off of a second partition in Volume 1 and it doesn't have any problems at all.

I'm at a loss here!


----------



## shitson (Feb 5, 2012)

Are da0 and da1 both on the same controller?


----------



## ryanm (Feb 6, 2012)

shitson said:
			
		

> Are da0 and da1 both on the same controller?



Yes, they are.


----------

