# Kernel Panic at 3:00 am



## BachiloDmitry (Feb 22, 2012)

Greetings. I have a problem that has already been discussed here, except for the solution there was to replace motherboard, RAM and processor, which I did and it did not help.

I am getting this every night:


```
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x326d78
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff8081fb76
stack pointer           = 0x28:0xffffff80c52602e0
frame pointer           = 0x28:0xffffff80c5260310
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 11460 (find)
trap number             = 12
panic: page fault
cpuid = 0
```
It can happen some time from 3 to 4 am, the trap is not always 12, can be 9 and other numbers, but it's always a 'find' process. 

I have a


```
CPU: AMD Athlon(tm) 64 Processor 3000+ (1809.31-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0x20ff0  Family = f  Model = 2f  Stepping = 0
real memory  = 4294967296 (4096 MB)
avail memory = 2824007680 (2693 MB)
```
(it's 4 DDR-1 1GB memory modules)

This ASUS motherboard has 8 SATA ports, they are all used. 


```
ada0: <ST31000340NS SN04> ATA-6 SATA 1.x device
ada1: <ST3500320NS SN04> ATA-6 SATA 1.x device
ada2: <WDC WD800AAJS-00PSA0 05.06H05> ATA-7 SATA 2.x device
ada3: <WDC WD800AAJS-00PSA0 05.06H05> ATA-7 SATA 2.x device
ada4: <ST3500320NS SN05> ATA-8 SATA 1.x device
ada5: <ST3500320NS SN04> ATA-6 SATA 1.x device
ada6: <ST3500320NS SN04> ATA-6 SATA 1.x device
ada7: <ST3500630NS 3.AEK> ATA-7 SATA 1.x device
```


```
samba# gmirror status 
          Name    Status  Components
mirror/system0  COMPLETE  ada2 (ACTIVE)
                          ada3 (ACTIVE)
samba# zpool status
  pool: data
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada7    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada6    ONLINE       0     0     0

errors: No known data errors
```

It all worked perfectly for years on version-8 FreeBSDs (from 8.0 to 8.2) until I decided to upgrade. I installed Socket 775 based ASUS MB and a Core 2 processor, also 4 BG of DDR3 RAM, 2 modules, 2 GB each. And there it started - it worked normally during day and started crashing at night. First of all I've updated to 9.0-RELEASE, problem remained. Then I downgraded to what I've had - MB, CPU, memory - but problem remained. I don't get it anymore. Not believing that disks can cause it I still checked SMARTs on all of them, all are fine. Furthermore, this machine gets a dump/dd through SSH from another machine, writing a file which is more then 80 GB for now, takes 3 hours, but works perfectly fine, I tried moving this file from my terabyte disk to zpool, causing heavy writes on all the other HDDs - still nothing, everything works fine under heavy loads until night, when it comes to 'find'. I don't get it, what should I replace now?


----------



## DutchDaemon (Feb 22, 2012)

This may not be much help, but I remember an occasion where I had a lost+found directory somewhere after a severe crash, and whenever I tried doing something with it, even entering it, let alone moving or copying it, the system rebooted immediately.  Don't remember if it caused a panic before the reboot, but it was one touchy directory .. And find is typically one of those utilities that will touch every file and directory on the system, so you may have one of those. Maybe try running a manual find from the root of each filesystem with a print statement, and get a clue about where and when it causes a panic exactly? At least you'll know where to start looking.


----------



## jalla (Feb 22, 2012)

periodic(8) starts it's daily run at 3:01 every night.

Perhaps you can narrow down the problem by running the scripts in /etc/periodic/daily/ one by one.

If you don't find the problem there, check also your crontabs for other jobs that may run around the same time.


----------



## wblock@ (Feb 22, 2012)

There have been other threads with the same problem with ZFS, and some type of workaround... which I can't recall.


----------



## kpa (Feb 22, 2012)

Check the snapdir property on all ZFS datasets, it should be set to "hidden".


----------



## BachiloDmitry (Feb 22, 2012)

ok, thanks everyone for the answers, I checked everything you've said:

I have no lost+found folders,
my only ZFS dataset 'data' has *snapdir* property set to 'hidden',
all scripts in /etc/periodic/daily/ ran normally.

And also today I got kernel panic with 'smbd' process "guilty". I also have dd'ed all my disks from /dev/adaX to /dev/null successfully with no errors.

I'll keep searching.


----------



## BachiloDmitry (Feb 22, 2012)

No, wait! All scripts from /etc/priodic/daily/ exited zero or somehow else themselves except for 450.status-security - it took too long so I've Ctrl-C'ed it after a couple of minutes.

But now when I gave it some time I got Kernel Panic. I'll try to reproduce it when the server will come back.


----------



## jalla (Feb 22, 2012)

450.status-security doesn't do anything directly.
It will just invoke all the scripts in /etc/periodic/security, so you should try running those.


----------



## BachiloDmitry (Feb 22, 2012)

Well, at least I've reproduced it twice. I'll find what security script causes it and post here.


----------



## BachiloDmitry (Feb 22, 2012)

It's the first one. 


```
sh 100.chksetuid

Checking setuid files and devices:
```

and panic after near 10 minutes.


----------



## kpa (Feb 22, 2012)

That script does only a simple find(1) operation, nothing out of the ordinary. I would boot the machine in single user mode and run fsck(8) on UFS file systems followed by a 
`# zpool scrub` on ZFS pool(s).


----------



## idownes (Feb 23, 2012)

I have an issue where ZFS consumes *all* available memory as a result of that same periodic 100.chksetuid script running. It normally grinds the machine to a halt but I think I've also had a panic.

Could you set up a log of the zfs sysctls like: 
	
	



```
export ZFSLOG=/tmp/vfs_zfs.log; while true; do date >> $ZFSLOG; sysctl vfs.zfs >> $ZFSLOG; sleep 30; done
```
 and see what the vfs.zfs.arc_meta_used is before it crashes?


----------



## phoenix (Feb 23, 2012)

The find processes run by a couple of the periodic(8) scripts will run a ZFS system out of RAM.  I've had to disable the following on all my large ZFS systems (from /etc/periodic.conf):

```
daily_status_security_chksetuid_enable="NO"
```

Search the forums for "zfs periodic panic" or even just "zfs periodic" for more info.


----------



## BachiloDmitry (Feb 23, 2012)

Yes, *zpool scrub* seems to be causing a kernel panic, so now it's constantly falling into it. I made an ARC tuning couple of years ago, so it should not consume more than 1.5 GB, I used to have panics connected with that, but kernel panic message was saying it clearly, and now it does not. Maybe there's really just something wrong with my pool? Should I try fixing it under Solaris maybe? As I've already said, I had a panic yesterday, which was caused by smbd, not find, it was at day time. I'll still try that log idea anyway.


----------



## BachiloDmitry (Feb 23, 2012)

Ok, so it seems I completely lost all my 1.5 TB of valuable data. As I've already said, after the *zpool scrub* command the system panicked immediately and all the reboots after it finished with panic even before the mounts. So I've decided to check what Solaris 11 would say about it, and guess what - it panics just the same way. Furthermore, I have a raidz2 (RAID6) pool with five disks in it, so theoretically I can remove any two and should be still able to access my pool, but no, even without one disk FreeBSD and Solaris say that there are not enough replicas for pool to continue functioning, showing that four of five devices are online and that it is raidz2 array. Is there anything else left to try?


----------



## kpa (Feb 24, 2012)

Ask on freebsd-fs mailing list. It's hopefully not too late for you but RAID is never a substitute for proper backups, if you manage to recover your data the first thing you should do is to make backups of your valuable data.


----------



## kr651129 (Feb 29, 2012)

Shot in the dark here -- have you tested your HDD to see if it's bad?  On several occasions I've had a bad hard drive cause kernel panics at specific times or when programs are ran.


----------



## PacketMan (Jul 24, 2014)

I'm not sure this is related or not, but my FreeBSD 10 OS is panic dumping at 3:00 am.  Its happened once for sure, but maybe even twice. I will have to watch for this to be sure.

```
Jul 23 03:03:56 BSD001-NAS savecore: reboot after panic: page fault
Jul 23 03:03:56 BSD001-NAS savecore: writing core to /var/crash/vmcore.0
```
I’ll to have to login locally as root to view the crash files.


----------



## junovitch@ (Jul 27, 2014)

Same advise should apply.  The usual suspects would be periodic scripts that are heavy on disk IO if there are underlying disk problems.  Start by running them one by one until you find the culprit.  `cd /etc/periodic/security; env PERIODIC=security ./100.chksetuid` and so forth.  There is a bug on 10.0 so you must specific the PERIODIC variable for all the security scripts.


----------



## PacketMan (Aug 13, 2014)

Okay I will think that through.  

But I seem to be missing something. I thought this issue was identified back in release 9, and thus I had assume that this would have been 'patched/fixed' in 10?  Is there some sort of patch file I am supposed to download for release 10? Or is this issue still unresolved?  Given that some of the busiest servers (thus lots of disk usage) in the world are FreeBSD based, I find it hard to believe that this issue would be still outstanding?

I am not running anything on this box other than Bittorrent SYNC, and its only used for syncing family pictures, and some other files. Not a busy box at all; typically CPU usage is 10% or less.


----------



## junovitch@ (Aug 13, 2014)

The point is that on a box like yours that is not busy, the burst of disk IO done during periodic triggers a panic because of some latent hardware issue.  There most likely nothing to fix in the OS.  On the busiest servers in the world as your example mentions, hardware failures likely present themselves much sooner.


----------



## PacketMan (Aug 15, 2014)

junovitch said:
			
		

> ..... because of some latent hardware issue.  There most likely nothing to fix in the OS.



So you mean a hardware malfunction issue, or are you thinking hardware driver (software) issue?  I'll have to see if I can find some sort of 'disk check' command.  Waiting for Christmas for some books.  -  

I intend on building a new machine software identical to the one I have now, but the hardware will be all different.  

Thanks again.


----------



## junovitch@ (Aug 21, 2014)

PacketMan said:
			
		

> So you mean a hardware malfunction issue, or are you thinking hardware driver (software) issue?  I'll have to see if I can find some sort of 'disk check' command.  Waiting for Christmas for some books.  -
> 
> I intend on building a new machine software identical to the one I have now, but the hardware will be all different.
> 
> Thanks again.



Search for a memtest86 live CD to see if it's a memory issue.  Try running the periodic scripts individully to see if one triggers it consistently.  If it is a script that is heavy on disk IO, take a look at sysutils/smartmontools to query the drive and run a full check on it.  Check for loose SATA cables or other cables in the actual machine.  It could be any number of things.  Those would probably be the usual culprits.


----------



## PacketMan (Aug 25, 2014)

Righto. Thanks again.


----------



## ethoms (Aug 28, 2014)

Sorry, I haven't had time to read the whole thread. But I had a server reboot quite often at exactly 3:00am. Basically it was a daily periodic script that checked mount points. On this particular server I had an sshfs remote filesystem mounted. The security script, ran at 3:00am daily would cause a panic (I guess).

Thanks to my /etc/periodic.conf (below), It no longer reboots at 3:00am.


```
$ cat /etc/periodic.conf                                                                                                                                                                                                                                   
# 200.chkmounts                                                                                                                                                                                                                                                                
security_status_chkmounts_enable="NO"                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                               
# 310.locate                                                                                                                                                                                                                                                                   
weekly_locate_enable="NO"                              # Update locate weekly
```


----------



## ethoms (Aug 28, 2014)

Another suggestion: have you tried monitoring your RAM usage? Do you set your ARC max (vfs.zfs.arc_max)? If not, you should. I have had major stability problems when arc_max is not set. The ARC grows too big and the kernel memory management doesn't know abnout it. The ARC will grab RAM as soon as it is available, whict other processes will lock until there is free RAM. Because the virtual memory subsystem isn't aware of ARC, even swap will not help you.

So, it could be the find command (from periodic) is triggering the ARC to grow.

You must must must set the vfs.zfs.arc_max in your /boot/loader.conf.

e.g.: on my 32GB machine (running 10 or so VirtualBox VMs):


```
$ cat /boot/loader.conf                                                                                                                                                                                                                                    
## Filesystem Support                                                                                                                                                                                                                                                          
zfs_load="YES"                                                                                                                                                                                                                                                                 
libiconv_load="YES"                                                                                                                                                                                                                                                            
libmchain_load="YES"                                                                                                                                                                                                                                                           
cd9660_iconv_load="YES"                                                                                                                                                                                                                                                        
msdosfs_iconv_load="YES"                                                                                                                                                                                                                                                       
ntfs_load="YES"                                                                                                                                                                                                                                                                
ntfs_iconv_load="YES"                                                                                                                                                                                                                                                          
udf_load="YES"                                                                                                                                                                                                                                                                 
udf_iconv_load="YES"                                                                                                                                                                                                                                                           
fuse_load="YES"                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                               
## VirtualBox Support                                                                                                                                                                                                                                                          
vboxdrv_load="YES"                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                               
# Limit ARC to 12GB of RAM                                                                                                                                                                                                                                                     
vfs.zfs.arc_max="12G"                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                               
# sys-v style shared memory tuning                                                                                                                                                                                                                                             
kern.ipc.shmall=524288                                                                                                                                                                                                                                                         
kern.ipc.shmseg=512                                                                                                                                                                                                                                                            
kern.ipc.shmmni=384                                                                                                                                                                                                                                                            
kern.ipc.semmni=256                                                                                                                                                                                                                                                            
kern.ipc.semmns=512                                                                                                                                                                                                                                                            
kern.ipc.semmnu=256
```


----------



## junovitch@ (Aug 30, 2014)

ethoms said:
			
		

> Sorry, I haven't had time to read the whole thread. But I had a server reboot quite often at exactly 3:00am. Basically it was a daily periodic script that checked mount points. On this particular server I had an sshfs remote filesystem mounted. The security script, ran at 3:00am daily would cause a panic (I guess).



Sounds like you found a bug with either the FUSE implementation or the SSHFS portion of it.  If it's reproducible, you should probably look around on Bugzilla for anything like it and open a ticket with the details if you can't find anything.  I remember hear the FUSE implementation had some issues so that may help in nailing them all down.


----------



## ethoms (Oct 21, 2014)

Well it's hard to reproduce now. I've re-installed the server with a much newer version of FreeBSD. I think I disable that option in /etc/periodic just in case, or as well as I don't ned the security check. Anyway, the FUSE has been replaced since then by a better implementation. I have much better experience with the new FUSE support.


----------



## PacketMan (Mar 7, 2015)

Earlier in this discussion I posted about this.  That was an older laptop I was tinkering with, and turns out it was the hard drive that finally failed. Thanks for your help with that.

So now, one of my new-to-me servers is doing this. An I can reproduce it manually at will by running the `find` command. I am suspicious that `find` alone is not the culprit, but may have to do with another process that is running at the same time. The interesting thing is that I have two freshly built servers that are very similar, and one is fine so far. Both are FreeBSD 10.1-RELEASE on Dell Dimension boxes. The working one has a 40GB OS drive, 2TB content/nas drive, and 4GM ram. The failing one has 140GB OS drive, and 2TB content/nas drive, and 1GB ram. Both have UFS files systems.

There are no crash dumps or reboots when the fault occurs.  I just get an OS freeze. The server is not pingable, and I cannot console into it. Tapping the power button does not cause a shutdown.  I have no choice but to power cycle. I do wonder if its because it only has 1GB of ram, but it does have 4GB of swap and during the freeze only 24M was used. I cannot find any logs of any sort, but maybe I am not looking in all the right places.

If anyone has any thoughts, please post, otherwise stay tuned. I'll see what I can figure out.


----------



## BachiloDmitry (Mar 7, 2015)

Well, in this thread I had almost all five HDDs faulty using ZFS, that's why find panic'ed every time it started through /etc/periodic.conf.


----------



## PacketMan (Mar 9, 2015)

So running `find` many many many times by itself no issues. Running it with another program using significant cpu resources, and I assume disk resources, causes the OS to freeze. The disk being used though is a new one, not an old one. Gonna see if I can figure out what exactly is happening, and let you know. I'm thinking now that the other program is causing the freeze.


----------



## PacketMan (Mar 21, 2015)

This hasn't occurred since. I believe it was caused by a program that was indexing a lot of stuff.


----------

