# System panic



## Tracker (Dec 9, 2022)

My system was working fine.
All of a sudden Chrome started crashing a couple of days ago.

Now the system won't even boot property and gets stuck here :

`.   Fatal trap 9 : general protection fault while in kernel mode`

What do I do? Feeling lost


----------



## Alain De Vos (Dec 9, 2022)

Can you boot from usb/dvd then mount/chroot into your system ?


----------



## Tracker (Dec 9, 2022)

Alain De Vos said:


> Can you boot from usb/dvd then mount/chroot into your system ?


Can boot into single user mode on the same machine/hard drive.

Tried to boot into different boot environment using beadm and bectl but multiuser mode keeps crashing.


----------



## Alain De Vos (Dec 9, 2022)

When you can boot into single user mode. It means the kernel can get loaded, and you are able to go to a shell and fix the problems.
Things you can try:
-Mount root filesystem
-Use chroot
-Perform freebsd-upgrade
-Do a pkg update& upgrade.
-Reinstall bootcode with "gpart bootcode"


----------



## ralphbsz (Dec 9, 2022)

Before doing any complicated things that change the state (and may sweep the problem under the rug), diagnose what the problem is. To begin with: when the system panics, what does the console show? What is the call stack? That tells you what part of the kernel is unhappy, which is already valuable. Perhaps even more important: What process caused the panic? I think the normal panic output on the console shows the current process ID and name.

Next question: What action causes it? If you can boot into single user mode, stay there for a while, and try several things. For example: Is your storage OK? Are there disk errors? What unusual stuff do you see in dmesg or /var/log/messages? Are your file systems in good health? Are all the expected peripherals present?

Think back: What was the most recent configuration change you made before it started crashing?

Having explored single user mode, try bringing the system to multi-user mode but do NOT start X windows or the GUI. Does it still work? Is normal user login possible? You said above that multi-user mode crashes, but I don't know whether you really mean GUI login there.


----------



## Tracker (Dec 9, 2022)

Alain De Vos said:


> When you can boot into single user mode. It means the kernel can get loaded, and you are able to go to a shell and fix the problems.
> Things you can try:
> -Mount root filesystem
> -Use chroot
> ...


Thanks. So I was able to mount using
`zfs mount -a`

Now able to see user home directory. Trying to figure out how to chroot


----------



## Tracker (Dec 9, 2022)

ralphbsz said:


> Before doing any complicated things that change the state (and may sweep the problem under the rug), diagnose what the problem is. To begin with: when the system panics, what does the console show? What is the call stack? That tells you what part of the kernel is unhappy, which is already valuable. Perhaps even more important: What process caused the panic? I think the normal panic output on the console shows the current process ID and name.
> 
> Next question: What action causes it? If you can boot into single user mode, stay there for a while, and try several things. For example: Is your storage OK? Are there disk errors? What unusual stuff do you see in dmesg or /var/log/messages? Are your file systems in good health? Are all the expected peripherals present?
> 
> ...


Thanks. Good suggestions. Couldn't find much under/var/log/messages

And /var/crash shows a file minfree - so still not sure how to check what's causing this.

Dmesg also doesn't show anything - although I haven't chrooted yet, trying to figure that out.


----------



## Alain De Vos (Dec 9, 2022)

Something like,

```
chroot / /usr/local/bin/zsh
```


----------



## Tracker (Dec 9, 2022)

Alain De Vos said:


> Something like,
> 
> ```
> chroot / /usr/local/bin/zsh
> ```


This worked but somehow when I try setting zfs fa to writeable using this it doesn't work

`zfs set read-only=off zroot/ROOT/default`

It complains saying " cannot open 'zroot/ROOT/default': dataset does not exist

Update: Got this step to work


----------



## Alain De Vos (Dec 9, 2022)

I'm not a specialist. You could try a legacy mount. I.e.,

```
mount -t zfs zroot/ROOT/default /
```


----------



## _martin (Dec 9, 2022)

Are you able to take picture of that fault and share?



Tracker said:


> Can boot into single user mode on the same machine/hard drive.


Meaning the same system/BE that would normally crash if going into multiuser ? If we see where the GPF is happening we may be able to narrow it down. Any special driver being loaded in loader.conf or by rc.conf (graphics driver) ?

If you had chrome crashing prior to this boot issue it could be a HW problem. Can you do a memtest+ ram test booting usb ?


----------



## Alain De Vos (Dec 9, 2022)

/boot/loader.conf
&
/etc/rc.conf
Could be a cause.


----------



## Tracker (Dec 9, 2022)

Alain De Vos said:


> /boot/loader.conf
> &
> /etc/rc.conf
> Could be a cause.


Don't think I've changed them recently


----------



## Tracker (Dec 9, 2022)

_martin said:


> Are you able to take picture of that fault and share?
> 
> 
> Meaning the same system/BE that would normally crash if going into multiuser ? If we see where the GPF is happening we may be able to narrow it down. Any special driver being loaded in loader.conf or by rc.conf (graphics driver) ?
> ...


Yes, same system crashing in multiuser. Tried changing BE but multiuser mot working.

Unlikely to be a hardware issue - nothing changed really.

Chrome crash was few days ago, consistently crashing, upon start. And now multiuser won't boot. Strange. Not sure how to diagnose.


----------



## _martin (Dec 9, 2022)

If you show us the picture of that crash we can see at least what code has been executing when GPF occurred. That's a first step. As you have access to the system in single mode you can still configure crashdumps to save this and possible share either here or open a PR (with latter being the proper way of reporting issues). `dumpon -l` shows you if you have any dump devices configured. Depending on your disk layout (`gpart show`) you need to select swap partition with `dumpon` and add at least `dumpdev="AUTO"` in rc.conf.

Of course chrome could be crashing due to some SW bug. Without further information your guess is good as mine. But at least for now it does show a pattern.


----------



## Tracker (Dec 9, 2022)

Alain De Vos said:


> When you can boot into single user mode. It means the kernel can get loaded, and you are able to go to a shell and fix the problems.
> Things you can try:
> -Mount root filesystem
> -Use chroot
> ...


So I tried mounting zfs latest pool
Did chroot
Did freebsd-update fetch and install
pkg update and install

Just haven't tried the "gpart boot code" command


----------



## elgrande (Dec 9, 2022)

Tracker said:


> Don't think I've changed them recently


A module loaded in loader.conf might have changed during a package upgrade or similar.


----------



## _martin (Dec 9, 2022)

If I can share my 2c, please don't do a kitchen sink approach to this problem and for sure don't start reinstalling stuff. GPF trap in kernel is for sure not due to bad bootcode. You are already in single mode, you're long gone after bootcode.
If you share that picture, we can see what's happening. From there we can navigate and give you better suggestions.


----------



## Tracker (Dec 9, 2022)

elgrande said:


> A module loaded in loader.conf might have changed during a package upgrade or similar.


Yea this could be likely - but how do I check?


----------



## elgrande (Dec 9, 2022)

Tracker said:


> Yea this could be likely - but how do I check?


Just comment out everything not 100% required in loader.conf would be worth a try.


----------



## Tracker (Dec 9, 2022)

_martin said:


> If I can share my 2c, please don't do a kitchen sink approach to this problem and for sure don't start reinstalling stuff. GPF trap in kernel is for sure not due to bad bootcode. You are already in single mode, you're long gone after bootcode.
> If you share that picture, we can see what's happening. From there we can navigate and give you better suggestions.


My Mobile camera pictures are apparently too large to upload here:/

But the scren it stops on mentions:
`Current process: 49022 (rm)
Trap no =9
Panic: general protection fault 
......
....
Fatal trap 9: general protection fault while in kernel mode
......
.....
Warning !drm_modeset_is_locked)...`


----------



## _martin (Dec 9, 2022)

Well, not ideal, but you did share already important part: "drm_modeset_is_locked". This is graphics driver related stuff. As I mentioned in my first reply here - locate video-driver specific lines in /boot/loader.conf and /etc/rc.conf , comment them out and try booting again. 
Also most likely some part of the system did get updated then.


----------



## Tracker (Dec 10, 2022)

_martin said:


> Well, not ideal, but you did share already important part: "drm_modeset_is_locked". This is graphics driver related stuff. As I mentioned in my first reply here - locate video-driver specific lines in /boot/loader.conf and /etc/rc.conf , comment them out and try booting again.
> Also most likely some part of the system did get updated then.


So I was able to see dmesg logs. It seems like it's the graphic drivers. Starts with
`!drm_modeset_is_locked(.... failed at /wrkdirs/user/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_8/drivers/GPU/drm/drm_atomic_helper.c:669
Kernel trap 12 with interrupts disabled

Fatal trap 12 : page fault while in kernel mode
cpuid=0; apic Id= 00
Fault virtual address= .....`


----------



## Alain De Vos (Dec 10, 2022)

What is the graphics card you use and which driver ?
Do you load the driver from loader.conf or rc.conf.


----------



## Tracker (Dec 10, 2022)

Alain De Vos said:


> What is the graphics card you use and which driver ?
> Do you load the driver from loader.conf or rc.conf.


I vaguely remember installing drm-510-kmod from ports back in the day . I tried `make deinstall delete` by going into ports... Seemed to delete it. 

Rebooted but issue persists. Maybe I should reinstall it via pkg?

My loader.conf file has these (commented out + uncommented) since pretty long and worked fine (since a couple of years I believe now) :

`....
#i915kms_load="YES"

fuse_load="YES"

#kern.vty=vt`
.....
[/CMD]
My rc.conf has these ones


`.....
kld_list="/boot/modules/i915kms.ko"
#gnome
dbus_enable="YES"
hald_enable="YES"
slim_enable="YES"
.....`


----------



## elgrande (Dec 10, 2022)

I'd comment out fuse_load and kld_list and see if it helps.
Since there is no more hal, you can remove hald_enable anyhow.

Edit: It most probably is the graphics driver, so the kld_list line should do it.


----------



## Tracker (Dec 10, 2022)

elgrande said:


> I'd comment out fuse_load and kld_list and see if it helps.
> Since there is no more hal, you can remove hald_enable anyhow.
> 
> Edit: It most probably is the graphics driver, so the kld_list line should do it.



Thanks. Tried commenting out all 3. Still crashing. Maybe I should try to reinstall drm-kmod module with these still commented out.

Update: uncommented/commented fuse_loaf to no effect


----------



## larshenrikoern (Dec 10, 2022)

Hi

I would suggest to replace  kld_list="/boot/modules/i915kms.ko" with kld_list="i915kms"

I think this is the common way of loading kernel modules


----------



## Tracker (Dec 10, 2022)

Tried installing drm-kmod via pkg - didn't work . Restarting leads to crash.

Tried only kld_list="i915kms" - didn't work. Restarting leads to crash.

What next?

PS: Fwiw this is quite an old machine, been working fine since a couple of years with freebsd. Would suspect chrome/machine crashing issue might be due to some recent update. But can't say for sure and unable to figure out how. ((


----------



## elgrande (Dec 10, 2022)

Just to be sure:

```
i915kms_load="YES"
```
 is still commented out in loader.conf?


----------



## larshenrikoern (Dec 10, 2022)

Hi again

What version of FreeBSD are you using ?? And what graphics (or in this case cpu) are you using ??


----------



## Tracker (Dec 10, 2022)

elgrande said:


> Just to be sure:
> 
> ```
> i915kms_load="YES"
> ...


It's always been commented out.

In the dmesg | less output after crash im still seeing something to do with : !drm_modeset_is_locked.... Error I can was referring to earlier.




larshenrikoern said:


> Hi again
> 
> What version of FreeBSD are you using ?? And what graphics (or in this case cpu) are you using ??


I'm a little unsure how to answer this. Uname -a gives out 13.1-RELEASE-p3 after I chroot using single user mode and do:
 zfs set readonly=off latest-zfs-snapshot-i-see-with-zfs-list

i somewhat remember it being upgraded to P4 release. I've tried upgrading from single user mode as well but seems to be showing 13.1-RELEASE-p3 ..... Will try again.

Regarding graphics it's an Intel cpu with no graphics card I think. It's been working flawlessly for a couple of years except the minor hiccups.

Also BE is set to latest snapshot that was taken when trying to perform upgrade in single user mode. Maybe I should set that back? But wont that affect new data not being shown/recovered?


----------



## Tracker (Dec 10, 2022)

Ok weird. So after installing drm-510-kmod through pkg I thought I should try and compile it through ports/usr/ports/graphics/drm-510-kmod

I tried to do make install clean and the system rebooted again! Instead of throwing any errors or warning.

I would highly suspect this pkg/port is the culprit. Am I correct in thinking so? I'm still a bit lost about what to do about it


----------



## larshenrikoern (Dec 10, 2022)

Tracker said:


> It's always been commented out.
> 
> In the dmesg | less output after crash im still seeing something to do with : !drm_modeset_is_locked.... Error I can was referring to earlier.
> 
> ...


I would be sure you was using a supported FreeBSD version. If not that could be your problem. But you are on latest release  

As far I know if your old snapshot is booting your data is still there and should be able to find. But as I am on UFS I will not acclaim myself ZFS guru.


----------



## Tracker (Dec 10, 2022)

larshenrikoern said:


> I would be sure you was using a supported FreeBSD version. If not that could be your problem. But you are on latest release
> 
> As far I know if your old snapshot is booting your data is still there and should be able to find. But as I am on UFS I will not acclaim myself ZFS guru.


This is definitely not something exotic. Always upgraded through standard freebsd update fetch/install

Yes seems like it's booting into the p3 release snapshot, even though I see a couple more snapshots below after trying to update/upgrade. I tried setting the BE to the latest snapshot versions. Maybe I should set BE to the older versions? (just not sure if the latest data will be lost or not)


----------



## Tracker (Dec 10, 2022)

Changing BE to a month old BE also DOESN'T work and still crashes. 

Does anyone else think it's the drm-510-kmod issue only?


----------



## _martin (Dec 10, 2022)

Tracker said:


> I tried to do make install clean and the system rebooted again! Instead of throwing any errors or warning.


Rebooted when it was compiling? If so again, this could be a faulty RAM. Creating memtest usb takes few minutes and you can verify that modules are ok.

Can you share what is the virtual address you had the page fault on ?


----------



## Tracker (Dec 10, 2022)

_martin said:


> Rebooted when it was compiling? If so again, this could be a faulty RAM. Creating memtest usb takes few minutes and you can verify that modules are ok.
> 
> Can you share what is the virtual address you had the page fault on ?


Yes, it rebooted while compiling. I tried to compile some other game just to see if it was drm-510--kmod specific and same thing. Rebooted again! Seems like it could be RAM (which might make sense also because my Chrome instance was a memory hog). Thanks 

I'll try to use a Ubuntu stick I have for memtest, not sure how to do it with freebsd.

Regarding virtual address, this was the output
`iung: iun_read_firmuare: ucode rev=0x89dd8401 ulang: link state changed to UP

c:619

WARNING !drm_modeset_is_locked (&crtc->mutex) failed at /urkdirs/usr/ports cs/drm-518-kmod/work/drm-kmod-drn_v5.10.113_8/drivers/gpu/drm/drm_atomic_ HARNING Idrm_modeset_is_locked(&crtc->mutex) failed at /urkdirs/usr/ports.

c:619

drm/drm_atomic_helper.c:669

kernel trap 12 with interrupts disabled

cs/drm-510-kmod/work/drm-kmod-drm_v5.10.113_8/drivers/gpu/drm/drm_atomic_h HARNING !drm_modeset_is_locked (&dev->mode_config.connection_mutex) failed a kdirs/usr/ports/graphics/drm-518-kmod/work/drm-kmod-drm_v5.18.113_8/drivers

Fatal trap 12: page fault while in kernel mode cpuid 8: apic id = 89 fault virtual address = 8x1d8858fdb85`


----------



## _martin (Dec 10, 2022)

The best way is to boot the memtest itself, OS independent. You can download it (free version is ok) from here.
That virtual address is bogus (expected with kernel panic), not sure if you made some mistakes during manual "copy-paste". But even 0x1d8858fdb85 would be a bogus one.


----------



## Alain De Vos (Dec 10, 2022)

Try to comment out the line

```
kld_list="/boot/modules/i915kms.ko"
```
Then

```
pkg update -f
pkg install -f gpu-firmware-kmod
```


----------



## Tracker (Dec 10, 2022)

_martin said:


> The best way is to boot the memtest itself, OS independent. You can download it (free version is ok) from here.
> That virtual address is bogus (expected with kernel panic), not sure if you made some mistakes during manual "copy-paste". But even 0x1d8858fdb85 would be a bogus one.


I am running memtest86 v4.10 currently

After 1 pass it hasn't shown any errors so far. Will leave it on for some time to complete and report back.

Yes the address is what you wrote - I made a mistake with copy/paste 




Alain De Vos said:


> Try to comment out the line
> 
> ```
> kld_list="/boot/modules/i915kms.ko"
> ...


Will try this after the memtest86 cycle is complete. Will report back.

Hope to make some progress


----------



## _martin (Dec 10, 2022)

Ok, let it run through all the tests. I'm assuming you mean "1 pass of the test pass, not the entire testing". It does show the fancy all over the screen "PASS" text once it's done.

The virtual address is totally bogus, it doesn't show signs of small buffer overruns, etc. As you had GPF and now page fault we can assume it was jumping "all over". It would help to see if the issue is always on the same function, i.e. stack trace. Maybe you should find out how to decrease the sizeof the picture and post it that way.

It was mentioned here above, you should comment practically everything out from rc.conf to see if it boots up.

But .. you said you are having troubles compiling stuff. That usually is what I mentioned above - HW issue. That sudden reboot is a sign of triple fault.


----------



## elgrande (Dec 10, 2022)

Can you boot a fresh 13.1-RELEASE stick into multi user mode?
That would be an enlightening test.


----------



## Tracker (Dec 10, 2022)

_martin said:


> Ok, let it run through all the tests. I'm assuming you mean "1 pass of the test pass, not the entire testing". It does show the fancy all over the screen "PASS" text once it's done.
> 
> The virtual address is totally bogus, it doesn't show signs of small buffer overruns, etc. As you had GPF and now page fault we can assume it was jumping "all over". It would help to see if the issue is always on the same function, i.e. stack trace. Maybe you should find out how to decrease the sizeof the picture and post it that way.
> 
> ...


It's been running since a couple of hours now, 3 passes done - memtest86 .... No errors so far. Should I let it continue running? See image below.

Regarding rc.conf (after this memtest86 is done) - what about trying to put in the default rc.conf instead - which I read online is at /boot/defaults/loader.conf ?

For stacktrace please see last image


----------



## Tracker (Dec 10, 2022)

elgrande said:


> Can you boot a fresh 13.1-RELEASE stick into multi user mode?
> That would be an enlightening test.


I have a 12.x stick available - would that be fine? Right now just running memtest86 since 3-4 hrs now. Not sure how much longer I should let it run.


----------



## elgrande (Dec 10, 2022)

Tracker said:


> I have a 12.x stick available - would that be fine? Right now just running memtest86 since 3-4 hrs now. Not sure how much longer I should let it run.


I guess 12.x is a good first test.
If it boots fine this indicates it might be a borked installation and not a hardware error.
If it boots fine I‘d create a 13.1 stick to double check.
Memtest afaik can run reaaaaally long like 24h long, imho you can do the boot test first, but up to you


----------



## _martin (Dec 10, 2022)

Tracker said:


> I have a 12.x stick available - would that be fine? Right now just running memtest86 since 3-4 hrs now


Memory modules are OK, memtest does pass as shown in the picture.

Observe the crashes and check if it fails always at the same place `intel_bw_cal_min_cdclk+0x89`. You could open a PR (bug report) for this too.
Suggestions to test with other releases is not a bad idea but if this situation started happening few days ago then still something other is up.



Tracker said:


> Regarding rc.conf (after this memtest86 is done) - what about trying to put in the default rc.conf instead


On FreeBSD 13/ZFS I think you'd be ok even with empty rc.conf (zfs_enable="YES" is not needed for boot to succeed I think). This way you can test the boot without the intel driver.


----------



## Tracker (Dec 10, 2022)

4:27 hrs and running this memtest86 :/
Pass 3 .... 85% complete apparently says line at the top.... Still no errors though.... Was hoping to find something


----------



## Tracker (Dec 10, 2022)

elgrande said:


> I guess 12.x is a good first test.
> If it boots fine this indicates it might be a borked installation and not a hardware error.
> If it boots fine I‘d create a 13.1 stick to double check.


I'll try with the current one 12.x iirc. Last time I tried it took me to the installation screen then I turned it off not sure if there was a multiuser option there - will check again.


_martin said:


> On FreeBSD 13/ZFS I think you'd be ok even with empty rc.conf (zfs_enable="YES" is not needed for boot to succeed I think). This way you can test the boot without the intel driver


Ok. I'll try backing up current rc.conf as rc.comf.backup and having an empty rc.comf (or even zfs enable option)

Will report back. This bug report process might be too lengthy, not sure I will be able to do it anytime soon- have important work/data on machine I need to get running. Seems like I might have to buy another machine: /


----------



## elgrande (Dec 10, 2022)

Before buying a new machine I would at least try a reinstall of 13.1 - you never know.


----------



## Tracker (Dec 10, 2022)

Ok so memtest86 results after running for 5 hrs+ ..... Think that's enough. 

Was hoping to find some errors with RAM. The theory of memory hog Chrome instance starting to crash + ports compiling causing reboot made some sense. But the memtest86 doesn't seem to show any issues with memory. Check image below 

Is this definite enough to conclude RAM is fine? 

Will try some other steps and report back


----------



## Tracker (Dec 10, 2022)

Alain De Vos said:


> Try to comment out the line
> 
> ```
> kld_list="/boot/modules/i915kms.ko"
> ...


Update as I promised on this - was able to install and comment it out. Same issue. Panic and not booting.

Will try to have an empty/zfs enable rc.conf next


----------



## chrbr (Dec 10, 2022)

Dear Tracker,


elgrande said:


> Before buying a new machine I would at least try a reinstall of 13.1 - you never know.


I would follow the FreeBSD-12 instead. FreeBSD-12.4 will be supported for quite some time. Then the smoke on ZFS might have settled. There is nothing wrong tracking a mature release - if there is no other show stopper. At least I will do so.
Kind regards,
Christoph


----------



## Tracker (Dec 10, 2022)

Tracker said:


> Will try to have an empty/zfs enable rc.conf next


Ok this is interesting! A blank rc.conf allowed me to login as normal user! But without zfs enabled I don't think I'll be able to mount the filesystem. Will try zfs enabled  again and report. Need to go back again into single user mode FML 

But seems like we are getting closer to finding the cause?


chrbr said:


> I would follow the FreeBSD-12 instead. FreeBSD-12.4 will be supported for quite some time. Then the smoke on ZFS might have settled. There is nothing wrong tracking a mature release - if there is no other show stopper. At least I will do so.


Yes this makes sense, although with the above blank rc.conf I was able to login. So maybe will keep this option as backup if I hit a dead end later.


----------



## Tracker (Dec 10, 2022)

Ok very weird. A blank rc.conf let's me login into multiuser mode

While a rc.conf with only
`zfs_enable="YES"`
 Causes the panic and doesn't boot!

Could zfs be the culprit? Seems like it but not sure how to go ahead


----------



## Alain De Vos (Dec 10, 2022)

Just some random commands, check if nothing unusual.

```
zpool status -x
zpool list -v
zfs list
```


----------



## Jose (Dec 10, 2022)

Did you install a ZFS development version, perchance? Either sysutils/openzfs-kmod or sysutils/openzfs.

(Thanks to Erichans for teaching me about these.)


----------



## Tracker (Dec 10, 2022)

Switched rc.conf.backup earlier to rc.conf as earlier with all the values. Then went onto _comment out_ zfs_enable="YES" and now I can even see the GUI login screen as earlier! Woohoo.

But it doesn't allow me to login because zfs is the filesystem :/ (failed to execute login)

Now I need to figure out how to fix zfs I guess? Any pointers?


----------



## Tracker (Dec 10, 2022)

Alain De Vos said:


> Just some random commands, check if nothing unusual.
> 
> ```
> zpool status -x
> ...


I was actually using `zfs list` output, all this while, to set readonly=off variable to be able to edit files in single user mode.

This command actually gives some errors that are related to Chrome!!! Had a sneaky feeling something had to do with Chromium

See attached image below 
`zpool status -v`
It asks to restore the file on question if possible or to restore entire pool from backup. What should I do?



Jose said:


> Did you install a ZFS development version, perchance? Either sysutils/openzfs-kmod or sysutils/openzfs.
> 
> (Thanks to Erichans for teaching me about these.)


No I had the standard zfs, might have switched automatically if freebsd changed it with versions 12.x to 13.1.

I however remember doing some operations with zfs and asking about it when it wasn't working earlier. Maybe I messed up something then (however it worked fine for a couple of months) that's come back to bite me now


----------



## Alain De Vos (Dec 10, 2022)

```
zpool scrub -w zroot
```
Make take alot of time ...


----------



## Tracker (Dec 10, 2022)

Alain De Vos said:


> ```
> zpool scrub -w zroot
> ```
> Make take alot of time ...


Thanks. Started the scrub without w option but can see it progressing under `zpool status`

Quick question:
1) It shows Chromium files and locations in the image in my last reply. Would the scrub basically be deleting those files? Could I have done whatever scrub would do with those files manually?

2) How does this kind of problem arise in the first place? Chrome(ium) interfering and corrupting zfs pool?


----------



## Tracker (Dec 10, 2022)

Also most important question:  after the scrub setting the zfs_enable="YES" in rc.conf should get things back to
normal? I hope

UPDATE: Scrub finished, but it's still showing errors: Permanent errors have been detected in the following files.....like earlier

Should I do zfs_enable again or do I need to do something else?


----------



## Alain De Vos (Dec 10, 2022)

I'm out of ideas...


----------



## Tracker (Dec 10, 2022)

So I tried enabling `zfs_enable="YES"` in rc.conf again after scrub.

Tried rebooting and same issue again! Won't boot in multiuser mode, panic!


----------



## _martin (Dec 10, 2022)

Memtest was successful, you did let it run for more than enough time.

Looking at the stacktrace you provided it's a bit interesting. It seems picture doesn't show all information (screen was scrolled), there seems to be an issue before frame 5. When reading the output from bottom up after warning you can see vpanic() function already. Also what's interesting is that the process that triggered it is rm. That's weird.

I tried to think about it but without dump to give more information I don't know. When in single mode, are there any other messages prior to the crash? Any MCA errors, etc.


----------



## elgrande (Dec 10, 2022)

__





						Repairing a Corrupted File or Directory -  Managing ZFS File Systems in Oracle® Solaris 11.2
					

If a file or directory is corrupted, the system might still function, depending on the type of corruption. Any damage is effectively unrecoverable if no good copies...



					docs.oracle.com
				




Seems multiple runs of scrub and clear may be required.


----------



## _martin (Dec 10, 2022)

If you configure the dump and you're willing to share it I can do the bureaucracy, I'll open a PR. If you are up to it we should trigger the crash as close to GENERIC without any additional modules as possible. First:
a) can you confirm you're running GENERIC kernel? i.e. you didn't compile it yourself.

If yes we can move to point 2, and that is triggering the crash without compiled drm module, with zfs only. Can you comment everything in rc.conf and leave it with the `zfs_enable="YES"` only? Are you able to trigger crash this way ?If so, please share the stacktrace of the failure.

Third point: configure dump device. Do you have swap partition on your system? If you're not sure please show us the `gpart show` so we can check. If yes we need to do what I mentioned above. Once you have a crash in /var/crash we are ready to open PR.


----------



## cracauer@ (Dec 10, 2022)

Tracker said:


> Ok so memtest86 results after running [....]
> 
> Is this definite enough to conclude RAM i



No, memtest doesn't stress the RAM enough. I use SuperPi to establish that the RAM has the right timings etc (Linux binary, but runs in Linuxulator).

And mprime/prime95 to establish the CPU is OK.


----------



## Tracker (Dec 10, 2022)

elgrande said:


> __
> 
> 
> 
> ...


Hmm. So I rebooted the system again in single user mode and the chromium files that were showing as permanent errors have disappeared automatically. Now the only permanent errors is this

`zroot/tmp:<0x3>`

Trying running scrub again now. Not sure about clear - should I be running it?


----------



## Alain De Vos (Dec 10, 2022)

After mount of zroot/tmp
Everything which is in /tmp can be safely deleted.

```
rm -fR /tmp/* /tmp/.??*
```
& run scrub again.
Perform

```
zpool import zroot
zfs mount -a
```
verify
/boot/loader.conf
/etc/rc.conf


----------



## Tracker (Dec 10, 2022)

cracauer@ said:


> No, memtest doesn't stress the RAM enough. I use SuperPi to establish that the RAM has the right timings etc (Linux binary, but runs in Linuxulator).
> 
> And mprime/prime95 to establish the CPU is OK.


This is interesting. Will try to keep in mind. Hopefully I solve this issue for now by scrubbing but yes crashing upon compiling earlier does make me still wonder if it could be a RAM issue.


_martin said:


> If you configure the dump and you're willing to share it I can do the bureaucracy, I'll open a PR. If you are up to it we should trigger the crash as close to GENERIC without any additional modules as possible. First:
> a) can you confirm you're running GENERIC kernel? i.e. you didn't compile it yourself.
> 
> If yes we can move to point 2, and that is triggering the crash without compiled drm module, with zfs only. Can you comment everything in rc.conf and leave it with the `zfs_enable="YES"` only? Are you able to trigger crash this way ?If so, please share the stacktrace of the failure.
> ...


Thanks for this offer. I'll see if the issue doesn't get resolved then maybe will go down this route. For now my primary objective is just to get it running.

Yes I think I'm using GENERIC although the boot menu shows 2 options, probably something I must have tried to compile long back as a learning exercise.

The zfs_enable causes issues and panic when enabled in rc.conf as the only line in it as you suggested.


----------



## Tracker (Dec 10, 2022)

Alain De Vos said:


> After mount of zroot/tmp
> Everything which is in /tmp can be safely deleted.
> 
> ```
> ...


Ok this is very strange. I just mentioned that the tmp file was the only one showing. However it seems like the Chromium files have reappeared under permanent errors, unsure if it happened after the scrub but as far as I can tell it seemed to not be there before as I mentioned earlier. Strange because Chromium was obviously not run since the system is unable to run.

Should I just delete the Chromium files along with the tmp file.as.you suggested earlier? Although that Chromium instance might have my new data


----------



## Alain De Vos (Dec 10, 2022)

It's a good idea to remove all "browser-related-data".


----------



## elgrande (Dec 10, 2022)

Tracker said:


> Ok this is very strange. I just mentioned that the tmp file was the only one showing. However it seems like the Chromium files have reappeared under permanent errors, unsure if it happened after the scrub but as far as I can tell it seemed to not be there before as I mentioned earlier. Strange because Chromium was obviously not run since the system is unable to run.
> 
> Should I just delete the Chromium files along with the tmp file.as.you suggested earlier? Although that Chromium instance might have my new data


I am not a zfs pro, but from the link I sent you, I have the impression that the 'clear' is required to reset the error stats.
Anyhow if it is just a Chrome directory, why not delete the whole directory and scrub/clear once after this?


----------



## Tracker (Dec 10, 2022)

It's apparently oscillating from showing permanent error files from zroot/tmp:<0x3> + Chromium files to just tmp::0x3 again

That suggestion to delete everything in tmp - When I do LS on tmp it seems to show a couple of files with BE names matching recent BEs as well.

And that 0x3 in brackets just isn't there at all under tmp directory.


----------



## cracauer@ (Dec 10, 2022)

Tracker said:


> This is interesting. Will try to keep in mind. Hopefully I solve this issue for now by scrubbing but yes crashing upon compiling earlier does make me still wonder if it could be a RAM issue.



Could also be the CPU mixing up a bit here and there. That's what mprime/prime95 tests.

Even if you clear the ZFS errors the question remains how it got corrupted in the first place.


----------



## Tracker (Dec 10, 2022)

elgrande said:


> I am not a zfs pro, but from the link I sent you, I have the impression that the 'clear' is required to reset the error stats.
> Anyhow if it is just a Chrome directory, why not delete the whole directory and scrub/clear once after this?


Deleted the files that were showing up as permanent errors under scrub - then did zpool clear zroot.

Rebooted. Same panic error , unable to reboot.

Will get some sleep now and try again on few other suggestions and report back.

Thanks everyone.


----------



## _martin (Dec 10, 2022)

In my opinion you should not be chasing stale issues after scrub on files you most likely don't care as they are in tmp or are not important (chrome related).

If you are able to panic system without graphics driver that panic is what you need to be after. Show us what is the panic about. 
Yes, passing memtest is not 100% assurance all is ok but it's always a good place to start. Other tests that stress the system are always a good idea. But here you can always reliably trigger a panic.
RAM is usually the best way to start. CPU is next. That's why I asked about the MCE errors in the syslog.


----------



## Tracker (Dec 11, 2022)

cracauer@ said:


> Could also be the CPU mixing up a bit here and there. That's what mprime/prime95 tests.
> 
> Even if you clear the ZFS errors the question remains how it got corrupted in the first place.


Ok so I installed mprime and trying stress testing with default options now (2 cores instead of 4 i believe - this machine is pretty old so I guess that should be reasonable? Was already having temp issues when overloaded). Just hoping CPU is running reasonably fine.

How long does this usually run?

And what should I do to test RAM?


----------



## Tracker (Dec 11, 2022)

Alain De Vos said:


> It's a good idea to remove all "browser-related-data".





elgrande said:


> I am not a zfs pro, but from the link I sent you, I have the impression that the 'clear' is required to reset the error stats.
> Anyhow if it is just a Chrome directory, why not delete the whole directory and scrub/clear once after this?



Tried this I think - by deleting the files scrub was complaining about manually under Chromium config. Didn't seem to work and still causing issues. Panic at boot.

So right now I need to fix the zpool errors with scrub - maybe run it a couple more times? Already ran it twice I think.

And need to test hardware - running mprime for CPU currently.



_martin said:


> Third point: configure dump device. Do you have swap partition on your system? If you're not sure please show us the `gpart show` so we can check. If yes we need to do what I mentioned above. Once you have a crash in /var/crash we are ready to open PR.


How do I configure dump device? I have swap on the system, yes.

FYI any output I share might have typos because I'm not able to do it correctly from my phone.


----------



## Tracker (Dec 11, 2022)

Update: Not sure how long this mprime CPU test is supposed to run. But been running since a couple of hours and the (rolling, difficult to follow) output doesn't seem to be indicating errors whenever I glance at it.

See image below. How long do I let this run?

I'm assuming couple of hours should have been good enough to catch glaring errors?

UPDATE 2 : Closing the mprime testing now, let it run for 3 hrs I think. Wasn't indicating any failures whenever I looked at it. Attaching a second image below to show how it was going, please let me know if there's any errors you spot (mprime noob here). I reckon this should be sufficient to conclude that CPU isn't the primary fault here?

Oh also including a 3rd image which says 0 errors 0 warnings after 2:45 hrs run


----------



## Tracker (Dec 11, 2022)

I *think* this is the root problem now. When I do `zpool status -v` it's again showing me the permanent errors to be in the following files
`zroot/tmp:<0x3>` like I mentioned earlier.

I can't see any such file when I try to `ls /tmp`  it outputs some other files (that seem temporary in nature) . I think maybe if I could remove this.file then things could possibly work?

The output of 
zpool status -v 
Also points to this link https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ . When I read it it makes me think this might be a metadata level corruption of the zpool? Just not sure how to fix this.

Anyone?


----------



## Alain De Vos (Dec 11, 2022)

That's not a nice message. It means you need to start from scratch ...


----------



## _martin (Dec 11, 2022)

Tracker said:


> How do I configure dump device? I have swap on the system, yes.


Please look at my comments above, I've already mention it twice. /etc/rc.conf: `dumpdev="AUTO"` is sufficient. Check with `dumpdev -l` if you have dump device enabled. If not execute `dumpon /dev/diskXpN` to use that swap.

If this is the only computer you have and you are posting from cell phone I get it, that's pain to do. But you never showed the stack trace of the crash when you had only zfs_enable in rc.conf.

For the sake of clarity I'll reiterate: HW stressing is a good way to figure out if your HW is ok. More tests the better. In your case testing RAM and checking syslog or MCE should be enough. You are reliably able to reproduce the issue.
Is it ZFS related? Could be. The only crash you showed is not complete (screen scroll), it seems to have nested issues (vpanic already happening when drm ran into another issue). Also caused by a rm process is a bit weird (maybe pointing back to fs ?).

in either case (ZFS or graphics driver) you'd need PR as those projects are really big and informing developers about the issue is the best way to go.


----------



## Tracker (Dec 11, 2022)

Alain De Vos said:


> That's not a nice message. It means you need to start from scratch ...


Holy cow. Do you mean _ALL_ my data is lost? Is there no way to recover? (Snapshots/BEs?)

PS: I am able to see the files in single user mode  with ls on home directory


----------



## Tracker (Dec 11, 2022)

_martin said:


> Please look at my comments above, I've already mention it twice. /etc/rc.conf: `dumpdev="AUTO"` is sufficient. Check with `dumpdev -l` if you have dump device enabled. If not execute `dumpon /dev/diskXpN` to use that swap.
> 
> If this is the only computer you have and you are posting from cell phone I get it, that's pain to do. But you never showed the stack trace of the crash when you had only zfs_enable in rc.conf.
> 
> ...


Thanks for walking me through this and sorry for being unable to pay enough attention. I only have a mobile device now - so I'm not sure how I can accomplish this. I think I can say with some confidence though that the panic is reproduced when zfs_enable is present uncommented in the rc.conf file. Whenever it's commented out the system boots normally like before (but unable to login due to fs missing, even if I manually try to mount the zfs system with readonly=off the GUI login doesn't work and goes blank)

Regarding stacktrace - the pictures I posted earlier too were at the bottom of the panic apparently. It's difficult to get the scrolling output given I only have a phone. If there's something easy that I can do then please let me know.

Thanks again.


----------



## _martin (Dec 11, 2022)

If you have important data on this pool doing less is better. You can always boot the USB/cd and recover the data from there (i.e. activate pool in recovery/live mode, mount fs, copy data to an external disk,etc.).
As it's not that hard to enable those dumps please do that. We can only guess what's happening but without being able to see the crash and/or logs it's pretty much a guess work.


----------



## cracauer@ (Dec 11, 2022)

Tracker said:


> Ok so I installed mprime and trying stress testing with default options now (2 cores instead of 4 i believe - this machine is pretty old so I guess that should be reasonable? Was already having temp issues when overloaded). Just hoping CPU is running reasonably fine.
> 
> How long does this usually run?
> 
> And what should I do to test RAM?



I run mprime for 24 hours, but 3 should be fine, too.

For RAM I use the Linux binary of SuperPi.


----------



## cracauer@ (Dec 11, 2022)

Tracker said:


> I *think* this is the root problem now. When I do `zpool status -v` it's again showing me the permanent errors to be in the following files
> `zroot/tmp:<0x3>` like I mentioned earlier.


Try this:

```
zdb -c poolname
```


----------



## Tracker (Dec 11, 2022)

cracauer@ said:


> Try this:
> 
> ```
> zdb -c poolname
> ```


Seems to show error counts
Error No  count 
97.            1

Please check image below for full screen output


----------



## Tracker (Dec 11, 2022)

_martin said:


> If you have important data on this pool doing less is better. You can always boot the USB/cd and recover the data from there (i.e. activate pool in recovery/live mode, mount fs, copy data to an external disk,etc.).
> As it's not that hard to enable those dumps please do that. We can only guess what's happening but without being able to see the crash and/or logs it's pretty much a guess work.


Ok, makes sense, I'm trying to figure out the basics of how to use zfs snapshots/BEs to recover data to another disk. Need to buy that too. Was hoping this would get fixed by software without need for additional hardware.

I just checked 'dumpdev="AUTO"' was present all this while! However there's no dumpdev installed on the system. Assuming I have to install it and it works - how will I possibly share it here? On mobile. 


cracauer@ said:


> I run mprime for 24 hours, but 3 should be fine, too.
> 
> For RAM I use the Linux binary of SuperPi.


Thanks. Somehow pkg search SuperPi doesn't return any results.


----------



## elgrande (Dec 11, 2022)

It cannot be emphasized enough.
If you have important data only on this device, NOW is the time to backup as much as you can before fiddling further with ZFS.
Since you can still read the data a backup should be possible.


----------



## _martin (Dec 11, 2022)

Tracker said:


> However there's no dumpdev installed on the system.


Opps, my typo. `dumpon -l` to list the dump device, `dumpon /dev/diskNpX` to use the swap as swapdevice. AUTO part of the rc.conf should then automatically use that device.


----------



## Tracker (Dec 11, 2022)

Than


elgrande said:


> It cannot be emphasized enough.
> If you have important data only on this device, NOW is the time to backup as much as you can before fiddling further with ZFS.
> Since you can still read the data a backup should be possible.


Good point. Just trying to figure out how to save the data using snapshots? onto another hard drive. Never really faced this situation.

A) Is it possible to use snapshots/BE to get the exact replica of current data?

B) If A Is possible, would such a setup require an exact set-up as the current machine on the second hard drive? ( It's GELI encrypted)


----------



## elgrande (Dec 11, 2022)

Tracker said:


> Than
> 
> Good point. Just trying to figure out how to save the data using snapshots? onto another hard drive. Never really faced this situation.
> 
> ...


Since you can mount the zfs volume, you can just copy the data to another device I guess?


----------



## free-and-bsd (Dec 11, 2022)

Tracker said:


> I was actually using `zfs list` output, all this while, to set readonly=off variable to be able to edit files in single user mode.
> 
> This command actually gives some errors that are related to Chrome!!! Had a sneaky feeling something had to do with Chromium
> 
> ...


What kind of hard drive are you using? Not SSD perchance?

Well anyway, I have experienced these permanent errors in a ZFS pool. This has nothing to do with RAM, evidently. In my case (2 cases actually) it was failing hard drive. Yes, this happens to hard drives and nothing can be done about it.

You can very well copy whatever files you need from your old pool onto a new hard drive by either rsync or by zfs send sending a snapshot of your entire pool. Read-only mode is no problem in that case as "read" is all you will need. If you choose `zfs send` command, then your pool errors will, of course, be copied over as well, but on a NEW hard drive they will be easily fixed. Because their not being fixed is likely caused by failing hard drive.


----------



## free-and-bsd (Dec 11, 2022)

Tracker said:


> Than
> 
> Good point. Just trying to figure out how to save the data using snapshots? onto another hard drive. Never really faced this situation.
> 
> ...


Yes, that's the beauty of zfs & snapshots, it will be exactly all the data you have there, including your setup. Not sure about GELI or other kind of encryption though, never tried that. However, you can mount your decrypted stuff read-only and then copy over all the data using rsync. Then you will use whatever encryption method you prefer on the new hard drive.

But if I were you, I would first get things settled with data, then worry about encryption etc. Unless , of course, we're talking here about TBs of data, which you never mentioned in your messages ))


----------



## free-and-bsd (Dec 11, 2022)

BTW, you can use that Ubuntu USB stick you've mentioned earlier and use the HDD diagnostic tool (forgot its name, sorry) they have in every distro. At least you'll see if your disk shows errors...


----------



## Tracker (Dec 11, 2022)

free-and-bsd said:


> What kind of hard drive are you using? Not SSD perchance?


Yes, SSD, Samsung's 860. And I was also using swap on it a fair bit due to system load (8 gb ram which used to fall short so kept 8gb swap- used to run near capacity most of the time). Possibly that might have caused the hard drive to wear down faster? I mean it's still not clear definitely that it's a hard drive issue but I used to always think about the stress on hard drive with swap being almost fully used.


free-and-bsd said:


> You can very well copy whatever files you need from your old pool onto a new hard drive by either rsync or by zfs send sending a snapshot of your entire pool. Read-only mode is no problem in that case as "read" is all you will need. If you choose `zfs send` command, then your pool errors will, of course, be copied over as well, but on a NEW hard drive they will be easily fixed. Because their not being fixed is likely caused by failing hard drive.


Interesting. So IF it's due to a failing hard drive then the error on the new one should be fixable? How would I go about fixing it? Using scrub?


free-and-bsd said:


> Not sure about GELI or other kind of encryption though, never tried that. However, you can mount your decrypted stuff read-only and then copy over all the data using rsync. Then you will use whatever encryption method you prefer on the new hard drive.


Yes so I guess I'll have to install a vanilla Freebsd 13.1 , with zfs, then go about copying using zfs send?


free-and-bsd said:


> BTW, you can use that Ubuntu USB stick you've mentioned earlier and use the HDD diagnostic tool (forgot its name, sorry) they have in every distro. At least you'll see if your disk shows errors.


I'll try to boot using that stick. I'm not sure if it would be able to properly check hard drive for errors given it's zfs+encrypted, would it?


----------



## free-and-bsd (Dec 11, 2022)

Tracker said:


> Yes, SSD, Samsung's 860. And I was also using swap on it a fair bit due to system load. Possibly that might have caused the hard drive to wear down faster? I mean it's still not clear definitely that it's a hard drive issue but I used to always think about the stress on hard drive with swap being almost fully used.
> 
> Interesting. So IF it's due to a failing hard drive then the error on the new one should be fixable? How would I go about fixing it? Using scrub?
> 
> ...


Ok, point by point)))

1) Actually Samsung is a _good reputation_ SSD manufacturer. Still, things happen...

2) Yes, the error will be copied over to the new pool, but there it WILL be fixed with scrub, and you also WILL be able to delete the unfortunate files.

3) Actually, if you're "expert" enough, you may boot from a 13.1 installation media, create a zpool on your new HDD, then use zfs send command to send your old pool (snapshot) to the new zpool. That will restore ALL you have in the old pool to the new pool. You will then be able to boot from that stuff same as you booted from the  old one. Well, some basic setup for booting will be necessary, of course.

4) That tool checks HDD on the low hardware level. It cares nothing about the stuff that's written to the disk. We're supposedly dealing here with hardware level problems. And ZFS + its encryption is, as you would know, software level


----------



## free-and-bsd (Dec 11, 2022)

free-and-bsd said:


> Ok, point by point)))
> 
> 1) Actually Samsung is a _good reputation_ SSD manufacturer. Still, things happen...


Unless I'm mistaken, you must have a 4-5 year guarantee from manufacturer.


----------



## free-and-bsd (Dec 11, 2022)

free-and-bsd said:


> BTW, you can use that Ubuntu USB stick you've mentioned earlier and use the HDD diagnostic tool (forgot its name, sorry) they have in every distro. At least you'll see if your disk shows errors...


Ok, smartctl is the name. For HDD's with S.M.A.R.T. present. Not sure if it applies to SSD's in any useful way.


----------



## Tracker (Dec 11, 2022)

free-and-bsd said:


> Ok, smartctl is the name. For HDD's with S.M.A.R.T. present. Not sure if it applies to SSD's in any useful way.


Ok - since everyone seems to think it's a hard disk issue - I ran smartctl from a ubuntu 16.x stick - so see if ssd is the culprit here .... here are the results :

short test :


> ubuntu@ubuntu:~$ sudo smartctl -t short /dev/sda
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-28-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> ...



Now started the long test on smartctl as well - will update this post with those results after 85 mins that it says it will take :
UPDATE : Here is the result for the long test below


> ubuntu@ubuntu:~$ sudo smartctl -t long /dev/sda
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-28-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> ...


----------



## Tracker (Dec 11, 2022)

free-and-bsd said:


> Ok, point by point)))
> 
> 1) Actually Samsung is a _good reputation_ SSD manufacturer. Still, things happen...
> 
> ...


This is very helpful - I'm guessing the smartctl tool above should catch any errors. on the SSD. I still have to buy a new SSD yet.



_martin said:


> Opps, my typo. `dumpon -l` to list the dump device, `dumpon /dev/diskNpX` to use the swap as swapdevice. AUTO part of the rc.conf should then automatically use that device.


I'm on a ubuntu stick now 16.x .... is there some logs/operations that might help from a ubuntu stick to make things more apparent? This is so much easier than being on my phone and replying. Happy to provide any such info.

cracauer@ vaguely remember reading online that maybe zdb could fix the issue as you suggested me to use that command - I replied here https://forums.FreeBSD.org/threads/system-panic.87387/post-591149 ..... but apparently (from what I read) the behaviour of zdb isn't well documented) ..... did my previous reply show any progress is possible with my situation?


----------



## cracauer@ (Dec 11, 2022)

You can try this zdb option. Would do it after a backup.

--all-reconstruction


----------



## Tracker (Dec 11, 2022)

cracauer@ said:


> You can try this zdb option. Would do it after a backup.
> 
> --all-reconstruction



Ok - so that's a last resort. Will keep in mind.

The smartctl tests above - do those look ok? I mean if it was a SSD error as you're suspecting it to be


----------



## _martin (Dec 11, 2022)

Well Ubuntu 16 is not ZFS aware as far as I know so you can't do much with the FS itself. My advise to you is still the same: make sure you have dumps available so others can have a look as see what's happening and do the backup without doing too many (if any) chance to the current ZFS setup.


----------



## Tracker (Dec 11, 2022)

_martin said:


> Well Ubuntu 16 is not ZFS aware as far as I know so you can't do much with the FS itself. My advise to you is still the same: make sure you have dumps available so others can have a look as see what's happening and do the backup without doing too many (if any) chance to the current ZFS setup.


I thought someone above mentioned that hard drive test could be done since zfs+encryption was at software level and the hard drive test was at a hardware level. Is that incorrect?

Regarding dumps - I'm still not sure how I'll be able to share it from my mobile phone. Being on a ubunti stick I can access browser - going back to single user mode makes things difficult to share.


----------



## _martin (Dec 11, 2022)

The thing is - we are all guessing. We don't know what's happening. Panic/crash has all the information needed. You can't go wrong with HW test, why not. So far from little you shared we don't see any obvious HW issue.
You could use USB stick to copy contents of the /var/crash to it, boot Ubuntu and upload that. Just make sure you're using FS that both FreeBSD and Linux is comfortable with. To not to complicate things fat32 is just fine in your setup.


----------



## Tracker (Dec 11, 2022)

_martin said:


> The thing is - we are all guessing. We don't know what's happening. Panic/crash has all the information needed. You can't go wrong with HW test, why not. So far from little you shared we don't see any obvious HW issue.
> You could use USB stick to copy contents of the /var/crash to it, boot Ubuntu and upload that. Just make sure you're using FS that both FreeBSD and Linux is comfortable with. To not to complicate things fat32 is just fine in your setup.


Ok - let me try this. Will report back soon. Will probably need to set dump device since auto option was already set.

Update _martin : 'dumpon -l' outputs '/dev/null'

Also, I had earlier tried to check contents of /var/crash.... There seems to only be a single file there with the name "minfree" and a single line containing the number "2048"

Feeling a little lost here how to proceed


----------



## free-and-bsd (Dec 12, 2022)

_martin said:


> So far from little you shared we don't see any obvious HW issue.


Are there some other possible explanations as to why ZFS pool would have _permanent_ errors that don't go away nor can be fixed? Like files reported corrupted that can't be fixed nor deleted? I would like to know because I've encounter the same situation twice. And in both cases replacing the disk solved the problem.
It is understood that ZFS is extremely difficult to kill. So when it does start failing like this, what could it possibly be , other than HW failure?


----------



## smithi (Dec 12, 2022)

_martin said:


> The thing is - we are all guessing. We don't know what's happening. Panic/crash has all the information needed. You can't go wrong with HW test, why not. So far from little you shared we don't see any obvious HW issue.



The smartctl output looked clean to me but I'm no expert, and I only observe with awe the issues people have with ZFS.



_martin said:


> You could use USB stick to copy contents of the /var/crash to it, boot Ubuntu and upload that. Just make sure you're using FS that both FreeBSD and Linux is comfortable with. To not to complicate things fat32 is just fine in your setup.



Tracker could maybe use an OTG cable to move files from and to his box from the phone, iff the phone does OTG?

Works for me on my 2017 Samsung J5pro, so should on anything as or more recent ...


----------



## _martin (Dec 12, 2022)

Tracker said:


> Feeling a little lost here how to proceed


Please re-read my posts, I did mention this at least 3 times now how to set that dump device.



free-and-bsd said:


> Are there some other possible explanations as to why ZFS pool would have _permanent_ errors that don't go away nor can be fixed?


I not familiar with ZFS internals so I don't know. Several people even here on forums reported similar issues over time.

It's up to OP to do what he/she thinks is the best way to proceed, be it focusing on the files that are reported as problematic by ZFS. My 2c to the issue is what I said above: 100% reproducible panic on GENERIC kernel --> collect information and open PR.


----------



## Tracker (Dec 12, 2022)

_martin said:


> Please re-read my posts, I did mention this at least 3 times now how to set that dump device


I tried putting into rc.conf :

dumpdir="/var/crash"

But it won't show me anything in/var/crash directory after reboot/panic 

Next I'm trying dumpon to a usb stick partition  - would it remember the partition/USB after reboot though?

Update: upon typing:
dumpon /dev/das3

It says:  da0s3 isn't suitable for kernel dumps (wrong type?)

This disk partition is fat32 on remaining space of a bootable freebsd 13.1 stick 

What do I do now?


_martin said:


> It's up to OP to do what he/she thinks is the best way to proceed, be it focusing on the files that are reported as problematic by ZFS. My 2c to the issue is what I said above:


I strongly suspect this is a zfs issue but not sure how to proceed with it.

Will have to look for a zfs send/backup guide as well. Running out of time...


----------



## covacat (Dec 12, 2022)

gpart modify -i3 -t freebsd-swap da0


----------



## Tracker (Dec 12, 2022)

covacat said:


> gpart modify -i3 -t freebsd-swap da0


It says invalid argument. Please see image


----------



## covacat (Dec 12, 2022)

igonre that
problem is probably because partition is smaller than phys mem


----------



## Tracker (Dec 12, 2022)

covacat said:


> igonre that
> problem is probably because partition is smaller than phys mem


You mean to say I need a bigger disk to dump on or should I go ahead and try it anyways? Yes mem is 8 gb


----------



## covacat (Dec 12, 2022)

it wont dump not because of the partition type but rather because the size of the partition is less than physical memory size
so ignore the error of gpart modify
yes, you need a bigger disk


----------



## _martin (Dec 12, 2022)

Why da0 ? Isn't that your USB key? Judging from the size of the disk and that ada0 actually has zfs partition ada0 is your system disk. ada0p2 is freebsd-swap which is a proper type. Use that.

You don't need to specify dumpdir in rc.conf. Use this only:
	
	



```
dumpdev="/dev/ada0p2"
```
But in order for it to be usable make sure you execute (in your case in single mode): `dumpon /dev/ada0p2`. Then do a reboot to single mode again and do `dumpdev -l`. You should see this partition as dump device. Once that's done you're set for the crash.


----------



## Tracker (Dec 12, 2022)

_martin said:


> ada0p2 is freebsd-swap which is a proper type. Use that


Ok. Did dumpon/dev/ada0p2

Next step reboot? How do I access the dump?


----------



## _martin (Dec 12, 2022)

I've updated my post and included again the info I shared above. When system panics you'll see it's saving the crash. When it reboots go to single mode. I'm not 100% sure if savecore is executed in single mode, if not you'll execute it yourself (we'll see). The crash should be in `/var/crash`.


----------



## Tracker (Dec 12, 2022)

_martin said:


> But in order for it to be usable make sure you execute (in your case in single mode): `dumpon /dev/ada0p2`. Then do a reboot to single mode again and do `dumpdev -l`. You should see this partition as dump device


Ok I followed these steps. Upon checking however the output shows the same dump destination separated by a comma, ie:
`dumpon /dev/ada0p2
Dumpon -l
ada0p2,ada0p2`

Does that look correct or I need to fix something? _martin


----------



## _martin (Dec 12, 2022)

I'm not 100% sure why it's twice there, I'm assuming it was caught once by rc.conf and now as you did that manually. But I think you're ready to boot normally and see if it properly dumps.


----------



## Tracker (Dec 12, 2022)

_martin said:


> I'm not 100% sure why it's twice there, I'm assuming it was caught once by rc.conf and now as you did that manually. But I think you're ready to boot normally and see if it properly dumps.


Ok I rebooted. I saw it say "Dumping...10%....90%"

But NOTHING under/var/crash

etc/rc.conf only has dumpdev="/dev/ada0p2" (as was suggested earlier)

What am I missing? Soo close


----------



## _martin (Dec 12, 2022)

Perfect, almost there. Assuming you have enough free space in /var run this from single mode: `/etc/rc.d/savecore start` and check the /var/crash afterwards.
edit: actually in single mode your fileset(s) may be in readonly mode. Do you know how to remount it read-write?


----------



## Tracker (Dec 12, 2022)

_martin said:


> Perfect, almost there. Assuming you have enough free space in /var run this from single mode: `/etc/rc.d/savecore start` and check the /var/crash afterwards.
> edit: actually in single mode your fileset(s) may be in readonly mode. Do you know how to remount it read-write?


Ok , so I set readonly=off and ran savecore.

I am finally able to see /var/crash/core.txt.0 !!! Alongside vmcore.last plus a couple of other new files 

However less on that file doesn't work. What do I do with it and how do I access it feom say a Ubuntu stick?


----------



## _martin (Dec 12, 2022)

What does it mean it doesn't work ? That's the text summary of the crash, it should be readable. Copy the whole contents of the /var/crash to that usb stick and share it from there.


----------



## _martin (Dec 12, 2022)

While I never did this before as there was no need you can save some time and avoid manual file copying. After crash once you are in single mode mount the usb key to, let's say `/a` and run the savecore command manually: `savecore  /a /dev/ada0p2` - it will save it to that directory directly and hence will be on USB key right away.


----------



## Tracker (Dec 12, 2022)

When


_martin said:


> What does it mean it doesn't work ? That's the text summary of the crash, it should be readable. Copy the whole contents of the /var/crash to that usb stick and share it from there.


When I tried less on one of the files it said there was no debugger or something like that. Now trying to copy the files, one of them is 450+ mb as well. Will post files sooner.


----------



## _martin (Dec 12, 2022)

Ok, you don't have gdb installed. Not a problem. Along with that please can you do `cksum /boot/kernel/kernel` and post what version of FreeBSD you're running exactly?


----------



## Tracker (Dec 12, 2022)

_martin said:


> Ok, you don't have gdb installed. Not a problem. Along with that please can you do `cksum /boot/kernel/kernel` and post what version of FreeBSD you're running exactly?


Oops logged out now - really need to focus on recoering data and getting my system back now.

Here are the files:
"bounds" contains only


> 1


core.txt.0 contains only


> Unable to find a kernel debugger.
> Please install the devel/gdb port or gdb package.


info.0 contains only


> Dump header from device: /dev/ada0p2
> Architecture: amd64
> Architecture Version: 2
> Dump Length: 477515776
> ...


info.last contains only


> Dump header from device: /dev/ada0p2
> Architecture: amd64
> Architecture Version: 2
> Dump Length: 477515776
> ...


and then there's vmcore binary files I think which are 450+ mb

Let me know if this helps fix the system please?


----------



## _martin (Dec 12, 2022)

The text is not good enough and full trace should be provided. But this is very important:

`Panic String: VERIFY3(0 == zap_add_int(zfsvfs->z_os, zfsvfs->z_unlinkedobj, zp->z_id, tx)) failed (0 == 97)`

Issue you are having is related to ZFS (cause of the panic is ZFS) and you need somebody with ZFS internals to tell you more (hence PR). 

`freebsd-version -kru`
HW details - at least some description.
+stack backtrace and we have all info needed


----------



## Tracker (Dec 12, 2022)

_martin said:


> `freebsd-version -kru`
> HW details - at least some description.
> +stack backtrace and we have all info needed


Please check image


----------



## Tracker (Dec 12, 2022)

So anyone still following this thread - seems like the culprit IS zfs, as many suspected, see this message: https://forums.freebsd.org/threads/system-panic.87387/post-591355

Now trying to recover data:
I'm trying to boot into a later BE (p4) and activated it via beadm activate - however it seems to be chrooting me into p3 only (as shown by uname -a, even after reboot). What am I doing wrong?  (Later BE to recover latest data)

See image for reference


----------



## Tracker (Dec 12, 2022)

Strange. BE is set to p4 but single user mode login (which is the only thing I can do rn) uname says p3 version login 

Please see image below for this

Why is this happening?


----------



## SirDice (Dec 12, 2022)

p4 didn't involve the kernel, it only had some userland updates. P5 is also just a couple of userland updates. So a p3 kernel is perfectly normal.


----------



## _martin (Dec 12, 2022)

I opened PR 268333.


----------



## Tracker (Dec 12, 2022)

SirDice said:


> p4 didn't involve the kernel, it only had some userland updates. P5 is also just a couple of userland updates. So a p3 kernel is perfectly normal.


So p3 is a month old, I'd like to backup from p4/p5 .... How can I make that happen?

Should I try running freebsd-update fetch/install?

Edit: Sorry I'm a bit confused about this. I guess the files/data don't really depend upon p3/4/5 , or do they?


_martin said:


> I opened PR 268333.


Thank you! Please let me know if there's anything else you need from me or if there's a solution to my issue!


----------



## Tracker (Dec 12, 2022)

Ok _martin's advice finally cracked it!!

Basically now zroot/tmp is not mounted.

Next what should I do? Is there a way to fix this zroot/tmp issue for good or do I need to still go ahead and backup coz this might blow up soon?



> I see system is panicing during /tmp cleanup. Idea is to either disable this fileset or create a new one. The thing is I don't want to touch ZFS too much as we don't know what state is it in. Disabling it, however, should be ok.
> In single mode do zfs set mountpoint=none zroot/tmp and reboot. This dataset would not be mounted but rather /tmp in / would be used. This could be the convenience you need to get to the full system and do backup from there.


----------



## SirDice (Dec 12, 2022)

Can you rename it? Or would that blow up ZFS? `zfs rename zroot/tmp zroot/tmp.broken` If that works I would create a new tmp; `zfs create -o mountpoint=/tmp zroot/tmp`
After it's been mounted make sure to `chmod 1777 /tmp` as it needs the sticky(7) bit there.

Or you can just leave it as-is. It just means /tmp ends up in zroot/ROOT/default


----------



## _martin (Dec 12, 2022)

I told him not to touch ZFS as much as possible. Maybe there are more issues there anyway but this way he's in full environment and can do a backup. Setting mount point to none is the least invasive approach.

He doesn't need to do anything to /tmp directory. FreeBSD "self-healing" /etc/rc.d/tmp does take care of it.


----------



## Jose (Dec 12, 2022)

_martin said:


> The thing is - we are all guessing. We don't know what's happening.


Yup. I've gleaned from reading too many of the OP's posts that it's an older system prone to overheating. My stab in the dark is that some aging component has started to fail, but only shows symptoms when the system overheats. Not too many paths forward besides new hardware.

I admire the time and effort you and others have spent trying to save the OP's data, though.


----------



## Tracker (Dec 12, 2022)

Somehow I'm not able to mount the other disk that I need to backup to, after doing
`geli attach /dev/da0p3
Enter passphrase:
sudo mount /dev/da0p3.eli /mnt
mount: /dev/da0p3.eli: No such file or directory`

I see the eli active though.

Also the data seems to have taken a hit , Firefox won't start without asking me to create a new profile when I had multiple windows running. And Chrome won't even start. That's where I had some of my important stuff.



Jose said:


> OP's posts that it's an older system prone to overheating.


What specifically gives it away that it has overheating issues?


Jose said:


> I admire the time and effort you and others have spent trying to save the OP's data, though.


Definitely. All of them are rockstars for having gone out of their way to help me 
Even though my data seems to have taken a hit


----------



## Jose (Dec 12, 2022)

Tracker said:


> What specifically gives it away that it has overheating issues?





Tracker said:


> ...this machine is pretty old so I guess that should be reasonable? Was already having temp issues when overloaded...


Also the reboots during compilations are pretty typical effects of overheating.


----------



## covacat (Dec 12, 2022)

looks a lot like https://support.oracle.com/knowledge/Sun Microsystems/2421977_1.html
just i can't remember my larry support account


----------



## _martin (Dec 12, 2022)

covacat: Sounds interesting, even more so that the KB is not that old. Sadly I don't have valid MOS either.

It's up to you how you decide to do a backup, there are more ways to skin a cat. I would opt for filesystem backup using rsync and would not do zfs send. I mean as you do have corrupted pool issue is there one way or the other. It would be my personal preference though.

In a private chat you mentioned this disk you're using is somewhat backup of the original one. Pay attention you don't have pools with the same name on both disks.
If da0p3.eli doesn't exist after you entered passphrase you didn't enter a proper one then. Syslog (/var/log/messages) might give you more information about that.


----------



## _martin (Dec 12, 2022)

Actually, I do have MOS support. I won't blindly copy-paste the contents of the link here though.
Suggested solutions were mentioned here actually (scrub). if that fails restore is needed.

I went through your pictures you shared here again. One where you share zpool status -v (those 3 errors) is important. This picture is what lead me to the suggestion to disable rpool/tmp dataset in the first place. I suggest you attemp to clean it this way. 

a) chromium: I'm not sure how much data you have there (bookmarks, saved passwords, etc.) but I'd rather have chromium recreate everything from scratch. As root (without chromium running do this), purposely split into two commands: 
	
	



```
cp -rp /usr/home/c1utt4r/.config/chromium /var/crash
rm -rf /usr/home/c1utt4r/.config/chromium
```

b) zroot/tmp .. As mention before you could probably remove zroot/tmp and recreate it again. But this is something I'd do rather _after_ you have backup done. 
 Interesting point: if you can't fix metadata on a dataset you should restore the whole pool, i.e. don't trust the pool at all.

You had reported issues only on a) and b) so I'd say your data are still safe. And as you don't have any other means of backup this is the only option for you.


----------



## Tracker (Dec 12, 2022)

covacat said:


> looks a lot like https://support.oracle.com/knowledge/Sun Microsystems/2421977_1.html
> just i can't remember my larry support account


This is the link that the result which shows error points to https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ .... seems like a metadata level corruption .... not sure how to get rid of it .... but first I guess I need to salvage whatever data remains


_martin said:


> If da0p3.eli doesn't exist after you entered passphrase you didn't enter a proper one then. Syslog (/var/log/messages) might give you more information about that.


Passphrase is correct - it attaches itself but it doesn't mount. Here is the output to show
`sudo zdb -l /dev/da0p3.eli 
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'zroot'
    state: 0
    txg: 3557598
    pool_guid: 10535025700179738651
    hostid: 2647270205
    hostname: ''
    top_guid: 1525963299974165836
    guid: 1525963299974165836
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 1525963299974165836
        path: '/dev/ada0p3.eli'
        phys_path: 'id1,enc@n3061686369656d30/type@0/slot@3/elmdesc@Slot_02/p3/eli'
        whole_disk: 1
        metaslab_array: 67
        metaslab_shift: 31
        ashift: 12
        asize: 311476617216
        is_log: 0
        DTL: 284
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3`



_martin said:


> It's up to you how you decide to do a backup, there are more ways to skin a cat. I would opt for filesystem backup using rsync and would not do zfs send. I mean as you do have corrupted pool issue is there one way or the other. It would be my personal preference though.


I was hoping to use zfs for file permissions, etc being the same, and possibly easier. If data is corrupted (as it seems) maybe zfs is a better option than rsync ?


----------



## _martin (Dec 12, 2022)

If that eli is zfs pool you need to import it, you can't mount it as a regular FS (you are mounting it as FFS actually as that's the default fs for FreeBSD).
Also, as I had suspected, that pool is also named zroot. If you run `zpool import` you should see the pool.

I'm not particularly proud of editing my posts but I noticed this: 
	
	



```
zdb -l /dev/da0p3.eli 

path: '/dev/ada0p3.eli'
```
You didn't explain how you got to that disk but pay attention. It seems those are clones of some sort -- you can make a mess if you try to import it.


----------



## Tracker (Dec 12, 2022)

_martin said:


> Also, as I had suspected, that pool is also named zroot. If you run `zpool import` you should see the pool.




```
sudo zpool import
no pools available to import
```


----------



## _martin (Dec 12, 2022)

I'm only guessing now but I think because it was clone it's messing up the status. If you don't need data on that da0 disk (backup disk to be) I'd rather delete it and start from scratch.


----------



## Tracker (Dec 12, 2022)

_martin said:


> I'm only guessing now but I think because it was clone it's messing up the status. If you don't need data on that da0 disk (backup disk to be) I'd rather delete it and start from scratch.


Not sure I feel comfortable deleting and starting from scrath given that some of the data is corrupted on the SSD I'll be backing up from. 

Yes the HDD is an older version - maybe I shouldn't import it as you said - but then how else can I mount it?


----------



## _martin (Dec 12, 2022)

Tracker said:


> I was hoping to use zfs for file permissions, etc being the same, and possibly easier. If data is corrupted (as it seems) maybe zfs is a better option than rsync ?


If data is corrupted you're done, nothing will help you. I would personally prefer filesystem backup at this point than zfs. But in the end it's your choice as sysadmin. zfs send does provide you with this option.

Of course rsync does preserve ownership/permissions, even file flags and ACLs. For basic setup `rsync -avxH ..` is enough. Note the -x not to cross FS, maybe something you actually want.

Question is: do you need data on that da0 disk? The one you want to put your data on. If not I'm suggesting to delete it and create new backup pool where you push your data. Don't name it the same one as your zroot so you can import it to the current machine.
Also if you don't care about the data on da0 you could even do dd if=/dev/ada0 of=/dev/da0 and let it do 1:1 backup.


----------



## free-and-bsd (Dec 13, 2022)

BTW, if you have problems decrypting it, you can use command `zfs send --raw` to send it as is.


----------



## free-and-bsd (Dec 13, 2022)

_martin said:


> Question is: do you need data on that da0 disk? The one you want to put your data on. If not I'm suggesting to delete it and create new backup pool where you push your data. Don't name it the same one as your zroot so you can import it to the current machine.
> Also if you don't care about the data on da0 you could even do dd if=/dev/ada0 of=/dev/da0 and let it do 1:1 backup.


I wonder if his zpool from 468G freebsd-zfs partition can be copied that way to a 7G USB flash drive. It will have to be a different disk.


----------



## _martin (Dec 13, 2022)

When you look at his zdb output you can spot a problem - the pool in a backup is exact copy of the running pool. Red flag in that output is `path:` which doesn't correspond to the provider queried in zdb.
The disk that is supposed to be a backup disk is a clone (of sort) of the current running disk and hence this problem. In private conversation I suggested him to boot to live FreeBSD media and change the name (zpool import .. ) and its guid (zpool reguid).
Then once it's done boot back to the system and do the zdb comparison again. ZFS will not import the pool if there's a problem but zdb is good way to check.

This all is done to a) avoid buying new backup disk b) not to touch older data on the old disk c) try to backup current data.


----------



## free-and-bsd (Dec 13, 2022)

Yes, ZFS _naming_ must be paid attention to first and foremost. With 2 pools named the same (just started using ZFS back then) I ended up with unresponsive system. And there was no way out of it except just disconnect the offending drive.


----------



## Tracker (Dec 20, 2022)

cracauer@ said:


> You can try this zdb option. Would do it after a backup.
> 
> --all-reconstruction


I tried this - didn't work

```
zdb --all-reconstruction
zdb: illegal option -- -
Usage:    zdb [-AbcdDFGhikLMPsvXy] [-e [-V] [-p <path> ...]] [-I <inflight I/Os>]
        [-o <var>=<value>]... [-.....
```



_martin said:


> Of course rsync does preserve ownership/permissions, even file flags and ACLs. For basic setup `rsync -avxH ..` is enough. Note the -x not to cross FS, maybe something you actually want.


I did rsync - with different options - seemed like it did copy files on the face of it.


_martin said:


> Question is: do you need data on that da0 disk?


Yes, until I can fix the ssd for good - the da0/hdd is an older version which went onto become the ssd/corrupted - so has some version history of the data on the corrupted drive.


----------



## Tracker (Dec 21, 2022)

This zfs corruption is really messed up - can't even see home folder under previous snapshots when I mount them!

Also doesn't let show me programs like Firefox when I try to run them from the launcher on xmonad - although the terminal recognizes correctly that there's a program called "firefox" on the system


----------



## Alain De Vos (Dec 21, 2022)

A few commands:

```
zpool import
zpool list -v
zpool status -x
zfs mount -a
zfs list
kenv | egrep "currdev|mountfrom|kernel_path|kernelname"
```


----------



## Tracker (Dec 21, 2022)

Alain De Vos said:


> A few commands:
> 
> ```
> zpool import
> ...




```
sudo zpool import
Password:
no pools available to import
```


```
zpool list -v
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot          456G   289G   167G        -         -    48%    63%  1.00x    ONLINE  -
  ada0p3.eli   456G   289G   167G        -         -    48%  63.3%      -    ONLINE
```


```
sudo zpool status -x
  pool: zroot
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:27:07 with 2 errors on Wed Dec 21 04:23:43 2022
config:

    NAME          STATE     READ WRITE CKSUM
    zroot         ONLINE       0     0     0
      ada0p3.eli  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list
```
`zfs mount -a`

```
zfs list
NAME                                            USED  AVAIL     REFER  MOUNTPOINT
zroot                                           289G   153G       88K  /zroot
zroot/ROOT                                      116G   153G       88K  none
zroot/ROOT/12.3-RELEASE-p1_2022-03-18_164224      8K   153G     17.5G  /
zroot/ROOT/12.3-RELEASE-p3_2022-03-23_175807      8K   153G     17.4G  /
zroot/ROOT/12.3-RELEASE-p4_2022-04-06_232036      8K   153G     17.9G  /
zroot/ROOT/12.3-RELEASE-p5_2022-08-10_011525      8K   153G     27.0G  /
zroot/ROOT/12.3-RELEASE-p6_2022-09-03_171127      8K   153G     29.3G  /
zroot/ROOT/12.3-RELEASE-p7_2022-09-10_230907      8K   153G     29.7G  /
zroot/ROOT/12.3-to-13.1                           8K   153G     29.6G  /
zroot/ROOT/13.0-RELEASE-p11_2022-07-01_213226     8K   153G     23.7G  /
zroot/ROOT/13.1-RELEASE-p2_2022-09-10_231247    700K   153G     29.8G  /
zroot/ROOT/13.1-RELEASE-p2_2022-09-10_232433    644M   153G     29.0G  /
zroot/ROOT/13.1-RELEASE-p2_2022-09-11_220401    716K   153G     29.1G  /
zroot/ROOT/13.1-RELEASE-p2_2022-11-06_014830    856K   153G     26.7G  /
zroot/ROOT/13.1-RELEASE-p3_2022-11-16_133737    123M   153G     26.8G  /
zroot/ROOT/13.1-RELEASE-p4_2022-12-03_162103    114G   153G     29.0G  /
zroot/ROOT/13.1-RELEASE-p4_2022-12-10_051051    527M   153G     29.0G  /
zroot/ROOT/13.1-RELEASE-p4_2022-12-10_061442    574M   153G     29.0G  /
zroot/ROOT/13.1-RELEASE-p4_2022-12-10_180529      8K   153G     28.8G  /
zroot/ROOT/13.1-p2-after-destroying            65.6M   153G     23.0G  /
zroot/tmp                                      62.5M   153G     62.5M  none
zroot/usr                                       172G   153G       88K  /usr
zroot/usr/home                                  170G   153G      170G  /usr/home
zroot/usr/ports                                1.50G   153G     1.50G  /usr/ports
zroot/usr/src                                   771M   153G      771M  /usr/src
zroot/var                                       269M   153G       88K  /var
zroot/var/audit                                  88K   153G       88K  /var/audit
zroot/var/crash                                 267M   153G      267M  /var/crash
zroot/var/log                                  1.91M   153G     1.91M  /var/log
zroot/var/mail                                  112K   153G      112K  /var/mail
zroot/var/tmp                                    88K   153G       88K  /var/tmp
```


```
kenv | egrep "currdev|mountfrom|kernel_path|kernelname"
currdev="zfs:zroot/ROOT/13.1-RELEASE-p4_2022-12-03_162103:"
kernel_path="/boot/kernel"
kernelname="/boot/kernel/kernel"
vfs.root.mountfrom="zfs:zroot/ROOT/13.1-RELEASE-p4_2022-12-03_162103"
```


----------



## Tracker (Dec 21, 2022)

Does any of this help? 

Think I am going to proceed with reinstalling the OS (with zfs) onto the corrupted disk/SSD in the next couple of hours. So far no resolution seems to have come about nor the bug report shows any progress.


----------



## Tracker (Dec 21, 2022)

cracauer@ said:


> You can try this zdb option. Would do it after a backup.
> 
> --all-reconstruction


If I understood correctly , the "-Y" option does the same and zpool needs to be specified

so I ran this `sudo zdb -Y zroot` and it terminated like this - not sure if there are any zfs/zdb experts :

```
Dataset zroot/ROOT/13.1-RELEASE-p4_2022-12-03_162103 [ZPL], ID 590, cr_txg 6764419, 29.0G, 904448 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

        TX_CREATE           len    120, txg 6857403, seq 53
        TX_WRITE            len   4288, txg 6857403, seq 54
        TX_SETATTR          len    184, txg 6857403, seq 55
        TX_CREATE           len    120, txg 6857535, seq 56
        TX_WRITE            len   4288, txg 6857535, seq 57
        TX_SETATTR          len    184, txg 6857535, seq 58
        Total               6
        TX_CREATE           2
        TX_WRITE            2
        TX_SETATTR          2


    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K   350M     512   895M   49.33  DMU dnode
        -1    1   128K  1.50K     8K     512  1.50K  100.00  ZFS user/group/project used
        -2    1   128K     2K     8K     512     2K  100.00  ZFS user/group/project used

    Dnode slots:
    Total used:             0
    Max used:               0
    Percent empty:        nan

dmu_object_next() = 97
Abort trap
```

There was also this command I ran `sudo zdb -c zroot` and that too didn't seem to run properly 


```
Traversing all blocks to verify metadata checksums and verify nothing leaked ...

loading concrete vdev 0, metaslab 227 of 228 ...
47.4G completed (1101MB/s) estimated time remaining: 0hr 03min 44sec        zdb_blkptr_cb: Got error 97 reading <94, 3, 0, f>  -- skipping
 288G completed (1672MB/s) estimated time remaining: 0hr 00min 00sec        
Error counts:

    errno  count
       97  1

    No leaks (block sum matches space maps exactly)

    bp count:               7355397
    ganged count:             28006
    bp logical:        421186207744      avg:  57262
    bp physical:       300170215936      avg:  40809     compression:   1.40
    bp allocated:      310002585600      avg:  42146     compression:   1.36
    bp deduped:                   0    ref>1:      0   deduplication:   1.00
    Normal class:      310002225152     used: 63.59%
    Embedded log class         360448     used:  0.02%

    additional, non-pointer bps of type 0:     377741
    Dittoed blocks on same vdev: 640087
```

So something is definitely up with that 97 / error no 97 on zfs - any experts who know what it is and if its fixable ?


----------



## Alain De Vos (Dec 21, 2022)

```
zpool status -v
```


----------



## Tracker (Dec 21, 2022)

Alain De Vos said:


> ```
> zpool status -v
> ```


Ended up reinstalling the OS onto the SSD/corrupted

Thanks everyone for the help, even though this problem remained unsolved. Incredibly grateful


----------



## cracauer@ (Dec 21, 2022)

Yeah. Once zdb throws up its hand into the air the pool is ready for re-creation. Sorry.


----------



## free-and-bsd (Dec 23, 2022)

deleted


----------

