# Panic: 12.2 does not work with jails



## PMc (Dec 7, 2020)

Okay, You moved my other thread to ports+packages, but this is base: 12.2 cannot run my jails!

As soon as I start and stop 2-3 jails, there is kernel panic. 
This machine was running for a year with 11.3 and 4 in this layout, and the only thing I did was a clean upgrade to 12.2-p1 and recompile the ports. This is not related to ports. I first thought that, too, but it isn't.

It is not one specific jail, it seems any will crash. 
I tried already GENERIC kernel, with same fault. 
It may be related to VIMAGE, not sure yet. An old-style jail seems to have behaved better, that needs further test. (If it is VIMAGE, it might also be Netgraph.)

I had Rel 12.2 running on the other machine for quite a while with no issues, but that doesn't have jails. Now this is my main backend, I don't even have telephone working now. This is really bad.

I'm not sure how to proceed now: should I try to find the flaw, or go back to 11.4?
Has anybody heard about such problem, anywhere?


----------



## SirDice (Dec 7, 2020)

PMc said:


> Has anybody heard about such problem, anywhere?


Nothing familiar comes to mind. I have several 12.2 machines with jails, no problems there.


----------



## PMc (Dec 7, 2020)

Thank You. Are Yours in VIMAGE with Netgraph?

From the backtrace it gets obvious that this is a vnet/VIMAGE problem with the interface:


```
#4 0xffffffff810bbadf at trap_pfault+0x4f
#5 0xffffffff810bb23f at trap+0x4cf
#6 0xffffffff810933f8 at calltrap+0x8
#7 0xffffffff80cdd555 at _if_delgroup_locked+0x465
#8 0xffffffff80cdbfbe at if_detach_internal+0x24e
#9 0xffffffff80ce305c at if_vmove+0x3c
#10 0xffffffff80ce3010 at vnet_if_return+0x50
#11 0xffffffff80d0e696 at vnet_destroy+0x136
#12 0xffffffff80ba781d at prison_deref+0x27d
#13 0xffffffff80c3e38a at taskqueue_run_locked+0x14a
#14 0xffffffff80c3f799 at taskqueue_thread_loop+0xb9
#15 0xffffffff80b9fd52 at fork_exit+0x82
#16 0xffffffff8109442e at fork_trampoline+0xe
```

Something must have been changed with this in Rel. 12.

Thank the Gods I was lazy and some vital stuff is still on the old-style jails... this is becoming a rough-style upgrade...


----------



## SirDice (Dec 7, 2020)

PMc said:


> Are Yours in VIMAGE with Netgraph?


No, they're the traditional jails. Let me try to set up a vnet jail to verify.


----------



## PMc (Dec 7, 2020)

I posted this to freebsd-stable list now.

There must be lots of people using VIMAGE, so it's strange this went undetected.
But from what I read, I don't read much about using Netgraph for this - that's a smaller user-base.

And when I crafted the config, I got crashes, until I figured how to do it correctly:


```
rail {
        jid = 10;
        devfs_ruleset = 11;
        host.hostname = "xxx.xxx.xxx.org";
        vnet = "new";
        sysvshm;
        $ifname1l = nge_${name}_1l;
        $ifname1l_mac = 00:1d:92:01:01:0a;
        vnet.interface = "$ifname1l";
        exec.prestart = "
            echo -e \"mkpeer eiface crhook ether\nname .:crhook $ifname1l\" \
                | /usr/sbin/ngctl -f -
            /usr/sbin/ngctl connect ${ifname1l}: svcswitch: ether link2
            ifname=`/usr/sbin/ngctl msg ${ifname1l}: getifname | \
                awk '$1 == \"Args:\" { print substr($2, 2, length($2)-2)}'`
            /sbin/ifconfig \$ifname name $ifname1l
            /sbin/ifconfig $ifname1l link $ifname1l_mac
        ";
        exec.poststart = "
            /usr/sbin/jexec $name /sbin/sysctl kern.securelevel=3 ;
        ";
        exec.poststop = "/usr/sbin/ngctl shutdown ${ifname1l}:";
}
```

Maybe "correctly" is now a bit different...


----------



## SirDice (Dec 7, 2020)

PMc said:


> There must be lots of people using VIMAGE, so it's strange this went undetected.


It's more likely it's something with your specific situation. Or it would indeed have been detected a long time ago already. Lots of people are running 12-STABLE, which is basically an alpha of the next minor release.


----------



## PMc (Dec 7, 2020)

SirDice said:


> It's more likely it's something with your specific situation.



Believe me, if I would see something specific, I would immediately remove it and check.
But given the number of possible attributes, practically every configuration is in it's way specific.



SirDice said:


> Or it would indeed have been detected a long time ago already. Lots of people are running 12-STABLE, which is basically an alpha of the next minor release.


I doubt these "lots of people" - why would anybody do that, except on some test systems with fabricated tests?
This was different in the old time, when nobody had a mirrored disk, nobody had a security fetish, and the developers did still talk to the common people and tell what they were doing, so it was fun to participate. Even bugreports were handled within 2-3 years then!

So lets conclude: it doesn't work anymore, we do not know why, so I have to revert to last working release 11.4. That gives a few months to look for an alternative. Maybe I have to buy a plastic router with backdoor and use windows as a server (the desktop seems to be still fine with FreeBSD).


----------



## PMc (Dec 7, 2020)

SirDice said:


> Or it would indeed have been detected a long time ago already.



And, btw, it is absolutely Not Uncommon that I do encounter most severe flaws in widely used and well tested production releases.

A year ago it was with postgres 12 production release, and it came out that they had entirely ignored the EWOULDBLOCK situation on writes to the client, delivering garbage data. Everybody else had just increased their MTU above the postgres block size, and bought faster networks. I was the only one to complain.

A half year ago, again postgres production release, I figured that it does happily delete it's own redo-logs, rendering any backup worthless. That time I wasn't the only one, and it was already known when I reported.

And there are more of such examples. I think the problem is that there is no longer anybody who would take an end-to-end responsibility for things. Developers put in their individual projects and test with fabricated testsuites, and the end-user looks just for workarounds to get their things going. There is no root cause analysis happening anymore.


----------



## PMc (Dec 7, 2020)

Hunting the bug: 
1. Remove all applications:

```
jail.conf:
rail {
     ...
     exec.start = "/bin/sleep 4 &";
}
```

This one works:

```
for i in `count 1 100` ; do
  service jail start rail
  sleep 5
  /usr/sbin/ngctl shutdown nge_rail_1l:
done
```

This one crashes after 2 iterations:

```
for i in `count 1 100` ; do
  service jail start rail
  sleep 2
  service jail stop rail
done
```


----------



## PMc (Dec 7, 2020)

Continuing with:

```
for i in `count 1 100` ; do
  service jail start rail
  sleep 2
  service jail stop rail
done
```

This one does work:


```
exec.poststop = "
            sleep 2 ;
            /usr/sbin/ngctl shutdown ${ifname1l}: ;
            sleep 2 ;
        ";
```

This one does also work:


```
exec.poststop = "
            sleep 2 ;
            /usr/sbin/ngctl shutdown ${ifname1l}: ;
        ";
```

While this one does crash at the 3rd iteration:


```
exec.poststop = "
            /usr/sbin/ngctl shutdown ${ifname1l}: ;
            sleep 2 ;
        ";
```

And, by the way, this one does also NOT work: it does crash at 17th iteration:


```
exec.release = "
        /usr/sbin/ngctl shutdown ${ifname1l}:
    ";
```

`exec.release` is a new jail.conf option in Rel. 12. It did not exist in Rel. 11. And it's rationale and documentation was obviousely not met in the code.

Or probably, while putting these finer grained callbacks in, something got broken, and things have been made worse instead better.


----------



## genneko (Dec 8, 2020)

I'm using vnet + (netgraph or epair) and have seen similar panics on a variety of FreeBSD versions and host environments.
Now I always use the workaround (removing vnet interfaces from a jail in exec.prestop) described in the  PR 238326 and it seems working very well.


----------



## PMc (Dec 8, 2020)

Okay, so this is a known problem since 12.0. Thanks.
(As a german engineer I would have some reluctance putting 'workaround' and 'very well' into the same sentence. But never mind.)


----------

