# 13.1 failed to boot when built with LLVM 14 from ports



## neogeo (Oct 22, 2022)

I'm running a FreeBSD 13.1 amd64 build from stable/13 at changeset b63021e001d. uname:


```
FreeBSD xmin.cloud.thinkum.space 13.1-STABLE FreeBSD 13.1-STABLE #1 build/stable/13-n252436-b63021e001d: Sat Oct 22 08:47:32 PDT 2022     gimbal@xmin.cloud.thinkum.space:/usr/obj/xmin_FreeBSD-13.1-STABLE_amd64/usr/src/amd64.amd64/sys/XMIN amd64
```

I'd noticed some unusual errors when building ports - using ZFS in poudriere, no tmpfs. An example of the error, during a port build for llvm13:

```
===>   Returning to build of llvm13-13.0.1_3
===>   llvm13-13.0.1_3 depends on executable: ninja - not found
===>   Installing existing package /packages/All/ninja-1.11.1,2.pkg
[xmin.bld.cloud.thinkum.space] Installing ninja-1.11.1,2...
[xmin.bld.cloud.thinkum.space] `-- Installing python38-3.8.15...
[xmin.bld.cloud.thinkum.space] |   `-- Installing libffi-3.4.3...
[xmin.bld.cloud.thinkum.space] |   `-- Extracting libffi-3.4.3: .......... done
[xmin.bld.cloud.thinkum.space] |   `-- Installing mpdecimal-2.5.1...
[xmin.bld.cloud.thinkum.space] |   `-- Extracting mpdecimal-2.5.1: .......... done
[xmin.bld.cloud.thinkum.space] |   `-- Installing readline-8.1.2...
[xmin.bld.cloud.thinkum.space] |   `-- Extracting readline-8.1.2: .......... done
[xmin.bld.cloud.thinkum.space] `-- Extracting python38-3.8.15: ...
pkg-static: Fail to chown /usr/local/lib/python3.8/idlelib/idle_test/.pkgtemp.test_squeezer.py.bFIrNzt0LFou:Bad file descriptor
```

In a local ports build, this bad FD error had occurred with a total of 190 ports, during one poudriere bulk run. This might be an example of a bug observed by y.freebsd@paritcher.com.

At the time - I was in fact running cinnamon in an X session while poudriere ran - I'd also noticed that gvfsd-trash had more than 10,000 open file descriptors, visible via 'fstat -p'. I've since sent a patch to the ports section of FreeBSD bugzilla, for devel/gvfs, to make gvfsd-trash an optional feature in the port. After closing the gvfsd-trash process (sighup) then rebulding iirc during the same boot session, I'd not noticed the error again.

I later rebooted, and tried to finish the port build. The Bad FD error resurfaced. So, I rebuilt the kernel, using options recommended by y.freebsd:


```
makeoptions     BUILD_OPTIMIZED=NO
makeoptions     COPTFLAGS=-O0
```

The port build completed, running the newer kernel built from that same changeset.

Out of curiousity, I thought I'd try building FreeBSD 13.1 from stable/13 (at a latest changeset, 84b4709f38f) using LLVM 14 from ports. This machine is using ZFS on root. bectl made it easy to install the upgrade. I rebooted to the new root filesystem, but it failed to boot during vfs mount.

This new build was built with LLVM 14 from ports as XCC

With this FreeBSD 13.1 build, built with LLVM 14 from ports,  I was dropped to the loader to select a suitable root filesystem. I tried to select the filesystem from the earlier build, but the loader reported the same error, "Unknown file system". Both of these filesystems are on ZFS.


```
$ bectl list -a
BE/Dataset/Snapshot                               Active Mountpoint Space Created

13.1-STABLE
  mroot/ROOT/13.1-STABLE                          -      -          724K  2022-10-10 00:14
    mroot/ROOT/13.1-next-01@2022-10-22-08:52:52-0 -      -          0     2022-10-22 08:52

13.1-next-00
  mroot/ROOT/13.1-next-00                         NR     /          1.81M 2022-10-22 08:52
    mroot/ROOT/13.1-next-01@2022-10-22-10:43:49-0 -      -          0     2022-10-22 10:43

13.1-next-01
  mroot/ROOT/13.1-next-01                         -      -          16.9G 2022-10-22 10:43
```

The 13.1-next-01 build was built with LLVM 14 from ports, at the newer changeset. It was mounted successfully while booted from the 13.1-next-00 root filesystem, with the uname illustrated above.

When trying to boot from the 13.1-next-01 filesystem, it failed - seemingly unable to recognize the root filesystem on ZFS. I've rebooted successfully from the root filesystem mroot/ROOT/13.1-next-00 on the same ZFS pool.

Using a USB thumb drive from a local memstick image for an earlier stable/13 build (built with LLVM 13 from the base system), I was able to reboot and set the `bootfs` property on the root ZFS pool to boot from the original root filesystem with stable/13 built at changeset b63021e001d.

I'm assuming that the newer stable/13 source tree will probably build and run successfully if it's built with LLVM 13 from the base system, using the existing, bootable stable/13 build. I'll try to test this shortly.

I've rebooted to the earlier stable/13 build for now, with the newer kernel build (same changeset as the base system, added kernel options as above). I'll try to test out some more port builds, hopefully the bad FD error may not occur with this build.

I'm not certain why the build with LLVM 14 from ports would not boot. The same ZFS root filesystem is accessible with that earlier build, which was built with LLVM in the base system (CC and XCC)

Is it odd that newer stable/13 build that was built with LLVM 14 from ports would not load the root (ZFS) filesystem?


----------



## neogeo (Oct 31, 2022)

Now running a FreeBSD 13.1 stable/13 build at changeset 84b4709f38f it built fine when using CC and clang from the existing 13.1 base system. I'm still seeing the bad file descriptor errors during poudriere builds on ZFS builder jails - same as illustrated above - and uncertain how to debug this. Another excerpt of a poudriere build log, illustrating some part of the error:


```
===>   Returning to build of libical-3.0.8_7
===>   libical-3.0.8_7 depends on file: /usr/local/bin/cmake - not found
===>   Installing existing package /packages/All/cmake-core-3.24.0.pkg
[xmin.bld.cloud.thinkum.space] Installing cmake-core-3.24.0...
[xmin.bld.cloud.thinkum.space] `-- Installing curl-7.85.0...
[xmin.bld.cloud.thinkum.space] |   `-- Installing brotli-1.0.9,1...
[xmin.bld.cloud.thinkum.space] |   `-- Extracting brotli-1.0.9,1: .......... done
[xmin.bld.cloud.thinkum.space] |   `-- Installing c-ares-1.18.1_1...
[xmin.bld.cloud.thinkum.space] |   `-- Extracting c-ares-1.18.1_1: .......... done
[xmin.bld.cloud.thinkum.space] |   `-- Installing ca_root_nss-3.83...
[xmin.bld.cloud.thinkum.space] |   `-- Extracting ca_root_nss-3.83: .......... done
[xmin.bld.cloud.thinkum.space] |   `-- Installing libnghttp2-1.48.0...
[sic]
[xmin.bld.cloud.thinkum.space] `-- Extracting libarchive-3.6.1,1: .......... done
[xmin.bld.cloud.thinkum.space] `-- Installing libuv-1.44.2...
[xmin.bld.cloud.thinkum.space] `-- Extracting libuv-1.44.2: .......... done
[xmin.bld.cloud.thinkum.space] `-- Installing rhash-1.4.3...
[xmin.bld.cloud.thinkum.space] `-- Extracting rhash-1.4.3: .........
pkg-static: Fail to chown /usr/local/share/doc/rhash/.pkgtemp.README.md.bacbhG0QOdJJ:Bad file descriptor
```

Maybe the odd thing with the boot failure (vfs mount) when it was built with LLVM 14, is unrelated?

The internal HDD is an NVME stick, it's a Minisforum HX90 machine, amd Ryzen 9 5900HX CPU. I've not noticed errors with ZFS outside of the bad file descriptors (and subsequent build failures) when installing build-depends in some poudriere builder filesystems. It seems that it's reproducible, though I'm not certain of how to isolate the exact cause of it.

I'm running poudriere with 16 builders in parallel, most of the builds are completing. The workaround appears to be: To rebuild for the same ports list repeatedly, until it all builds without bad file descriptor errors in build-depends.

This is with Poudriere using tmpfs only for wrkdirs and cache, ZFS for the builder base filesystems and localbase. I'd tried building with tmpfs enabled for the entire builder filesystems. It bogged this machine down quite a lot. Using ZFS for the rest, it seems quicker, even with the resulting build failures.

The containing zpool is about a month old, on an NVME stick. I hope it's not storage decay yet.

Might try out a build from the FreeBSD main branch, here and on a server too. Once the ports are built for it, I guess that's not tough to roll back from with bectl


----------



## neogeo (Oct 31, 2022)

After setting checksum=sha512 in a location where it would be inherited by the ZFS datasets created for the poudriere builders, I'm not seeing the bad FD error now.

With ZPOOL=mroot and ZROOTFS=/opt/poudriere in my poudriere.conf I've then set checksum=sha512 like so:


```
# zfs set checksum=sha512 mroot/opt/poudriere/data/.m  mroot/opt/poudriere/basefs
```

the 'basefs' dataset seems to be where it might actually be used. The builder jail filesystems are created under that dataset, given a certain poudriere config. Those filesystems then inherit the checksum=sha512 setting and it seems to work out now.

These datasets were originally using the default fletcher4. After a short check, I'd also seen the bad FD errors when using skein, in this zpool

119 ports built without a bad FD error - may have finally found a workaround that works here. Not sure about whether this should be reported as a bug, the bad FD errors as previously

*Update:* no such luck

Still seeing the errors here

```
[xmin.bld.cloud.thinkum.space] |   |   |   | `-- Extracting llvm14-14.0.6: ..
pkg-static: Fail to chmod /usr/local/llvm14/include/clang/Lex/.pkgtemp.Token.h.EcjcEhd6Y4Xz:Bad file descriptor
[xmin.bld.cloud.thinkum.space] |   |   |   | `-- Extracting llvm14-14.0.6... done
```

It was just a guess that trying to build the system with LLVM 14 might somehow work around it, did not work out though.

This may be similar to bad file descriptor errors seen by Oclair: random Bad file descriptor errors building ports on a vps from Digitalocean

For lack of an idea of how else to approach this, will try `checksum=sha256` next ....

... Tried that, no change tho. Maybe there has to be a certain amount of activity with how Poudriere uses ZFS e.g, before the bug surfaces, at least under some configurations


----------

