# random program crashes, no coredumps, and error 94



## PMc (Feb 27, 2022)

I upgraded one of my nodes from 12.3 to stable/13

Then I started to get random program crashes from the backup tool sysutils/bareos-client , always after some 10-20 hours runtime.
I hacked the backup tool to actually produce coredump (they had the great idea to catch SIGSEGV&friends, for whatever crap might work on Linux, but not here), and set kern.sugid_coredump=1

Now the crashes are either SIGSEGV or SIGBUS, and they come from arbitrary places in the code. Oh crap.

I looked into the images, and it appears that places in memory (e.g. memory addresses) are overwritten with the string "oSoC" (it seems to always be that string, but at different locations).

I had another look:

```
# strings bareos-fd.0.0.core
FreeBSD
bareos-fd
/usr/local/sbin/bareos-fd -u root -g wheel -v -c /usr/local/etc/bareos
FreeBSD
CoSo
CoSo
+oSo
FreeBSD
```

Now, is that as it should be, or something else? (not sure if it's a stable/13 issue or an issue with the program)

I tried to get some sane coredumps to compare, from 12.3 and stable/13, in the usual way (`kill -SEGV`). And I found, I cannot! Instead, I get this:


```
kernel: pid 53977 (sleep), uid (0):  Path `/var/tmp/sleep.0.0.core' failed on initial open test, error = 94
kernel: pid 53977 (sleep), jid 0, uid 0: exited on signal 4
```

Doesn't work in either 12.3 or stable/13, doesn't work in single user, doesn't work in GENERIC kernel.

Error 94 says "Not permitted in capability mode", and this leads to capsicum(4), which doesn't give much of a clue.

Now I am wondering: why is the sleep command in singleuser running in capsicum?

I finally found a switch kern.capmode_coredump in `sysctl` that helps the issue, but doesn't seem to be documented anywhere (not even the commitlog gives much insight).

Finally, those oSoC (or CoSo) seem to not belong there, so this might be a rogue pointer spamming the memory space. Love it. 

Posting this in case anybody else gets error 94.


----------



## ralphbsz (Feb 28, 2022)

It does smell like a random memory overwrite.

On a little-endian machine (I assume this is Intel?), the string CoSo is 0x436f546f or about a billion, which doesn't seem plausible as as an integer, and isn't even very likely half of a 64-bit pointer.


----------



## PMc (Mar 2, 2022)

ralphbsz said:


> It does smell like a random memory overwrite.
> 
> On a little-endian machine (I assume this is Intel?), the string CoSo is 0x436f546f or about a billion, which doesn't seem plausible as as an integer, and isn't even very likely half of a 64-bit pointer.


No, it's certainly not an address:


```
# lldb -c /tmp/bareos-fd.0.0.core /usr/local/sbin/bareos-fd
(lldb) target create "/usr/local/sbin/bareos-fd" --core "/tmp/bareos-fd.0.0.core"
Core file '/tmp/bareos-fd.0.0.core' (x86_64) was loaded.

(lldb) bt
* thread #1, name = 'bareos-fd', stop reason = signal SIGSEGV
  * frame #0: 0x00000008002f66e2 libbareos.so.20`alist::get(this=0x0000000801188228, index=0) at alist.cc:132:10
    frame #1: 0x000000080029788a libbareosfind.so.20`AcceptFile(ff=0x000000080117e000) at find.cc:347:39

(lldb) var
(alist *) this = 0x0000000801188228
(int) index = 0

(lldb) print *this
(alist) $0 = {
  items = 0x0000000000000000
  num_items = 1867738957
  max_items = 0
  num_grow = 1
  cur_item = 0
  own_items = true
}
```

1867738957 = 0x6f536f4d = MoSo


----------



## Jose (Mar 2, 2022)

PMc said:


> Now the crashes are either SIGSEGV or SIGBUS, and they come from arbitrary places in the code. Oh crap.


The one time I had this problem, it was a stack overflow. A colleague had "fixed" a size problem by increasing the size of an array allocated on the stack by 10x. Worked fine on Solaris. Crashed and burned in random places on Windows.


----------



## PMc (Mar 2, 2022)

Jose said:


> The one time I had this problem, it was a stack overflow. A colleague had "fixed" a size problem by increasing the size of an array allocated on the stack by 10x. Worked fine on Solaris. Crashed and burned in random places on Windows.



It's not the first strange effect there. Two releases back, it happened to run for a few days, and then, suddenly and silently, encrypt the backup data in a wrong way. It uses X509 certificate for encryption and signing, and it became obvious that one half of the public/private keypair had strangely changed, in the midst of operation. 
I couldn't figure what might be going on, and then the next release worked well again. But now I'm getting an idea that this might be related.

It seems to happen only under specific conditions - and I am doing rather unusual things also (like running a limited backup every 10 minutes as a kind of "undelete" feature) - but here I have no idea how this could be isolated. One would need to record every memory access, during hours of runtime...


----------



## ralphbsz (Mar 2, 2022)

Debugging random memory overwrites is very very hard. Personally, I know of two techniques that have worked for me. One is boring and brute force: Go through the code, and change it to make sure no pointer variable is used before being initialized (Resource Acquisition is Initialization, named after the Italian radio station), and to make sure any pointer variables are immediately set to NULL when going out of scope. The second one is more amusing, but less likely to succeed: add lots of honeypots to the memory. For example, add to every struct an extra word, which is the checksum of the struct. Update the checksum after every modification of the struct, and check it before every use. Or poison memory before use and after freeing with 0xdeadbeef, and check for those when using freshly allocated memory. The more honeypots you add, the more likely it is that you catch the offender in the act.


----------



## PMc (Mar 2, 2022)

Thank You for confirming, ralphbsz. This is more or less what I already imagined: one way or the other it boils down to: _read and understand the entire code_.


----------



## PMc (Mar 2, 2022)

... or maybe not ...

The node in question, which I upgraded to stable/13, is called Moon. The backup client configuration was copied from another node, called Conn. The backup client configuration file did still carry the name Conn, until I recognized that and changed it to Moon.
At that same time, the bogus memory content changed from CoSo to MoSo.

Searching for something similar from where the other two letters might come, brought me on a track. This line defines two variables ...








						bareos/message.cc at 674a61a742de868a441b3395ac68777e8acda6b3 · bareos/bareos
					

Main repository with the code for the libraries and daemons - bareos/message.cc at 674a61a742de868a441b3395ac68777e8acda6b3 · bareos/bareos




					github.com
				



... and it seems the bogus content is the first two bytes from each of them. But I am not a native speaker of c++. I don't really understand it.


----------



## ralphbsz (Mar 2, 2022)

PMc said:


> The backup client configuration was copied from another node, called Conn. The backup client configuration file did still carry the name Conn, until I recognized that and changed it to Moon.


That's the kind of smoking gun that makes debugging these things more practical. At this point, you only need to look at the code that uses the names.

Alas ...


> But I am not a native speaker of c++. I don't really understand it.


I looked a little bit. The code is "well written", in the sense of being clear and verbose; alas, there is a great lack of comments. And it uses all its own libraries; the places where the names are touched is in functions named Bvsnprintf() and Mmsg(). So to debug this, one has to first dig down a few levels to what these functions do. And the problem seems to be in the area of string handling (in particular there are variable-length printf-style specifications there), which are a notorious source of memory bugs in C and C++. It would take me many hours to dig into this, and I'm not volunteering.

Suggestion: Find out who maintains this thing, and contact them directly. You have a lot of information now, it might be easy for them.


----------



## PMc (Mar 2, 2022)

ralphbsz said:


> That's the kind of smoking gun that makes debugging these things more practical. At this point, you only need to look at the code that uses the names.
> 
> Alas ...
> 
> I looked a little bit. The code is "well written", in the sense of being clear and verbose; alas, there is a great lack of comments. And it uses all its own libraries; the places where the names are touched is in functions named Bvsnprintf() and Mmsg(). So to debug this, one has to first dig down a few levels to what these functions do. And the problem seems to be in the area of string handling (in particular there are variable-length printf-style specifications there), which are a notorious source of memory bugs in C and C++. It would take me many hours to dig into this, and I'm not volunteering.


I understand that.  

The important task of this specific function is to print error messages to stderr, which is /dev/null. And it emits error messages, probably because my site can do IPv4+6, but this KVM provider does not give me an IPv6 address. So there is an error 47 for each backup job, before the other address then does bind().

What they are doing with these Mmsg() etc., they try to maintain their own malloc pools.
Then there is a comment somewhere in the code, that they removed these pools because the OS can nowadays do it as good. But apparently the pools aren't fully removed.

Anyway, these error 47 messages are attached _one behind the other_ in memory, and as there is a job run every 10 minutes, this gets lengthy. And apparently then something overruns. I don't yet know if these memory pools are broken by design, broken by technical debt, or were never supposed to cope with dozens of messages - anyway, I now removed that elaborate task of printing those messages to /dev/null. Let's see what happens.



ralphbsz said:


> Suggestion: Find out who maintains this thing, and contact them directly. You have a lot of information now, it might be easy for them.



Yeah, I know who maintains the thing. There are two rivalling gangs, the bacula gang and the bareos gang, and one is said to have stolen the code from the other, and both of them are primarily interested in securing their territories and making money.
I already engaged in one dispute about the definiton of the word "bug" - it ended up with paperwork that I should sign in order to be allowed to find bugs, but would have needed an international industrial property attorney to even understand what I should sign there - so I basically told them to go to hell instead.

On the bright side, I got it all to compile with full debugging symbols (including the libs), so it should now be possible to run it interactively in the debugger, and watch what the code actually does...


----------



## _martin (Mar 2, 2022)

Looking at the dump (using src from ports) you crashed on  alist.cc:132:10:

```
return items[index]
```
in `alist::get()`. From the output you provided items is set to NULL, index to 0. Segmentation fault is on `items[0]`, i.e. addr NULL.  
It seems node has stale data; while `items` is uninitialized `num_items` is set to some arbitrary value. 

Looking at the code there are several places where malloc() or realloc() is called but no check on its success is made. Some functions do check if items is NULL, some rely on num_items value (such as this get function). Using signed int for tracking size can cause some pain too.


----------



## PMc (Mar 3, 2022)

_martin said:


> Looking at the dump (using src from ports) you crashed on  alist.cc:132:10:


It crashes at lots of different places, this is only one example.


_martin said:


> `num_items` is set to some arbitrary value.


It is an ASCII string. 
And it doesn't belong there.



_martin said:


> Looking at the code there are several places where malloc() or realloc() is called but no check on its success is made.


Oh. Hm, why am I not surprized?



_martin said:


> Some functions do check if items is NULL, some rely on num_items value (such as this get function).


Yes, Some things do it this way, some do it the other way, and many do it somehow redundantly, so that it usually works even when it is erroneous.



_martin said:


> Using signed int for tracking size can cause some pain too.


Yeah. I once tried to create a "bug of the week" radio show. I think I had enough material for a few years.

But put it that way: Say you want a really versatile semi-enterprise-scale backup solution with an abundance of features. Say you dont want to pay anything for it. Say you want good coding quality. 
Choose two, you can't have it all.


----------



## _martin (Mar 3, 2022)

PMc said:


> It is an ASCII string.
> And it doesn't belong there.


I'm not sure what you mean by this. `num_items` is defined as int. If you set it to 26739, or 0x6873, memory where it's stored would show you "sh" if treated as string. But that doesn't mean it's a string. There are plenty of numbers in range of int that could be interpreted as valid string. 

items being NULL while get() was called on it would be my first choice to look at. If you have crashes at other places most likely they have some common denominator.

Yeah, if that's good enough for what you need I have no argument against it.


----------



## PMc (Mar 3, 2022)

_martin said:


> I'm not sure what you mean by this. `num_items` is defined as int.


Yes, it is defined as int, and it has value 1867738957 - which is an int, but which is a bogus value in the application's context.

Then, if you consider that an int is a 4-byte value (usually), and if you try and read these 4 bytes as a 4-byte string, you get 4 characters. And I think I know from where these 4 characters were copied to this place (erroneously, that is). 



_martin said:


> If you set it to 26739, or 0x6873, memory where it's stored would show you "sh" if treated as string. But that doesn't mean it's a string. There are plenty of numbers in range of int that could be interpreted as valid string.


Yes. But, as I find this very same sequence of 4 characters at various places, here as an int, somewhere as a memory address, and whereever it appears, it consequentially results in a program crash, then I think there is some relation to that.

But we will see - if I am right, there should be no further program crashes now.


----------



## ralphbsz (Mar 3, 2022)

This sounds like so much fun! Clearly, you have it as much under control as can be.



PMc said:


> There are two rivalling gangs, the bacula gang and the bareos gang, and one is said to have stolen the code from the other, and both of them are primarily interested in securing their territories and making money.


They don't sing and dance while doing that? It would really add some flair. Think "West Side Story".


----------



## _martin (Mar 3, 2022)

PMc said:


> as I find this very same sequence of 4 characters at various places, here as an int, somewhere as a memory address, and whereever it appears, it consequentially results in a program crash, then I think there is some relation to that.


Quite possibly, yes. You do see the bigger picture with the dump afterall.

Did you have these issues on 12 too or did they start to occur on 13 only?


----------



## PMc (Mar 3, 2022)

_martin said:


> Quite possibly, yes. You do see the bigger picture with the dump afterall.
> 
> Did you have these issues on 12 too or did they start to occur on 13 only?


Yes, 13 only. 
I still have the machine's image for 12.3 - I might reinstall that and compare, but only after I know a little bit more about what actually happens.

For now I have verified, indeed because the server is dual-stack, we get an error at first bind() - this is the code where we connect to the server:








						bareos/bsock_tcp.cc at 6310c9b5a256fa10c386427e3497c15072f4c950 · bareos/bareos
					

Main repository with the code for the libraries and daemons - bareos/bsock_tcp.cc at 6310c9b5a256fa10c386427e3497c15072f4c950 · bareos/bareos




					github.com
				



Pmsg2() prints an error message to stdout (which is /dev/null when invoked from rc.d).

When removing that Pmsg2(), no more crashes seem to appear - but with things happening only after 10-30 hours, this doesn't say much. 
So I went the other way round, and repeated that Pmsg2() invocation a hundred times, recompiled, and installed. Now things crash after half an hour, and memory looks like this:


```
* thread #1, name = 'bareos-fd', stop reason = signal SIGBUS
  * frame #0: 0x00000008002f6700 libbareos.so.20`alist::destroy(this=0x00000008011892c8) at alist.cc:141:14
(lldb) var
(alist *) this = 0x00000008011892c8
(int) i = 0
(lldb) print *this
(alist) $0 = {
  items = 0x6f536f4d6f536f4d
  num_items = 1867738957
  max_items = 1867738957
  num_grow = 1867738957
  cur_item = 1867738957
  own_items = true
}
```

That rocks! 
 Also with strings:


```
# strings bareos-fd.0.0.core
ekly.l
h_MoSo
MoSo
oMoSo@
MoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSo
MoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoIoSoMoSoMoSo
MoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSo
ipsec
```


----------



## _martin (Mar 3, 2022)

That's smells like some sort of UaF bug.
Reading the bareos-client code is like a punishment. Those guys are not even sticking to the same indent.
But then, if the client is the same (assuming you are using 1:1 same binary as on 12), what that significant to this issue changed in 13? That's a rhetorical question.


----------



## PMc (Mar 3, 2022)

_martin said:


> That's smells like some sort of UaF bug.
> Reading the bareos-client code is like a punishment.


You're welcome. I usually try to avoid that - fix things w/o reading the code...



_martin said:


> But then, if the client is the same (assuming you are using 1:1 same binary as on 12),


I did. Then it crashed, then I recompiled the port under 13, and that didn't change anything.



_martin said:


> what that significant to this issue changed in 13? That's a rhetorical question.


Memory allocation whatever might be. It's enough these bogus things go to some other place, where they are not computationally used (e.g. when they instead overwrite the backup payload, nothing will crash - at least not until after a restore *eg*). And yes, this is a good question.


----------



## PMc (Mar 5, 2022)

This is getting more entertaining. I found the offending line in the code, it is here:









						bareos/message.cc at f07e6e9180535dc2d5f27c7ee2bc69704cb38f62 · bareos/bareos
					

Main repository with the code for the libraries and daemons - bareos/message.cc at f07e6e9180535dc2d5f27c7ee2bc69704cb38f62 · bareos/bareos




					github.com
				





```
fputs(buf, stdout);
fflush(stdout);
```

I didn't understand it, so I changed it to get me returncode and errno:
The fflush() fails at first try. The fputs() also starts failing after writing total of some 4350 bytes.
These 4350 bytes are what I did see in memory with `strings`, in the core file or in /proc/PID/mem.

The errno is EBADF.

And apparently, when repeating these erroneous calls a couple of times, arbitrary memory gets overwritten.

stdout is present und points to /dev/null, which is writeable:


```
[root@moon /tmp]# lsof -p 18050
COMMAND     PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
bareos-fd 18050 root    0r  VCHR               0,13      0t0   13 /dev/null
bareos-fd 18050 root    1r  VCHR               0,13      0t0   13 /dev/null
bareos-fd 18050 root    2r  VCHR               0,13      0t0   13 /dev/null
bareos-fd 18050 root    3u  IPv4 0xfffffe0086130870      0t0  TCP moon-
```


----------



## covacat (Mar 5, 2022)

maybe stdout is already borked and point to something else than a FILE struct


----------



## PMc (Mar 5, 2022)

covacat said:


> maybe stdout is already borked and point to something else than a FILE struct


Don't think so. Because when running the program in a terminal with default stdout, then the output appears normally, and there are no errors. So it rather looks like libc does not consider /dev/null a file.


----------



## covacat (Mar 5, 2022)

libc does not know about /dev/null it knows about file descriptor 1 which may go anywhere
EBADF says that stdout->_file is an invalid handle
i wonder if stdout->_file is 1 or something else when it fails


----------



## _martin (Mar 5, 2022)

Are you able to set breakpoint in that location and check &buf ?
Can you dump the FILE* structure too? The fd is defined in FILE.file (stdio.h).


----------



## PMc (Mar 5, 2022)

covacat said:


> i wonder if stdout->_file is 1 or something else when it fails


fileno(stdout) is 1.


----------



## PMc (Mar 5, 2022)

_martin said:


> Are you able to set breakpoint in that location and check &buf ?


Yes, it's fine.


_martin said:


> Can you dump the FILE* structure too? The fd is defined in FILE.file (stdio.h).


How should I set a breakpoint via /dev/null?
To set breakpoints, I must run the program in foreground, and then stdout is /dev/tty and everything works fine.


----------



## PMc (Mar 5, 2022)

covacat said:


> libc does not know about /dev/null it knows about file descriptor 1 which may go anywhere


It goes to /dev/null, see lsof output above.

When running in foreground, it looks like this, and there are no errors:

```
# lsof -p 24185
COMMAND     PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
bareos-fd 24185 root    0u  VCHR               0,91 0t922928   91 /dev/pts/1
bareos-fd 24185 root    1u  VCHR               0,91 0t922928   91 /dev/pts/1
bareos-fd 24185 root    2u  VCHR               0,91 0t922928   91 /dev/pts/1
bareos-fd 24185 root    3u  IPv4 0xfffffe008612d950      0t0  TCP moon
```

When I remove the fputs/fflush from the code, there are no failures.
When I step thru the fputs/fflush, they execute code in libc and libthr.



> EBADF says that stdout->_file is an invalid handle



And how might this result in libc taking the first two bytes of the output string and writing it to some bogus memory location?


----------



## _martin (Mar 5, 2022)

I don't use lldb, I use gdb. I'm assuming they are similar enough. I asked to set breakpoint on the line of code you shared (fflush), not via "/dev/null". 
You can do it in two ways. Either run debugger, set the breakpoint and continue or run the program, attach the debugger (gdb -p ) , set the breakpoint and let it continue. I'd prefer the later option.
If you have compiled it via ports you could enable debug mode (CFLAGS+=-g) and set it on line of code, or if you're debugging binary without debug symbols you need to find which instruction is calling that fflush and set breakpoint on that.

FILE* structure has fd assigned to it or -1 if none is used. It would be interesting to see if that FILE* structure has sane values. As covacat mentioned it may be that the FILE* structure is already corrupted, not causing the corruption (or in other words it's a victim of a bug, not a bug).

FILE* (and hence printf,scanf & friends) does use buffers (allocated on heap). It could be that this part of the code rubs the bug the correct way and gets triggered. 

That EBADF could be that stdout is set to -1, i.e. not used. This is not a bug necessarily. Consider this example of code:

```
close(1);
..
..
fprintf(stdout, "hello world\n");
```
Technically there's nothing wrong with this code. But printing to stdout in the terminal will most likely (stdout can be redirected elsewhere, that's why I said most likely) end up in error EBADF. There may be logic in the code where 0,1,2 is either redirected to socket or closed completely. Hence the error. 

But I keep asking myself -- what has changed in FreeBSD that it's being triggered now.


----------



## PMc (Mar 5, 2022)

covacat said:


> I don't use lldb, I use gdb. I'm assuming they are similar enough. I asked to set breakpoint on the line of code you shared (fflush), not via "/dev/null".


Yes, but somehow you must "ask to set breakpoint" - usually by typing the respective debugger command into a terminal via stdin/out. And you cannot do this through /dev/null, and the error only happens when stdin/out is /dev/null.



covacat said:


> But I keep asking myself -- what has changed in FreeBSD that it's being triggered now.



Just figured that one out. Hold on...


----------



## _martin (Mar 5, 2022)

We don't understand each other. You run the bareos-client or whatever that binary is as usual. Then, from other terminal, you attach to that running command with the gdb -p $pid where $pid is the pid of that bareos client. Then you can set the breakpoint. And you continue to debug it "live".


----------



## PMc (Mar 5, 2022)

covacat said:


> EBADF says that stdout->_file is an invalid handle


Apparently it is an invalid handle only for fflush(), not for fputs(), because fputs() writes the first 4350 bytes successfully, and only fflush() fails from the beginning.

So I think this cannot be. But apparently it has to do something with buffering.
So I try, what happens when I do
`fclose(stdout);`
and then try to write something out -> that fails rightaway with EBADF.

And what happens when I do
`close(1);`
and then try to write something to stdout? Now there are differences:



12.3/dev/ttyfputs() fails EBADF, fflush() works12.3/dev/nullfputs() works, fflush() fails EBADF13/dev/ttyfputs() fails EBADF, fflush() fails EBADF13/dev/nullfputs() works for ~4300 bytes, then fails EBADF, fflush always fails EBADF


----------



## PMc (Mar 5, 2022)

_martin said:


> We don't understand each other. You run the bareos-client or whatever that binary is as usual. Then, from other terminal, you attach to that running command with the gdb -p $pid where $pid is the pid of that bareos client. Then you can set the breakpoint. And you continue to debug it "live".


Sorry, didn't know one can do that.


----------



## PMc (Mar 5, 2022)

It seems this problem is solved. (If you look carefully enough, you could see the nature of the problem already in the quotes above.)









						bareos/daemon.cc at f07e6e9180535dc2d5f27c7ee2bc69704cb38f62 · bareos/bareos
					

Main repository with the code for the libraries and daemons - bareos/daemon.cc at f07e6e9180535dc2d5f27c7ee2bc69704cb38f62 · bareos/bareos




					github.com
				




Now ain't this gorgeous????

Here we close all the stdio handles, and then we open them again - and we open all of them as O_RDONLY. (And one can see this from the lsof output above: it shows stdout and stderr handles as 1r and 2r).

This is precisely why I love this code so much. It does lots of superfluous things, and it mostly does them wrong.

But then, there is also a weakness in libc. One probably should not modify the lower level close()/dup() filehandles while also using the upper level stdio functions. But then, if one tries to write onto the upper level while at the same time having the lower level set RDONLY, this probably should not result in gross memory corruption.


----------



## covacat (Mar 5, 2022)

PMc said:


> 12.3/dev/ttyfputs() fails EBADF, fflush() works12.3/dev/nullfputs() works, fflush() fails EBADF13/dev/ttyfputs() fails EBADF, fflush() fails EBADF13/dev/nullfputs() works for ~4300 bytes, then fails EBADF, fflush always fails EBADF


this shows a different initial buffering (see setvbuf)
if the FILE is line buffered or unbuffered fflush has nothing to do because the buffer is always empty (after a fputs)
if the FILE is fully buffered fputs may only write to memory and never touch the file descriptor
so the failure always occurs at _swrite


----------



## PMc (Mar 5, 2022)

covacat said:


> this shows a different initial buffering (see setvbuf)
> if the FILE is line buffered or unbuffered fflush has nothing to do because the buffer is always empty (after a fputs)
> if the FILE is fully buffered fputs may only write to memory and never touch the file descriptor
> so the failure always occurs at _swrite


Maybe. I tried to somehow reproduce behaviour with setbuf, but wasn't successful. Maybe I didn't try hard enough; anyway, we can say for certain that there is a difference between 12.3 and stable/13 (as of 2 weeks ago, because I don't update base while hunting a bug - so this may be related to some development work also). But, honestly, this is rather my least concern.

What is of concern to me is on one hand the phantastic coding quality as shown above, where I don't know what else is lingering there and ready to stab me in the back. And on the other hand these memory overwrites, which are probably not well explainable by different buffering behaviour (and probably also not by ongoing development work, but then, one should check the commit logs).
And the third issue is the original question of error 94, i.e. what kind of things happen in relation to capsicum, and where that might be documented.


----------



## PMc (Mar 5, 2022)

Please try reproducing:


```
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

main() {
    char buf[] = "12345678901234567890123456789012345678901234567890";
    int fd = open("/dev/null", O_RDONLY);
    int i = 0;

    close(1);
    dup2(fd, 1);
    close(fd);

    while(1) {
      fputs(buf, stdout);
      fflush(stdout);
      i++;
      fprintf(stderr, "%d\n", i);
    }
}
```

Here it crashes after 135962 iterations. (stable/13 @ 22ba2970766 )


----------



## _martin (Mar 5, 2022)

Damn, I had too many beers to concentrate now.  13.0-RELEASE (releng/13.0-n244733-ea31abc261f) ok, 14-current - crash after 44781.

Quick check in gdb:

```
root@fbsdc:(/tmp/forums)$ gdb test test.core
(gdb)
..
#0  0x000000082232d82d in memcpy () from /lib/libc.so.7
=> 0x000000082232d82d <memcpy+173>:    48 89 17    mov    QWORD PTR [rdi],rdx
(gdb)

(gdb) i r $rdi
rdi            0x824f07ffa         34979479546
(gdb)

(gdb) x/3i $pc
=> 0x82232d82d <memcpy+173>:    mov    QWORD PTR [rdi],rdx
   0x82232d830 <memcpy+176>:    mov    QWORD PTR [rdi+rcx*1-0x8],r8
   0x82232d835 <memcpy+181>:    ret
(gdb)

(gdb) i r $rdi
rdi            0x824f07ffa         34979479546
(gdb)

(gdb) ip
         0x824d08000        0x824f08000   0x200000        0x0  rw- ----
(gdb)

(gdb) bt
#0  0x000000082232d82d in memcpy () from /lib/libc.so.7
#1  0x00000008222ed03f in ?? () from /lib/libc.so.7
#2  0x00000008222eb5cd in fputs () from /lib/libc.so.7
#3  0x0000000000201b54 in main () at test.c:15
(gdb)
```
So SIGSEGV is due memcpy overstepping into unmapped memory: `0x824f07ffa + 8 = 0x824F08002` which is unmapped. We should check the /usr/src/lib/libc and see what changed there within 13 release (my 13.0 is working just fine).

edit: btw fflush() is not needed.


----------



## PMc (Mar 5, 2022)

_martin said:


> Damn, I had too many beers to concentrate now.


Hey, didn't want to disturb Your weekend!



_martin said:


> 13.0-RELEASE (releng/13.0-n244733-ea31abc261f) ok, 14-current - crash after 44781.


Thanks, that's valuable (so it's not one of my local patches or such). 

Your crash location is mostly same as here.

I have a stable/13 at fa3cc60e6dc without crash, but that's a generic build for default CPU, 
so I wasn't sure yet. But if this is not the CPUTYPE, and only libc, then there is now not
much delta:

fa3cc60e6dc   Jan 8 10:22:08 2022
22ba2970766   Feb 10 16:11:22 2022

What do You think about these:
ec2db06d0db22ae11c1b5414446e3aecd71a93e3
afa9a1f5ec9974793a8744c55036ef5c4d08903d


----------



## _martin (Mar 5, 2022)

It's ok, I do like these types of problems. I may not be able to help tonight much though.
Both seem to be about fflush; we can reproduce it without it (so I didn't check further).

I did spin up 13.0-RELEASE-p7 VM and I can't reproduce it there either.


----------



## PMc (Mar 6, 2022)

_martin said:


> It's ok, I do like these types of problems. I may not be able to help tonight much though.
> Both seem to be about fflush; we can reproduce it without it (so I didn't check further).


I suppose the fputs() will internally call fflush() (or equivalent) when the buffer is full.



_martin said:


> I did spin up 13.0-RELEASE-p7 VM and I can't reproduce it there either.


It came with afa9a1f5ec9974793a8744c55036ef5c4d08903d into stable/13.


----------



## covacat (Mar 6, 2022)

this is causing it








						__sflush(): on write error, if nothing was written, reset FILE state … · freebsd/freebsd-src@afa9a1f
					

…back  PR:	76398  (cherry picked from commit 86a16ada1ea608408cec370171d9f59353e97c77)




					github.com
				



i built a 13.0-R libc with that file replaced and it bombs when internally fflush is called
if you disable buffering with setvbuf it works
it always bombs at size of vbuf / size of string outputed so when the buffer fills it bombs
a vbuf of 16k and the string of 50 causes it to bomb at 328
the explicit call to fflush is not needed (like _martin said)


----------



## covacat (Mar 6, 2022)

i suppose that somehow the code in fvwrite.c (fputs) is not aware of this 'reset' and is causing the bomb


----------



## PMc (Mar 6, 2022)

Folks, 

seems we did it! We improved quality! Thank You all! 





__





						src - FreeBSD source tree
					






					cgit.freebsd.org


----------



## _martin (Mar 6, 2022)

Nice. That's why I like this type of threads here.


----------



## covacat (Mar 10, 2022)

has anybody filed a PR ? 
because this is/will be in 13.1


----------



## Erichans (Mar 10, 2022)

I can't find a PR related to PMc's message #43. Going by the commit to -CURRENT (2022-03-06 15:29:51) in




__





						src - FreeBSD source tree
					






					cgit.freebsd.org
				



and comparing with:
stable/13/lib/libc/stdio/fvwrite.c L135-L142
stable/13/lib/libc/stdio/fvwrite.c L176-L182
releng/13.1/lib/libc/stdio/fvwrite.c L135-L142
releng/13.1/lib/libc/stdio/fvwrite.c L176-L182

At the moment it is not in 13-STABLE (as the precursor of 13.1-RELEASE) and not in the just branched of releng/13.1 (per commit 13.1: create releng/13.1 branch as of 2022-03-10 00:10:32).


----------



## covacat (Mar 10, 2022)

the problem is introduced by the fflush commit in feb 1


----------



## PMc (Mar 10, 2022)

PR is 76398 which is not updated, but not closed yet either. Anyway, this here is good enough for me:





__





						Re: Program crashes on stable/13 (but not on 12.3)
					





					lists.freebsd.org


----------

