# EQ overflow FreeBSD 12.0 + nVidia 1660Ti with 430 driver + Ryzen 2400G



## Rob S (Jun 29, 2019)

Hello.

I am getting EQ overflow errors in my Xorg log using the nVidia 430 driver. These correspond to my system freezing-up. The mouse and keyboard work for about 10 seconds before this happens. However, I am able to get a stable display with the VESA driver.

* I'm Running FreeBSD 12.0 with generic kernel (also tried custom, without option VESA). I have switched off my on-board graphics in my AsRock AB350 bios. My primary display is on an nVidia 1660Ti with HDMI. I was using a kvm switch ( Belkin Flip ) but have subsequently plugged both mouse and keyboard directly into PC and this does not solve the problem.

* The 430 driver was installed from a patched version of nvidia-driver in the /usr/ports tree. The patch was taken from this bug report page: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645

* I'm running nvidia_modeset and nvidia kernel modules (specified in /boot/loader.conf) and dbus and hald (specified in rc.conf). Also, I am using the xorg.conf generated automatically by nvidia-xconfig (430). I'm using vga textmode (in loader.conf) but it doesn't make any difference with/without.

Can someone offer advice on how to resolve this, please? I would be grateful.

Thanks,

Rob.

This has also been raised on reddit and bugs.freebsd.org (from where I was redirected here): https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645


----------



## shkhln (Jun 30, 2019)

`cat /var/log/messages | grep -i -E "(nvidia|NVRM)"`? `sysctl hw.nvidia.registry.ResmanDebugLevel=0` will get you more verbose debug output from the driver, although I'm not sure how useful it is in practice.


----------



## Rob S (Jun 30, 2019)

Hi. Thanks for the advice. I did as you suggested. There's some interesting output in /var/log/messages (attached) but I'm not sure what it means.

I did note that the BusID is shown as 0:10:0:0 in /var/log/messages but 0:16:0:0 in pciconf -lv. Possibly one is hex and one is dec - so nothing to worry about?

I haven't got any further towards identifying the problem.

Thanks,

Rob.


----------



## Rob S (Jun 30, 2019)

Also wondered if it's my choice of hardware. Using MSI Armor OC 1660 Ti. Maybe the overclocking is throwing things off? If so, I don't know how to avoid this.


----------



## shkhln (Jun 30, 2019)

By the way, there no need to attach < 100 line files, it's totally fine to include them inline.



> Jun 30 22:17:16 robs-pc kernel: NVRM: Xid (PCI:0000:10:00): 79, GPU has fallen off the bus.



Quite generic error, unfortunately. Make sure you don't have hardware issues: PSU is strong enough, power connector is properly attached, GPU isn't overheating. Check whether your video card works properly under Windows/Linux, if you must.


----------



## Rob S (Jul 1, 2019)

PSU is 750W...easily enough for this system. Card works fine under heavy load in Win10.


----------



## shkhln (Jul 1, 2019)

1. You can try to send Nvidia a crash dump as an NVRM message suggests. They are unlikely to react on it, though.
2. Do any non-NVRM messages between "RmInitAdapter succeeded!" and "GPU has fallen off the bus." lines look interesting?
3. Did you test the most basic X11 desktop environment? I usually suggest `X -retro`, that doesn't even start a terminal — just a mouse pointer on a gray background.


----------



## Rob S (Jul 3, 2019)

Interesting. It is apparently stable with the X -retro setup. I can move the mouse pointer and it doesn't freeze, even after a few minutes.

However, vty switching doesn't seem to work. When I do this, the monitor says it stopped getting a signal. So I can't kill X without a reboot. I think that's an issue that others have had and possibly is unrelated to the freezing problem on starting X normally. If I try to vty switch and then do a hard reboot with my power button, I can capture some error messages, as in the log below. However, if I just do a hard reboot from X, I don't get these messages.

Another odd thing is that I get kernel log messages (before I try to vty switch) that say "interrupt storm detected on "irq:259"; throttling interrupt source". Perhaps that's just me moving the mouse a lot when I check if my desktop is frozen? I'm using sysmouse - could that be a source of the problem? Is there an alternative?

Thanks,

Rob S.

```
Jul  3 00:19:00 robs-pc kernel: NVRM: GPU at PCI:0000:10:00: GPU-890b60a8-d9b3-824a-784b-648e84db328b
Jul  3 00:19:00 robs-pc kernel: NVRM: GPU Board Serial Number: 
Jul  3 00:19:00 robs-pc kernel: NVRM: Xid (PCI:0000:10:00): 79, GPU has fallen off the bus.
Jul  3 00:19:00 robs-pc kernel: NVRM: GPU 0000:10:00.0: GPU has fallen off the bus.
Jul  3 00:19:00 robs-pc kernel: NVRM: GPU 0000:10:00.0: GPU is on Board .
Jul  3 00:19:00 robs-pc kernel: NVRM: A GPU crash dump has been created. If possible, please run
Jul  3 00:19:00 robs-pc kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Jul  3 00:19:00 robs-pc kernel: NVRM: the NVIDIA kernel module is unloaded.
Jul  3 00:19:04 robs-pc kernel: uhub_reattach_port: giving up port reset - device vanished
Jul  3 00:19:16 robs-pc syslogd: last message repeated 10 times
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57d:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:1:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:3:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:5:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:7:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57d:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:1:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:3:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:5:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:7:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:2:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: uhub_reattach_port: giving up port reset - device vanished
Jul  3 00:19:48 robs-pc syslogd: last message repeated 25 times
Jul  3 00:19:49 robs-pc devd[721]: check_clients:  dropping disconnected client
Jul  3 00:19:50 robs-pc kernel: uhub_reattach_port: giving up port reset - device vanished
```


----------



## shkhln (Jul 3, 2019)

Rob S said:


> Interesting. It is apparently stable with the X -retro setup. I can move the mouse pointer and it doesn't freeze, even after a few minutes.



Now we (well, you) need to find what actually crashes the driver. Start a twm session (`startx` with default settings, i.e. without ~/.xinitrc), run `glxgears` from Mesa, then maybe something more heavy like Unigine Valley benchmark.



Rob S said:


> However, vty switching doesn't seem to work. When I do this, the monitor says it stopped getting a signal.



Switching from (modern and relatively new) vt back to syscons might or might not help with it. Also see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237050.



Rob S said:


> Another odd thing is that I get kernel log messages (before I try to vty switch) that say "interrupt storm detected on "irq:259"; throttling interrupt source". Perhaps that's just me moving the mouse a lot when I check if my desktop is frozen?



Probably not.



Rob S said:


> I'm using sysmouse - could that be a source of the problem?



No, nothing of the sort. It doesn't talk to hardware directly, that would be either _ums_ or _psm_ driver.


----------



## Rob S (Jul 7, 2019)

I tried running a full xfce4 desktop. It was actually stable(-ish) this time. The weird thing is that the fans spun up on starting X and remained on even when idling. These fans are supposed to stop when the card is idle (and they do on Win10). I did get a few short freezes before it finally died when I opened a terminal. There were a few clusters of "interrupt storm" in the message log.


----------



## T-Daemon (Jul 7, 2019)

If you haven't done it yet update to version 12.0-RELEASE-p7. According to dmesg.txt your system is at r341666.
`freebds-update fetch`
`freebsd-update install`
`reboot`

remove in /boot/loader.conf

```
linux_enable="YES"
nvidia_load="YES"
nvidia-modeset_load="YES"
```

edit /etc/rc.config, set:

```
linux_enable="YES"
kld_list="nvidia-modeset"
```

rename any xorg.conf file, ex. xorg.conf.nvidia,
create /usr/local/etc/X11/xorg.conf.d/nvidia.conf file, set:

```
Section "Device"
   Identifier "Card0"
   Driver     "nvidia"
EndSection
```

`reboot`

login as user, execute
`startx`

If the problems persist report back with
`dmesg`
`pciconf -lv |grep -B4 VGA`
/var/log/Xorg.0.log


----------



## Rob S (Jul 7, 2019)

Thank you for your reply. I have upgraded FreeBSD, changed the /boot/loader.conf and /etc/rc.conf and changed the Xorg config, as you suggested. Of course, it was also necessary to recompile the driver module.

I found that the standard startx setup worked stably for about 5 minutes before I did a soft reset (because I can't switch back to VT). Subsequently, I tried startx and startxfce4. Both of these crashed quite quickly, after 30 seconds or less. I didn't get the fans spinning up as before but I think that was happening randomly anyway.

I attach/paste the requested logs.

Thank you.


```
vgapci0@pci0:16:0:0:    class=0x030000 card=0x37501462 chip=0x218210de rev=0xa1 hdr=0x00
    vendor     = 'NVIDIA Corporation'
    device     = 'TU116 [GeForce GTX 1660 Ti Rev. A]'
    class      = display
    subclass   = VGA
```

For info., the card is an MSI Armor OC GeForce GTX 1660Ti .

Therer are two Xorg logs below. The "old" one has a suspicious error message in it. The other one did not report any messages but crashed anyway.


----------



## shepper (Jul 7, 2019)

Your nVidia card is a circa 2010 and has no where near the capability of the built in Graphics of your Ryzen 2400G.  Plus nVidia support is not great for older cards.  Someone needs to ask, why the complexity of a separate card?  This forum has several threads reporting success with Ryzen graphics.


----------



## Rob S (Jul 7, 2019)

shepper said:


> Your nVidia card is a circa 2010 and has no where near the capability of the built in Graphics of your Ryzen 2400G.  Plus nVidia support is not great for older cards.  Someone needs to ask, why the complexity of a separate card?  This forum has several threads reporting success with Ryzen graphics.



Hi Shepper. My card was released about 5 months ago.

Edit: I updated the typo from "1600Ti" to "1660Ti" in my post.


----------



## shkhln (Jul 7, 2019)

shkhln said:


> run `glxgears` from Mesa, then maybe something more heavy like Unigine Valley benchmark.



?


----------



## Rob S (Jul 8, 2019)

shkhln said:


> ?



Hi shkhln. Thanks - sorry I didn't follow up on the glxgears thing earlier. I did get glxgears to work once with my setup ( this when X was running for about 5 minutes before I decided to reset ). However, last time I tried ( just now ), glxgears just hanged on the command line. Also, glxinfo hanged on the command line (after saying one line about the display name). I then ran firefox in another window and then the whole of X crashed.

I will try glxgears again.


----------



## Rob S (Jul 8, 2019)

Attempt 1:

startx
Run glxgears in login term - hangs
Run glxinfo in login term - hangs after generic one-line message
Run firefox - X stops responding

Attempt 2:

startx
Run glxgears in login term - glx gears works - reports 60 fps three times before I exit
Run glxinfo in login term - hangs after generic one-line message
Run glxgears in login term - hangs
Run firefox in other term - no response ( ps STAT has state D ).
... try last few lines a few more times with same results...
quit by pressing hw reset

On an earlier attempt I did get glxinfo to work and it reported that 3D rendering was enabled.

What I notice is that in both times I get:

interrupt storm detected "irq276:" - throttling input source

This appears in my dmesg  or /var/log/messages. I got this message about 60 times on attempt 2, before I reset. I was getting these messages (but with irq259) before I upgraded FreeBSD in response to an earlier post in this thread.


----------



## toorski (Jul 8, 2019)

I also didn’t see anything in your  Xorg log that would indicate issues with nividia’s GPU driver and display output.

All I can think of is your video card’s OC setting/configuration. You should maybe use MSI’s GPU tuner software to reset the video card to its default settings, with no OC, if there’s such option.

Or else there's some kind of DMA/IRQ hardware conflict that FreeBSD cannot deal with.

Edit:
I would also do:
`kldstat | grep nvidia`
to make sure that the mods are in
I would re-run:
`nvidia-xconfig`

Then, reboot and try *startx* again.


----------



## Rob S (Jul 8, 2019)

Thanks toorski. There are two xorg logs. The "old" one has a driver backtrace in it. 

I will definitely try the overclocking thing. The Win10 MSI tool allows the clocks to be slowed down relative to the current OC setting, so I'll need to look up the values for the stock clocks. Hopefully those settings will persist after a reboot. 
I have to go offline now for about 20 hours.


----------



## shkhln (Jul 8, 2019)

Rob S said:


> Run glxgears in login term - glx gears works - reports 60 fps three times before I exit
> Run glxinfo in login term - hangs after generic one-line message
> Run glxgears in login term - hangs



All in the same session? Can you post `truss glxgears` output (where it hangs)?


----------



## Rob S (Jul 8, 2019)

OK so I tried again with glxgears. It ran OK at 60 FPS for about 1 minute then the framerate dropped to about 3 FPS and the desktop became very poorly responsive. I did a ctrl+c to quit. Then I ran glxgears again and it didn't even start. I did the truss glxgears this time (output attached). I also attach the result of truss glxinfo when it hanged.

System is very choppy with random freezes.

I'm going to try reducing the overclock now. Not sure if it is possible to make this persist across a reboot with a factory overclocked card but we'll see. I have no way to adjust the overclock in FreeBSD apparently - I need to reboot to Win10.

I got the usual interrupt storm detected on irq276 (repeated 11 times). Also I get EQ overflow in the Xorg log ( not the "card has fallen off the bus" like I did last time ). Seems to be intermittently freezing / crashing with one of those two errors.


(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
(EE) 
(EE) Backtrace:
(EE) 0: /usr/local/bin/X (?+0x0) [0x3dd360]
(EE) 1: /usr/local/bin/X (?+0x0) [0x2a1d30]
(EE) 2: /usr/local/bin/X (?+0x0) [0x2de8b0]
(EE) 3: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0xe06a25990]
(EE) 4: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0xe06a22e10]
(EE) 5: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0xe06a21e90]
(EE) 6: /usr/local/bin/X (?+0x0) [0x2cf780]
(EE) 7: /usr/local/bin/X (?+0x0) [0x2f3030]
(EE) 8: /lib/libthr.so.3 (pthread_sigmask+0x536) [0x800ae9916]
(EE) 9: /lib/libthr.so.3 (pthread_getspecific+0xe12) [0x800ae96f2]
(EE) 10: ? (?+0xe12) [0x7fffffffee15]
(EE) 11: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (nvidiaAddDrawableHandler+0x52c89) [0x80230df82]
(EE) 
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.
[   129.232] [mi] Increasing EQ size to 1024 to prevent dropped events.
[   129.233] [mi] EQ processing has resumed after 43 dropped events.
[   129.233] [mi] This may be caused by a misbehaving driver monopolizing the server's resources.


----------



## Rob S (Jul 8, 2019)

further note: I have been running X for 15 minutes now - possibly a record. However, whenever I open a new window there is a ~10 second freeze. Also happens when I open a new tab in browser and also intermittently. The /var/log/messages interrupt storm on irq276 has now increased "repeated 98 times"


----------



## Rob S (Jul 8, 2019)

> All in the same session?



Yes, all in the same session.


----------



## shkhln (Jul 11, 2019)

Rob S said:


> The /var/log/messages interrupt storm on irq276 has now increased "repeated 98 times"



`vmstat -i`?


----------



## Amzo (Jul 11, 2019)

Did you build the driver yourself from outside of the port tree? I'm just curious as since the issue is only with X, it could be you failed to patch or address something. Ports nvidia-driver is still at 390.87 which was released before the GTX 1660ti if I remember correctly.


----------



## shkhln (Jul 11, 2019)

Amzo said:


> I'm just curious as since the issue is only with X, it could be you failed to patch or address something.



Nothing here is related to packaging.


----------



## Amzo (Jul 11, 2019)

Some time today I'll upgrade to the same driver version with the patches. If it is a driver issue relating to the newest FreeBSD Nvidia release and based on your information I should be able to reproduce it and go from there.


----------



## Rob S (Jul 11, 2019)

Amzo said:


> Did you build the driver yourself from outside of the port tree? I'm just curious as since the issue is only with X, it could be you failed to patch or address something. Ports nvidia-driver is still at 390.87 which was released before the GTX 1660ti if I remember correctly.


Hi Amzo. Thanks for your reply. I built the driver from the ports tree using a patch:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645

But yes, perhaps it's possible it's out-of-sync with my version of ports? I installed FreeBSD less than a day after building the driver.

Maybe I could try the 418 version instead.

Thanks,

RobS.


----------



## Rob S (Jul 11, 2019)

shkhln said:


> `vmstat -i`?


interrupt storm detected on "irq275:"; throttling interrupt source
interrupt storm detected on "irq275:"; throttling interrupt source
interrupt storm detected on "irq275:"; throttling interrupt source
interrupt storm detected on "irq275:"; throttling interrupt source
$ vmstat -i
interrupt                          total       rate
cpu0:timer                         17283         97
cpu1:timer                         10466         59
cpu2:timer                         15061         85
cpu3:timer                         11164         63
cpu4:timer                         92594        521
cpu5:timer                          8643         49
cpu6:timer                         12006         68
cpu7:timer                         11184         63
irq259: hdac0                          8          0
irq261: xhci1                        179          1
irq262: ahci0                      11564         65
irq263: ahci1                        268          2
irq265: re0                        16651         94
irq266: nvme0                         14          0
irq267: nvme0                        168          1
irq268: nvme0                         45          0
irq269: nvme0                         44          0
irq270: nvme0                        274          2
irq271: xhci2                       4664         26
irq275: vgapci0                     7181         40
Total                             219461       1235


----------



## Rob S (Jul 11, 2019)

Amzo said:


> Some time today I'll upgrade to the same driver version with the patches. If it is a driver issue relating to the newest FreeBSD Nvidia release and based on your information I should be able to reproduce it and go from there.


Cool!


----------



## Rob S (Jul 11, 2019)

Also I have this but not sure if it's saying anything useful:

root@robs-pc:/usr/home/robs # nvidia-debugdump -z -D
nvmlInit succeeded
Using ALL devices
Dumping all components.
nvdZip_Open(dump.zip) for writing succeeded
System: Dumping component: system_info.
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpSystemComponent() failed, return code: 0x3e7
System: Dumping component: error_data.
GetCaptureBufferSize succeeded, bufSize: 0x8fb
GetCaptureBuffer succeeded, bufSize: 0x83d
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
Nvlog: Dumping component(nvlog.log): nvlog.
internal_dumpNvLogComponent() succeeded
Device: GeForce GTX 1660 Ti : 0: Dumping component: debug_buffers.
GetCaptureBufferSize succeeded, bufSize: 0x22
GetCaptureBuffer succeeded, bufSize: 0x2
nvdZip_AddFile succeeded
internal_dumpGpuComponent() succeeded
Device: GeForce GTX 1660 Ti : 0: Dumping component: rm.
GetCaptureBufferSize succeeded, bufSize: 0x41c0
GetCaptureBuffer succeeded, bufSize: 0x3af3
nvdZip_AddFile succeeded
internal_dumpGpuComponent() succeeded
Nvlog: Dumping component(nvlog.gpu000.log): nvlog.
internal_dumpNvLogComponent() succeeded
nvdZip_Close() succeeded


----------



## T-Daemon (Jul 11, 2019)

I have installed on a 12.0-RELEASE test system the 430.34 NVIDIA driver, not from ports but from downloaded tar ball at NVIDIA, without linux compatibility support. The video card is an old GeForce GT 630, passive cooled. So far I haven't had any problems. In your case the issues could be related to the overclocking. Have you tried slowing down the card as you mentioned in your post #19?


----------



## Rob S (Jul 11, 2019)

T-Daemon said:


> I have installed on a 12.0-RELEASE test system the 430.34 NVIDIA driver, not from ports but from downloaded tar ball at NVIDIA, without linux compatibility support. The video card is an old GeForce GT 630, passive cooled. So far I haven't had any problems. In your case the issues could be related to the overclocking. Have you tried slowing down the card as you mentioned in your post #19?



I looked at this but it seemed the only way was to use the MSI Afterburner tool ( dual-boot with Win10 ).  I'm not sure if settings will persist across reboot but I will try now. The card is supposed to have a boost clock of 1860, which is more than stock.

I will also try a different tool now because the version of MSI Afterburner I have seems to make the settings obscure.

For info. ( as you probably saw ) I am running linux compatibility support ( or so I understand ).

Thanks for trying it!


----------



## Rob S (Jul 11, 2019)

Some overclock settings in Win 10 attached.

As far as I can tell, the card is operating at the default clock speeds but it is a model that claims to be overclocked.


----------



## Rob S (Jul 11, 2019)

Rob S said:


> Some overclock settings in Win 10 attached.
> 
> As far as I can tell, the card is operating at the default clock speeds but it is a model that claims to be overclocked.



Perhaps a VBIOS flash would solve it but I don't really know.


----------



## toorski (Jul 12, 2019)

Rob S said:


> For info. ( as you probably saw ) I am running linux compatibility support ( or so I understand ).



I've noticed that your linux kernel module is invoked in */etc/rc.conf*
I'm not sure which is the correct way for enabling the Linux module, especially in 12.0 
In my case, I load the module in * /boot/loader.conf*, in *11.2 *
In *11.2*, my nvidia driver module is also loaded from */boot/loader.conf*
In 12.0, I don't have nvidia GPU to play with nvidia driver.

Moreover, I would also try this:



T-Daemon said:


> I have installed on a 12.0-RELEASE test system the 430.34 NVIDIA driver, not from ports but from downloaded tar ball at NVIDIA,



I remember, sometime ago, I had to make latest nvidia-driver (from tarball) to play with CUDA  and my GTX960, when 11.*? didn't have it in pkg or ports tree. The driver worked fine and so did CUDA.

I would even try the nvidia-driver 390.* from pkg install,  just to see what would happen 

Edit:
I just verified and corrected, in my 12.0, how linux module is loaded.  It's from */etc/rc.conf*


----------



## shkhln (Jul 12, 2019)

Rob S said:


> Also I have this but not sure if it's saying anything useful:



It doesn't. Only Nvidia has means to analyze crash/debug dumps.



Rob S said:


> The card is supposed to have a boost clock of 1860, which is more than stock.



Factory OC cards are completely meaningless with GPU boost. Each card (including non-OC versions) boosts as much as it can, which should be somewhere in 19xx.



Rob S said:


> I will also try a different tool now because the version of MSI Afterburner I have seems to make the settings obscure.



You can lower the power limit level with _-pl_ option of _nvidia-smi_ utility. For some reason it does require starting Xorg first, though. Another setting you can play with is Coolbits X config option, which unlocks a few things in _nvidia-settings_ utility.


----------



## Amzo (Jul 12, 2019)

Rob S said:


> Hi Amzo. Thanks for your reply. I built the driver from the ports tree using a patch:
> 
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645
> 
> ...



The reason I was curious is that the new driver has new IRQ code as was assuming there was a bug in it. I can't test it yet as I'm busy working on tensorflow port atm. Since an interrupt storm is when the processor receives too many interrupt requests I figured it may be a bug.

Try increasing the IRQ limit as a temporary fix, the default is 1000.


```
hw.intr_storm_threshold="9000"
```


----------



## Rob S (Jul 12, 2019)

Thanks all. Won't get the chance to look at it until tomorrow.


----------



## Rob S (Jul 12, 2019)

toorski said:


> I would even try the nvidia-driver 390.* from pkg install, just to see what would happen



I did this accidentally at one point. An error was reported on loading the kernel module, which basically said the driver was incompatible with my card.


----------



## Rob S (Jul 13, 2019)

Amzo said:


> The reason I was curious is that the new driver has new IRQ code as was assuming there was a bug in it. I can't test it yet as I'm busy working on tensorflow port atm. Since an interrupt storm is when the processor receives too many interrupt requests I figured it may be a bug.
> 
> Try increasing the IRQ limit as a temporary fix, the default is 1000.
> 
> ...



Hi Amzo,

Thanks for that. I haven't been able to keep running X windows long enough to see the interrupt storm messages, so I haven't been able to test this fix. The last two times I tried running X, it crashed after about 10 seconds with the EQ overflow. I'll post about that separately.


----------



## Rob S (Jul 13, 2019)

toorski said:


> I've noticed that your linux kernel module is invoked in */etc/rc.conf*
> I'm not sure which is the correct way for enabling the Linux module, especially in 12.0
> In my case, I load the module in * /boot/loader.conf*, in *11.2 *
> In *11.2*, my nvidia driver module is also loaded from */boot/loader.conf*
> ...



toorski, T-Daemon : I just tried uninstalling the ports tree 430.26 driver (patched from 390) and built + installed the official nvidia driver 430.34 from source. With this driver, I got the EQ overflow error message (in Xorg.0.log) after about 10 seconds and then my system freezes. This is the same as the error I got when I tried it with the 430.26 driver. I also get the "GPU has fallen off the bus in /var/log/messages". 

So it seems the irq interrupt storm messages are causing random slowdowns but not crashes. Then the EQ overflow is causing the actual crash.

I would try re-seating my card inside the PC but it works fine under heavy load in Win10, so it would suggest the hardware is fine.


----------



## shkhln (Jul 13, 2019)

Rob S said:


> I would try re-seating my card inside the PC but it works fine under heavy load in Win10, so it would suggest the hardware is fine.



Test it under Linux.


----------



## shkhln (Jul 13, 2019)

Rob S said:


> installed the official nvidia driver 430.34 from source.



What's the point?



Rob S said:


> Then the EQ overflow is causing the actual crash.



No, the driver crash (eventually) causes EQ overflow, not other way around.


----------



## Rob S (Jul 13, 2019)

shkhln said:


> Test it under Linux.


Thanks, yes, I was thinking of doing that.


----------



## Rob S (Jul 13, 2019)

shkhln said:


> What's the point?


One of the others reported that their system runs fine with this NVIDIA source build. Also, this is a later driver version, which I thought could contain a bug fix. Alas, it has not solved my issue.


----------



## shkhln (Jul 13, 2019)

Don't do that again.

It's crazy how advice on this forum constantly switches between "never mix packages and ports" and no respect for package management whatsoever. It drives me nuts. (Yes, I know that typically these are different people.)


----------



## Amzo (Jul 13, 2019)

Last thing to try, but have you tried rebuilding Xorg and dependencies. I'm wondering if you have any enabled that could be causing issues, could you post /etc/make.conf? Other users on NVidia forums reported that they solved the issue of "Failed to query display engine channel state", by re-seating the card and memory as the problem was from bad contact / hardware.

Also what is the wattage of your power supply?


----------



## Rob S (Jul 13, 2019)

Amzo said:


> Last thing to try, but have you tried rebuilding Xorg and dependencies. I'm wondering if you have any enabled that could be causing issues, could you post /etc/make.conf? Other users on NVidia forums reported that they solved the issue of "Failed to query display engine channel state", by re-seating the card and memory as the problem was from bad contact / hardware.
> 
> Also what is the wattage of your power supply?


I will try re-seating stuff. I have also removed my Raid controller. When I booted Linux it posted some IO page fault errors relating to that. 

Wattage of PSU is 750W (overkill). As mentioned it's stable under load with Win10.

I'm rebuilding Freebsd now. Btw, he ports driver install asks "WBINVD Flush CPU caches directly". I left this unselected.

My make.conf is blank.


----------



## Rob S (Jul 13, 2019)

Re


Rob S said:


> I will try re-seating stuff. I have also removed my Raid controller. When I booted Linux it posted some IO page fault errors relating to that.
> 
> Wattage of PSU is 750W (overkill). As mentioned it's stable under load with Win10.
> 
> ...


Reseat of gfx card seems to have made no difference. However, one of the connectors on my power cable seemed to be dead (when I tried swapping connectors, just in case). To be safe, I swapped the entire cable for a different one (modular PSU) which now works apparently (just as the original setup did). 

Rebuild of FreeBSD and driver, from ports tree, doesn't seem to have made much difference.


----------



## Amzo (Jul 13, 2019)

It might be worth asking over at NVidia as this issue has been posted a lot on recent cards and the newest version of the Linux driver. Seems like it could be a bug.


----------



## Rob S (Jul 13, 2019)

Amzo said:


> It might be worth asking over at NVidia as this issue has been posted a lot on recent cards and the newest version of the Linux driver. Seems like it could be a bug.


OK might do that. By the way, do you think it's worth trying 418 driver or whatever the earlier one with 1660ti support is? 

Thanks for looking at it.


----------



## Amzo (Jul 14, 2019)

Could be worth a try, as it seems like people on ArchLinux have had issues with cards such as 1070, 1080 and newer fallen off the bus issue. And downgrading the driver didn't resolve the issue. Not sure if this is fixed yet. Nvidia and Linux driver Fallen of the bus thread.


----------



## shkhln (Jul 14, 2019)

Amzo said:


> this issue has been posted a lot on recent cards and the newest version of the Linux driver. Seems like it could be a bug.



It's not a single issue. "GPU has fallen off the bus" is a generic catch-it-all error message.


----------



## Amzo (Jul 14, 2019)

shkhln said:


> It's not a single issue. "GPU has fallen off the bus" is a generic catch-it-all error message.



I know, I'm just saying it seems to be a common issue across FreeBSD and Linux with the latest Nvidia cards 1070, 1080 and more recent. It may well be a bug in the driver as a lot of ArchLinux users have also reported this issue on Arch Forums with the newer drivers and cards as well which is why it may  be best to report it to the Nvidia forums.


----------



## shkhln (Jul 14, 2019)

Amzo said:


> a lot of ArchLinux users have also reported this issue on Arch Forums with the newer drivers and cards



Ok, what makes _this_ more interesting than similar complaints from 2016, 2015, 2014, 2013 or 2012?


----------



## Rob S (Jul 15, 2019)

Seems like this is a driver bug that has been affecting a small percentage of users and that hasn't been fixed for the better part of a decade. I'm wondering now if I've got a valid case for an RMA... Ideally to get an AMD Navi, which I guess is more FreeBSD friendly, in part exchange. 
Also wondered if an alternative would be to run both cards (1660Ti and Ryz 2400g) and switch-over the display while running FreeBSD (other ssd has Win10 for games).


----------



## shkhln (Jul 15, 2019)

Which PSU model do you have, by the way?



Rob S said:


> Seems like this is a driver bug that has been affecting a small percentage of users and that hasn't been fixed for the better part of a decade.



Nope. It's _not_ a _single_ software bug or a hardware issue. It's just a "something is not right" message with pretty much unlimited number of causes.



Rob S said:


> I'm wondering now if I've got a valid case for an RMA...



Can you get another 1660 Ti? Preferably a different model.


----------



## Rob S (Jul 15, 2019)

Maybe


shkhln said:


> Which PSU model do you have, by the way?
> 
> 
> 
> ...


Well, it's worth a try. Maybe could get something without OC. It's just I thought I heard that AMD cards work better in Freebsd/Linux. 

PSU is a Corsair 750M. According to the power reqs. calculator I used, it's about twice as much as I need.


----------



## Rob S (Jul 15, 2019)

Rob S said:


> Maybe
> 
> Well, it's worth a try. Maybe could get something without OC. It's just I thought I heard that AMD cards work better in Freebsd/Linux.
> 
> PSU is a Corsair 750M. According to the power reqs. calculator I used, it's about twice as much as I need.


1660Ti is, of course, very low power consumption, at 120W.


----------



## shkhln (Jul 15, 2019)

Rob S said:


> It's just I thought I heard that AMD cards work better in Freebsd/Linux.



No idea, I don't trust AMD fans. You have a Ryzen 2400G APU, thus you can assess the driver (amdgpu) performance and stability yourself.



Rob S said:


> PSU is a Corsair 750M. According to the power reqs. calculator I used, it's about twice as much as I need.



Ah, ok. There were some reports of issues with Seasonic Focus PSUs.


----------



## Rob S (Jul 15, 2019)

shkhln said:


> No idea, I don't trust AMD fans. You have a Ryzen 2400G APU, thus you can assess the driver (amdgpu) performance and stability yourself.


I assume you mean fans in the sense of "advocates", rather than "cooling solution". If the latter, then I believe that's manufacturer-dependent. I've always thought NVidia cards are better made but, apparently, the AMD drivers are better supported in FreeBSD/Linux. 

I've heard that 2400G works well in FreeBSD but it would be easier to use the same card in both FreeBSD and Win10 - the reason I have got a med/high spec card is so I can play modern games in Win10. Switching back to 2400G on both OSs is not a good option.


----------



## Rob S (Jul 16, 2019)

Thanks all for your help. Going to try for an RMA. Will post on NVidia boards when I get the chance.


----------



## Django (Jul 21, 2019)

I have Nvidia GTX 1660 Zotac card.Gforce driver 430.34 compiled by source  Gnome and KDE freezes, *BUT not *the Mate Desktop !! It is working with no problems . I do not believe that is a problem with the nvidia driver. I think  some  Desktops not work with the new Driver properly.


----------



## Rob S (Jul 22, 2019)

Django said:


> I have Nvidia GTX 1660 Zotac card.Gforce driver 430.34 compiled by source  Gnome and KDE freezes, *BUT not *the Mate Desktop !! It is working with no problems . I do not believe that is a problem with the nvidia driver. I think  some  Desktops not work with the new Driver properly.


That's interesting. I will try installing Mate Destkop when I get the chance.


----------



## Rob S (Jul 22, 2019)

Rob S said:


> That's interesting. I will try installing Mate Destkop when I get the chance.


Although, mine doesn't work in twm or xfce4.


----------



## KoMa350 (Aug 24, 2019)

hello,

not to hi jack the thread, but i'm thinking since this is pretty much the same thing i'd just jump in here...

i have a geforce gtx 1050 on a freebsd 12 box (12.0-RELEASE-p10 FreeBSD 12.0-RELEASE-p10 GENERIC  amd64) and with some games (games/warzone2100 and games/endless-sky for now) the kernel panics after maybe a minute. using `kgdb`, here's what i get:



```
# kgdb /boot/kernel/kernel /var/crash/vmcore.9

GNU gdb (GDB) 8.3 [GDB v8.3 for FreeBSD]

Copyright (C) 2019 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.

Type "show copying" and "show warranty" for details.

This GDB was configured as "x86_64-portbld-freebsd12.0".

Type "show configuration" for configuration details.

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>.

Find the GDB manual and other documentation resources online at:

    <http://www.gnu.org/software/gdb/documentation/>.



For help, type "help".       

Type "apropos word" to search for commands related to "word"...

Reading symbols from /boot/kernel/kernel...

Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

                             

Unread portion of the kernel message buffer:

NVRM: GPU at PCI:0000:01:00: GPU-66172491-d2d9-dd3b-e25e-8cf4ef024576

NVRM: GPU Board Serial Number:

NVRM: Xid (PCI:0000:01:00): 8, Channel 00000009

                             

                             

Fatal trap 12: page fault while in kernel mode

cpuid = 2; apic id = 12      

fault virtual address   = 0x0

fault code              = supervisor read data, page not present

instruction pointer     = 0x20:0xffffffff82de73a4

stack pointer           = 0x28:0xfffffe00004d1430

frame pointer           = 0x28:0xfffffe004d1fca88

code segment            = base 0x0, limit 0xfffff, type 0x1b

                        = DPL 0, pres 1, long 1, def32 0, gran 1

processor eflags        = interrupt enabled, resume, IOPL = 0

current process         = 12 (swi4: clock (0))

trap number             = 12 

panic: page fault            

cpuid = 2                    

time = 1566588099            

KDB: stack backtrace:        

#0 0xffffffff80be78d7 at kdb_backtrace+0x67

#1 0xffffffff80b9b4b3 at vpanic+0x1a3

#2 0xffffffff80b9b303 at panic+0x43

#3 0xffffffff81074bff at trap_fatal+0x35f

#4 0xffffffff81074c59 at trap_pfault+0x49

#5 0xffffffff8107427e at trap+0x29e

#6 0xffffffff8104f625 at calltrap+0x8

Uptime: 13m4s

Dumping 657 out of 8031 MB:..3%..13%..22%..32%..42%..52%..61%..71%..81%..91%



__curthread () at ./machine/pcpu.h:234

234             __asm("movq %%gs:%1,%0" : "=r" (td)

(kgdb) backtrace

#0  __curthread () at ./machine/pcpu.h:234

#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:366

#2  0xffffffff80b9b09b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:446

#3  0xffffffff80b9b513 in vpanic (fmt=<optimized out>, ap=0xfffffe00004d1180) at /usr/src/sys/kern/kern_shutdown.c:872

#4  0xffffffff80b9b303 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:799

#5  0xffffffff81074bff in trap_fatal (frame=0xfffffe00004d1370, eva=0) at /usr/src/sys/amd64/amd64/trap.c:929

#6  0xffffffff81074c59 in trap_pfault (frame=0xfffffe00004d1370, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:765

#7  0xffffffff8107427e in trap (frame=0xfffffe00004d1370) at /usr/src/sys/amd64/amd64/trap.c:441

#8  <signal handler called>

#9  0xffffffff82de73a4 in _nv008794rm () from /boot/modules/nvidia.ko

#10 0xfffffe004d2bd000 in ?? ()

#11 0xffffffff82de73b9 in _nv008790rm () from /boot/modules/nvidia.ko

#12 0xfffff801525cbd80 in ?? ()

#13 0xffffffff82de7237 in _nv008791rm () from /boot/modules/nvidia.ko

#14 0xfffffe004d1fca88 in ?? ()

#15 0xfffffe004d1fcb00 in ?? ()

#16 0x0000000000000000 in ?? ()

(kgdb) list *0xffffffff82de73a4

(kgdb)
```

the similarity to the OP's problem is that once i actually saw X throwing all these (EE)s, so this and also the kernel panic backtrace mentioning nvidia module makes me think i'm having the same issue. seeing the (EE)s was under the condition that i'd be able to quickly type ctrl+alt+del once the screen freezes, sometimes that would work and un-freeze the screen after a couple of seconds, after which everything would work alright for as long as i would be willing to test (say, at least another hour). oh, and the driver installed is the x11/nvidia-driver. also, i compiled the 430.40 tarball from nvidia.com, then it would be more time until the kernel panicked, but it still would happen eventually. downside of compiling the 430.40 was that i would get up to mountroot> and then have to mount /dev/ada0p3 manually or comment out all the nvidia modules in /boot/loader.conf to have it booting alright, so that's a bummer. just for the heck of it i also replaced the ram with a brand new one, still the same thing.

not sure if there's anything that could be done about it, i've been running this machine for more than two years now (initially with 375 driver from nvidia.com and for over a year now with x11/nvidia-driver and never had a problem, up until now, i don't recall exactly when it started but i'm almost sure after i upgraded userland like 2 weeks ago and x11/nvidia-driver was part of that.


----------



## KoMa350 (Aug 26, 2019)

an interesting detail i figured yesterday is that the system actually only freezes in fullscreen mode, i had `glxspheres64` running for a while with no problem then i switched the window to fullscreen mode (alt+f11, i'm running icewm btw) and after 10 or so seconds screen freezes and i get a reboot after kernel panic.


----------



## shkhln (Aug 26, 2019)

KoMa350 said:


> not to hi jack the thread



If you don't want hijacking this thread, then don't hijack it. OP never got a kernel panic.


----------



## KoMa350 (Aug 27, 2019)

Rob S said:


> Hello.
> 
> I am getting EQ overflow errors in my Xorg log using the nVidia 430 driver. These correspond to my system freezing-up. The mouse and keyboard work for about 10 seconds before this happens.
> ...
> ...



this i misinterpreted then as the same thing, sorry for that. in my case the mouse keeps working as well for some seconds after the screen freezes, haven't figured the keyboard.
for now i just avoid entering fullscreen mode...


----------



## KoMa350 (Nov 25, 2019)

after updating the x11/nvidia-driver to version 440 the problem is gone. only i had to use
`kld_list="nvidia-modeset"`
in /etc/rc.conf instead of
`nvidia_load="YES"
nvidia_name="nvidia"
nvidia_modeset_load="YES"
nvidia_modeset_name="nvidia-modeset"`
in /boot/loader.conf. if i wouldn't, then X would hang at startup.


----------



## Rob S (Nov 25, 2019)

KoMa350 said:


> after updating the x11/nvidia-driver to version 440 the problem is gone. only i had to use
> `kld_list="nvidia-modeset"`
> in /etc/rc.conf instead of
> `nvidia_load="YES"
> ...



Thanks for letting me know. I'm not able to test this for a couple of weeks but hopefully the port updates will solve my issue as well.


----------

