GPU Driver Error, No Temps Error

I have one of my rigs reboot probably about once every two days due to “GPU driver error, no temps”. When I click on the error though, I see the following with no additional information. I understand from other posts that this could be the memory overclock on one of the cards, or a power delivery issue, or even ghost settings (needing a re-image of the drive). I’d like to figure out which card this is though. Is there a way to do that.

01:00.0 Temp: 61C Fan: 63% Power: 115W
0c:00.0 Temp: 50C Fan: 69% Power: 262W
0d:00.0 Temp: 44C Fan: 30% Power: 109W
0e:00.0 Temp: 55C Fan: 53% Power: 260W

Can you post a screenshot of your worker overview screen?

Not much to go with on the screenshot unfortunately

Try setting the fan on the 3080tis to 100 and reducing the mem clocks slightly on those too.

Not having much luck with the memory clock changes. Rig is now rebooting more often. Is there a way to look at the logs to identify at least which card is being reported for this error? It’s not available in the GUI, and I had no luck finding it in /var/log logs available.

Reduce memory clocks by a lot and see if it’s stable, that’s almost always the issue.

Thanks for linking to the article. I’ll reduce memory clock settings even further. If I have to suspect any one card, it would be the MSI 3080Ti. Another of the same card, gave me issues on a separate rig but now runs fine with the settings possibly not working on this rig.

Interestingly, all other instances of similar errors posted in this forum, users are able identify the address from the BUS ID as the error screen clearly shows a “0” temperature for the card in question. I don’t see that in the GUI. Is that because I am running nVidia cards, and in Linux nVidia drivers don’t post tjunction temperatures?

it just depends how it crashes

the rig worked fine all night, an error began to appear during the day, video cards were not mined, although they were online
sometimes it just goes offline

dont use core offsets on 30 series. locked core clocks will use less power and be more stable

