I have two HP OEM 3060s in a rig with 6 other Nvidia GPUs. Today I got the dreaded “GPU detected dead” error, and it was one of the the HP 3060s.
I couldn’t get the fan speed to go above 34%, regardless of Hive setting, which causes the GPU to overheat after about 5 minutes. Since I have two of these, I started swapping things around, including risers, fans, etc. At first I thought it must a bad fan. After swapping, both fans worked fine/the same, but I could isolate the ‘misbehaving’ GPU which performed the same regardless of which fan I put on it. Still stuck at 34%.
So I tried this bad GPU in Windows. With Afterburner, I was able to take the fan all the way up to 100%, and GPU-Z reported the same fan %. Sounds like it should be a problem with Hive?
Back in the Hive rig, I tried updating my Nvidia drivers, as well upgrading to the latest version of HiveOS. No difference, although this time (after being in the Windows box) the fan now reports 36%. Same difference I guess.
Finally, I tried the Autofan setting in Hive, setting the max memory temp level at 50º to trigger all fans to 80%. No difference, the fan is still stuck at 36%, no matter if I manually set it in Hive to 80-100%, or even leave it a 0 for Auto.
It seems easy to blame the GPU, but why would it work correctly in Windows? All GPUs in this rig behave normally with regards to their fan settings. And it’s not just a fan speed reporting issue, the GPU will get hot and overheat and crash the rig.
I also tried flashing the VBIOS from TechPowerUp (EDIT this BIOS is for the non-LHR version, but I tried the LHR version but it didn’t help). Also tried flashing the BIOS from the “good” card to the bad one… same fan behavior.
Finally, I tried a brand new HiveOS image, on a fresh SSD. Same behavior. One GPU’s fan is still stuck at 34%!