Hello, while I was sleeping my rig faced some issues, and when I noticed it one of my GPUs stopped being recognized by HiveOS. The faulty GPU is a Radeon RX 480 ARMOR 8G OC - MSI, and it was connected directly to the motherboard.
“Autofan: GPU temperature 511 is unreal, driver error” message:
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: GPU fault detected: 147 0x03e08802 for process teamredminer pid 2775 thread teamredminer pid 2775
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0003647C
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02088002
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: VM fault (0x02, vmid 1, pasid 32772) at page 222332, read from 'TC6' (0x54433600) (136)
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: GPU fault detected: 146 0x06a8080c for process teamredminer pid 2775 thread teamredminer pid 2775
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0xFFFFFFFF
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF
Mar 17 05:10:45 MyFirstRig kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1598687, emitted seq=1598688
Mar 17 05:10:45 MyFirstRig kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process teamredminer pid 2775 thread teamredminer pid 2775
01:00.0 Temp: 64C Fan: 67% Power: 95W
02:00.0 Temp: 511C Fan: 100% Power: 16777215W
05:00.0 Temp: 63C Fan: 67% Power: 62W
“teamredminer: GPU 1: detected DEAD” message:
[2022-03-17 05:10:41] ------------------------------------------------------------------------------------------------
[2022-03-17 05:10:41] Mining ethash with 3 GPU workers
[2022-03-17 05:10:41] GPU PCIe CUs CoreMHz SocMHz MemMHz TEdge TJct TMem FanPct FanRpm VDDC Power ETH Cfg
[2022-03-17 05:10:41] Watchdog GPU 1: Temperature 511C is over limit 85C, stopping GPU work.
[2022-03-17 05:10:41] 0 01:00.0 32 1168 0 2100 63C 0C 0C 67.84% 1191 868 mV 95 W A202
[2022-03-17 05:10:41] 1 02:00.0 36 0 0 0 511C 0C 0C 100.00% 0 65 mV 1073 W A204
[2022-03-17 05:10:41] 2 05:00.0 36 1125 0 2100 63C 0C 0C 67.84% 2962 900 mV 62 W A192
[2022-03-17 05:10:41]
[2022-03-17 05:10:41] Stats Uptime: 10 days, 13:32:23
[2022-03-17 05:10:41] ----------------------------------------- GPU Status -------------------------------------------
[2022-03-17 05:10:41] GPU 0 [63C, fan 67%] ethash: 31.55Mh/s, avg 31.57Mh/s, pool 31.33Mh/s a:6659 r:1 hw:0
[2022-03-17 05:10:41] GPU 1 [511C, fan 99%] ethash: 31.02Mh/s, avg 31.01Mh/s, pool 30.81Mh/s a:6548 r:0 hw:0
[2022-03-17 05:10:41] GPU 2 [63C, fan 67%] ethash: 30.57Mh/s, avg 30.55Mh/s, pool 30.06Mh/s a:6389 r:2 hw:0
[2022-03-17 05:10:41] Total ethash: 93.13Mh/s, avg 93.13Mh/s, pool 92.21Mh/s a:19596 r:3
[2022-03-17 05:10:41] ----------------------------------------- Pool Status ------------------------------------------
[2022-03-17 05:10:41] eu-eth.hiveon.net ethash: 84.76Mh/s, avg 92.43Mh/s, pool 92.21Mh/s a:19596 r:3
[2022-03-17 05:10:41] eu-eth.hiveon.net ethash: 0.000 h/s, avg 0.000 h/s, pool 0.000 h/s a:0 r:0
[2022-03-17 05:10:41] ------------------------------------------------------------------------------------------------
[2022-03-17 05:10:42] Pool eu-eth.hiveon.net received new job. (job_id: 0x42e79a9ddda..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:44] Pool eu-eth.hiveon.net received new job. (job_id: 0xc4db4a95d48..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:46] Pool eu-eth.hiveon.net received new job. (job_id: 0xbba79b6e9ec..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:48] Pool eu-eth.hiveon.net received new job. (job_id: 0x8a2c811de28..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:50] Pool eu-eth.hiveon.net received new job. (job_id: 0x028937d8712..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:50] Pool eu-eth.hiveon.net received new job. (job_id: 0x028937d8712..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:51] Pool eu-eth.hiveon.net received new job. (job_id: 0x868b03cc443..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:51] Pool eu-eth.hiveon.net received new job. (job_id: 0xeda10e7dbe8..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:53] Pool eu-eth.hiveon.net received new job. (job_id: 0x2df30727be9..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:55] Pool eu-eth.hiveon.net received new job. (job_id: 0x121e5c1aa52..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:56] Pool eu-eth.hiveon.net share accepted. (GPU2) (a:19597 r:3) (69 ms) (diff 102.38 GH)
[2022-03-17 05:10:58] Pool eu-eth.hiveon.net received new job. (job_id: 0x7de259c2ed0..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:59] Pool eu-eth.hiveon.net received new job. (job_id: 0x435e3b59e0f..., diff 1.000 / 4295 MH)
[2022-03-17 05:11:01] GPU 1: detected DEAD (02:00.0), will execute restart script watchdog.sh
[2022-03-17 05:11:01] Watchdog script executor thread executing script 'watchdog.sh'
[2022-03-17 05:11:01] Pool eu-eth.hiveon.net received new job. (job_id: 0xa5b67ad257d..., diff 1.000 / 4295 MH)
The second “GPU Dead” message is related to an initialization fail because when it didn’t recognise the aforementioned GPU it made GPU 2 into GPU 1 which OC settings unabled it to initialize, so I put a GPU that worked with the same OC settings in the previous one slot.
This image shows the settings used, and the temp history:
The faulty GPU was placed in GPU 1 with the settings presented in the image, and it ran consistent under 50ºC (as seen in the image before the gap where the right was off). It consumed an average of 68W as well.
I tried switching PCIe slots, using a riser, cleaning the dust (it had a bit) with cotton swabs, opened to check if anything seemed burned, and I didn’t find anything, nor is it working yet. When I turn on the rig sometimes the fans rotate a bit and then stop, and sometimes this happens when I turn the rig off.
Can anybody please give me some assistance or let me know if it’s an issue with a dead GPU? I tried to find info related with this, but couldn’t find this issue specifically, where the GPU is not recognized after the fault 147 and 146 logs. Thank you!