Radeon RX 480 ARMOR 8G OC - MSI not being detected after fault 147 and 146

ShadowJ14 · March 17, 2022, 4:21pm

Hello, while I was sleeping my rig faced some issues, and when I noticed it one of my GPUs stopped being recognized by HiveOS. The faulty GPU is a Radeon RX 480 ARMOR 8G OC - MSI, and it was connected directly to the motherboard.
“Autofan: GPU temperature 511 is unreal, driver error” message:

Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: GPU fault detected: 147 0x03e08802 for process teamredminer pid 2775 thread teamredminer pid 2775
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0003647C
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02088002
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: VM fault (0x02, vmid 1, pasid 32772) at page 222332, read from 'TC6' (0x54433600) (136)
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu: GPU fault detected: 146 0x06a8080c for process teamredminer pid 2775 thread teamredminer pid 2775
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0xFFFFFFFF
Mar 17 05:10:32 MyFirstRig kernel: amdgpu 0000:02:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF
Mar 17 05:10:45 MyFirstRig kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1598687, emitted seq=1598688
Mar 17 05:10:45 MyFirstRig kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process teamredminer pid 2775 thread teamredminer pid 2775
01:00.0  Temp: 64C  Fan: 67%  Power: 95W
02:00.0  Temp: 511C  Fan: 100%  Power: 16777215W
05:00.0  Temp: 63C  Fan: 67%  Power: 62W

“teamredminer: GPU 1: detected DEAD” message:

[2022-03-17 05:10:41] ------------------------------------------------------------------------------------------------
[2022-03-17 05:10:41] Mining ethash with 3 GPU workers
[2022-03-17 05:10:41] GPU PCIe      CUs CoreMHz SocMHz MemMHz TEdge TJct  TMem  FanPct  FanRpm  VDDC   Power  ETH Cfg
[2022-03-17 05:10:41] Watchdog GPU 1: Temperature 511C is over limit 85C, stopping GPU work.
[2022-03-17 05:10:41] 0   01:00.0   32  1168    0      2100   63C    0C    0C   67.84%  1191    868 mV  95 W  A202
[2022-03-17 05:10:41] 1   02:00.0   36  0       0      0     511C    0C    0C   100.00%    0     65 mV 1073 W  A204
[2022-03-17 05:10:41] 2   05:00.0   36  1125    0      2100   63C    0C    0C   67.84%  2962    900 mV  62 W  A192
[2022-03-17 05:10:41] 
[2022-03-17 05:10:41] Stats Uptime: 10 days, 13:32:23
[2022-03-17 05:10:41] ----------------------------------------- GPU Status -------------------------------------------
[2022-03-17 05:10:41] GPU  0 [63C, fan 67%]      ethash: 31.55Mh/s, avg 31.57Mh/s, pool 31.33Mh/s a:6659 r:1 hw:0
[2022-03-17 05:10:41] GPU  1 [511C, fan 99%]      ethash: 31.02Mh/s, avg 31.01Mh/s, pool 30.81Mh/s a:6548 r:0 hw:0
[2022-03-17 05:10:41] GPU  2 [63C, fan 67%]      ethash: 30.57Mh/s, avg 30.55Mh/s, pool 30.06Mh/s a:6389 r:2 hw:0
[2022-03-17 05:10:41] Total                      ethash: 93.13Mh/s, avg 93.13Mh/s, pool 92.21Mh/s a:19596 r:3
[2022-03-17 05:10:41] ----------------------------------------- Pool Status ------------------------------------------
[2022-03-17 05:10:41] eu-eth.hiveon.net          ethash: 84.76Mh/s, avg 92.43Mh/s, pool 92.21Mh/s a:19596 r:3
[2022-03-17 05:10:41] eu-eth.hiveon.net          ethash: 0.000 h/s, avg 0.000 h/s, pool 0.000 h/s a:0 r:0
[2022-03-17 05:10:41] ------------------------------------------------------------------------------------------------
[2022-03-17 05:10:42] Pool eu-eth.hiveon.net received new job. (job_id: 0x42e79a9ddda..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:44] Pool eu-eth.hiveon.net received new job. (job_id: 0xc4db4a95d48..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:46] Pool eu-eth.hiveon.net received new job. (job_id: 0xbba79b6e9ec..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:48] Pool eu-eth.hiveon.net received new job. (job_id: 0x8a2c811de28..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:50] Pool eu-eth.hiveon.net received new job. (job_id: 0x028937d8712..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:50] Pool eu-eth.hiveon.net received new job. (job_id: 0x028937d8712..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:51] Pool eu-eth.hiveon.net received new job. (job_id: 0x868b03cc443..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:51] Pool eu-eth.hiveon.net received new job. (job_id: 0xeda10e7dbe8..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:53] Pool eu-eth.hiveon.net received new job. (job_id: 0x2df30727be9..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:55] Pool eu-eth.hiveon.net received new job. (job_id: 0x121e5c1aa52..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:56] Pool eu-eth.hiveon.net share accepted. (GPU2) (a:19597 r:3) (69 ms) (diff 102.38 GH)
[2022-03-17 05:10:58] Pool eu-eth.hiveon.net received new job. (job_id: 0x7de259c2ed0..., diff 1.000 / 4295 MH)
[2022-03-17 05:10:59] Pool eu-eth.hiveon.net received new job. (job_id: 0x435e3b59e0f..., diff 1.000 / 4295 MH)
[2022-03-17 05:11:01] GPU 1: detected DEAD (02:00.0), will execute restart script watchdog.sh
[2022-03-17 05:11:01] Watchdog script executor thread executing script 'watchdog.sh'
[2022-03-17 05:11:01] Pool eu-eth.hiveon.net received new job. (job_id: 0xa5b67ad257d..., diff 1.000 / 4295 MH)

The second “GPU Dead” message is related to an initialization fail because when it didn’t recognise the aforementioned GPU it made GPU 2 into GPU 1 which OC settings unabled it to initialize, so I put a GPU that worked with the same OC settings in the previous one slot.
This image shows the settings used, and the temp history:

The faulty GPU was placed in GPU 1 with the settings presented in the image, and it ran consistent under 50ºC (as seen in the image before the gap where the right was off). It consumed an average of 68W as well.
I tried switching PCIe slots, using a riser, cleaning the dust (it had a bit) with cotton swabs, opened to check if anything seemed burned, and I didn’t find anything, nor is it working yet. When I turn on the rig sometimes the fans rotate a bit and then stop, and sometimes this happens when I turn the rig off.
Can anybody please give me some assistance or let me know if it’s an issue with a dead GPU? I tried to find info related with this, but couldn’t find this issue specifically, where the GPU is not recognized after the fault 147 and 146 logs. Thank you!

Grea · March 17, 2022, 5:14pm

511 is a “power disconnection” malfunction. Normally riser path, but in your case, it sounds like it was directly in a PCI slot.

None of the data shared to date supports an OS level issue, but points to Hardware.

Personally, I’d pull the GPU and put it in another validated PC/rig and see if it shows up at all at BIOS, HiveOS, Windows, etc.

ShadowJ14 · March 17, 2022, 5:35pm

Thank you for the comment I don’t have another PC/rig I can test it on, but I’m creating a portable Windows USB drive to see if it shows up there. I can’t find where PCIe devices are in my BIOS but I’ll give another look.

Grea · March 17, 2022, 6:09pm

HiveOS is going to report as “visible” only those GPUs which are assigned PCI addresses by the BIOS

Sometimes they won’t be identified properly due to Kernel, Drivers, etc., but they’ll be in there somehow if the BIOS sees them and assigns the address.

When one disappears as you describe, Hardware or Power are first places to hunt.

ShadowJ14 · March 18, 2022, 10:23am

That makes sense. I think it may be Hardware (GPU) related. The universe is conspiring against me, I’m building my first rig, and the third GPU I buy (used from ebay), dies less than a month after purchase

I’ve tried plugging it both on the motherboard and the riser, and with different power cables that work for the other GPUs I have, and the faulty one still doesn’t work, even when it’s the only one connected. Most times the fans start and stop almost immediately when I turn on the rig, but sometimes the fans stay on for a bit and then turn off. In neither situation, the card is recognized. What would you suggest I do? Would changing the thermal paste be a possibility, or should I look to try and get it fixed? And is it even worth it? Could the repair cost be almost as high as the card itself(~250€)?

Grea · March 18, 2022, 3:33pm

Put it in a 16x PCIe slot running windows and see what happens. There are some windows tools which tend to be more widely described on the internet/YouTube/technical sites, etc.

Let your other GPUs mine while all this is going on

system · May 9, 2023, 6:34am

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.