12 GPU rack has been fine for months - now having 7 GPUs drop offline. Extensive troubleshooting

rawbar · July 1, 2021, 5:48pm

I’ll try to stick to the facts and make this simple. I have spent probably 20 hours troubleshooting at this point including rebuilding the whole rack.

Since Feb, I’ve had a 12 GPU (all 3070 and 3060Ti) running very stable on HiveOS.
I am using 2x HP server PSUs and Parallel Miner X11 and ZSX card

A couple of weeks ago I noticed 7 cards kept dropping off. Reboots weren’t helping so I went into the basement to troubleshoot. While troubleshooting, the ZSX caught on fire! I got things back up pulling a PSU and ZSX from another rig that I’d recently shut down. Everything came back up fine.

But because I was concerned about the power draw on each PSU, I ordered a 3rd PSU and another ZSX card to spread the load out.

Upon installing the 3rd PSU, I kept having 7 cards drop offline again after a few minutes. Ultimately I ended up rebuilding the whole rig as I couldn’t be sure that risers and GPUs were connected to the same PSU. I made sure when I booted that the 1x PCIe slot LED on the MB came on for each GPU (in some cases I had to replace risers to get it to come on, risers that were just working the day before… replaced 3 sets I think).

I have still having the same issue, 7 cards drop off after a few minutes (they are on multiple PSUs). Example error from syslog:

Jul 1 13:03:43 RAWBAR-RIG1 kernel: [ 447.812938][ C0] NVRM: GPU at PCI:0000:05:00: GPU-258fe041-3674-470b-c91b-b37c750329db
Jul 1 13:03:43 RAWBAR-RIG1 kernel: [ 447.812943][ C0] NVRM: GPU Board Serial Number:
Jul 1 13:03:43 RAWBAR-RIG1 kernel: [ 447.812948][ C0] NVRM: Xid (PCI:0000:05:00): 79, pid=0, GPU has fallen off the bus.
Jul 1 13:03:43 RAWBAR-RIG1 kernel: [ 447.812957][ C0] NVRM: GPU 0000:05:00.0: GPU has fallen off the bus.
Jul 1 13:03:43 RAWBAR-RIG1 kernel: [ 447.812960][ C0] NVRM: GPU 0000:05:00.0: GPU is on Board .

Now, If i power off the 3rd PSU, 8 cards come up and mine fine, no issues at all. I tested for 24 hours.
If I turn on that 3rd PSU (which is a brand new PSU and brand new ZSX), which has 4 cards connected to it, I have 7 cards drop off the system and vanish as far as HiveOS is concerned…

Nothing in the t-rex.log of substance.
-----------------20210701 13:03:49 -----------------
Mining at eth-us-east.flexpool.io:4444, diff: 4.00 G
GPU # 0: EVGA RTX 3070 - 52.05 MH/s, [T:51C, P:119W, F:59%, E:437kH/W], 6/6 R:0%
GPU # 1: EVGA RTX 3070 - 52.05 MH/s, [T:67C, P:119W, F:67%, E:437kH/W], 5/5 R:0%
GPU # 2: EVGA RTX 3060 Ti - 51.73 MH/s, 4/4 R:0%
GPU # 3: Zotac RTX 3060 Ti - 52.20 MH/s, 3/3 R:0%
GPU # 4: Gigabyte RTX 3070 - 51.94 MH/s, 4/4 R:0%
GPU # 5: Gigabyte RTX 3070 - 52.06 MH/s, 6/6 R:0%
GPU # 6: Gigabyte RTX 3070 - 51.94 MH/s, 10/10 R:0%
GPU # 7: RTX 3060 Ti - 51.88 MH/s, 2/2 R:0%
GPU # 8: EVGA RTX 3070 - 51.94 MH/s, 6/6 R:0%
GPU # 9: ASUS RTX 3070 - 52.05 MH/s, [T:61C, P:119W, F:66%, E:437kH/W], 6/6 R:0%
GPU #10: Gigabyte RTX 3070 - 52.05 MH/s, [T:60C, P:123W, F:68%, E:423kH/W], 8/8 R:0%
GPU #11: ASUS RTX 3070 - 52.05 MH/s, [T:59C, P:119W, F:62%, E:437kH/W], 7/7 R:0%
Hashrate: 623.94 MH/s, Shares/min: 11.704 (Avr. 10.691)
Uptime: 6 mins 21 secs | Algo: ethash | T-Rex v0.20.4

20210701 13:03:58 ethash epoch: 424, block: 12743028, diff: 4.00 G
20210701 13:04:06 [ OK ] 68/68 - 623.93 MH/s, 26ms … GPU #0
20210701 13:04:06 [ OK ] 69/69 - 623.93 MH/s, 24ms … GPU #9
20210701 13:04:07 [ OK ] 70/70 - 623.93 MH/s, 22ms … GPU #9
20210701 13:04:17 ethash epoch: 424, block: 12743029, diff: 4.00 G

-----------------20210701 13:04:19 -----------------
Mining at eth-us-east.flexpool.io:4444, diff: 4.00 G
GPU # 0: EVGA RTX 3070 - 52.05 MH/s, [T:51C, P:119W, F:59%, E:437kH/W], 7/7 R:0%
GPU # 1: EVGA RTX 3070 - 52.05 MH/s, [T:67C, P:119W, F:67%, E:437kH/W], 5/5 R:0%
GPU # 2: EVGA RTX 3060 Ti - 0.00 H/s, 4/4 R:0%
GPU # 3: Zotac RTX 3060 Ti - 0.00 H/s, 3/3 R:0%
GPU # 4: Gigabyte RTX 3070 - 0.00 H/s, 4/4 R:0%
GPU # 5: Gigabyte RTX 3070 - 0.00 H/s, 6/6 R:0%
GPU # 6: Gigabyte RTX 3070 - 0.00 H/s, 10/10 R:0%
GPU # 7: RTX 3060 Ti - 0.00 H/s, 2/2 R:0%
GPU # 8: EVGA RTX 3070 - 0.00 H/s, 6/6 R:0%
GPU # 9: ASUS RTX 3070 - 52.05 MH/s, [T:61C, P:119W, F:67%, E:437kH/W], 8/8 R:0%
GPU #10: Gigabyte RTX 3070 - 52.05 MH/s, [T:61C, P:125W, F:68%, E:420kH/W], 8/8 R:0%
GPU #11: ASUS RTX 3070 - 52.05 MH/s, [T:59C, P:119W, F:63%, E:437kH/W], 7/7 R:0%
Hashrate: 260.24 MH/s, Shares/min: 11.478 (Avr. 10.526)
Uptime: 6 mins 51 secs | Algo: ethash | T-Rex v0.20.4

Cards are currently NOT overclocked. I’ve only set the voltage max so as not to overload the PSUs.

Yesterday I did see an error that I’m not seeing today that preceded the 7 cards dropping off. It related to the NVRM nvidia tool reporting that it couldn’t read the temperature of card #2. Followed by 7 cards immediately falling off the bus. I disconnected #2, rebooted, but same problems happened to the other 6 cards.

Note: These screenshots are from yesterday when I still had some OC applied.
hiveos gpu temps 0
unknown

This is just so strange. I have spent so much time on this I really need another set of eyes. The fact I’m losing 7 cards across multiple PSUs with 3 PSUs connected, but not losing any cards if I shut off the one new PSU doesn’t make any sense to me.

I looked at the Motherboard bios config as well. That’s an H310-F Pro w/8GB memory. I didn’t see anything wrong that jumped out at me.

rawbar · July 1, 2021, 6:21pm

Works fine with 8 cards and 2 PSU.

rawbar · July 1, 2021, 9:17pm

another data point. I moved two cards of the 4 off the new PSU and put one each on the other PSUs leaving 2 on the new PSU.

I booted up 10 cards on two PSUs and let it run for about 30 minutes. works great.
added in the 3rd psu, 7 cards crashed again. so whenever i introduce psu #3 into the picture, i have the same problem with 7 cards dropping off. but i’ve swapped out the PSU and zsx and the same thing happens. i wonder if one of the remaining two cards on that PSU is bad and somehow taking down 6 more with it… ill have to test that theory but probably not tonight.

rkulov · July 2, 2021, 4:29am

try the new PSU that you got to power the mobo and some cards but not many, and the other two PSU to power the risers and the other power connection to of the remaining cards .

rawbar · July 2, 2021, 4:50am

Thanks, I’ve spent several more hours on this. I’ve got it up to 11 cards. The 12th I’ve swapped out every single thing including the GPU. If I add a 12th, I end up having 7 cards drop off. Only 1 card now on the new psu with 11. 2 with 12.

I’ve spent enough time on this, tomorrow I’m going to drop down to 10 cards and move the new psu over to my decommissioned rig 2 and bring it back online.

rkulov · July 2, 2021, 4:54am

i wish you luck . i have only one card to worry about. when i get to 12 or another 2 rigs i will have more Exp and help you better.

rawbar · July 2, 2021, 6:27pm

I gave up. I have 10 GPUs on rig 1 and 2 on rig 2 now instead of 12 on rig 1. i dont know what happened to prevent 12 from working after it’s been fine since feb, but i’ve spent too much time on this already. only downside is the 2nd rig isn’t in a grow tent venting heat outside but 2 gpus isn’t bad.

the same exact gpus, risers. cables, psu that introduce problem on rig 1 work absolutely fine on 2.

wutupdogy23 · November 16, 2021, 1:44am

I have the same plm. Don’t know what to do

rawbar · November 16, 2021, 2:03am

I still have only 10 running on rig 1. Just left it that way and grew a 2nd

michaelbarriesmith · November 30, 2021, 9:15am

measure the output voltage of your new power supply I will bet it is a lower voltage and causing issues ,just a guess. I would measure all outputs from each supply with a DVM one will be causing problems or you have a earth issue between PSU’s

system · January 21, 2023, 12:15am

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.