3 rigs go offline at the SAME SECOND (HiveOS)

BravoBend · August 2, 2022, 9:47pm

Hello guys and girls,

I have a very big problem. I own 7 rigs, and all are set up via T-rex on ERGO. Out of the 7 rigs, only 3 rigs crash at the SAME second.
They go offline on HiveOS and they make a weird fan speed up noise for 1 second and then the GPUs flash some color (I have ZOTACs). The rig works, just no picture on the monitor or online in HiveOS. And no, it is not working as in 2miners I see it as 0 MH/s.
I have the rigs setup like this:
2 rigs = 6 x 3080Ti Zotac on 2000w Dell Server PSU Platinum Edition (pulls 1680w from the wall)
1 rig = 4 x 3070 and 3 x 3060Ti on 1600w HP Server PSU Platinum Edition (pulls 980w from the wall)

They are WELL within the 20%-25% difference from the wattage rated on the PSU from the wall.
All 7 rigs are flashed on Kingston USB 3.2 (same model) and all have same Kernel/NVIDIA Driver/Trex Miner Version.
The 3 rigs are on the chinese motherboards for 8 slots, which I give power on the MOBO via 2x6pin connectors.
The CPU temp on all Mobos is between 45c and 55c.
What I have tried:
New RISERS on all gpus on the 3 rigs
Reset the CMOS battery and bios has 4g enabled with always on if power off
PCIe lanes are default, not set
All GPUs are run via seperate 6pin->dual 8pin cables (288w rated) AWG18
This does NOT happen on a fixed time, but rather it can happen once or twice a day, or it will not happen even for 2-3 days or even more… What’s SUPER weird is that at the exact same second ONLY these 3 rigs go offline.

They have in common:

Same motherboard: AFHM65-ETH8EX (with 4GB RAM)
Same risers (on all 7 rigs)
Same PSU (2000w Dell Server on 4 rigs)
They are with a different power breaker (3 different houses/locations) and with a different internet provider.

No, OC is not the problem here. They are LOW and I have the same OC on different rigs and NOT even once have a problem. If it’s a problem of the OC, it would display an error or crash log…

PLEASE HELP ME TROUBLESHOOT THIS, IT IS DRIVING ME NUTZ.

keaton_hiveon · August 2, 2022, 9:58pm

are you on the latest stable image on all? what kernel number?

once the rig(s) crashes you are unable to get any picture? do you have to hard reset them to function again?

BravoBend · August 2, 2022, 10:08pm

All 6 rigs run on the same release, and on the same kernel and nvidia driver. Once the RIGs crash, if I connect a monitor it shows no picture but the fans are spinning still like at 20-30%, And I have to plug off and plug in so it works again

BravoBend · August 2, 2022, 10:09pm

5.10.0-hiveos #110
0.6-217@220422

On all 6 rigs…

keaton_hiveon · August 2, 2022, 10:16pm

with multiple showing offline at the same time i would think it would be a api server connection issue, but with them actually hard crashing and requiring a reboot that doesn’t seem to be the case.

start by updating everything, theres quite a few software updates in the last 3.5 months. do you ever get any other crashes, errors etc on any of the rigs? what clocks are you running?

might also be worth trying another miner on one of the rigs, jsut to see if it impacts the crash issue. do all 3 rigs crash at the same time every time? or is it sometimes one or two?

BravoBend · August 2, 2022, 10:19pm

The specific 3 rigs crash at the SAME SECOND (same way of crashing)… I don’t think it’s a miner issue NOR the version of HiveOS as I have on 3 more rigs the same hiveos version and same miner and they don’t have this issue.

keaton_hiveon · August 2, 2022, 10:29pm

in order to troubleshoot you need to do some trial and error. what i reccomend is try another algo/pool on one, eth is more profitable right now anyway so you could always mine eth and trade for ergo and get more via unminenable or similar, that kills 2 birds with one stone there. more profit (almost double) and ruling out your miner/config.

i would also try setting more conservative clocks on at least one rig, you are using locked core clocks already, right?

BravoBend · August 2, 2022, 10:35pm

On the 3060ti/3070 i am using absolute core clocks, and on the others i am using the same settings… it’s very strange cause they can run for a few days and then all of a sudden 3 AT THE SAME TIME…

How do you, me or someone else explains that? At the same fu*king second they do the same thing (fan spin up 100% for 1 second and then cool down)…

If it was the OC (im running same oc on other cards in the other rigs) it would display some sort of error (Temp unreal error, anything related to driver error and the rig would crash)…

This is straight up voltage spike and some problem happens… Maybe the entire batch of motherboards are faulty?

BravoBend · August 2, 2022, 10:37pm

I understand about the trial and error, and I am mining for around 3-4 years now…

What these 3 rigs have in common is:

same motherboard model AFHM65-ETH8EX (AFOX)
two of the rigs have 2000w Server Dell PSU (but then again 2 more rigs have this exact psu and no problems)

keaton_hiveon · August 2, 2022, 10:41pm

maybe there is some spike when the miner switches to the dev wallet, or something like that. its really hard to pinpoint anything without troubleshooting. try what i mentioned above. have you checked the miner log up to the crash? does it crash on the same point every time? (like building dag, switching to dev wallet etc) , you can also try leaving a monitor plugged in and on the miner console to see what happens when it crashes if anything is frozen on the screen, etc.

wouldnt hurt to do a fresh install via hive-replace -s -y either.

BravoBend · August 2, 2022, 10:51pm

I will do that tonight in hopes that it will show something… I will turn on the logs and see what they will produce

keaton_hiveon · August 2, 2022, 10:54pm

logs will wreck a usb drive pretty quickly, best to use an ssd or hdd for that btw. an ssd wouldnt be a bad thing to try either in the troubleshooting.

BravoBend · August 2, 2022, 10:56pm

Yeah I know… It’s not the USB either as I have been using USB drives my entire life. How do they fall down at the SAME exact second is the milion dollar mystery

keaton_hiveon · August 2, 2022, 10:58pm

gotta troubleshoot and find out

BravoBend · August 2, 2022, 11:01pm

Can it be connected via HIVE’s IP or something? What if I change the worker, delete this one and make a new one…?

keaton_hiveon · August 2, 2022, 11:08pm

you can try creating a new worker/different account etc. but anything on hives side wouldn’t cause the rig to crash in the way youre explaining imo

BravoBend · August 2, 2022, 11:27pm

I will re-flash the usb, but only 0.6-217@220423 image is available for download from the official website. I will also create a new worker as well

keaton_hiveon · August 2, 2022, 11:32pm

let me know how it goes. make sure to update after reflashing, a new software update just came out today with some improvements. (0.6-219@220802)

BravoBend · August 3, 2022, 12:25am

0.6-219@220802

Updated to this, installed the latest T-rex miner. Rig has a different name, different ID and a brand new usb. I also added another 6pin power to the motherboard instead of only one.

Will update when the crash happens and if it still affects this rig. The rest 2 rigs are un-touched.

Atanas · August 9, 2022, 8:42am

Continuing the discussion from 3 rigs go offline at the SAME SECOND (HiveOS):

Hi there! I’ve read all post diagonally, but I think it’s connection problem.
Are those 3 problematic rigs connected to the same router/ISP?

I suggest to leave 1 monitor connected to 1 of the rigs all the time until the next crush. It must be connected from the start.