I have a rig with 8 Vega 56 and 64. Vegas are not very stable cards so gpu dead errors happens few times a day. Watchdog recognizes the problem reboots miner (TRM) and the system works until next error.
Unfortunately sometimes (I mean 2-3 times a week) the system goes offline. All I can do is turn off the power. I found something in syslog but to be honest I do not know how to interpret this data. Anybody can help?
The rig went offline at 17:47. At 17:54 I cut the power off and on and it came alive.