Hey there! I’ve had an 8 GPU miner running for a few months now. Its been running fine, but all of a sudden last night I’ve been noticing lots of GPU driver errors and restarts. Digging into the log it seems one of the cards has a DAG verification error almost immediately on miner startup. It’ll continue mining for 5-10 minutes before running into a Device 7 CUDA Error (illegal memory access), which causes the miner to restart. While mining, the card will have approximately 50% reject/invalid share rate. No other GPU in the system is running into this issue.
I’ve attempted replacing the kernel with the latest version as well as updating the GPU drivers to the latest stable version. I’ve attempted reducing the overclock of that card significantly. Additionally, I’ve replaced the riser and power cables as well as trying different PCIe slots. Nothing seems to be working. The only thing that stops the system from crashing is unplugging the card receiving the errors.
For reference, the card that is receiving errors is a 3060ti and I’m using nbminer. It has had no issues with a 1550 core lock and a 2300 memory clock for 3+ months. I’ve attempted reducing that down to a 1350 core clock and 1600 memory OC, no dice. Temperature for all cards is ~55 C.