I actually patch this issue with the reboot function into the autofan configuration.
Should be appreciate to have the solution ^^
It seems like the miner can’t read the fan speed so it stays low and the gpu overheats. Sometimes I see it happen before the gpu reaches my predefined critical temperature which I have set to 75. Restarting the miner before the system reboots by itself solves the issue. This tells me the problem is with the miner and not the OS. This is random and days can go by before it happens again.
With new version of T-REX (0.20.3) I have error:
TREX: Can’t find nonce with device [ID=0, GPU #0], cuda exception in [synchronize, 52], CUDA_ERROR_ILLEGAL_ADDRESS, try to reduce overclock to stabilize GPU state
I know this means that I need to decrease the overclocking, but with these settings everything was fine in the old version of the miner.
Hello. same problem with a 3070
In my case it was caused by the overclocking. I found out only a few days ago. I decided to reduce the memory OC by 100mhz on all the cards and it’s been running well for two days. Let me know about your OC setting and I’ll help you.
I was just getting the same behaviour right here.
1 of my 1660 Super cards was unable to read its fan value, and t-rex miner got restarted frequently like every 2-3 hours of running.
haven’t tried auto fans setting yet, will give it a try if there’s no other choices left.
Below is my OC setting of the card.
update: even tried auto fans setting it still had frequent restart.
below is logs i’ve fetched from /var/log/miner/t-rex.log
PS, after miner restart i’ve seen CRC with (!) mark on the GPU slot during DAG gnerate. not sure if it’s something with the process on this card itself or not.
20210514 11:46:42 GPU #4: DAG generated [crc: 1096dfed, time: 27595 ms], memory left: 1.49 GB
20210514 11:46:49 GPU #1: DAG generated [crc: 91ed30ce(!), time: 34313 ms], memory left: 1.49 GB
20210514 11:44:04 [ OK ] 66/66 - 256.05 MH/s, 236ms ... GPU #1
20210514 11:44:05 TREX: Can't find nonce with device [ID=1, GPU #1], cuda exception in [synchronize, 52], CUDA_ERROR_LAUNCH_FAILED, try to reduce overclock to stabilize GPU state
20210514 11:44:05 WARN: Miner is going to shutdown...
20210514 11:44:06 Main loop finished. Cleaning up resources...
20210514 11:44:06 ApiServer: stopped listening on 127.0.0.1:4059
20210514 11:44:06 ApiServer: stopped listening on 127.0.0.1:4058
20210514 11:44:06 [ OK ] 67/67 - 0.00 H/s, 445ms ... GPU #4
20210514 11:44:07 T-Rex finished.
I had the same problem and it was caused by the overclocking. Not all the GPUs will OC the same way. Reduce the OC or start from the default settings and see if that solves the problem. Then increase memory OC slowly until the problem reappears. This way you’ll know what setting to use. Don’t hurry the process. Another way to do it is to lower de OC by 50 or 100 until the issue is gone, this is probably the fastest way to do it. Let me know about your results. Good luck!
Thanks for the suggestion though
will try my best not to adjust the value during the monitoring lol
I have been having problems with one of my gpu for a week. Very low hashrates with my rtx 3090, it works between 40 and 50 mh. I have tried everything and I can’t correct it
What are your OC settings? When did you buy this card? I have a water-cooled EVGA FTW3 3090 in my main system and it does about 118 MH/s while drawing 293 watts and with 1000 memory OC. When it wasn’t water-cooled its hashrate was around 90 MH/s because of memory overheating. Did you change the thermal pads? Is the back of the card well cooled? What happens with stock settings? 3090 is a very tricky card to mine with because the memory modules on the back of the card don’t have good cooling. However, the hashrate you’re getting seems too low.
update: my card has been stable after adjusting the OC settings with core=1000 / memory=1600.
will try to push the memory up a little bit and keep the observation
Good to hear that. Remember to do it slowly. Give each new OC setting about a day. Also, sometimes the hashrate improvement is very low at high levels of memory OC. In my opinion, it isn’t worth trading 0.5 MH/s for stability. Keep this in mind because maybe your hashrate is good enough right now and your rig will be more stable if you don’t push the OC too much, your cards will also last longer.
we have for a week one of our 3070 this error massage , we updatet rig, get our #10 vga setup everything and after few hours t-rex drops out the card, first cannot read fan speed, after few try we get theis error again and again, now try running the msi ventus 3070 with “-200” memory clock couse we tried lower from 2400 to 2200 , 2000 ,1000 the t-rex restarts again and again with this error massage
we changed already the riser , and power cables
the card gpu temp was never over 49-50 celsius , hope not melted down the memory modul
Swap the card to another position and see if it still happens to the same card.
tried already still have the issue
Try another miner for a while and see what happens. If you did this already I’d recommend you to put the card alone in a system with windows and run the same miner to see if there is something wrong with the card itself.
tried to run phoenix, and get lot of rejected shares on other cards so changed back to t-rex :D, yeah i spoke already with support , if we have time, we try benchmarking the msi ventus with superposition and furmark
dunno maybe try the warranty, we have still, if not try to change termal pads, msi ventus 3070 have really bad plastick backplate
we try running t-rex on windows seems stable for now
seeing nothing special, temperature looks ok, waiting for crash
*we have on windowst-rex-0.19.14-win-cuda11.1 , hive has already newer version, my toughts was maybe probleme with the new version of t-rex
crashed after 2 hours i have no log, windows restarted, write nothing in logbook