GPU Driver Errors and 6600XT?

MalsBrownCoat · October 21, 2021, 12:31am

Thanks for the suggestions, Luigi. I had already tried swapping out components that were known to be working (risers, power cables, pcie slots on the mb, a complete reinstall of Hive on a good SSD, etc). I’m beginning to think that 2 of these gpus are just defective. I can completely understand if a few cards don’t like certain thresholds of over (or under) clocking. But the two in question don’t seem to accept any OC whatsoever, and they take the whole rig down with them when they crash.

Like my earlier comment, I don’t really consider even a full 24 hours of continuous mining to be classified (with certainty) as “stable”. However, after removing those two Powercolor GPUs, the rig has been running since my last post. I noticed that it did stop mining for about an hour early this morning, but it appears that it recovered on its own without any intervention on my part.

I just took these screenshots -

Notice the dip to “0”, but then it recovered, followed by another couple of brief areas where it didn’t report at all.

A screenshot of NBMiner indicates that at some point (likely right when those dips occur), at very least, NBMiner restarted itself (otherwise it would show ~23 hours of the activity).

Note that in these shots, GPU’s 0 and 3 are also Powercolor’s, whereas 1 and 2 are Sapphire Pulse’s. So far, I have not had any problems with the Sapphires, it has always been a Powercolor that crapped the bed. The two Powercolors that are showing here are different ones than the two that were having the earlier problems though.

So, maybe it’s a simple as “bad cards”. Not sure I’m really buying that though. One? Ok. Two?
…mmm…I dunno about that…

I do have a few more Sapphire Pulse’s that I think I’ll add to the rig. I suppose if those run for a few days, maybe it really was those two Powercolor’s. Suffice it to say, I have a few more days left to return those, which I think I’ll just go ahead and do.

MalsBrownCoat · October 21, 2021, 12:46am

mini-miner - when you loaded the (new) NVME, did you load the exact same HiveOS version, or was it by chance an updated (or earlier) version than your rig.conf file was associated with?

One of the reasons that I question this is because (if I recall in another thread), your OC seems to be locked at 1000 Mhz and I believe that was an issue on prior/stable (rather than beta) versions.

I’m just going to spitball here, but to rule out other variables, have you also done the steps that mogliettazza had shared; remove the flight sheet/“clean” the oc/reboot/apply new flight sheet/apply new OC

As for 4G decoding, that’s been a bit of a standard over the last several years with mining motherboards. BAR settings should not be required and may in fact hinder performance (though I have no concrete evidence of this). The point being; motherboards have supported more than 4 GPUs just fine without the need for anything BAR related for years.

And I don’t mean this to sound remedial, but have you been able to rule out a problem with that specific 5th GPU? What happens if you take one of the four that are currently working, and use it in one of their places? I would also repeat that test for the specific 6th GPU in question.

mogliettazza · October 21, 2021, 2:06pm

Very interesting,so i was almost sure that you tried already to rebuild a rig one component at the time but you know, i had to ask,could be a bad card, those maybe have a upgrade bios?
maybe just better return it and buy something else,to me personally the better card is not the one with more hashrate but the more stable,and less w because the last thing i want is headache,i want focus more to research and buy more gpu(for how long we can mine eth,that is another thing…)

yes i load the same version/worker/conf file the only think i did was change the ethernet cable for a commercial one because i run a wifi ethernet splitter and sometime got me problem but now no more.

im not a fan of the beta version but i wasn’t able to make the 6600tx work on the stable version,no way just wont work,so i build a new rig with a b550 plus and a syoncon usb 4 pcie splitter with beta version and after set oc and new fsheet all seem work fine for 3 day now with 32 hashrate firm and 49 w 20% fan,no bad at all but i did not trust this card,period 48/58 temp
5.10.0-hiveos #60
Kernel Version
A20.40 (5.11.0825)
N470.63.01
hiveos 0.6-210@211010

my favorite setup is a 110 btc asrock 13 gpu mboard with stable hiveos version (but no way to make the 6600 work) but telling you the truth i only put 12 gpu,why?no reason i never like to overload anything i want just keep a bit lite,i wish i can buy better card yes i do,but i not want spend scalpel price for high end card,is just no fair.

today or tomorrow i will install another card on the new rig and i will see what happen,im pretty sure the 6600xt going to lose the configuration and i have to re do the oc ans flighsheet but this time if happen no more time waisted i will return the card and just go with another 1660 s,im not lazy but i cant and want be concerned everyday.
a soon i do that i let you know what going to happen,im waiting for a 1660s and a 3060 lhr
going to be here to papa anytime now

as right now,how your rig going?what card you left out of the rig?(if any)
look at your pictures you notice that any 6600 work at different temp?

edwinst31 · October 21, 2021, 10:48pm

your problem lies in the hive OS version you are using.

you are using hive os stable version 6-208, and it does not work well for VGA RX 6600 XT, the maximum hashrate you can get is 28.5 mh/s, no matter you try to set your overclock, it will still get 28.5 mh/s .

download the latest beta version on the hive OS website, as far as I know the latest beta version is version 6-210.

use that version, and you will be able to adjust your OC, but you need to remember, this RX 6600XT is a new VGA, there is no version that is completely stable.

if you use beta version, you will get problem, settings are very sensitive. but the hashrate you get can reach 32-33 mh/s.

mogliettazza · October 22, 2021, 12:11pm

perhaps i get 32,and sure the card is about feb 2021 but the problem here is not the hashrate, only

MalsBrownCoat · October 23, 2021, 8:12pm

Just an update - after removing those two problematic Powercolor Red Devils, I replaced them with a pair of Sapphire Pulse cards and things seemed to be going well for about 3 days. I woke up this morning and saw that at some point last night, the rig had crashed. I hadn’t enabled the watchdog on it yet, so that explains why it never restarted. But, since it did crash at some point, the clocks may need a slight adjustment somewhere.

I don’t really have the time to futz with it right now and I’ll be on a work trip for a week (I made sure to setup the watchdog in the meantime). I’ll provide an update when I return. Appreciate everyone’s insight on things so far. Thank you.

luci77 · October 27, 2021, 2:52am

I feel you, mate. I’ve read thoroughly all your comments and I can resonate with the dissapointment, felt more or less the same sort of frustation. I’m exactly in the same situation. trying to run a rig with 4 new MSI RX6600 XT with Samsung GDDR6 mem, driver 20.40 (5.11.0825) - tried another version as well and it seems that I’m not getting anywhere. Currently running on HiveOs latest beta, 0.6-210@211025 (of course I tried other versions, stable HiveOs, etc) and already tried changing cables, PSU, risers, etc. The most solid continuous run was around 25 hours before another pesky GPU * detected dead error.

It seems a bit more stable with the OC settings around the following values: Core 945, VDD 660, MEM 1132 (yes, tried a tone of other values), aprox 32MH, but after 4 days seems that I’m back to square 1. What I wanted to add though is the fact that I really believe it’s a HiveOs related issue, not a hardware one. In my case, at the beginning, the card detected dead was randomly hit by the error (could be 0,1,2,3), but lately I noticed is more the GPU 0. Also, looking at the error log files, the pattern was that whichever card was about to die would have a 7 points increase in its VDDC and a loss in wattage, no matter the initial VDD value (i.e. 640, 660, 675, 685 etc) before the error. Have no ideea if this thing says something or not, it’s just an observation.

As a last resort, perhaps we should try using Windows with these cards

Batfink · November 2, 2021, 12:49pm

Hi MalsBrownCoat, just a suggestion buddy but are your motherboard PCI-E slots set for “Gen1” in your BIOS? (I had ALOT of very random problems/crashes until I set the slots for Gen1)

btw, I am running TRM on HiveOS 0.6-211@211031 with 2x Powercolor 6600XT Red Devils + 4x MSI Dual Gaming OC

I would suggest dropping all your MEMs down to 1125 and then increase by 5 until it crashes, then back-off by 2 until it is stable. You will have to do for each card individually (if you do them all at once you won’t know which one does not like higher MEM values).

I have six of these cards and only two of them will run over MEM=1140. MEM=1128 is the lowest one of my 6600 XT cards needs to run at, MEM=1149 is the highest. Hope this helps…

mini_miner · November 3, 2021, 2:29pm

The hiveos updated to 0.6-211@211102 last night and I am finally able to get above that 28.5 Mh/s limit. Thanks.

MalsBrownCoat · November 5, 2021, 11:31pm

Hey guys, apologies for the delayed response. It seems that while I was away, I didn’t really have many issues. A couple of times, my internet connection died (which I verified was due to my ISP working on the node that I was connected to). So with that, the low hashrate trigger had fired . Aside from that, things seem to be going well.

This leads me to really suspect that those two Powercolor cards were problematic. I have since sent them back to Newegg and was refunded for them. If I end up replacing them, I’ll probably stick to the Sapphires and call it a day.

Over all, I think that the hashrates and power consumptions are fairly on par with what I was expecting. I could probably get the wattage down just a bit, but not sure I’m really going to lose any sleep over the extra maybe $1-2 a month on the electric bill. At least, not enough to warrant messing around with things. I dunno, what do you guys think?

Luci77 - while this would be a bit time consuming and mundane, you could disconnect all but a single gpu and let it run for several days (until any error presents itself). Then switch to another gpu and repeat the test. If those errors don’t repeat themselves, you may have identified a culprit GPU. Though, I do see your perspective on it seeming like an OS related issue.

Batfink - most definitely. I’ve been using Gen 1 since BitsBeTrippin even started his channel.
I did end up trying various memory settings and even the sub 1140 range yielded the random crashing.

mini_miner - how has your stability been since the upgrade?

MalsBrownCoat · November 5, 2021, 11:37pm

Also, I meant to ask for general thoughts on these stats. Stale shares seem pleasingly low.

And how do these efficiency ratings look?

mogliettazza · November 27, 2021, 12:47pm

look good,still going strong?

MalsBrownCoat · November 28, 2021, 6:10pm

Hey, thanks for checking in.

I’d say that things are working pretty well. Aside from the random reports lately (which we all seem to be getting) about rigs being offline when they’re actually not. I have not really had any problems after returning those 2 Powercolors. I think I’m going to add one more Sapphire Pulse and call it a day on this rig.
These are the settings that I’ve been running (ETH on NBMiner);

CoinGod · December 6, 2021, 7:12pm

Would you please clarify your settings as the Header Columns are not shown in the image.
Are your settings ?
Core Clock: 980
Core Voltage: 650
Memory Clock: 1140

MalsBrownCoat · December 8, 2021, 10:50pm

I hadn’t included any other setting details because there were no other configurations (aside from the fan being set at 60%). So yes, what you asked about are the only things that are configured.

Hiveshoe · December 20, 2021, 8:11am

I got 3 Asus RX 6600XT. I had one card that would not hash so i swapped it. My rig stops mining around 2am every day with the swapped card. What are the chances of a card mining for 24 hours and then shuts down? I’m a little lost since i get the “GPU driver error, no temp message” Of 4 Asus cards, 2 are bad? sigh

ZWorm · January 31, 2022, 7:57am

Did you ever solve this? I have a 5 card 6600XT rig and past few days has driven me nuts. It was working for ages but that was because I had two 6700’s in there with three of them. Since moving out the 6700’s having an all 6600XT rig has meant I had the GPU 0 dead issue at first, but after lots of troubleshooting am thinking its maybe not an obvious issue at all.

ZWorm · January 31, 2022, 8:02am

Yes I have two of the same brand (MSI MECH) and the OS seems to pick on one of these - predominantly one of them that tends to be in GPU 0.

Testing with them disconnected and the other cards work OK.

I did read a few months back someone like Son of a Tech posted an issue with multiple 6600’s in a rig but at the time I only had a couple so it didnt affect me then.

ZWorm · January 31, 2022, 8:06am

I seem to get this error now that I wiped Hive and started afresh, and also the error of GPU 0 dead. I first swapped the riser and cables.

I may have to install Windows on a different SSD just to see if I can replicate any problems.

system · March 23, 2023, 11:06pm

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.