GPU Driver Errors and 6600XT?

mogliettazza · October 19, 2021, 1:28pm

did you try to complete remove the oc?
go in the oc section, “clean” oc, reboot
than apply your setting,the one i posted above work strong for 3 day now,just add amother card(1660 super) with no issue,will see what happen when add more,the 6600tx i like for the ultra low w only and i not thing im going to buy more of this card

MalsBrownCoat · October 19, 2021, 11:09pm

Before making further changes, I ran the stock settings (everything at 0) for 2 days and everything seems to work just fine.
I just “cleaned” the OC, and removed/recreated the flight sheet, then rebooted.
It’s currently running on LOLMiner and began hashing at 32.27 MH.

…for about 8 minutes, then threw a hardware error.

MalsBrownCoat · October 19, 2021, 11:10pm

A minute later, and I was back to zero with a full crash.

I never had problems like this running nvidia/windows. Very disappointing.

mogliettazza · October 20, 2021, 12:51am

this is a benchmark rig were i test component software etc…

i put a 1660 super and 6600 tx and try both ETH and RVN algorithm

i delete all the OC and start fresh with HIVEOS beta version
all good so far at the time of the pictures was about 10 min run fine
now is about 20 min no issue

i can also jump from ETH to RVN with no problem,just need to reboot and change Flight sheet and OC
hope this help

mogliettazza · October 20, 2021, 12:58am

here Fsheet and OC

26 min no problem so far

in KAWPOW worked for 3 day no issue now i put on ETHash

MalsBrownCoat · October 20, 2021, 1:15am

These are great details, and I do appreciate them, but I don’t consider 26 minutes to be “stable”.
In my experience, establishing true “stability” can be fairly random.
An OC that worked for months could suddenly decide that it doesn’t want to work anymore.

For example, I switched back to NBMiner (because LOL simply wouldn’t start, even after following the steps you suggested). I went ahead and removed the card that I had previously been testing and instead, connected 4 that seemed to be fine when I first started the initial troubleshooting.
This time, with NBMiner, things were running ok.
And 35 minutes later, it crashed.

mogliettazza · October 20, 2021, 2:44am

touché, you absolute right,even 1 day cant be considered stable,

i bet you already try another riser usb cable etc…

what i would do if nothing work is disassemble the rig and try part for part and find the faulty one

all your card are 6600tx ,can be that one of the card not work properly ? defective ?

i not like this card, at that hashrate im in love with 1660 super or 3060 (even lhr) i know that are a bit high on w but on my personal experience the 6600tx give me trouble and those not(at least so far)
plus price point are a bit cheapper and easy to find new and used

you try already to change pcie or reinstall hive os?

i not try to waist your time but im out of suggestion

mini_miner · October 20, 2021, 3:28pm

Hello,

My system was working fine for a couple weeks then the SSD died. The SSD is something like 10 years old… It was only a matter of time.

Once it went out, I grabbed a super cheap SSD and the rig wouldn’t even boot. So, I grabbed a 1T NVME drive as this is all I had laying around (i’ll need to buty another SSD but in the mean time, I’d like to get my rig in operation)

My issue is with my original SSD I was able to get 32 Mh/s on 6 6600xt consuming 45-ish watts of power. Now, I can’t get above what the screen shots are showing below. I have highlighted a few things that show you that I have adjusted the overclock setting manually; however, the memory settings refuse to change…

does any one have any advice?

oh, and I can only get 4 GPU to boot with OS not the original 6. When I plug in the 5th one the system refused to finish POST. I have enabled 4g Mining in bios; however, that made things worse. The rig won’t boot with even just one GPU with 4g and BAR size enabled.

I have even edited the miner config file… No Dice Tango.

I am lost.

Any ideas?

MalsBrownCoat · October 21, 2021, 12:31am

Thanks for the suggestions, Luigi. I had already tried swapping out components that were known to be working (risers, power cables, pcie slots on the mb, a complete reinstall of Hive on a good SSD, etc). I’m beginning to think that 2 of these gpus are just defective. I can completely understand if a few cards don’t like certain thresholds of over (or under) clocking. But the two in question don’t seem to accept any OC whatsoever, and they take the whole rig down with them when they crash.

Like my earlier comment, I don’t really consider even a full 24 hours of continuous mining to be classified (with certainty) as “stable”. However, after removing those two Powercolor GPUs, the rig has been running since my last post. I noticed that it did stop mining for about an hour early this morning, but it appears that it recovered on its own without any intervention on my part.

I just took these screenshots -

Notice the dip to “0”, but then it recovered, followed by another couple of brief areas where it didn’t report at all.

A screenshot of NBMiner indicates that at some point (likely right when those dips occur), at very least, NBMiner restarted itself (otherwise it would show ~23 hours of the activity).

Note that in these shots, GPU’s 0 and 3 are also Powercolor’s, whereas 1 and 2 are Sapphire Pulse’s. So far, I have not had any problems with the Sapphires, it has always been a Powercolor that crapped the bed. The two Powercolors that are showing here are different ones than the two that were having the earlier problems though.

So, maybe it’s a simple as “bad cards”. Not sure I’m really buying that though. One? Ok. Two?
…mmm…I dunno about that…

I do have a few more Sapphire Pulse’s that I think I’ll add to the rig. I suppose if those run for a few days, maybe it really was those two Powercolor’s. Suffice it to say, I have a few more days left to return those, which I think I’ll just go ahead and do.

MalsBrownCoat · October 21, 2021, 12:46am

mini-miner - when you loaded the (new) NVME, did you load the exact same HiveOS version, or was it by chance an updated (or earlier) version than your rig.conf file was associated with?

One of the reasons that I question this is because (if I recall in another thread), your OC seems to be locked at 1000 Mhz and I believe that was an issue on prior/stable (rather than beta) versions.

I’m just going to spitball here, but to rule out other variables, have you also done the steps that mogliettazza had shared; remove the flight sheet/“clean” the oc/reboot/apply new flight sheet/apply new OC

As for 4G decoding, that’s been a bit of a standard over the last several years with mining motherboards. BAR settings should not be required and may in fact hinder performance (though I have no concrete evidence of this). The point being; motherboards have supported more than 4 GPUs just fine without the need for anything BAR related for years.

And I don’t mean this to sound remedial, but have you been able to rule out a problem with that specific 5th GPU? What happens if you take one of the four that are currently working, and use it in one of their places? I would also repeat that test for the specific 6th GPU in question.

mogliettazza · October 21, 2021, 2:06pm

Very interesting,so i was almost sure that you tried already to rebuild a rig one component at the time but you know, i had to ask,could be a bad card, those maybe have a upgrade bios?
maybe just better return it and buy something else,to me personally the better card is not the one with more hashrate but the more stable,and less w because the last thing i want is headache,i want focus more to research and buy more gpu(for how long we can mine eth,that is another thing…)

yes i load the same version/worker/conf file the only think i did was change the ethernet cable for a commercial one because i run a wifi ethernet splitter and sometime got me problem but now no more.

im not a fan of the beta version but i wasn’t able to make the 6600tx work on the stable version,no way just wont work,so i build a new rig with a b550 plus and a syoncon usb 4 pcie splitter with beta version and after set oc and new fsheet all seem work fine for 3 day now with 32 hashrate firm and 49 w 20% fan,no bad at all but i did not trust this card,period 48/58 temp
5.10.0-hiveos #60
Kernel Version
A20.40 (5.11.0825)
N470.63.01
hiveos 0.6-210@211010

my favorite setup is a 110 btc asrock 13 gpu mboard with stable hiveos version (but no way to make the 6600 work) but telling you the truth i only put 12 gpu,why?no reason i never like to overload anything i want just keep a bit lite,i wish i can buy better card yes i do,but i not want spend scalpel price for high end card,is just no fair.

today or tomorrow i will install another card on the new rig and i will see what happen,im pretty sure the 6600xt going to lose the configuration and i have to re do the oc ans flighsheet but this time if happen no more time waisted i will return the card and just go with another 1660 s,im not lazy but i cant and want be concerned everyday.
a soon i do that i let you know what going to happen,im waiting for a 1660s and a 3060 lhr
going to be here to papa anytime now

as right now,how your rig going?what card you left out of the rig?(if any)
look at your pictures you notice that any 6600 work at different temp?

edwinst31 · October 21, 2021, 10:48pm

your problem lies in the hive OS version you are using.

you are using hive os stable version 6-208, and it does not work well for VGA RX 6600 XT, the maximum hashrate you can get is 28.5 mh/s, no matter you try to set your overclock, it will still get 28.5 mh/s .

download the latest beta version on the hive OS website, as far as I know the latest beta version is version 6-210.

use that version, and you will be able to adjust your OC, but you need to remember, this RX 6600XT is a new VGA, there is no version that is completely stable.

if you use beta version, you will get problem, settings are very sensitive. but the hashrate you get can reach 32-33 mh/s.

mogliettazza · October 22, 2021, 12:11pm

perhaps i get 32,and sure the card is about feb 2021 but the problem here is not the hashrate, only

MalsBrownCoat · October 23, 2021, 8:12pm

Just an update - after removing those two problematic Powercolor Red Devils, I replaced them with a pair of Sapphire Pulse cards and things seemed to be going well for about 3 days. I woke up this morning and saw that at some point last night, the rig had crashed. I hadn’t enabled the watchdog on it yet, so that explains why it never restarted. But, since it did crash at some point, the clocks may need a slight adjustment somewhere.

I don’t really have the time to futz with it right now and I’ll be on a work trip for a week (I made sure to setup the watchdog in the meantime). I’ll provide an update when I return. Appreciate everyone’s insight on things so far. Thank you.

luci77 · October 27, 2021, 2:52am

I feel you, mate. I’ve read thoroughly all your comments and I can resonate with the dissapointment, felt more or less the same sort of frustation. I’m exactly in the same situation. trying to run a rig with 4 new MSI RX6600 XT with Samsung GDDR6 mem, driver 20.40 (5.11.0825) - tried another version as well and it seems that I’m not getting anywhere. Currently running on HiveOs latest beta, 0.6-210@211025 (of course I tried other versions, stable HiveOs, etc) and already tried changing cables, PSU, risers, etc. The most solid continuous run was around 25 hours before another pesky GPU * detected dead error.

It seems a bit more stable with the OC settings around the following values: Core 945, VDD 660, MEM 1132 (yes, tried a tone of other values), aprox 32MH, but after 4 days seems that I’m back to square 1. What I wanted to add though is the fact that I really believe it’s a HiveOs related issue, not a hardware one. In my case, at the beginning, the card detected dead was randomly hit by the error (could be 0,1,2,3), but lately I noticed is more the GPU 0. Also, looking at the error log files, the pattern was that whichever card was about to die would have a 7 points increase in its VDDC and a loss in wattage, no matter the initial VDD value (i.e. 640, 660, 675, 685 etc) before the error. Have no ideea if this thing says something or not, it’s just an observation.

As a last resort, perhaps we should try using Windows with these cards

Batfink · November 2, 2021, 12:49pm

Hi MalsBrownCoat, just a suggestion buddy but are your motherboard PCI-E slots set for “Gen1” in your BIOS? (I had ALOT of very random problems/crashes until I set the slots for Gen1)

btw, I am running TRM on HiveOS 0.6-211@211031 with 2x Powercolor 6600XT Red Devils + 4x MSI Dual Gaming OC

I would suggest dropping all your MEMs down to 1125 and then increase by 5 until it crashes, then back-off by 2 until it is stable. You will have to do for each card individually (if you do them all at once you won’t know which one does not like higher MEM values).

I have six of these cards and only two of them will run over MEM=1140. MEM=1128 is the lowest one of my 6600 XT cards needs to run at, MEM=1149 is the highest. Hope this helps…

mini_miner · November 3, 2021, 2:29pm

The hiveos updated to 0.6-211@211102 last night and I am finally able to get above that 28.5 Mh/s limit. Thanks.

MalsBrownCoat · November 5, 2021, 11:31pm

Hey guys, apologies for the delayed response. It seems that while I was away, I didn’t really have many issues. A couple of times, my internet connection died (which I verified was due to my ISP working on the node that I was connected to). So with that, the low hashrate trigger had fired . Aside from that, things seem to be going well.

This leads me to really suspect that those two Powercolor cards were problematic. I have since sent them back to Newegg and was refunded for them. If I end up replacing them, I’ll probably stick to the Sapphires and call it a day.

Over all, I think that the hashrates and power consumptions are fairly on par with what I was expecting. I could probably get the wattage down just a bit, but not sure I’m really going to lose any sleep over the extra maybe $1-2 a month on the electric bill. At least, not enough to warrant messing around with things. I dunno, what do you guys think?

Luci77 - while this would be a bit time consuming and mundane, you could disconnect all but a single gpu and let it run for several days (until any error presents itself). Then switch to another gpu and repeat the test. If those errors don’t repeat themselves, you may have identified a culprit GPU. Though, I do see your perspective on it seeming like an OS related issue.

Batfink - most definitely. I’ve been using Gen 1 since BitsBeTrippin even started his channel.
I did end up trying various memory settings and even the sub 1140 range yielded the random crashing.

mini_miner - how has your stability been since the upgrade?

MalsBrownCoat · November 5, 2021, 11:37pm

Also, I meant to ask for general thoughts on these stats. Stale shares seem pleasingly low.

And how do these efficiency ratings look?

mogliettazza · November 27, 2021, 12:47pm

look good,still going strong?