"Autofan: GPU driver error, rebooting" message

77164 · July 24, 2018, 7:49pm

So after extensive testing, the riser and GPU in question were not defective at all. What actually resolved the issue was reducing the GPU quantity from 12 to 10 in that particular rig. I came to that realization/conclusion yesterday when I re-installed the riser and GPU and the error messages re-appeared, then, again, I removed a riser and GPU, and the error messages stopped.

This particular rig used to function properly with 12 GPUs. However, that may have been on Windows, before I switched the rigs over to Hive OS.

For clarification, my other rig that also produced these error messages has only 6 GPUs. So the 11+ GPU issue doesn’t apply in every situation.

Pavlo_Bilous · July 25, 2018, 5:17am

Getting same error. I have 5 1060s and 5 1080ti connected to my rig with Asus B250 Mining MoBo. Rig’s working fine for 24hrs and then gets this error. Will try to fix with manual autofan config…

iLoveLobster · July 25, 2018, 1:26pm

Mines doing the same thing regardless of 8 or 10 GPU rigs.

When the error shows up on dashboard all my cards drop from dashboard in red. While I can still SSH in the miner runs at 50% hashrate.

ibdmi · July 25, 2018, 2:20pm

I’ve just started using HiveOS (7x 1060 3GB MSI on 270P mobo), but this issue started right away for me. I’ve replaced risers & graphics cards with no change to the error. The error is random in timing, ranging from about 45 minutes to 26 hours.

Using the “Tuning” option & enabling the hashrate watchdog has mitigated the issue for me without requiring manual intervention. I didn’t try the manual autofan config because my experience is that manual fixes are undone by upgrades.

Just my 2¢

Pavlo_Bilous · July 25, 2018, 5:37pm

Here are my results after creating autofan.config: got all of these errors in the last 12hrs…

If anyone has any suggestions on how to fix these will much appreciate, thanks

Dmint · July 25, 2018, 7:35pm

I had the same problem with one of my freshly upgraded rigs, so far I’ve downgraded to 0.5-60 that it’s been running before. Will see how it goes, but I think it’s the proper version to use before autofans were implemented. So far it works great for a couple of hours. Just a solution for you to try.

If it helped, tips are welcome:
3LMaJKvM5UgWqhJ6dgmLMGryuBafUo9gdT

P.S. Version downgrade thread: Version Downgrade

Pavlo_Bilous · July 25, 2018, 8:51pm

I figured it was too much overclocking on some of my gpus

77164 · July 29, 2018, 3:35am

Update:

My main rig’s issue was not related to 10+ GPUs. I removed 6 GPUs, and then re-installed the 2 orginal GPUs I had removed and the error messages began again. Seems to be a defective riser part.

xroot · July 30, 2018, 2:23pm

I suppose the problem is with the nvidia driver or the nvidia-settings command.
From here:

github.com

minershive/hiveos-linux/blob/294985f08cb398d804c415184aad18ca23d7eb30/hive/sbin/autofan#L346


		local brand="Nvidia"
		#event_by_temperature $gpu_temperature
		get_fan_speed $gpu_temperature $gpu_temperature_previous $gpu_fan_speed $card_bus_id $i $brand
		#not set fan_speed if not changed
		[[ -n $target_fan_speed && $target_fan_speed -ne $gpu_fan_speed ]] && 
			args+=" -a [gpu:$i]/GPUFanControlState=1 -a [fan:$i]/GPUTargetFanSpeed=$target_fan_speed"
		i=$(( $i+1 ))
	done
	#[[ -n $args ]] && nvidia-settings $args > /dev/null 2>&1
	if [[ -n $args ]]; then
		timeout 60 nvidia-settings $args > /dev/null 2>&1
		if [[ $? -ne 0 ]]; then
			if [[ $REBOOT_ON_ERROR == 1 ]]; then
				local msg="Autofan: unable to set fan speed, rebooting"
				message warning "$msg"
				nohup bash -c 'sreboot' > /tmp/nohup.log 2>&1 &
			else
				if [[ $unable_to_set_fan_speed == 0 ]]; then
					local msg="Autofan: unable to set fan speed"
					message warning "$msg"
				fi

it looks like nvidia-settings commands fails somehow, which causes the autofan error message.
There is no recovery attempted in the code except a reboot when REBOOT_ON_ERROR is set.
When the error happens I do see that nvidia-settings process is sitting there taking up 100% CPU. The system load goes up to 15. Sometimes (not always) killing this proc and restarting the miner does the trick. If used -9 it most likely will warrant a reboot.
I suggest to try restarting the X server and re-trying nvidia-settings command again. Maybe rmmod/insmod nvidia drivers? And only then reboot? I don’t know.
If somebody has time to look at what exactly is failing with nvidia-settings, we could figure something out.
Change this line:
timeout 60 nvidia-settings $args > /dev/null 2>&1
to:
timeout 60 nvidia-settings $args | tee /tmp/nvidia-settings.out 2>&1
and try to catch the error in the output?
Be aware not to fill up /tmp or /

thanks,
-alexm

unkn0wn · August 10, 2018, 7:10pm

после апгрейда с 0.5-57 до последней стало появляться Autofan: GPU driver error, no temps и все висит пока не сделаешь reboot
пришлось откатиться обратно до 0.5-57
пожалуйста поправьте в последних версиях это

peeshypow · August 15, 2018, 4:22am

I am a paying member, CAN WE PLEASE GET A RESPONSE ON THIS.

How do we fix it? I don’t like my rigs not working. I dont want to go reinstall windows on my rigs.

What is the solution?

i am using 5.69 the latest version.

hiveon · September 1, 2018, 12:12am

In Hive 2 autofan script does not reboot rig by default. It can give some warnings about failed temerature reads or driver error but will not reboot if you didn’t enable it.

Rub3r · September 5, 2018, 1:50pm

Dear developer, what will happen if I switch off “reboot on errors”? It doesn’t fix anything, and how the system will know the proper speed for cards? Just 100%?
Уважаемый разработчик, так а что произойдёт, если я выключу “reboot on errors”? Ошибки то это не исправит! И какую скорость автофона в таком случае выставит Хайв, на 100% просто?

zoid009 · September 6, 2018, 11:17am

Дико раздражает, каждые 5-30 минут риг идёт в перезагрузку с ошибкой «Autofan: GPU driver error, rebooting» как это пофиксить? Или какие настройки корректные сделать?
Сейчас
задано : Целевая температура 65
Минимальная скорость 25
максимальная скорость 100
Критичная температура 79
Перезагрузка при ошибках - вкл.

Kirill_Brusov · September 6, 2018, 2:24pm

Тоже столкнулся с этой неприятной проблемой, много часов ушло на ее решение, до конца решить пока не удалось, но есть гипотезы в чем причина. Последние обновления хайва не тянули за собой обновления драйверов nvidia, в частности обновив hiveos на нескольких машинах до 0.5-70, я заметил, что драйвера nvidia остались прежними - 387.34, хотя последние стабильные драйвера 390.59

Пробовал обновлять вручную через скрипт “nvidia-driver-update”, но получил ошибку:

ERROR: An NVIDIA kernel module ‘nvidia-modeset’ appears to already be
loaded in your kernel. This may be because it is in use (for
example, by an X server, a CUDA program, or the NVIDIA Persistence
Daemon), but this may also happen if your kernel was configured
without support for module unloading. Please be sure to exit any
programs that may be using the GPU(s) before attempting to upgrade
your driver. If no GPU-based programs are running, you know that
your kernel supports module unloading, and you still receive this
message, then an error may have occured that has corrupted an NVIDIA
kernel module’s usage count, for which the simplest remedy is to
reboot your computer.
ERROR: Installation has failed.  Please see the file
       '/var/log/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

Nanda · September 7, 2018, 1:32pm

I have the same problem every 2 or 3 months at some rigs.

the solution is cleaning all the components on the rigs and install VGA one by one.
and usually all back to normal.

I also found A broken power supply cable and broken riser is one of the causes of the problem.

Terima Kasih

adimetrius · October 28, 2018, 10:17am

What causes autofan to fail is this:
/hive/sbin/autofan: line 350: 21712 Segmentation fault timeout -s9 60 nvidia-settings $args > /dev/null 2>&1

When i run nvidia-settings, even without any parameters, it stops with segmentation fault. No other messages, no other information.

Autofan does not work, gpu-fans-find does not work (same segmentation fault), fan speed settings do not work.

Pro100Tack · October 29, 2018, 11:36am

Вчера запустил ферму, часов 6 работало норм, под утро начало вырубать:
autofan.conf didn’t exist, so I created it with the following data,
Claymore Reboot WATCHDOG GPU 2 hangs in OpenCL call, exit.
Перепробовал все советы, дрова стояли 39*.* - не помню точно, но по совету выше решил обновить дрова NVIDIA через =Shellinabox=. Актуальная версия: 410.66. Заодно майнер поменял на =ethermine= - час полёт норм!!

Sifis_Koulourhs · November 20, 2018, 11:33pm

hi all… i have big problem in my ring…
the ring have restart all the time for this error

i dont have nvndia card… i have amd cards…

error
[/code]
{ “temp” : [ “” ] , “fan” : [ “” ] , “load” : [ “” ] , “power” : [ “” ] , “busids” : [ “” ] , “brand” : [ “” ] }
[/code]

error
>Autofan: GPU temperature 511 is unreal, driver error

**{** **"temp"** **:** **[** "45" **,** "42" **,** "511" **,** "511" **,** "511" **,** "511" ****]**** **,** **"fan"** **:** **[** "0" **,** "0" **,** "0" **,** "0" **,** "0" **,** "0" ****]**** **,** **"load"** **:** **[** "0" **,** "0" **,** "0" **,** "0" **,** "0" **,** "0" ****]**** **,** **"power"** **:** **[** "" **,** "" **,** "" **,** "" **,** "" **,** "" ****]**** **,** **"busids"** **:** **[** "03:00.0" **,** "04:00.0" **,** "08:00.0" **,** "09:00.0" **,** "0c:00.0" **,** "0d:00.0" ****]**** **,** **"brand"** **:** **[** "amd" **,** "amd" **,** "amd" **,** "amd" **,** "amd" **,** "amd" ****]**** ****}****

how to fix error

we try to do what our friend TwoChiefs says

to go here nano /hive-config/autofan.conf

i put here

# Set to 1 to disable AMD fan control
NO_AMD=1

but the error is displayed again and again…
makes a restart ring and stops doing mining
please tell me what I can do

jeesus · December 20, 2018, 5:26pm

Same problem here. I only have 4 GPUs though. I always suspected the risers, but the guy i’m buying the risers from said his clients never have had problems with his risers. I’ve tried 6 different risers per GPU. It can’t be that they’re all defective…

Is there any way to detect which GPU temp readings failed exactly? That way I could at least narrow down the problem.