NMI Watchdog is something you would have heard of. NMI or Non Maskable
Interrupt, is a type of interrupt which the CPU cores canʼt ignore. Other type of interrupts, can be routed to different CPUs depending on its workload, while a NMI interrupt canʼt be passed on and has to be handled by the CPU core which got the interrupt.
Historically, NMI Watchdog used to interrupt CPUs whenever there is a critical hardware error, but recently it is also used to do a soft reset whenever there is a kernel panic. If the interrupts are not increasing (5 sec interval ?), it means that the system is hung and the NMI initiates a reboot or kernel panic and generates a debug log.
However there could be situations where NMI watchdog is disabled. If NMI is disabled, you may get 0 values in the NMI row of /proc/interrupts. Benefit of NMI is that it can trigger a panic, but with a crash dump so that we can later analyze the situation which caused the crash.
What if, I want to kill some process whenever the load rises above certain limit and bring the system under control, automatically ? We used to run this using a cron script, but often times, I have noticed that cron jobs donʼt run whenever there is a heavy load. It just skips.
The hangwatch daemon will invoke pre-configured sysrq triggers when system load average exceeds a certain threshold. Hangwatch periodically polls /proc/loadavg, and echos a user-defined set of characters into /proc/sysrq-trigger if a user-defined load threshold is exceeded.
Unlike in cron, here with hangwatch, the system is passed a magic sysrq to the /proc/sysrq-trigger which will do the needful, including passing SIGTERM to all processes running.
https://github.com/jumanjiman/hangwatch/
You can set the configurations in the file /etc/sysconfig/hangwatch. A copy of the settings can be found at
github.com/jumanjiman/hangwatch/blob/master/src/etc/sysconfig/hangwatch