
时间:2022-09-03 20:03:52

My system includes a task which opens a network socket, receives pushed data from the network, processes it, and writes it out to disk or pings other machines depending on the messages. This task is intended to run forever, and the service is designed to have this task always running. But sometimes it crashes.


What's the best practice for keeping a task like this alive? Assume it's okay for the task to be dead for up to 30 seconds before we restart it.


Some obvious ideas include having a watchdog process that checks to make sure the process is still running. Watchdog could be triggered by cron. But how does it know if the process is alive or not? Write a pidfile? touch a heartbeat file? An ideal solution wouldn't continuously spin up more processes if the machine gets bogged down to the point where the watchdog is running faster than the heartbeat.


Are there standard linux tools for this? I can imagine a solution that uses a message queue, but I'm not sure if that's a good idea or not.


4 个解决方案



Depending on the nature of the task that you wish to monitor, one method is to write a simple wrapper to start up your task in a fork().


The wrapper task can then do a waitpid() on the child and restart it if it is terminated.


This does depend on modifying the source for the task that you wish to run.




sysvinit will restart processes that die, if added to inittab.


If you're worried about the process freezing without crashing and ending the process, you can use a heartbeat and hard kill the active instance, letting init restart it.




You could use monit along with daemonize. There are lots of tools for this in the *nix world.

你可以使用monit和daemonize。在* nix世界中有很多这方面的工具。



Supervisor was designed precisely for this task. From the project website:


Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.


It runs as a daemon (supervisord) controlled by a command line tool, supervisorctl. The configuration file contains a list of programs it is supposed to monitor, among other settings.


The number of options is quite extensive, -- have a look at the docs for a complete list. In your case, the relevant configuration section might be something like this:

选项的数量相当广泛, - 请查看完整列表的文档。在您的情况下,相关的配置部分可能是这样的:

command=/bin/my-network-task   # where your binary lives
autostart=true                 # start when supervisor starts?
autorestart=true               # restart automatically when stopped?
startsecs=10                   # consider start successful after how many secs?
startretries=3                 # try starting how many times?

I have used Supervisor myself and it worked really well once everything was set up. It requires Python, which should not be a big deal in most environments but might be.




Depending on the nature of the task that you wish to monitor, one method is to write a simple wrapper to start up your task in a fork().


The wrapper task can then do a waitpid() on the child and restart it if it is terminated.


This does depend on modifying the source for the task that you wish to run.




sysvinit will restart processes that die, if added to inittab.


If you're worried about the process freezing without crashing and ending the process, you can use a heartbeat and hard kill the active instance, letting init restart it.




You could use monit along with daemonize. There are lots of tools for this in the *nix world.

你可以使用monit和daemonize。在* nix世界中有很多这方面的工具。



Supervisor was designed precisely for this task. From the project website:


Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.


It runs as a daemon (supervisord) controlled by a command line tool, supervisorctl. The configuration file contains a list of programs it is supposed to monitor, among other settings.


The number of options is quite extensive, -- have a look at the docs for a complete list. In your case, the relevant configuration section might be something like this:

选项的数量相当广泛, - 请查看完整列表的文档。在您的情况下,相关的配置部分可能是这样的:

command=/bin/my-network-task   # where your binary lives
autostart=true                 # start when supervisor starts?
autorestart=true               # restart automatically when stopped?
startsecs=10                   # consider start successful after how many secs?
startretries=3                 # try starting how many times?

I have used Supervisor myself and it worked really well once everything was set up. It requires Python, which should not be a big deal in most environments but might be.
