在Linux上维护一个长期运行的任务

My system includes a task which opens a network socket, receives pushed data from the network, processes it, and writes it out to disk or pings other machines depending on the messages. This task is intended to run forever, and the service is designed to have this task always running. But sometimes it crashes.

我的系统包括一个任务,它打开一个网络套接字,从网络接收推送的数据,处理它,然后根据消息将其写入磁盘或ping其他机器。此任务旨在永久运行,并且该服务旨在使此任务始终运行。但有时它会崩溃。

What's the best practice for keeping a task like this alive? Assume it's okay for the task to be dead for up to 30 seconds before we restart it.

保持这样的任务活着的最佳做法是什么?假设在我们重新启动它之前,任务可以在30秒内死掉。

Some obvious ideas include having a watchdog process that checks to make sure the process is still running. Watchdog could be triggered by cron. But how does it know if the process is alive or not? Write a pidfile? touch a heartbeat file? An ideal solution wouldn't continuously spin up more processes if the machine gets bogged down to the point where the watchdog is running faster than the heartbeat.

一些明显的想法包括检查以确保进程仍在运行的监视程序进程。监视器可能由cron触发。但它如何知道该过程是否存在?写一个pidfile?触摸心跳文件?如果机器陷入监视器运行速度比心跳快的程度,那么理想的解决方案不会持续增加更多进程。

Are there standard linux tools for this? I can imagine a solution that uses a message queue, but I'm not sure if that's a good idea or not.

有没有标准的linux工具?我可以想象一个使用消息队列的解决方案,但我不确定这是不是一个好主意。

4 个解决方案

#1

Depending on the nature of the task that you wish to monitor, one method is to write a simple wrapper to start up your task in a fork().

根据您希望监视的任务的性质,一种方法是编写一个简单的包装器以在fork()中启动您的任务。

The wrapper task can then do a waitpid() on the child and restart it if it is terminated.

然后,包装器任务可以对子进行waitpid(),如果终止,则重新启动它。

This does depend on modifying the source for the task that you wish to run.

这取决于修改您希望运行的任务的源。

#2

sysvinit will restart processes that die, if added to inittab.

如果添加到inittab,sysvinit将重新启动死亡的进程。

If you're worried about the process freezing without crashing and ending the process, you can use a heartbeat and hard kill the active instance, letting init restart it.

如果你担心进程冻结而不崩溃并结束进程,你可以使用心跳并硬杀死活动实例,让init重新启动它。

#3

You could use monit along with daemonize. There are lots of tools for this in the *nix world.

你可以使用monit和daemonize。在* nix世界中有很多这方面的工具。

#4

Supervisor was designed precisely for this task. From the project website:

主管专为此任务而设计。从项目网站:

Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.

Supervisor是一个客户端/服务器系统,允许其用户监视和控制类UNIX操作系统上的许多进程。

It runs as a daemon (supervisord) controlled by a command line tool, supervisorctl. The configuration file contains a list of programs it is supposed to monitor, among other settings.

它作为由命令行工具supervisorctl控制的守护进程(supervisord)运行。配置文件包含应该监视的程序列表以及其他设置。

The number of options is quite extensive, -- have a look at the docs for a complete list. In your case, the relevant configuration section might be something like this:

选项的数量相当广泛, - 请查看完整列表的文档。在您的情况下,相关的配置部分可能是这样的:

[program:my-network-task]
command=/bin/my-network-task   # where your binary lives
autostart=true                 # start when supervisor starts?
autorestart=true               # restart automatically when stopped?
startsecs=10                   # consider start successful after how many secs?
startretries=3                 # try starting how many times?

I have used Supervisor myself and it worked really well once everything was set up. It requires Python, which should not be a big deal in most environments but might be.

我自己使用了Supervisor,一旦设置好,它就能很好地工作。它需要Python,这在大多数环境中都不应该是大问题但可能是。

#1