万无一失的跨平台进程杀死守护进程。

时间:2021-07-11 12:15:29

I have some python automation, which spawns telnet sessions that I log with the linux script command; there are two script process IDs (a parent and child) for each logging session.

我有一些python自动化,它生成我用linux脚本命令记录的telnet会话;每个日志记录会话有两个脚本进程id(父进程和子进程)。

I need to solve a problem where if the python automation script dies, the script sessions never close on their own; for some reason this is much harder than it should be.

我需要解决一个问题,如果python自动化脚本终止,脚本会话永远不会自动关闭;出于某种原因,这比实际情况要难得多。

So far, I have implemented watchdog.py (see bottom of the question), which daemonizes itself, and polls the python automation script's PID in a loop. When it sees the python automation PID disappear from the server's process table, it attempts to kill the script sessions.

到目前为止,我已经执行了watchdog。py(参见问题的底部),它将自己虚拟化,并在循环中轮询python自动化脚本的PID。当它看到python自动PID从服务器的进程表中消失时,它会尝试终止脚本会话。

My problem is:

我的问题是:

  • script sessions always spawn two separate processes, one of the script sessions is the parent of the other script session.
  • 脚本会话总是产生两个独立的进程,其中一个脚本会话是另一个脚本会话的父进程。
  • watchdog.py will not kill the child script sessions, if I start script sessions from the automation script (see AUTOMATION EXAMPLE, below)
  • 监管机构。如果我从自动化脚本启动脚本会话,py不会终止子脚本会话(参见下面的自动化示例)

AUTOMATION EXAMPLE (reproduce_bug.py)

import pexpect as px
from subprocess import Popen
import code
import time
import sys
import os

def read_pid_and_telnet(_child, addr):
    time.sleep(0.1) # Give the OS time to write the PIDFILE
    # Read the PID in the PIDFILE
    fh = open('PIDFILE', 'r')
    pid = int(''.join(fh.readlines()))
    fh.close()
    time.sleep(0.1)
    # Clean up the PIDFILE
    os.remove('PIDFILE')
    _child.expect(['#', '\$'], timeout=3)
    _child.sendline('telnet %s' % addr)
    return str(pid)

pidlist = list()
child1 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE1.txt'""")
pidlist.append(read_pid_and_telnet(child1, '10.1.1.1'))

child2 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE2.txt'""")
pidlist.append(read_pid_and_telnet(child2, '10.1.1.2'))

cmd = "python watchdog.py -o %s -k %s" % (os.getpid(), ','.join(pidlist))
Popen(cmd.split(' '))
print "I started the watchdog with:\n   %s" % cmd

time.sleep(0.5)
raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"

Now the example of what happens when I run the automation above... note that PID 30017 spawns 30018 and PID 30020 spawns 30021. All the aforementioned PIDs are script sessions.

下面是我在上面运行自动化时发生的事情的例子……注意,PID 30017产卵30018,PID 30020产卵30021。上述所有的PIDs都是脚本会话。

[mpenning@Hotcoffee Network]$ python reproduce_bug.py 
I started the watchdog with:
   python watchdog.py -o 30016 -k 30017,30020
Traceback (most recent call last):
  File "reproduce_bug.py", line 35, in <module>
    raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"
RuntimeError: Simulated script crash.  Note that script child sessions are hung
[mpenning@Hotcoffee Network]$

After I run the automation above, all the child script sessions are still running.

在我运行上面的自动化之后,所有的子脚本会话仍然在运行。

[mpenning@Hotcoffee Models]$ ps auxw | grep script
mpenning 30018  0.0  0.0  15832   508 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE1.txt
mpenning 30021  0.0  0.0  15832   516 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE2.txt
mpenning 30050  0.0  0.0   7548   880 pts/8    S+   12:08   0:00 grep script
[mpenning@Hotcoffee Models]$

I am running the automation under Python 2.6.6, on a Debian Squeeze linux system (uname -a: Linux Hotcoffee 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 GNU/Linux).

我正在一个Debian压缩linux系统(uname -a: linux Hotcoffee 2.6.32-5-amd64 #1 SMP mon16 16:22:28 UTC 2012 x86_64 GNU/ linux)上运行Python 2.6.6下的自动化。

QUESTION:

It seems that the daemon doesn't survive the spawning-process crash. How can I fix watchdog.py to close all the script sessions if the automation dies (as shown in the example above)?

看起来守护进程没有在生成过程崩溃中存活下来。我怎样才能修好看门狗呢?如果自动化停止(如上面的示例所示),py将关闭所有脚本会话?

A watchdog.py log that illustrates the problem (sadly, the PIDs do not coincide with the original question)...

一个监督机构。py日志说明了问题(遗憾的是,PIDs与原始问题不一致)……

[mpenning@Hotcoffee ~]$ cat watchdog.log 
2012-02-22,15:17:20.356313 Start watchdog.watch_process
2012-02-22,15:17:20.356541     observe pid = 31339
2012-02-22,15:17:20.356643     kill pids = 31352,31356
2012-02-22,15:17:20.356730     seconds = 2
[mpenning@Hotcoffee ~]$

Resolution

The problem essentially was a race condition. When I tried to kill the "parent" script processes, they had already died coincidental with the automation event...

问题本质上是种族条件。当我试图终止“父”脚本进程时,它们已经与自动化事件一起死去了……

To solve the problem... first, the watchdog daemon needed to identify the entire list of children to be killed before polling the observed PID (my original script attempted to identify children after the observed PID crashed). Next, I had to modify my watchdog daemon to allow for the possibility that some script processes might die with the observed PID.

为了解决这个问题…首先,监视器守护程序需要在轮投票之前确定所有要被杀死的孩子的列表(我的原始脚本试图在观察到的PID崩溃后识别孩子)。接下来,我必须修改我的守护进程,以允许某些脚本进程可能会死于所观察到的PID。


watchdog.py:

#!/usr/bin/python
"""
Implement a cross-platform watchdog daemon, which observes a PID and kills 
other PIDs if the observed PID dies.

Example:
--------

watchdog.py -o 29322 -k 29345,29346,29348 -s 2

The command checks PID 29322 every 2 seconds and kills PIDs 29345, 29346, 29348 
and their children, if PID 29322 dies.

Requires:
----------

 * https://github.com/giampaolo/psutil
 * http://pypi.python.org/pypi/python-daemon
"""
from optparse import OptionParser
import datetime as dt
import signal
import daemon
import logging
import psutil
import time
import sys
import os

class MyFormatter(logging.Formatter):
    converter=dt.datetime.fromtimestamp
    def formatTime(self, record, datefmt=None):
        ct = self.converter(record.created)
        if datefmt:
            s = ct.strftime(datefmt)
        else:
            t = ct.strftime("%Y-%m-%d %H:%M:%S")
            s = "%s,%03d" % (t, record.msecs)
        return s

def check_pid(pid):        
    """ Check For the existence of a unix / windows pid."""
    try:
        os.kill(pid, 0)   # Kill 0 raises OSError, if pid isn't there...
    except OSError:
        return False
    else:
        return True

def kill_process(logger, pid):
    try:
        psu_proc = psutil.Process(pid)
    except Exception, e:
        logger.debug('Caught Exception ["%s"] while looking up PID %s' % (e, pid))
        return False

    logger.debug('Sending SIGTERM to %s' % repr(psu_proc))
    psu_proc.send_signal(signal.SIGTERM)
    psu_proc.wait(timeout=None)
    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)


    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)
    children = list()

    # Get PIDs of all child processes...
    for childpid in kill.split(','):
        children.append(childpid)
        p = psutil.Process(int(childpid))
        for subpsu in p.get_children():
            children.append(str(subpsu.pid))

    # Poll observed PID...
    while check_pid(int(observe)):
        logger.debug('Poll PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('Poll PID: %s is *dead*, starting kills of %s' % (observe, ', '.join(children)))

    for pid in children:
        # kill all child processes...
        kill_process(logger, int(pid))
    sys.exit(0) # Exit gracefully

def run(observe, kill, seconds):

    with daemon.DaemonContext(detach_process=True, 
        stdout=sys.stdout,
        working_directory=os.getcwd()):
        watch_process(observe=observe, kill=kill, seconds=seconds)

if __name__=='__main__':
    parser = OptionParser()
    parser.add_option("-o", "--observe", dest="observe", type="int",
                      help="PID to be observed", metavar="INT")
    parser.add_option("-k", "--kill", dest="kill",
                      help="Comma separated list of PIDs to be killed", 
                      metavar="TEXT")
    parser.add_option("-s", "--seconds", dest="seconds", default=2, type="int",
                      help="Seconds to wait between observations (default = 2)", 
                      metavar="INT")
    (options, args) = parser.parse_args()
    run(options.observe, options.kill, options.seconds)

4 个解决方案

#1


2  

Your problem is that script not detached from automation script after spawning, so it's working as child, and when parent dies it stay unmanageable.

您的问题是,在生成之后,脚本没有与自动化脚本分离,因此它作为子脚本工作,当父脚本死亡时,它仍然无法管理。

To handle python script exit you can use atexit module. To monitor child processes exit you can use os.wait or handle SIGCHLD signal

要处理python脚本退出,可以使用atexit模块。要监视子进程退出,可以使用os。等待或处理SIGCHLD信号

#2


1  

You can try to kill the complete process group containing: the parent script, the child script, the bash spawned by script and - perhaps - even the telnet process.

您可以尝试终止包含:父脚本、子脚本、脚本衍生的bash,甚至telnet进程的完整进程组。

The kill(2) manual says:

杀(2)手册说:

If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.

如果pid小于-1,则将sig发送到ID为-pid的进程组中的每个进程。

So the equivalent of kill -TERM -$PID will do the job.

所以等价于kill - term -$PID就可以了。

Oh, the pid you need is that of the parent script.

噢,您需要的pid是父脚本的pid。


Edit

编辑

The process group killing seems to work for me if I adapt the following two functions in watchdog.py:

如果我在watchdog中采用以下两个功能,那么group killing这个过程似乎对我有用。

def kill_process_group(log, pid):
    log.debug('killing %s' % -pid)
    os.kill(-pid, 15)

    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)

    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)

    while check_pid(int(observe)):
        logger.debug('PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('PID: %s is *dead*, starting kills' % observe)

    for pid in kill.split(','):
        # Kill the children...
        kill_process_group(logger, int(pid))
    sys.exit(0) # Exit gracefully

#3


0  

Maybe you could use os.system() and do a killall in your watchdog to kill all instances of /usr/bin/script

也许您可以使用os.system(),并在您的监视程序中执行killall,以杀死/usr/bin/script的所有实例。

#4


0  

On inspection it seems that psu_proc.kill() (actually send_signal()) should raise OSError on failure, but just in case - have you tried checking for termination before setting the flag? As in:

在检查时,psu_proc.kill()(实际上是send_signal()))应该在失败时启用OSError,但为了以防万一,您是否尝试在设置标志之前检查终止?如:

if not psu_proc.is_running():
  finished = True

#1


2  

Your problem is that script not detached from automation script after spawning, so it's working as child, and when parent dies it stay unmanageable.

您的问题是,在生成之后,脚本没有与自动化脚本分离,因此它作为子脚本工作,当父脚本死亡时,它仍然无法管理。

To handle python script exit you can use atexit module. To monitor child processes exit you can use os.wait or handle SIGCHLD signal

要处理python脚本退出,可以使用atexit模块。要监视子进程退出,可以使用os。等待或处理SIGCHLD信号

#2


1  

You can try to kill the complete process group containing: the parent script, the child script, the bash spawned by script and - perhaps - even the telnet process.

您可以尝试终止包含:父脚本、子脚本、脚本衍生的bash,甚至telnet进程的完整进程组。

The kill(2) manual says:

杀(2)手册说:

If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.

如果pid小于-1,则将sig发送到ID为-pid的进程组中的每个进程。

So the equivalent of kill -TERM -$PID will do the job.

所以等价于kill - term -$PID就可以了。

Oh, the pid you need is that of the parent script.

噢,您需要的pid是父脚本的pid。


Edit

编辑

The process group killing seems to work for me if I adapt the following two functions in watchdog.py:

如果我在watchdog中采用以下两个功能,那么group killing这个过程似乎对我有用。

def kill_process_group(log, pid):
    log.debug('killing %s' % -pid)
    os.kill(-pid, 15)

    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)

    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)

    while check_pid(int(observe)):
        logger.debug('PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('PID: %s is *dead*, starting kills' % observe)

    for pid in kill.split(','):
        # Kill the children...
        kill_process_group(logger, int(pid))
    sys.exit(0) # Exit gracefully

#3


0  

Maybe you could use os.system() and do a killall in your watchdog to kill all instances of /usr/bin/script

也许您可以使用os.system(),并在您的监视程序中执行killall,以杀死/usr/bin/script的所有实例。

#4


0  

On inspection it seems that psu_proc.kill() (actually send_signal()) should raise OSError on failure, but just in case - have you tried checking for termination before setting the flag? As in:

在检查时,psu_proc.kill()(实际上是send_signal()))应该在失败时启用OSError,但为了以防万一,您是否尝试在设置标志之前检查终止?如:

if not psu_proc.is_running():
  finished = True