I'm running some applications on EC2 spot instances. Such instances can be killed by Amazon with no notice.
我正在EC2 spot实例上运行一些应用程序。亚马逊可以在没有通知的情况下杀死此类实例。
In the shutdown process, processes are killed in some order. We have monitoring/recovery programs that should behave differently based on whether the server is shutting down or the process just crashed. (specifically we don't want to do anything if the server is actually shutting down)
在关闭过程中,进程按某种顺序被终止。我们有监控/恢复程序,根据服务器是关闭还是进程崩溃,应该采取不同的行为。 (具体来说,如果服务器实际关闭,我们不想做任何事情)
How can I detect in the recovery process (if it is still alive) that processes were killed because of a shutdown?
如何在恢复过程中检测到(如果它仍然存在)进程因关闭而被杀死?
(More system details: I'm running unknown/untrusted/etc code in a sandbox that doesn't modify external state. Generally if sandboxed code crashes, it is fault of author of the untrusted code and we will not rerun it. But if the sandboxed code is terminated due to the VM shuting down or failing, we need to rerun it on another instance. The problem I'm having right now is that the user's code is terminated first so the monitoring program incorrectly believes the crash is user error.)
(更多系统细节:我在一个不修改外部状态的沙箱中运行未知/不可信/等代码。通常,如果沙盒代码崩溃,则是不受信任代码的作者的错误,我们不会重新运行它。但是如果由于虚拟机故障或失败,沙盒代码终止,我们需要在另一个实例上重新运行。我现在遇到的问题是用户的代码首先终止,因此监控程序错误地认为崩溃是用户错误。)
4 个解决方案
#1
5
agent
Run an agent on each machine that spawns sandbox child-processes. The agent runs your code that is "crash proof", and the sandbox code runs user code which could crash.
在生成沙箱子进程的每台计算机上运行代理。代理运行“防崩溃”代码,沙箱代码运行可能崩溃的用户代码。
The monitoring system that is in charge of starting a new machine with a new sandbox process checks which processes have been killed (both the agent and sandbox process or only the sandbox child process).
负责使用新沙箱进程启动新计算机的监视系统会检查哪些进程已被终止(代理程序和沙箱进程或仅沙箱子进程)。
It does that by opening a TCP connection (RMI/RPC/HTTP) to the agent querying about its child processes. If the agent responds - the machine is still running, and it can be asked about its child sandbox processes. If the agent does not respond - the machine is suspect of being terminated.
它通过打开TCP连接(RMI / RPC / HTTP)到代理查询其子进程来实现。如果代理响应 - 计算机仍在运行,则可以询问其子沙箱进程。如果代理没有响应 - 机器被怀疑被终止。
agent (variation)
The agent is also in charge of restarting the child sandbox process on the same VM in case it crashes.
代理还负责在同一VM上重新启动子沙箱进程,以防它崩溃。
lookup service
Use a look-up service (such as Zoo Keeper) to keep track of which processes sent heartbeat keep-alive. If the agent is alive then the machine is still running, if the agent is not alive, then it is not running.
使用查找服务(例如Zoo Keeper)来跟踪发送心跳保持活动的进程。如果代理处于活动状态,则计算机仍在运行,如果代理未处于活动状态,则表示该计算机未运行。
ec2 api
Poll the EC2 APIs to determine if the machine is in running or terminated state.
轮询EC2 API以确定机器是处于运行状态还是已终止状态。
#2
2
How does your recovery process work?
您的恢复过程如何运作?
If you're using waitpid
to monitor the process, when it exits you can determine:
如果您使用waitpid来监控进程,当它退出时您可以确定:
- Whether it exited normally, and what status the process returned if it did, or
- Whether it exited due to a signal, and what that signal was.
它是否正常退出,以及该过程返回的状态,或者
是否由于信号而退出,以及该信号是什么。
Depending on how the process is shut down, I'd expect to see it either exit normally or exit via SIGTERM
or SIGKILL
. SIGILL
, SIGABRT
, SIGFPE
, SIGBUS
, SIGSEGV
, and SIGSYS
would indicate a crash from a programming error.
根据进程的关闭方式,我希望看到它正常退出或通过SIGTERM或SIGKILL退出。 SIGILL,SIGABRT,SIGFPE,SIGBUS,SIGSEGV和SIGSYS会指示编程错误导致崩溃。
#3
1
That sounds like a very fragile scheme. Don't try to detect the state of the system: have your application write out a validity token (and sync the relevant files!) somehow following a "clean" shutdown/halt/stop of the app and use that.
这听起来像一个非常脆弱的计划。不要试图检测系统的状态:让应用程序在应用程序的“干净”关闭/暂停/停止后以某种方式写出有效性令牌(并同步相关文件!)并使用它。
#4
0
I assume that when the instance is shutting down your monitoring process would receive a SIGTERM signal.
我假设当实例关闭时,监控过程会收到SIGTERM信号。
So would it be possible to do something like - IF monitored process has exited && no SIGTERM signal received within next say 5 seconds - assume the process crashed. If a SIGTERM was received, simply exit from the signal handler.
那么可以做一些事情 - 如果监控过程已经退出&&在接下来的5秒内没有收到SIGTERM信号 - 假设过程崩溃了。如果收到SIGTERM,只需退出信号处理程序。
#1
5
agent
Run an agent on each machine that spawns sandbox child-processes. The agent runs your code that is "crash proof", and the sandbox code runs user code which could crash.
在生成沙箱子进程的每台计算机上运行代理。代理运行“防崩溃”代码,沙箱代码运行可能崩溃的用户代码。
The monitoring system that is in charge of starting a new machine with a new sandbox process checks which processes have been killed (both the agent and sandbox process or only the sandbox child process).
负责使用新沙箱进程启动新计算机的监视系统会检查哪些进程已被终止(代理程序和沙箱进程或仅沙箱子进程)。
It does that by opening a TCP connection (RMI/RPC/HTTP) to the agent querying about its child processes. If the agent responds - the machine is still running, and it can be asked about its child sandbox processes. If the agent does not respond - the machine is suspect of being terminated.
它通过打开TCP连接(RMI / RPC / HTTP)到代理查询其子进程来实现。如果代理响应 - 计算机仍在运行,则可以询问其子沙箱进程。如果代理没有响应 - 机器被怀疑被终止。
agent (variation)
The agent is also in charge of restarting the child sandbox process on the same VM in case it crashes.
代理还负责在同一VM上重新启动子沙箱进程,以防它崩溃。
lookup service
Use a look-up service (such as Zoo Keeper) to keep track of which processes sent heartbeat keep-alive. If the agent is alive then the machine is still running, if the agent is not alive, then it is not running.
使用查找服务(例如Zoo Keeper)来跟踪发送心跳保持活动的进程。如果代理处于活动状态,则计算机仍在运行,如果代理未处于活动状态,则表示该计算机未运行。
ec2 api
Poll the EC2 APIs to determine if the machine is in running or terminated state.
轮询EC2 API以确定机器是处于运行状态还是已终止状态。
#2
2
How does your recovery process work?
您的恢复过程如何运作?
If you're using waitpid
to monitor the process, when it exits you can determine:
如果您使用waitpid来监控进程,当它退出时您可以确定:
- Whether it exited normally, and what status the process returned if it did, or
- Whether it exited due to a signal, and what that signal was.
它是否正常退出,以及该过程返回的状态,或者
是否由于信号而退出,以及该信号是什么。
Depending on how the process is shut down, I'd expect to see it either exit normally or exit via SIGTERM
or SIGKILL
. SIGILL
, SIGABRT
, SIGFPE
, SIGBUS
, SIGSEGV
, and SIGSYS
would indicate a crash from a programming error.
根据进程的关闭方式,我希望看到它正常退出或通过SIGTERM或SIGKILL退出。 SIGILL,SIGABRT,SIGFPE,SIGBUS,SIGSEGV和SIGSYS会指示编程错误导致崩溃。
#3
1
That sounds like a very fragile scheme. Don't try to detect the state of the system: have your application write out a validity token (and sync the relevant files!) somehow following a "clean" shutdown/halt/stop of the app and use that.
这听起来像一个非常脆弱的计划。不要试图检测系统的状态:让应用程序在应用程序的“干净”关闭/暂停/停止后以某种方式写出有效性令牌(并同步相关文件!)并使用它。
#4
0
I assume that when the instance is shutting down your monitoring process would receive a SIGTERM signal.
我假设当实例关闭时,监控过程会收到SIGTERM信号。
So would it be possible to do something like - IF monitored process has exited && no SIGTERM signal received within next say 5 seconds - assume the process crashed. If a SIGTERM was received, simply exit from the signal handler.
那么可以做一些事情 - 如果监控过程已经退出&&在接下来的5秒内没有收到SIGTERM信号 - 假设过程崩溃了。如果收到SIGTERM,只需退出信号处理程序。