什么可能导致计划的Rails活动作业消失？

I have a suspicion that some our active jobs are disappearing but I don't know why. Below is one I have found the evidence for it's disappearance, but not the reason why.

我怀疑我们的一些活跃工作正在消失,但我不知道为什么。下面是我发现它消失的证据,但不是原因。

Our site makes use of an external cloud printing service. We kick the jobs off and then check their status. Having successfully created the remote cloud print, we create an active job to check the status immediately. If it's finished (successfully or otherwise), it's marked as such. If not then the check status job creates another one, with a slight delay. The delay increases each time.

我们的网站使用外部云打印服务。我们解雇工作,然后检查他们的状态。成功创建远程云打印后,我们创建一个活动作业来立即检查状态。如果它已完成(成功或其他),则标记为如此。如果没有,那么检查状态作业会创建另一个,稍有延迟。每次延迟都会增加。

One a status check today, the logs show that the wait reached 128 seconds. But the next status check did not occur, and there are no errors in the log either.

今天一个状态检查,日志显示等待达到128秒。但是没有发生下一次状态检查,并且日志中也没有错误。

We use active job backed by delayed job. The code for the status check job is below. It can't see any flaw in the logic which would not result in either correctly collected status check or another attempt with a wait.

我们使用延迟工作支持的积极工作。状态检查作业的代码如下。它无法看到逻辑中的任何缺陷,这些缺陷不会导致正确收集状态检查或另一次尝试等待。

class CheckCloudPrintStatusJob < ApplicationJob
  queue_as :default

  def perform(cloud_print, count = 0)
    cloud_print.update_status

    unless cloud_print.finished?
      count += 1
      wait = 2**(count-1)

      if count > 15
        cloud_print.mark_as_failed

        puts "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
        puts "~~~~~~~~~~~~~~~~~~ Cloud printing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
        puts "Cloud print ##{cloud_print.id} failed"
        puts "Finally waited #{wait} seconds and then cancelled."
        puts "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
      else
        puts "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
        puts "~~~~~~~~~~~~~~~~~~ Cloud printing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
        puts "Checking status of cloud print ##{cloud_print.id}"
        puts "Waiting #{wait} seconds and then retrying."
        puts "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"

        CheckCloudPrintStatusJob.set(wait: wait.seconds).perform_later(cloud_print, count)
      end
    end
  end
end

2 个解决方案

#1

Correct, there is no flaw in the stated logic that would result in either correctly collected status check or another attempt with a wait.

正确,所述逻辑中没有任何缺陷会导致正确收集状态检查或另一次等待尝试。

I've verified that your job code behaves successfully beyond a 128-second wait with the following setup:

我已经通过以下设置验证了您的作业代码在128秒等待之后成功运行:

rails new project

rails新项目

delayed_job_active_record added to the Gemfile (running bundle install)

delayed_job_active_record添加到Gemfile(运行bundle install)

rails generate delayed_job:active_record and rake db:migrate to install gems and create the Delayed Job DB table

rails生成delayed_job:active_record和rake db:migrate以安装gems并创建Delayed Job DB表

config.active_job.queue_adapter = :delayed_job in config/application.rb

config.active_job.queue_adapter =:config / application.rb中的delayed_job

a basic CloudPrint < ApplicationRecord model with update_status, finished? and mark_as_failed methods in app/models/cloud_print.rb

一个带有update_status的基本CloudPrint 模型,已完成?和app>

the provided code in app/jobs/check_cloud_print_status_job.rb

app / jobs / check_cloud_print_status_job.rb中提供的代码

Enqueuing a job by running CheckCloudPrintStatusJob.perform_later(CloudPrint.create) via the Rails Console (bin/rails c)

通过Rails控制台运行CheckCloudPrintStatusJob.perform_later(CloudPrint.create)来排队作业(bin / rails c)

Since the above sequence behaved correctly without any issue, you need to expand your search by providing a more complete and verifiable example that actually reproduces the problem. Either upload your entire Rails project into a GitHub repo once you've been able to reproduce your issue consistently, or investigate other aspects of your environment and project configuration. Here are some possibilities:

由于上述序列行为正确且没有任何问题,因此您需要通过提供实际再现问题的更完整且可验证的示例来扩展搜索。一旦您能够一致地重现问题,或者调查环境和项目配置的其他方面,就可以将整个Rails项目上传到GitHub仓库中。以下是一些可能性:

There could be logic in your model class that could possibly raise any exceptions;

您的模型类中可能存在可能引发任何异常的逻辑;

The worker-processing daemon could have been aborted or killed;

工作者处理守护程序可能已被中止或杀死;

The job queue could have been cleared (e.g., via rake jobs:clear)

作业队列可能已被清除(例如,通过rake作业:清除)

Another process could have modified and/or deleted the model object being processed;

另一个进程可以修改和/或删除正在处理的模型对象;

finished? could have returned true after update_status was invoked, causing the final status check to not have been printed even though the processing finished successfully.

完了吗?调用update_status后可能返回true,导致即使处理成功完成,也不会打印最终状态检查。

N.B. - Delayed Job supports retrying failed jobs with a delay of 5 seconds + N ** 4, where N is the number of attempts, there's no need to re-implement this logic yourself. Just raise an exception if cloud_print.finished? is false, and you shouldn't need any other custom delay code:

注: - 延迟作业支持重试失败的作业,延迟5秒+ N ** 4,其中N是尝试次数,不需要自己重新实现此逻辑。如果cloud_print.finished,只是引发异常?是假的,你不应该需要任何其他自定义延迟代码:

class CheckCloudPrintStatusJob < ApplicationJob
  queue_as :default

  def perform(cloud_print)
    raise 'Not ready' unless cloud_print.finished?
  end
end

#2

As it implies from the job code, the argument cloud_print is an instance of some Ruby class (seems ActiveRecord::Base). That is not a good idea in general to have complicated objects as arguments for background job because these arguments have to be serialized into string, json or yaml. DelayedJob uses YAML-serialized objects and sometimes it might be not possible to restore a model instance. For example, if delayed job runs as callback before_create - model object has not been saved and could not be restored then. More information could be found here https://github.com/collectiveidea/delayed_job/wiki/Common-problems#jobs-are-silently-removed-from-the-database

正如它从作业代码中暗示的那样,参数cloud_print是某个Ruby类的实例(似乎是ActiveRecord :: Base)。将复杂对象作为后台作业的参数通常不是一个好主意,因为这些参数必须序列化为string,json或yaml。 DelayedJob使用YAML序列化对象,有时可能无法恢复模型实例。例如,如果延迟作业以回调before_create运行 - 模型对象尚未保存且无法恢复。更多信息可以在这里找到https://github.com/collectiveidea/delayed_job/wiki/Common-problems#jobs-are-silently-removed-from-the-database

#1