We use Quartz scheduler in our application to scan a particular folder for any new files and if there is a new file, then kick off the associated workflow in the application to process it. For this, we have created our custom listener object which is associated with a job and a trigger that polls the file location every 5 min.
我们在应用程序中使用Quartz调度程序扫描特定文件夹中的任何新文件,如果有新文件,则启动应用程序中的相关工作流程以处理它。为此,我们创建了自定义侦听器对象,该对象与作业和触发器关联,该触发器每5分钟轮询一次文件位置。
The requirement is to process only the new file that arrives in that folder location while ignoring the already processed files. Also, we don't want the folder location to grow enormously with large number of files (otherwise it will slow down the folder scanning) - so at the end of workflow we actually delete the source file.
要求是仅处理到达该文件夹位置的新文件,同时忽略已处理的文件。此外,我们不希望文件夹位置因大量文件而大量增长(否则会降低文件夹扫描速度) - 因此在工作流程结束时我们实际上会删除源文件。
In order to implement it, we decided to maintain the list of processed files in job metadata. So at each polling, we fetch the list of processed files from the job metadata, compare it against current list of files and if the file is not yet processed - then kick off the associated process flow.
为了实现它,我们决定在作业元数据中维护已处理文件的列表。因此,在每次轮询时,我们从作业元数据中获取已处理文件的列表,将其与当前文件列表进行比较,如果文件尚未处理,则启动相关的流程。
The problem with the above approach is that over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.
上述方法的问题在于多年来(并且取决于每天接收的文件数量,可能是每天100K),作业元数据(持续存储的文件列表)变得非常大,它开始给我们带来了数据截断错误的问题(在石英表中持久化作业元数据)和缓慢。
To address this problem, we decided to refresh the list of processed files in job metadata with the current snapshot of the folder. This way, since we delete the processed file from folder location at the end of each workflow, the list of processed files remains in limit. But then we started having problem of processing of the duplicate files if it arrives with same name next day.
为解决此问题,我们决定使用文件夹的当前快照刷新作业元数据中已处理文件的列表。这样,由于我们在每个工作流结束时从文件夹位置删除已处理文件,因此处理文件列表仍然有限。但是,如果重复文件第二天到达同名,我们就开始处理重复文件的问题了。
What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrives with same name? Shall we consider the approach of persiting processed file list in the external database instead of job metadata? I am looking for recommended approach for implementing this solution. Thanks!
实现此要求的最佳方法是什么,并确保我们不处理具有相同名称的重复文件?我们应该考虑在外部数据库中持久处理文件列表而不是作业元数据的方法吗?我正在寻找实施此解决方案的推荐方法。谢谢!
1 个解决方案
#1
2
We had a similar request recently with our scheduler. If you are on linux, why not using solutions such as inotify ? Other systems may have other ways to monitor file system events.
最近我们的调度程序也有类似的请求。如果您使用的是Linux,为什么不使用inotify等解决方案?其他系统可能有其他方法来监视文件系统事件。
Our solution was to trigger some file processing at each creation event and then every x days removing the older files (similar to Walen DB suggestion). In that case, the list does not inflate too much and duplicate file can be handled in their own specific case.
我们的解决方案是在每个创建事件中触发一些文件处理,然后每隔x天删除旧文件(类似于Walen DB建议)。在这种情况下,列表不会过度膨胀,并且可以在自己的特定情况下处理重复文件。
(Sorry I do not have the rights to comment yet.)
(对不起,我还没有权利发表评论。)
#1
2
We had a similar request recently with our scheduler. If you are on linux, why not using solutions such as inotify ? Other systems may have other ways to monitor file system events.
最近我们的调度程序也有类似的请求。如果您使用的是Linux,为什么不使用inotify等解决方案?其他系统可能有其他方法来监视文件系统事件。
Our solution was to trigger some file processing at each creation event and then every x days removing the older files (similar to Walen DB suggestion). In that case, the list does not inflate too much and duplicate file can be handled in their own specific case.
我们的解决方案是在每个创建事件中触发一些文件处理,然后每隔x天删除旧文件(类似于Walen DB建议)。在这种情况下,列表不会过度膨胀,并且可以在自己的特定情况下处理重复文件。
(Sorry I do not have the rights to comment yet.)
(对不起,我还没有权利发表评论。)