处理文本文件ftp到托管服务器中的一组目录中

The situation is as follows:

情况如下:

A series of remote workstations collect field data and ftp the collected field data to a server through ftp. The data is sent as a CSV file which is stored in a unique directory for each workstation in the FTP server.

一系列远程工作站收集现场数据,并通过ftp将收集的现场数据ftp到服务器。数据作为CSV文件发送,存储在FTP服务器中每个工作站的唯一目录中。

Each workstation sends a new update every 10 minutes, causing the previous data to be overwritten. We would like to somehow concatenate or store this data automatically. The workstation's processing is limited and cannot be extended as it's an embedded system.

每个工作站每10分钟发送一次新更新,导致先前的数据被覆盖。我们想以某种方式自动连接或存储这些数据。工作站的处理是有限的,不能扩展,因为它是一个嵌入式系统。

One suggestion offered was to run a cronjob in the FTP server, however there is a Terms of service restriction to only allow cronjobs in 30 minute intervals as it's shared-hosting. Given the number of workstations uploading and the 10 minute interval between uploads it looks like the cronjob's 30 minute limit between calls might be a problem.

提出的一个建议是在FTP服务器中运行cronjob,但是有一个服务条款限制,只允许cronjobs以30分钟的间隔进行共享托管。考虑到上传工作站的数量和上传之间的10分钟间隔,看起来cronjob的30分钟限制可能是一个问题。

Is there any other approach that might be suggested? The available server-side scripting languages are perl, php and python.

是否有其他可能建议的方法?可用的服务器端脚本语言是perl,php和python。

Upgrading to a dedicated server might be necessary, but I'd still like to get input on how to solve this problem in the most elegant manner.

升级到专用服务器可能是必要的,但我仍然希望得到关于如何以最优雅的方式解决此问题的输入。

4 个解决方案

#1

Most modern Linux's will support inotify to let your process know when the contents of a diretory has changed, so you don't even need to poll.

大多数现代Linux都支持inotify,让你的进程知道一个指针的内容何时发生了变化,所以你甚至不需要轮询。

Edit: With regard to the comment below from Mark Baker :

编辑:关于Mark Baker的以下评论:

"Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files."

“但要小心,因为一旦创建文件就会收到通知,而不是在文件关闭时通知你。所以你需要一些方法来确保你不会获取部分文件。”

That will happen with the inotify watch you set on the directory level - the way to make sure you then don't pick up the partial file is to set a further inotify watch on the new file and look for the IN_CLOSE event so that you know the file has been written to completely.

使用您在目录级别设置的inotify监视会发生这种情况 - 确保您不接收部分文件的方法是在新文件上设置进一步的inotify监视并查找IN_CLOSE事件以便您知道该文件已完全写入。

Once your process has seen this, you can delete the inotify watch on this new file, and process it at your leisure.

一旦您的流程看到了这一点,您就可以删除此新文件上的inotify手表,并随意处理。

#2

You might consider a persistent daemon that keeps polling the target directories:

您可以考虑一个持久守护进程来保持轮询目标目录:

grab_lockfile() or exit();
while (1) {
    if (new_files()) {
        process_new_files();
    }
    sleep(60);
}

Then your cron job can just try to start the daemon every 30 minutes. If the daemon can't grab the lockfile, it just dies, so there's no worry about multiple daemons running.

然后你的cron作业可以尝试每30分钟启动一次守护进程。如果守护进程无法获取锁定文件,它就会死掉,因此不必担心多个守护进程正在运行。

Another approach to consider would be to submit the files via HTTP POST and then process them via a CGI. This way, you guarantee that they've been dealt with properly at the time of submission.

另一种考虑的方法是通过HTTP POST提交文件,然后通过CGI处理它们。这样,您保证在提交时已正确处理。

#3

The 30 minute limitation is pretty silly really. Starting processes in linux is not an expensive operation, so if all you're doing is checking for new files there's no good reason not to do it more often than that. We have cron jobs that run every minute and they don't have any noticeable effect on performance. However, I realise it's not your rule and if you're going to stick with that hosting provider you don't have a choice.

30分钟的限制确实非常愚蠢。在Linux中启动进程并不是一项昂贵的操作,所以如果您所做的只是检查新文件,那么没有理由不经常这样做。我们有每分钟运行的cron作业,它们对性能没有任何明显影响。但是,我意识到这不是你的规则,如果你要坚持使用那个托管服务提供商,你就没有选择权。

You'll need a long running daemon of some kind. The easy way is to just poll regularly, and probably that's what I'd do. Inotify, so you get notified as soon as a file is created, is a better option.

你需要一个长期运行的守护进程。简单的方法就是定期轮询,这可能就是我要做的。 Inotify,因此您可以在创建文件后立即收到通知,这是一个更好的选择。

You can use inotify from perl with Linux::Inotify, or from python with pyinotify.

您可以使用带有Linux :: Inotify的perl的inotify,或者带有pyinotify的python的inotify。

Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files.

但要小心,因为一旦创建文件,就会立即通知您,而不是在文件关闭时通知您。所以你需要一些方法来确保你不要拿起部分文件。

With polling it's less likely you'll see partial files, but it will happen eventually and will be a nasty hard-to-reproduce bug when it does happen, so better to deal with the problem now.

通过轮询,你不太可能看到部分文件,但它最终会发生,并且当它确实发生时将是一个令人讨厌的难以重现的错误,所以现在更好地处理问题。

#4

If you're looking to stay with your existing FTP server setup then I'd advise using something like inotify or daemonized process to watch the upload directories. If you're OK with moving to a different FTP server, you might take a look at pyftpdlib which is a Python FTP server lib.

如果您希望继续使用现有的FTP服务器设置,那么我建议您使用inotify或daemonized进程等内容来查看上传目录。如果您可以移动到其他FTP服务器,那么您可以查看pyftpdlib,这是一个Python FTP服务器库。

I've been a part of the dev team for pyftpdlib a while and one of more common requests was for a way to "process" files once they've finished uploading. Because of that we created an on_file_received() callback method that's triggered on completion of an upload (See issue #79 on our issue tracker for details).

我一直是pyftpdlib开发团队的一员,而且一个更常见的请求就是在文件上传完成后“处理”文件。因此我们创建了一个on_file_received()回调方法,该方法在上传完成时触发(有关详细信息,请参阅我们的问题跟踪器上的问题#79)。

If you're comfortable in Python then it might work out well for you to run pyftpdlib as your FTP server and run your processing code from the callback method. Note that pyftpdlib is asynchronous and not multi-threaded, so your callback method can't be blocking. If you need to run long-running tasks I would recommend a separate Python process or thread be used for the actual processing work.

如果您对Python感到满意,那么将pyftpdlib作为FTP服务器运行并从回调方法运行处理代码可能会很好。请注意,pyftpdlib是异步的而不是多线程的,因此您的回调方法无法阻止。如果您需要运行长时间运行的任务,我建议使用单独的Python进程或线程来进行实际的处理工作。

#1