I have an application that processes files in a directory and moves them to another directory along with the processed output. Nothing special about that. An interesting requirement was introduced:
我有一个应用程序处理目录中的文件,并将它们与处理后的输出一起移动到另一个目录。没什么特别的。引入了一个有趣的要求:
Implement fault tolerance and processing throughput by allowing multiple remote instances to work on the same file store.
通过允许多个远程实例在同一文件存储上工作来实现容错和处理吞吐量。
Additional considerations are that we can not assume the file system, as we support both Windows and NFS.
其他注意事项是我们不能假设文件系统,因为我们支持Windows和NFS。
Of course the problems is, how do I make sure that the different instances do not try and process the same work, potentially corrupting work or reducing throughput? File locking can be problematic, especially across network shares. We can use a more sophisticated method, such as a simple database or messaging framework, (a la JMS or similar), but the entire cluster needs to be fault tolerant. We can't have one database or messaging provider because of the single point of failure that it introduces.
当然问题是,如何确保不同的实例不会尝试和处理相同的工作,可能会破坏工作或降低吞吐量?文件锁定可能会有问题,尤其是跨网络共享。我们可以使用更复杂的方法,例如简单的数据库或消息传递框架(jMS或类似的),但整个集群需要具有容错能力。我们不能拥有一个数据库或消息传递提供程序,因为它引入了单点故障。
We've implemented a solution that uses multicast messages to self-discover processing instances and elect a supervisor who assigns work. There's a timeout in case the supervisor goes down and another election takes place. Our networking library, however, isn't very mature and the our implementation of messages is clunky.
我们已经实现了一个解决方案,该解决方案使用多播消息来自我发现处理实例并选出一位分配工作的主管。如果主管垮台并进行另一次选举,则会超时。然而,我们的网络库不是很成熟,我们的消息实现很笨拙。
My instincts, however, tell me that there is a simpler way.
然而,我的直觉告诉我,有一种更简单的方法。
Thoughts?
1 个解决方案
#1
I think you can safely assume that rename operations are atomic on all network file systems that you care about. So if you arrange an amount of work to be a single file (or keyed to a single file), then have each server first list the directory containing new work, pick a piece of work, and then have it rename the file to its own server name (say, machine name or IP address). For one of the instances who concurrently perform the same operation, the rename will succeed, so they should then process the work. For the others, it will fail, so they should pick a different file from the listing they got.
我认为您可以放心地假设重命名操作在您关心的所有网络文件系统上都是原子操作。因此,如果您将大量工作安排为单个文件(或键入单个文件),则让每个服务器首先列出包含新工作的目录,选择一项工作,然后将其重命名为自己的文件服务器名称(例如,机器名称或IP地址)。对于同时执行相同操作的其中一个实例,重命名将成功,因此他们应该处理该工作。对于其他人来说,它会失败,所以他们应该从他们获得的列表中选择一个不同的文件。
For creation of new work, assume that directory creation (mkdir) is atomic, but file creation is not (for file creation, the second writer might overwrite the existing file). So if there are multiple producers of work also, create a new directory for each piece of work.
对于创建新工作,假设目录创建(mkdir)是原子的,但文件创建不是(对于文件创建,第二个编写器可能会覆盖现有文件)。因此,如果还有多个工作生成器,请为每个工作创建一个新目录。
#1
I think you can safely assume that rename operations are atomic on all network file systems that you care about. So if you arrange an amount of work to be a single file (or keyed to a single file), then have each server first list the directory containing new work, pick a piece of work, and then have it rename the file to its own server name (say, machine name or IP address). For one of the instances who concurrently perform the same operation, the rename will succeed, so they should then process the work. For the others, it will fail, so they should pick a different file from the listing they got.
我认为您可以放心地假设重命名操作在您关心的所有网络文件系统上都是原子操作。因此,如果您将大量工作安排为单个文件(或键入单个文件),则让每个服务器首先列出包含新工作的目录,选择一项工作,然后将其重命名为自己的文件服务器名称(例如,机器名称或IP地址)。对于同时执行相同操作的其中一个实例,重命名将成功,因此他们应该处理该工作。对于其他人来说,它会失败,所以他们应该从他们获得的列表中选择一个不同的文件。
For creation of new work, assume that directory creation (mkdir) is atomic, but file creation is not (for file creation, the second writer might overwrite the existing file). So if there are multiple producers of work also, create a new directory for each piece of work.
对于创建新工作,假设目录创建(mkdir)是原子的,但文件创建不是(对于文件创建,第二个编写器可能会覆盖现有文件)。因此,如果还有多个工作生成器,请为每个工作创建一个新目录。