I have several different locations in a fairly wide area, each with a Linux server storing company data. This data changes every day in different ways at each different location. I need a way to keep this data up-to-date and synced between all these locations.
我在相当广泛的区域有几个不同的位置,每个位置都有一个存储公司数据的Linux服务器。这些数据每天都在不同的位置以不同的方式变化。我需要一种方法来使这些数据保持最新并在所有这些位置之间同步。
For example:
In one location someone places a set of images on their local server. In another location, someone else places a group of documents on their local server. A third location adds a handful of both images and documents to their server. In two other locations, no changes are made to their local servers at all. By the next morning, I need the servers at all five locations to have all those images and documents.
在一个位置,有人在其本地服务器上放置一组图像。在另一个位置,其他人在其本地服务器上放置一组文档。第三个位置将少量图像和文档添加到其服务器。在其他两个位置,根本不对其本地服务器进行任何更改。到第二天早上,我需要所有五个位置的服务器都拥有所有这些图像和文档。
My first instinct is to use rsync and a cron job to do the syncing over night (1 a.m. to 6 a.m. or so), when none of the bandwidth at our locations is being used. It seems to me that it would work best to have one server be the "central" server, pulling in all the files from the other servers first. Then it would push those changes back out to each remote server? Or is there another, better way to perform this function?
我的第一直觉是使用rsync和一个cron作业来进行夜间同步(从早上1点到早上6点左右),这时我们所在位置的所有带宽都没有被使用。在我看来,最好将一台服务器作为“*”服务器,首先从其他服务器中提取所有文件。然后它会将这些更改推回到每个远程服务器?或者是否有另一种更好的方法来执行此功能?
8 个解决方案
#1
3
The way I do it (on Debian/Ubuntu boxes):
我的方式(在Debian / Ubuntu盒子上):
- Use
dpkg --get-selections
to get your installed packages - Use
dpkg --set-selections
to install those packages from the list created - Use a source control solution to manage the configuration files. I use git in a centralized fashion, but subversion could be used just as easily.
使用dpkg --get-selections获取已安装的软件包
使用dpkg --set-selections从创建的列表中安装这些包
使用源代码管理解决方案来管理配置文件。我以集中的方式使用git,但可以轻松地使用subversion。
#2
2
An alternative if rsync isn't the best solution for you is Unison. Unison works under Windows and it has some features for handling when there are changes on both sides (not necessarily needing to pick one server as the primary, as you've suggested).
如果rsync不是最适合您的解决方案,那么Unison就是另一种选择。 Unison在Windows下工作,它有一些功能可以在双方都有变化时进行处理(不一定需要选择一个服务器作为主服务器,如你所建议的那样)。
Depending on how complex the task is, either may work.
根据任务的复杂程度,任何一个都可以工作。
#3
2
One thing you could (theoretically) do is create a script using Python or something and the inotify kernel feature (through the pyinotify
package, for example).
您可以(理论上)做的一件事是使用Python或其他东西以及inotify内核功能(例如通过pyinotify包)创建脚本。
You can run the script, which registers to receive events on certain trees. Your script could then watch directories, and then update all the other servers as things change on each one.
您可以运行脚本,该脚本注册以接收特定树上的事件。然后,您的脚本可以查看目录,然后在每个服务器上发生更改时更新所有其他服务器。
For example, if someone uploads spreadsheet.doc
to the server, the script sees it instantly; if the document doesn't get modified or deleted within, say, 5 minutes, the script could copy it to the other servers (e.g. through rsync)
例如,如果有人将spreadsheet.doc上传到服务器,脚本会立即看到它;如果在5分钟内没有修改或删除文档,脚本可以将其复制到其他服务器(例如通过rsync)
A system like this could theoretically implement a sort of limited 'filesystem replication' from one machine to another. Kind of a neat idea, but you'd probably have to code it yourself.
这样的系统理论上可以从一台机器到另一台机器实现一种有限的“文件系统复制”。有点干净的想法,但你可能需要自己编写代码。
#4
2
AFAIK, rsync is your best choice, it supports partial file updates among a variety of other features. Once setup it is very reliable. You can even setup the cron with timestamped log files to track what is updated in each run.
AFAIK,rsync是您的最佳选择,它支持各种其他功能之间的部分文件更新。设置完成后非常可靠。您甚至可以使用带时间戳的日志文件设置cron,以跟踪每次运行中更新的内容。
#5
1
I don't know how practical this is, but a source control system might work here. At some point (perhaps each hour?) during the day, a cron job runs a commit, and overnight, each machine runs a checkout. You could run into issues with a long commit not being done when a checkout needs to run, and essentially the same thing could be done rsync.
我不知道这有多实用,但源控制系统可能在这里工作。在白天的某个时刻(也许每个小时?),一个cron作业运行一个提交,并在一夜之间,每个机器运行一个结账。当结账需要运行时,你可能遇到长时间提交没有完成的问题,基本上同样的事情可以做rsync。
I guess what I'm thinking is that a central server would make your sync operation easier - conflicts can be handled once on central, then pushed out to the other machines.
我想我的想法是*服务器会让你的同步操作更容易 - 冲突可以在*处理一次,然后推送到其他机器。
#6
0
rsync would be your best choice. But you need to carefully consider how you are going to resolve conflicts between updates to the same data on different sites. If site-1 has updated 'customers.doc' and site-2 has a different update to the same file, how are you going to resolve it?
rsync将是您的最佳选择。但您需要仔细考虑如何解决不同站点上相同数据更新之间的冲突。如果site-1已更新'customers.doc'且site-2对同一文件有不同的更新,您将如何解决它?
#7
0
I have to agree with Matt McMinn, especially since it's company data, I'd use source control, and depending on the rate of change, run it more often.
我不得不同意Matt McMinn,特别是因为它是公司数据,我使用源代码控制,并且根据变化率,更频繁地运行它。
I think the central clearinghouse is a good idea.
我认为*票据交换所是个好主意。
#8
0
Depends upon following * How many servers/computers that need to be synced ? ** If there are too many servers using rsync becomes a problem ** Either you use threads and sync to multiple servers at same time or one after the other. So you are looking at high load on source machine or in-consistent data on servers( in a cluster ) at given point of time in the latter case
取决于以下*需要同步多少台服务器/计算机? **如果有太多服务器使用rsync成为问题**要么使用线程并同时同步到多个服务器,要么一个接一个地同步。因此,在后一种情况下,您在给定时间点查看源计算机上的高负载或服务器(在集群中)的一致数据
-
Size of the folders that needs to be synced and how often it changes
需要同步的文件夹的大小以及更改的频率
- If the data is huge then rsync will take time.
如果数据很大,那么rsync将需要时间。
-
Number of files
文件数量
- If number of files are large and specially if they are small files rsync will again take a lot of time
如果文件数量很大,特别是如果它们是小文件,则rsync将再次花费大量时间
So all depends on the scenario whether to use rsync , NFS , Version control
所有这些都取决于是否使用rsync,NFS,Version控件的场景
- If there are less servers and just small amount of data , then it makes sense to run rysnc every hour. You can also package content into RPM if data changes occasionally
如果服务器数量较少,数据量较少,那么每小时运行一次rysnc是有意义的。如果数据偶尔发生变化,您还可以将内容打包到RPM中
With the information provided , IMO Version Control will suit you the best .
通过提供的信息,IMO版本控制将最适合您。
Rsync/scp might give problems if two people upload different files with same name . NFS over multiple locations needs to be architect-ed with perfection
如果两个人上传具有相同名称的不同文件,则Rsync / scp可能会出现问题。多个位置的NFS需要完美地构建
Why not have a single/multiple repositories and every one just commits to those repository . All you need to do is keep the repository in sync. If the data is huge and updates are frequent then your repository server will need good amount of RAM and good I/O subsystem
为什么不拥有一个/多个存储库,每个存储库只提交到这些存储库。您需要做的就是保持存储库同步。如果数据很大且更新频繁,那么您的存储库服务器将需要大量的RAM和良好的I / O子系统
#1
3
The way I do it (on Debian/Ubuntu boxes):
我的方式(在Debian / Ubuntu盒子上):
- Use
dpkg --get-selections
to get your installed packages - Use
dpkg --set-selections
to install those packages from the list created - Use a source control solution to manage the configuration files. I use git in a centralized fashion, but subversion could be used just as easily.
使用dpkg --get-selections获取已安装的软件包
使用dpkg --set-selections从创建的列表中安装这些包
使用源代码管理解决方案来管理配置文件。我以集中的方式使用git,但可以轻松地使用subversion。
#2
2
An alternative if rsync isn't the best solution for you is Unison. Unison works under Windows and it has some features for handling when there are changes on both sides (not necessarily needing to pick one server as the primary, as you've suggested).
如果rsync不是最适合您的解决方案,那么Unison就是另一种选择。 Unison在Windows下工作,它有一些功能可以在双方都有变化时进行处理(不一定需要选择一个服务器作为主服务器,如你所建议的那样)。
Depending on how complex the task is, either may work.
根据任务的复杂程度,任何一个都可以工作。
#3
2
One thing you could (theoretically) do is create a script using Python or something and the inotify kernel feature (through the pyinotify
package, for example).
您可以(理论上)做的一件事是使用Python或其他东西以及inotify内核功能(例如通过pyinotify包)创建脚本。
You can run the script, which registers to receive events on certain trees. Your script could then watch directories, and then update all the other servers as things change on each one.
您可以运行脚本,该脚本注册以接收特定树上的事件。然后,您的脚本可以查看目录,然后在每个服务器上发生更改时更新所有其他服务器。
For example, if someone uploads spreadsheet.doc
to the server, the script sees it instantly; if the document doesn't get modified or deleted within, say, 5 minutes, the script could copy it to the other servers (e.g. through rsync)
例如,如果有人将spreadsheet.doc上传到服务器,脚本会立即看到它;如果在5分钟内没有修改或删除文档,脚本可以将其复制到其他服务器(例如通过rsync)
A system like this could theoretically implement a sort of limited 'filesystem replication' from one machine to another. Kind of a neat idea, but you'd probably have to code it yourself.
这样的系统理论上可以从一台机器到另一台机器实现一种有限的“文件系统复制”。有点干净的想法,但你可能需要自己编写代码。
#4
2
AFAIK, rsync is your best choice, it supports partial file updates among a variety of other features. Once setup it is very reliable. You can even setup the cron with timestamped log files to track what is updated in each run.
AFAIK,rsync是您的最佳选择,它支持各种其他功能之间的部分文件更新。设置完成后非常可靠。您甚至可以使用带时间戳的日志文件设置cron,以跟踪每次运行中更新的内容。
#5
1
I don't know how practical this is, but a source control system might work here. At some point (perhaps each hour?) during the day, a cron job runs a commit, and overnight, each machine runs a checkout. You could run into issues with a long commit not being done when a checkout needs to run, and essentially the same thing could be done rsync.
我不知道这有多实用,但源控制系统可能在这里工作。在白天的某个时刻(也许每个小时?),一个cron作业运行一个提交,并在一夜之间,每个机器运行一个结账。当结账需要运行时,你可能遇到长时间提交没有完成的问题,基本上同样的事情可以做rsync。
I guess what I'm thinking is that a central server would make your sync operation easier - conflicts can be handled once on central, then pushed out to the other machines.
我想我的想法是*服务器会让你的同步操作更容易 - 冲突可以在*处理一次,然后推送到其他机器。
#6
0
rsync would be your best choice. But you need to carefully consider how you are going to resolve conflicts between updates to the same data on different sites. If site-1 has updated 'customers.doc' and site-2 has a different update to the same file, how are you going to resolve it?
rsync将是您的最佳选择。但您需要仔细考虑如何解决不同站点上相同数据更新之间的冲突。如果site-1已更新'customers.doc'且site-2对同一文件有不同的更新,您将如何解决它?
#7
0
I have to agree with Matt McMinn, especially since it's company data, I'd use source control, and depending on the rate of change, run it more often.
我不得不同意Matt McMinn,特别是因为它是公司数据,我使用源代码控制,并且根据变化率,更频繁地运行它。
I think the central clearinghouse is a good idea.
我认为*票据交换所是个好主意。
#8
0
Depends upon following * How many servers/computers that need to be synced ? ** If there are too many servers using rsync becomes a problem ** Either you use threads and sync to multiple servers at same time or one after the other. So you are looking at high load on source machine or in-consistent data on servers( in a cluster ) at given point of time in the latter case
取决于以下*需要同步多少台服务器/计算机? **如果有太多服务器使用rsync成为问题**要么使用线程并同时同步到多个服务器,要么一个接一个地同步。因此,在后一种情况下,您在给定时间点查看源计算机上的高负载或服务器(在集群中)的一致数据
-
Size of the folders that needs to be synced and how often it changes
需要同步的文件夹的大小以及更改的频率
- If the data is huge then rsync will take time.
如果数据很大,那么rsync将需要时间。
-
Number of files
文件数量
- If number of files are large and specially if they are small files rsync will again take a lot of time
如果文件数量很大,特别是如果它们是小文件,则rsync将再次花费大量时间
So all depends on the scenario whether to use rsync , NFS , Version control
所有这些都取决于是否使用rsync,NFS,Version控件的场景
- If there are less servers and just small amount of data , then it makes sense to run rysnc every hour. You can also package content into RPM if data changes occasionally
如果服务器数量较少,数据量较少,那么每小时运行一次rysnc是有意义的。如果数据偶尔发生变化,您还可以将内容打包到RPM中
With the information provided , IMO Version Control will suit you the best .
通过提供的信息,IMO版本控制将最适合您。
Rsync/scp might give problems if two people upload different files with same name . NFS over multiple locations needs to be architect-ed with perfection
如果两个人上传具有相同名称的不同文件,则Rsync / scp可能会出现问题。多个位置的NFS需要完美地构建
Why not have a single/multiple repositories and every one just commits to those repository . All you need to do is keep the repository in sync. If the data is huge and updates are frequent then your repository server will need good amount of RAM and good I/O subsystem
为什么不拥有一个/多个存储库,每个存储库只提交到这些存储库。您需要做的就是保持存储库同步。如果数据很大且更新频繁,那么您的存储库服务器将需要大量的RAM和良好的I / O子系统