最快/最好的方式在S3到EC2之间复制数据？

I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.

我有相当多的数据(~30G,分成~100个文件)我想在S3和EC2之间传输:当我启动EC2实例时,我想将数据从S3复制到EC2本地磁盘尽可能快,当我完成处理时,我想将结果复制回S3。

I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.

我正在寻找一种能够来回快速/并行复制数据的工具。我有几个脚本被破解,包括一个做得不错的工作,所以我不是在寻找基本库的指针;我正在寻找快速可靠的东西。

5 个解决方案

#1

I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.

我认为使用Elastic Block Store来存储文件而不是S3可能会更好。 EBS类似于S3上的“驱动器”,可以安装到您的EC2实例中,而无需每次都复制数据,从而允许您在EC2实例之间保留数据,而无需每次都写入或读取S3。

http://aws.amazon.com/ebs/

#2

Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.

不幸的是,亚当的建议不会起作用,因为他对EBS的理解是错误的(虽然我希望他是对的,并且经常认为自己应该这样做)...因为EBS与S3无关,但它只会给你EC2实例的“外部驱动器”,它们是独立的,但可连接到实例。您仍然需要在S3和EC2之间进行复制,即使两者之间没有数据传输成本。

You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.

您没有提及您的实例的操作系统,因此我无法提供定制信息。我使用的一个流行的命令行工具是http://s3tools.org/s3cmd ...它基于Python,因此,根据其网站上的信息,它应该适用于Win和Linux,虽然我使用它所有在Linux上的时间。您可以轻松地创建一个快速脚本,该脚本使用其内置的“sync”命令,该命令与rsync类似,并在每次处理完数据时触发。您还可以使用递归put和get命令仅在需要时获取和放置数据。

There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.

有一些图形工具,如Cloudberry Pro,也有一些Windows命令行选项,您可以设置计划命令。 http://s3tools.org/s3cmd可能是最简单的。

#3

Install s3cmd Package as

安装s3cmd包作为

yum install s3cmd

sudo apt-get install s3cmd

depending on your OS

取决于您的操作系统

then copy data with this

然后用这个复制数据

s3cmd get s3://tecadmin/file.txt

also ls can list the files.

也可以列出文件。

for more detils see this

对于更多的detils看到这个

#4

By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

到目前为止,AWS Command line工具中有一个sync命令可以解决这个问题:http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

On startup: aws s3 sync s3://mybucket /mylocalfolder

在启动时:aws s3 sync s3:// mybucket / mylocalfolder

before shutdown: aws s3 sync /mylocalfolder s3://mybucket

在关机之前:aws s3 sync / mylocalfolder s3:// mybucket

Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)

当然,细节总是很有趣,例如。它是如何并行的(并且你可以使它更加平行,并且更快的是整个设置的虚拟性质)

Btw hope you're still working on this... or somebody is. ;)

顺便说一下,希望你还在努力......或者有人。 ;)

#5

For me the best form is:

对我来说,最好的形式是:

wget http://s3.amazonaws.com/my_bucket/my_folder/my_file.ext

from PuTTy

#1