I'm pretty happy with s3cmd, but there is one issue: How to copy all files from one S3 bucket to another? Is it even possible?
我对s3cmd很满意,但有一个问题:如何将所有文件从一个S3存储桶复制到另一个?它甚至可能吗?
EDIT: I've found a way to copy files between buckets using Python with boto:
编辑:我发现了一种使用Python与boto在存储桶之间复制文件的方法:
from boto.s3.connection import S3Connection
def copyBucket(srcBucketName, dstBucketName, maxKeys = 100):
conn = S3Connection(awsAccessKey, awsSecretKey)
srcBucket = conn.get_bucket(srcBucketName);
dstBucket = conn.get_bucket(dstBucketName);
resultMarker = ''
while True:
keys = srcBucket.get_all_keys(max_keys = maxKeys, marker = resultMarker)
for k in keys:
print 'Copying ' + k.key + ' from ' + srcBucketName + ' to ' + dstBucketName
t0 = time.clock()
dstBucket.copy_key(k.key, srcBucketName, k.key)
print time.clock() - t0, ' seconds'
if len(keys) < maxKeys:
print 'Done'
break
resultMarker = keys[maxKeys - 1].key
Syncing is almost as straight forward as copying. There are fields for ETag, size, and last-modified available for keys.
同步几乎与复制一样简单。 ETag,大小和最后修改的字段可用于键。
Maybe this helps others as well.
也许这对其他人也有帮助。
11 个解决方案
#1
85
s3cmd sync s3://from/this/bucket/ s3://to/this/bucket/
s3cmd sync s3:// from / this / bucket / s3:// to / this / bucket /
For available options, please use: $s3cmd --help
有关可用选项,请使用:$ s3cmd --help
#2
40
AWS CLI seems to do the job perfectly, and has the bonus of being an officially supported tool.
AWS CLI似乎完美地完成了这项工作,并且具有成为官方支持工具的好处。
aws s3 sync s3://mybucket s3://backup-mybucket
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
#3
29
The answer with the most upvotes as I write this is this one:
在我写这篇文章时,答案最多的是这一个:
s3cmd sync s3://from/this/bucket s3://to/this/bucket
It's a useful answer. But sometimes sync is not what you need (it deletes files, etc.). It took me a long time to figure out this non-scripting alternative to simply copy multiple files between buckets. (OK, in the case shown below it's not between buckets. It's between not-really-folders, but it works between buckets equally well.)
这是一个有用的答案。但有时同步不是你需要的(删除文件等)。我花了很长时间才弄清楚这种非脚本替代方案,只需在桶之间复制多个文件。 (好吧,在下面显示的情况下,它不在存储桶之间。它位于非实际文件夹之间,但它在存储桶之间同样有效。)
# Slightly verbose, slightly unintuitive, very useful:
s3cmd cp --recursive --exclude=* --include=file_prefix* s3://semarchy-inc/source1/ s3://semarchy-inc/target/
Explanation of the above command:
上述命令的说明:
- –recursive
In my mind, my requirement is not recursive. I simply want multiple files. But recursive in this context just tells s3cmd cp to handle multiple files. Great. - -recursive在我看来,我的要求不是递归的。我只想要多个文件。但在这种情况下递归只是告诉s3cmd cp处理多个文件。大。
- –exclude
It’s an odd way to think of the problem. Begin by recursively selecting all files. Next, exclude all files. Wait, what? - -exclude这是一个奇怪的方式来思考这个问题。首先递归选择所有文件。接下来,排除所有文件。等等,什么?
- –include
Now we’re talking. Indicate the file prefix (or suffix or whatever pattern) that you want to include.s3://sourceBucket/ s3://targetBucket/
This part is intuitive enough. Though technically it seems to violate the documented example from s3cmd help which indicates that a source object must be specified:s3cmd cp s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]
- - 包括现在我们正在谈论。指示要包含的文件前缀(或后缀或任何模式)。 s3:// sourceBucket / s3:// targetBucket /这部分足够直观。虽然从技术上讲它似乎违反了s3cmd帮助中记录的示例,该示例表明必须指定源对象:s3cmd cp s3:// BUCKET1 / OBJECT1 s3:// BUCKET2 [/ OBJECT2]
#4
8
I needed to copy a very large bucket so I adapted the code in the question into a multi threaded version and put it up on GitHub.
我需要复制一个非常大的存储桶,因此我将问题中的代码调整为多线程版本并将其放在GitHub上。
https://github.com/paultuckey/s3-bucket-to-bucket-copy-py
https://github.com/paultuckey/s3-bucket-to-bucket-copy-py
#5
7
You can also use the web interface to do so:
您也可以使用Web界面执行此操作:
- Go to the source bucket in the web interface.
- 转到Web界面中的源存储桶。
- Mark the files you want to copy (use shift and mouse clicks to mark several).
- 标记要复制的文件(使用shift和鼠标单击标记几个)。
- Press Actions->Copy.
- 按操作 - >复制。
- Go to the destination bucket.
- 转到目标存储桶。
- Press Actions->Paste.
- 按操作 - >粘贴。
That's it.
而已。
#6
3
It's actually possible. This worked for me:
这实际上是可能的。这对我有用:
import boto
AWS_ACCESS_KEY = 'Your access key'
AWS_SECRET_KEY = 'Your secret key'
conn = boto.s3.connection.S3Connection(AWS_ACCESS_KEY, AWS_SECRET_KEY)
bucket = boto.s3.bucket.Bucket(conn, SRC_BUCKET_NAME)
for item in bucket:
# Note: here you can put also a path inside the DEST_BUCKET_NAME,
# if you want your item to be stored inside a folder, like this:
# bucket.copy(DEST_BUCKET_NAME, '%s/%s' % (folder_name, item.key))
bucket.copy(DEST_BUCKET_NAME, item.key)
#7
2
Thanks - I use a slightly modified version, where I only copy files that don't exist or are a different size, and check on the destination if the key exists in the source. I found this a bit quicker for readying the test environment:
谢谢 - 我使用稍微修改过的版本,我只复制不存在或大小不同的文件,如果源中存在密钥则检查目标。我发现这对准备测试环境要快一点:
def botoSyncPath(path):
"""
Sync keys in specified path from source bucket to target bucket.
"""
try:
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
srcBucket = conn.get_bucket(AWS_SRC_BUCKET)
destBucket = conn.get_bucket(AWS_DEST_BUCKET)
for key in srcBucket.list(path):
destKey = destBucket.get_key(key.name)
if not destKey or destKey.size != key.size:
key.copy(AWS_DEST_BUCKET, key.name)
for key in destBucket.list(path):
srcKey = srcBucket.get_key(key.name)
if not srcKey:
key.delete()
except:
return False
return True
#8
2
I wrote a script that backs up an S3 bucket: https://github.com/roseperrone/aws-backup-rake-task
我编写了一个备份S3存储桶的脚本:https://github.com/roseperrone/aws-backup-rake-task
#!/usr/bin/env python
from boto.s3.connection import S3Connection
import re
import datetime
import sys
import time
def main():
s3_ID = sys.argv[1]
s3_key = sys.argv[2]
src_bucket_name = sys.argv[3]
num_backup_buckets = sys.argv[4]
connection = S3Connection(s3_ID, s3_key)
delete_oldest_backup_buckets(connection, num_backup_buckets)
backup(connection, src_bucket_name)
def delete_oldest_backup_buckets(connection, num_backup_buckets):
"""Deletes the oldest backup buckets such that only the newest NUM_BACKUP_BUCKETS - 1 buckets remain."""
buckets = connection.get_all_buckets() # returns a list of bucket objects
num_buckets = len(buckets)
backup_bucket_names = []
for bucket in buckets:
if (re.search('backup-' + r'\d{4}-\d{2}-\d{2}' , bucket.name)):
backup_bucket_names.append(bucket.name)
backup_bucket_names.sort(key=lambda x: datetime.datetime.strptime(x[len('backup-'):17], '%Y-%m-%d').date())
# The buckets are sorted latest to earliest, so we want to keep the last NUM_BACKUP_BUCKETS - 1
delete = len(backup_bucket_names) - (int(num_backup_buckets) - 1)
if delete <= 0:
return
for i in range(0, delete):
print 'Deleting the backup bucket, ' + backup_bucket_names[i]
connection.delete_bucket(backup_bucket_names[i])
def backup(connection, src_bucket_name):
now = datetime.datetime.now()
# the month and day must be zero-filled
new_backup_bucket_name = 'backup-' + str('%02d' % now.year) + '-' + str('%02d' % now.month) + '-' + str(now.day);
print "Creating new bucket " + new_backup_bucket_name
new_backup_bucket = connection.create_bucket(new_backup_bucket_name)
copy_bucket(src_bucket_name, new_backup_bucket_name, connection)
def copy_bucket(src_bucket_name, dst_bucket_name, connection, maximum_keys = 100):
src_bucket = connection.get_bucket(src_bucket_name);
dst_bucket = connection.get_bucket(dst_bucket_name);
result_marker = ''
while True:
keys = src_bucket.get_all_keys(max_keys = maximum_keys, marker = result_marker)
for k in keys:
print 'Copying ' + k.key + ' from ' + src_bucket_name + ' to ' + dst_bucket_name
t0 = time.clock()
dst_bucket.copy_key(k.key, src_bucket_name, k.key)
print time.clock() - t0, ' seconds'
if len(keys) < maximum_keys:
print 'Done backing up.'
break
result_marker = keys[maximum_keys - 1].key
if __name__ =='__main__':main()
I use this in a rake task (for a Rails app):
我在rake任务中使用它(对于Rails应用程序):
desc "Back up a file onto S3"
task :backup do
S3ID = "*****"
S3KEY = "*****"
SRCBUCKET = "primary-mzgd"
NUM_BACKUP_BUCKETS = 2
Dir.chdir("#{Rails.root}/lib/tasks")
system "./do_backup.py #{S3ID} #{S3KEY} #{SRCBUCKET} #{NUM_BACKUP_BUCKETS}"
end
#9
2
mdahlman's code didn't work for me but this command copies all the files in the bucket1 to a new folder (command also creates this new folder) in bucket 2.
mdahlman的代码对我不起作用,但是该命令将bucket1中的所有文件复制到存储桶2中的新文件夹(命令也创建此新文件夹)。
cp --recursive --include=file_prefix* s3://bucket1/ s3://bucket2/new_folder_name/
#10
1
s3cmd won't cp with only prefixes or wildcards but you can script the behavior with 's3cmd ls sourceBucket', and awk to extract the object name. Then use 's3cmd cp sourceBucket/name destBucket' to copy each object name in the list.
s3cmd不会只带有前缀或通配符的cp,但是您可以使用's3cmd ls sourceBucket'编写脚本,并使用awk提取对象名称。然后使用's3cmd cp sourceBucket / name destBucket'复制列表中的每个对象名称。
I use these batch files in a DOS box on Windows:
我在Windows上的DOS框中使用这些批处理文件:
s3list.bat
s3list.bat
s3cmd ls %1 | gawk "/s3/{ print \"\\"\"\"substr($0,index($0,\"s3://\"))\"\\"\"\"; }"
s3copy.bat
s3copy.bat
@for /F "delims=" %%s in ('s3list %1') do @s3cmd cp %%s %2
#11
1
You can also use s3funnel which uses multi-threading:
您也可以使用使用多线程的s3funnel:
https://github.com/neelakanta/s3funnel
https://github.com/neelakanta/s3funnel
example (without the access key or secret key parameters shown):
示例(未显示访问密钥或密钥参数):
s3funnel source-bucket-name list | s3funnel dest-bucket-name copy --source-bucket source-bucket-name --threads=10
s3funnel source-bucket-name list | s3funnel dest-bucket-name copy --source-bucket source-bucket-name --threads = 10
#1
85
s3cmd sync s3://from/this/bucket/ s3://to/this/bucket/
s3cmd sync s3:// from / this / bucket / s3:// to / this / bucket /
For available options, please use: $s3cmd --help
有关可用选项,请使用:$ s3cmd --help
#2
40
AWS CLI seems to do the job perfectly, and has the bonus of being an officially supported tool.
AWS CLI似乎完美地完成了这项工作,并且具有成为官方支持工具的好处。
aws s3 sync s3://mybucket s3://backup-mybucket
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
#3
29
The answer with the most upvotes as I write this is this one:
在我写这篇文章时,答案最多的是这一个:
s3cmd sync s3://from/this/bucket s3://to/this/bucket
It's a useful answer. But sometimes sync is not what you need (it deletes files, etc.). It took me a long time to figure out this non-scripting alternative to simply copy multiple files between buckets. (OK, in the case shown below it's not between buckets. It's between not-really-folders, but it works between buckets equally well.)
这是一个有用的答案。但有时同步不是你需要的(删除文件等)。我花了很长时间才弄清楚这种非脚本替代方案,只需在桶之间复制多个文件。 (好吧,在下面显示的情况下,它不在存储桶之间。它位于非实际文件夹之间,但它在存储桶之间同样有效。)
# Slightly verbose, slightly unintuitive, very useful:
s3cmd cp --recursive --exclude=* --include=file_prefix* s3://semarchy-inc/source1/ s3://semarchy-inc/target/
Explanation of the above command:
上述命令的说明:
- –recursive
In my mind, my requirement is not recursive. I simply want multiple files. But recursive in this context just tells s3cmd cp to handle multiple files. Great. - -recursive在我看来,我的要求不是递归的。我只想要多个文件。但在这种情况下递归只是告诉s3cmd cp处理多个文件。大。
- –exclude
It’s an odd way to think of the problem. Begin by recursively selecting all files. Next, exclude all files. Wait, what? - -exclude这是一个奇怪的方式来思考这个问题。首先递归选择所有文件。接下来,排除所有文件。等等,什么?
- –include
Now we’re talking. Indicate the file prefix (or suffix or whatever pattern) that you want to include.s3://sourceBucket/ s3://targetBucket/
This part is intuitive enough. Though technically it seems to violate the documented example from s3cmd help which indicates that a source object must be specified:s3cmd cp s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]
- - 包括现在我们正在谈论。指示要包含的文件前缀(或后缀或任何模式)。 s3:// sourceBucket / s3:// targetBucket /这部分足够直观。虽然从技术上讲它似乎违反了s3cmd帮助中记录的示例,该示例表明必须指定源对象:s3cmd cp s3:// BUCKET1 / OBJECT1 s3:// BUCKET2 [/ OBJECT2]
#4
8
I needed to copy a very large bucket so I adapted the code in the question into a multi threaded version and put it up on GitHub.
我需要复制一个非常大的存储桶,因此我将问题中的代码调整为多线程版本并将其放在GitHub上。
https://github.com/paultuckey/s3-bucket-to-bucket-copy-py
https://github.com/paultuckey/s3-bucket-to-bucket-copy-py
#5
7
You can also use the web interface to do so:
您也可以使用Web界面执行此操作:
- Go to the source bucket in the web interface.
- 转到Web界面中的源存储桶。
- Mark the files you want to copy (use shift and mouse clicks to mark several).
- 标记要复制的文件(使用shift和鼠标单击标记几个)。
- Press Actions->Copy.
- 按操作 - >复制。
- Go to the destination bucket.
- 转到目标存储桶。
- Press Actions->Paste.
- 按操作 - >粘贴。
That's it.
而已。
#6
3
It's actually possible. This worked for me:
这实际上是可能的。这对我有用:
import boto
AWS_ACCESS_KEY = 'Your access key'
AWS_SECRET_KEY = 'Your secret key'
conn = boto.s3.connection.S3Connection(AWS_ACCESS_KEY, AWS_SECRET_KEY)
bucket = boto.s3.bucket.Bucket(conn, SRC_BUCKET_NAME)
for item in bucket:
# Note: here you can put also a path inside the DEST_BUCKET_NAME,
# if you want your item to be stored inside a folder, like this:
# bucket.copy(DEST_BUCKET_NAME, '%s/%s' % (folder_name, item.key))
bucket.copy(DEST_BUCKET_NAME, item.key)
#7
2
Thanks - I use a slightly modified version, where I only copy files that don't exist or are a different size, and check on the destination if the key exists in the source. I found this a bit quicker for readying the test environment:
谢谢 - 我使用稍微修改过的版本,我只复制不存在或大小不同的文件,如果源中存在密钥则检查目标。我发现这对准备测试环境要快一点:
def botoSyncPath(path):
"""
Sync keys in specified path from source bucket to target bucket.
"""
try:
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
srcBucket = conn.get_bucket(AWS_SRC_BUCKET)
destBucket = conn.get_bucket(AWS_DEST_BUCKET)
for key in srcBucket.list(path):
destKey = destBucket.get_key(key.name)
if not destKey or destKey.size != key.size:
key.copy(AWS_DEST_BUCKET, key.name)
for key in destBucket.list(path):
srcKey = srcBucket.get_key(key.name)
if not srcKey:
key.delete()
except:
return False
return True
#8
2
I wrote a script that backs up an S3 bucket: https://github.com/roseperrone/aws-backup-rake-task
我编写了一个备份S3存储桶的脚本:https://github.com/roseperrone/aws-backup-rake-task
#!/usr/bin/env python
from boto.s3.connection import S3Connection
import re
import datetime
import sys
import time
def main():
s3_ID = sys.argv[1]
s3_key = sys.argv[2]
src_bucket_name = sys.argv[3]
num_backup_buckets = sys.argv[4]
connection = S3Connection(s3_ID, s3_key)
delete_oldest_backup_buckets(connection, num_backup_buckets)
backup(connection, src_bucket_name)
def delete_oldest_backup_buckets(connection, num_backup_buckets):
"""Deletes the oldest backup buckets such that only the newest NUM_BACKUP_BUCKETS - 1 buckets remain."""
buckets = connection.get_all_buckets() # returns a list of bucket objects
num_buckets = len(buckets)
backup_bucket_names = []
for bucket in buckets:
if (re.search('backup-' + r'\d{4}-\d{2}-\d{2}' , bucket.name)):
backup_bucket_names.append(bucket.name)
backup_bucket_names.sort(key=lambda x: datetime.datetime.strptime(x[len('backup-'):17], '%Y-%m-%d').date())
# The buckets are sorted latest to earliest, so we want to keep the last NUM_BACKUP_BUCKETS - 1
delete = len(backup_bucket_names) - (int(num_backup_buckets) - 1)
if delete <= 0:
return
for i in range(0, delete):
print 'Deleting the backup bucket, ' + backup_bucket_names[i]
connection.delete_bucket(backup_bucket_names[i])
def backup(connection, src_bucket_name):
now = datetime.datetime.now()
# the month and day must be zero-filled
new_backup_bucket_name = 'backup-' + str('%02d' % now.year) + '-' + str('%02d' % now.month) + '-' + str(now.day);
print "Creating new bucket " + new_backup_bucket_name
new_backup_bucket = connection.create_bucket(new_backup_bucket_name)
copy_bucket(src_bucket_name, new_backup_bucket_name, connection)
def copy_bucket(src_bucket_name, dst_bucket_name, connection, maximum_keys = 100):
src_bucket = connection.get_bucket(src_bucket_name);
dst_bucket = connection.get_bucket(dst_bucket_name);
result_marker = ''
while True:
keys = src_bucket.get_all_keys(max_keys = maximum_keys, marker = result_marker)
for k in keys:
print 'Copying ' + k.key + ' from ' + src_bucket_name + ' to ' + dst_bucket_name
t0 = time.clock()
dst_bucket.copy_key(k.key, src_bucket_name, k.key)
print time.clock() - t0, ' seconds'
if len(keys) < maximum_keys:
print 'Done backing up.'
break
result_marker = keys[maximum_keys - 1].key
if __name__ =='__main__':main()
I use this in a rake task (for a Rails app):
我在rake任务中使用它(对于Rails应用程序):
desc "Back up a file onto S3"
task :backup do
S3ID = "*****"
S3KEY = "*****"
SRCBUCKET = "primary-mzgd"
NUM_BACKUP_BUCKETS = 2
Dir.chdir("#{Rails.root}/lib/tasks")
system "./do_backup.py #{S3ID} #{S3KEY} #{SRCBUCKET} #{NUM_BACKUP_BUCKETS}"
end
#9
2
mdahlman's code didn't work for me but this command copies all the files in the bucket1 to a new folder (command also creates this new folder) in bucket 2.
mdahlman的代码对我不起作用,但是该命令将bucket1中的所有文件复制到存储桶2中的新文件夹(命令也创建此新文件夹)。
cp --recursive --include=file_prefix* s3://bucket1/ s3://bucket2/new_folder_name/
#10
1
s3cmd won't cp with only prefixes or wildcards but you can script the behavior with 's3cmd ls sourceBucket', and awk to extract the object name. Then use 's3cmd cp sourceBucket/name destBucket' to copy each object name in the list.
s3cmd不会只带有前缀或通配符的cp,但是您可以使用's3cmd ls sourceBucket'编写脚本,并使用awk提取对象名称。然后使用's3cmd cp sourceBucket / name destBucket'复制列表中的每个对象名称。
I use these batch files in a DOS box on Windows:
我在Windows上的DOS框中使用这些批处理文件:
s3list.bat
s3list.bat
s3cmd ls %1 | gawk "/s3/{ print \"\\"\"\"substr($0,index($0,\"s3://\"))\"\\"\"\"; }"
s3copy.bat
s3copy.bat
@for /F "delims=" %%s in ('s3list %1') do @s3cmd cp %%s %2
#11
1
You can also use s3funnel which uses multi-threading:
您也可以使用使用多线程的s3funnel:
https://github.com/neelakanta/s3funnel
https://github.com/neelakanta/s3funnel
example (without the access key or secret key parameters shown):
示例(未显示访问密钥或密钥参数):
s3funnel source-bucket-name list | s3funnel dest-bucket-name copy --source-bucket source-bucket-name --threads=10
s3funnel source-bucket-name list | s3funnel dest-bucket-name copy --source-bucket source-bucket-name --threads = 10