如何在RedShift上卸载表到一个CSV文件?

时间:2021-08-16 23:07:24

I want to migrate a table from Amazon RedShift to MySQL, but using "unload" will generate multiple data files which are hard to imported into MySQL directly.

我想将一个表从Amazon RedShift迁移到MySQL,但是使用“卸载”会生成多个很难直接导入到MySQL的数据文件。

Is there any approach to unload the table to a single CSV file so that I can import it to MySQL directly?

有没有办法把表卸载到一个CSV文件中,这样我就可以直接导入到MySQL ?

5 个解决方案

#1


29  

In order to send to a single file use parallel off

为了发送到一个单独的文件使用并行关闭

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off;

Also I recommend using Gzip, to make that file even smaller for download.

我还建议使用Gzip,使该文件更小以供下载。

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off
gzip;

#2


7  

This is an old question at this point, but I feel like all the existing answers are slightly misleading. If your question is, "Can I absolutely 100% guarantee that Redshift will ALWAYS unload to a SINGLE file in S3?", the answer is simply NO.

在这一点上,这是一个老问题,但我觉得所有现有的答案都有点误导人。如果您的问题是,“我能100%保证Redshift将始终卸载到S3中的单个文件吗?”答案是否定的。

That being said, for most cases, you can generally limit your query in such a way that you'll end up with a single file. Per the documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html), the main factor in limiting the number of files you generate is the actual raw size in bytes of your export (NOT the number of rows). The limit on the size of an output file generated by the Redshift UNLOAD command is 6.2GB.

也就是说,对于大多数情况,通常可以以这样一种方式限制查询,最终只使用一个文件。根据文档(https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html),限制生成文件数量的主要因素是导出的实际原始大小(而不是行数)。Redshift卸载命令生成的输出文件的大小限制是6.2GB。

So if you want to try to guarantee that you get a single output file from UNLOAD, here's what you should try:

所以如果你想保证你从卸载得到一个输出文件,你应该这样做:

  • Specify PARALLEL OFF. Parallel is "ON" by default and will generally write to multiple files unless you have a tiny cluster (The number of output files with "PARALLEL ON" set is proportional to the number of slices in your cluster). PARALLEL OFF will write files serially to S3 instead of in parallel and will only spill over to using multiple files if you exceed the size limit.
  • 默认情况下,PARALLEL是“ON”,并且通常会写入多个文件,除非您有一个很小的集群(设置为“PARALLEL ON”的输出文件的数量与集群中的片数成正比)。PARALLEL OFF将把文件串行地写入到S3中,而不是并行地写入,如果超过了大小限制,就会溢出到使用多个文件。
  • Limit the size of your output. The raw size of the data must be less than 6.2GB if you want a single file. So you need to make your query have a more restrictive WHERE clause or use a LIMIT clause to keep the number of records down. Unfortunately neither of these techniques are perfect since rows can be of variable size. It's also not clear to me if the GZIP option affects the output file size spillover limit or not (it's unclear if 6.2GB is the pre-GZIP size limit or the post-GZIP size limit).
  • 限制输出的大小。如果您想要一个文件,那么数据的原始大小必须小于6.2GB。因此,您需要使查询具有更严格的WHERE子句或使用LIMIT子句来保持记录的数量。不幸的是,这两种技术都不完美,因为行可以是可变大小的。我也不清楚GZIP选项是否影响输出文件大小溢出限制(不清楚6.2GB是前GZIP大小限制还是后GZIP大小限制)。

For me, the UNLOAD command that ending up generating a single CSV file in most cases was:

对于我来说,在大多数情况下最终生成一个CSV文件的卸载命令是:

UNLOAD
('SELECT <fields> FROM <table> WHERE <restrict_query>')
TO 's3://<bucket_name>/<filename_prefix>'
CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>'
DELIMITER AS ','
ADDQUOTES
NULL AS ''
PARALLEL OFF;

The other nice side effect of PARALLEL OFF is that it will respect your ORDER BY clause if you have one and generate the files in an order that keeps all the records ordered, even across multiple output files.

PARALLEL OFF的另一个好处是,如果你有一个ORDER BY子句,它就会尊重你的ORDER,并以保持所有记录都是有序的顺序生成文件,甚至可以跨多个输出文件。

Addendum: There seems to be some folkloric knowledge around using LIMIT 2147483647 to force the leader node to do all the processing and generate a single output file, but this doesn't seem to be actually documented anywhere in the Redshift documentation and as such, relying on it seems like a bad idea since it could change at any time.

附录:似乎有一些民俗知识使用限制2147483647迫使领导者节点做所有的处理并生成一个输出文件,但这似乎并没有实际记录的任何地方红移的文档和因此,依靠这似乎是一个坏主意,因为它随时可能改变。

#3


3  

It is a bit of a workaround, but you need to make your query a subquery and include a limit. It will then output to one file. E.g.

这是一个解决方案,但是您需要将查询设置为子查询并包含一个限制。然后它将输出到一个文件。如。

select * from (select * from bizdata LIMIT 2147483647);

So basically you are selecting all from a limited set. That is the only way it works. 2147483647 is your max limit, as a limit clause takes an unsigned integer argument.

所以基本上你是从一个有限的集合中选择所有的,这是唯一的方法。2147483647是您的最大限制,因为limit子句采用无符号整数参数。

So the following will unload to one file:

因此,下面将卸载到一个文件:

unload(' select * from (
select bizid, data
from biztable
limit 2147483647);
 ') to 's3://.......' CREDENTIALS 'aws_access_key_id=<<aws_access_key_id>>;aws_secret_access_key=<<aws_secret_access_key>>' csv ; 

#4


1  

Nope. { You can use a manifest and tell Redshift to direct all output to a single file. } Previous answer was wrong, I had used manifests for loading but not unloading.

不。{您可以使用一个清单并告诉Redshift将所有输出指向一个文件。以前的回答是错误的,我用的是装货清单,而不是卸货清单。

There appears to be 2 possible ways to get a single file:

似乎有两种可能的方法可以获得一个文件:

  1. Easier: Wrap a SELECT … LIMIT query around your actual output query, as per this SO answer but this is limited to ~2 billion rows.
  2. 更简单:围绕实际的输出查询包装一个SELECT…LIMIT查询,就像这个SO answer一样,但这仅限于20亿行。
  3. Harder: Use the Unix cat utility to join the files together cat File1.txt File2.txt > union.txt. This will require you to download the files from S3 first.
  4. 更难的:使用Unix cat实用程序将文件连接到cat文件1。txt File2。txt > union.txt。这需要您首先从S3下载文件。

#5


1  

There is no way to force Redshift to generate only a single output file, for sure.

当然,没有办法强制Redshift只生成一个输出文件。

Under a standard UNLOAD you will have output files created equivalent to the number of system slices, i.e. a system with 8 slices will create 8 files for a single unload command(This is the fastest method to unload.)

在标准卸载下,您将创建与系统片数量相等的输出文件,即一个包含8个片的系统将为一个卸载命令创建8个文件(这是最快的卸载方法)。

If you add a clause PARALLEL OFF in to he Unload Command, your output will be created as a single file, upto the time where the data extract soze does not go beyond 6.25GB, after which Redshift will automatically break the file into a new chunk.

如果在he Unload命令中添加一个并行的子句,那么输出将被创建为单个文件,直到数据提取soze没有超过6.25GB,之后Redshift将自动将文件分割为一个新的块。

The same thing holds true, if you produce compressed output files as well(There of course you will have greater chances to produce a single output file, considering that your file can accommodate more number of records in it.).

同样的情况也适用,如果您也生成压缩的输出文件(当然,您将有更大的机会生成一个输出文件,因为您的文件可以容纳更多的记录)。

#1


29  

In order to send to a single file use parallel off

为了发送到一个单独的文件使用并行关闭

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off;

Also I recommend using Gzip, to make that file even smaller for download.

我还建议使用Gzip,使该文件更小以供下载。

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off
gzip;

#2


7  

This is an old question at this point, but I feel like all the existing answers are slightly misleading. If your question is, "Can I absolutely 100% guarantee that Redshift will ALWAYS unload to a SINGLE file in S3?", the answer is simply NO.

在这一点上,这是一个老问题,但我觉得所有现有的答案都有点误导人。如果您的问题是,“我能100%保证Redshift将始终卸载到S3中的单个文件吗?”答案是否定的。

That being said, for most cases, you can generally limit your query in such a way that you'll end up with a single file. Per the documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html), the main factor in limiting the number of files you generate is the actual raw size in bytes of your export (NOT the number of rows). The limit on the size of an output file generated by the Redshift UNLOAD command is 6.2GB.

也就是说,对于大多数情况,通常可以以这样一种方式限制查询,最终只使用一个文件。根据文档(https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html),限制生成文件数量的主要因素是导出的实际原始大小(而不是行数)。Redshift卸载命令生成的输出文件的大小限制是6.2GB。

So if you want to try to guarantee that you get a single output file from UNLOAD, here's what you should try:

所以如果你想保证你从卸载得到一个输出文件,你应该这样做:

  • Specify PARALLEL OFF. Parallel is "ON" by default and will generally write to multiple files unless you have a tiny cluster (The number of output files with "PARALLEL ON" set is proportional to the number of slices in your cluster). PARALLEL OFF will write files serially to S3 instead of in parallel and will only spill over to using multiple files if you exceed the size limit.
  • 默认情况下,PARALLEL是“ON”,并且通常会写入多个文件,除非您有一个很小的集群(设置为“PARALLEL ON”的输出文件的数量与集群中的片数成正比)。PARALLEL OFF将把文件串行地写入到S3中,而不是并行地写入,如果超过了大小限制,就会溢出到使用多个文件。
  • Limit the size of your output. The raw size of the data must be less than 6.2GB if you want a single file. So you need to make your query have a more restrictive WHERE clause or use a LIMIT clause to keep the number of records down. Unfortunately neither of these techniques are perfect since rows can be of variable size. It's also not clear to me if the GZIP option affects the output file size spillover limit or not (it's unclear if 6.2GB is the pre-GZIP size limit or the post-GZIP size limit).
  • 限制输出的大小。如果您想要一个文件,那么数据的原始大小必须小于6.2GB。因此,您需要使查询具有更严格的WHERE子句或使用LIMIT子句来保持记录的数量。不幸的是,这两种技术都不完美,因为行可以是可变大小的。我也不清楚GZIP选项是否影响输出文件大小溢出限制(不清楚6.2GB是前GZIP大小限制还是后GZIP大小限制)。

For me, the UNLOAD command that ending up generating a single CSV file in most cases was:

对于我来说,在大多数情况下最终生成一个CSV文件的卸载命令是:

UNLOAD
('SELECT <fields> FROM <table> WHERE <restrict_query>')
TO 's3://<bucket_name>/<filename_prefix>'
CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>'
DELIMITER AS ','
ADDQUOTES
NULL AS ''
PARALLEL OFF;

The other nice side effect of PARALLEL OFF is that it will respect your ORDER BY clause if you have one and generate the files in an order that keeps all the records ordered, even across multiple output files.

PARALLEL OFF的另一个好处是,如果你有一个ORDER BY子句,它就会尊重你的ORDER,并以保持所有记录都是有序的顺序生成文件,甚至可以跨多个输出文件。

Addendum: There seems to be some folkloric knowledge around using LIMIT 2147483647 to force the leader node to do all the processing and generate a single output file, but this doesn't seem to be actually documented anywhere in the Redshift documentation and as such, relying on it seems like a bad idea since it could change at any time.

附录:似乎有一些民俗知识使用限制2147483647迫使领导者节点做所有的处理并生成一个输出文件,但这似乎并没有实际记录的任何地方红移的文档和因此,依靠这似乎是一个坏主意,因为它随时可能改变。

#3


3  

It is a bit of a workaround, but you need to make your query a subquery and include a limit. It will then output to one file. E.g.

这是一个解决方案,但是您需要将查询设置为子查询并包含一个限制。然后它将输出到一个文件。如。

select * from (select * from bizdata LIMIT 2147483647);

So basically you are selecting all from a limited set. That is the only way it works. 2147483647 is your max limit, as a limit clause takes an unsigned integer argument.

所以基本上你是从一个有限的集合中选择所有的,这是唯一的方法。2147483647是您的最大限制,因为limit子句采用无符号整数参数。

So the following will unload to one file:

因此,下面将卸载到一个文件:

unload(' select * from (
select bizid, data
from biztable
limit 2147483647);
 ') to 's3://.......' CREDENTIALS 'aws_access_key_id=<<aws_access_key_id>>;aws_secret_access_key=<<aws_secret_access_key>>' csv ; 

#4


1  

Nope. { You can use a manifest and tell Redshift to direct all output to a single file. } Previous answer was wrong, I had used manifests for loading but not unloading.

不。{您可以使用一个清单并告诉Redshift将所有输出指向一个文件。以前的回答是错误的,我用的是装货清单,而不是卸货清单。

There appears to be 2 possible ways to get a single file:

似乎有两种可能的方法可以获得一个文件:

  1. Easier: Wrap a SELECT … LIMIT query around your actual output query, as per this SO answer but this is limited to ~2 billion rows.
  2. 更简单:围绕实际的输出查询包装一个SELECT…LIMIT查询,就像这个SO answer一样,但这仅限于20亿行。
  3. Harder: Use the Unix cat utility to join the files together cat File1.txt File2.txt > union.txt. This will require you to download the files from S3 first.
  4. 更难的:使用Unix cat实用程序将文件连接到cat文件1。txt File2。txt > union.txt。这需要您首先从S3下载文件。

#5


1  

There is no way to force Redshift to generate only a single output file, for sure.

当然,没有办法强制Redshift只生成一个输出文件。

Under a standard UNLOAD you will have output files created equivalent to the number of system slices, i.e. a system with 8 slices will create 8 files for a single unload command(This is the fastest method to unload.)

在标准卸载下,您将创建与系统片数量相等的输出文件,即一个包含8个片的系统将为一个卸载命令创建8个文件(这是最快的卸载方法)。

If you add a clause PARALLEL OFF in to he Unload Command, your output will be created as a single file, upto the time where the data extract soze does not go beyond 6.25GB, after which Redshift will automatically break the file into a new chunk.

如果在he Unload命令中添加一个并行的子句,那么输出将被创建为单个文件,直到数据提取soze没有超过6.25GB,之后Redshift将自动将文件分割为一个新的块。

The same thing holds true, if you produce compressed output files as well(There of course you will have greater chances to produce a single output file, considering that your file can accommodate more number of records in it.).

同样的情况也适用,如果您也生成压缩的输出文件(当然,您将有更大的机会生成一个输出文件,因为您的文件可以容纳更多的记录)。