如何将csv数据文件复制到Amazon RedShift?

时间:2022-11-20 23:08:55

I'm trying to migrating some MySQL tables to Amazon Redshift, but met some problems.

我正在尝试将一些MySQL表迁移到Amazon Redshift,但遇到了一些问题。

The steps are simple: 1. Dump the MySQL table to a csv file 2. Upload the csv file to S3 3. Copy the data file to RedShift

步骤很简单:1。将MySQL表转储到csv文件2.将csv文件上传到S3 3.将数据文件复制到RedShift

Error occurs in step 3:

步骤3中发生错误:

The SQL command is:

SQL命令是:

copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' csv;

从's3://ciphor/TABLE_A.csv'CREDENTIALS'复制TABLE_A'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'delimiter','csv;

The error info:

错误信息:

An error occurred when executing the SQL command: copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx ERROR: COPY CSV is not supported [SQL State=0A000] Execution time: 0.53s 1 statement(s) failed.

执行SQL命令时发生错误:从's3://ciphor/TABLE_A.csv'复制TABLE_A'CREDENTIALS'aws_access_key_id = xxxx; aws_secret_access_key = xxxx错误:不支持COPY CSV [SQL State = 0A000]执行时间:0.53s 1个陈述失败。

I don't know if there's any limitations on the format of the csv file, say the delimiters and quotes, I cannot find it in documents.

我不知道csv文件的格式是否有任何限制,比如分隔符和引号,我在文档中找不到它。

Any one can help?

任何人都可以帮忙吗?

7 个解决方案

#1


12  

The problem is finally resolved by using:

最终通过使用以下方法解决问题:

copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' removequotes;

从's3://ciphor/TABLE_A.csv'CREDENTIALS'复制TABLE_A'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'delimiter','removequotes;

More information can be found here http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

更多信息请访问http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

#2


9  

Now Amazon Redshift supports CSV option for COPY command. It's better to use this option to import CSV formatted data correctly. The format is shown bellow.

现在,Amazon Redshift支持COPY命令的CSV选项。最好使用此选项正确导入CSV格式的数据。格式如下所示。

COPY [table-name] from 's3://[bucket-name]/[file-path or prefix]' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV;

COPY [table-name]来自's3:// [bucket-name] / [file-path或prefix]'CREDENTIALS'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'CSV;

The default delimiter is ( , ) and the default quotes is ( " ). Also you can import TSV formatted data with CSV and DELIMITER option like this.

默认分隔符是(,),默认引号是(“)。您也可以使用CSV和DELIMITER选项导入TSV格式的数据。

COPY [table-name] from 's3://[bucket-name]/[file-path or prefix]' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV DELIMITER '\t';

COPY [table-name]来自's3:// [bucket-name] / [file-path或prefix]'CREDENTIALS'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'CSV DELIMITER'\ t';

There are some disadvantages to use the old way(DELIMITER and REMOVEQUOTES) that REMOVEQUOTES does not support to have a new line or a delimiter character within an enclosed filed. If the data can include this kind of characters, you should use CSV option.

使用REMOVEQUOTES不支持在封闭字段中包含新行或分隔符的旧方法(DELIMITER和REMOVEQUOTES)存在一些缺点。如果数据可以包含此类字符,则应使用CSV选项。

See the following link for the details.

有关详细信息,请参阅以下链接。

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

#3


0  

If you want to save your self some code/ you have a very basic use case you can use Amazon Data Pipeline. it stats a spot instance and perform the transformation within amazon network and it's really intuitive tool (but very simple so you can't do complex things with it)

如果您想保存自己的一些代码/您有一个非常基本的用例,您可以使用Amazon Data Pipeline。它统计了一个现场实例,并在亚马逊网络中执行转换,它是非常直观的工具(但非常简单,所以你不能用它做复杂的事情)

#4


0  

You can try with this

你可以试试这个

copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' csv;

从's3://ciphor/TABLE_A.csv'CREDENTIALS'复制TABLE_A'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'csv;

CSV itself means comma separated values, no need to provide delimiter with this. Please refer link.

CSV本身表示逗号分隔值,无需为此提供分隔符。请参考链接。

[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]

[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]

#5


0  

It looks like you are trying to load local file into REDSHIFT table. CSV file has to be on S3 for COPY command to work.

看起来您正在尝试将本地文件加载到REDSHIFT表中。 CSV文件必须在S3上才能使COPY命令生效。

If you can extract data from table to CSV file you have one more scripting option. You can use Python/boto/psycopg2 combo to script your CSV load to Amazon Redshift.

如果您可以从表格提取数据到CSV文件,则还有一个脚本选项。您可以使用Python / boto / psycopg2组合将CSV加载脚本编写到Amazon Redshift。

In my MySQL_To_Redshift_Loader I do the following:

在我的MySQL_To_Redshift_Loader中,我执行以下操作:

  1. Extract data from MySQL into temp file.

    将MySQL中的数据提取到临时文件中。

    loadConf=[ db_client_dbshell ,'-u', opt.mysql_user,'-p%s' % opt.mysql_pwd,'-D',opt.mysql_db_name, '-h', opt.mysql_db_server]    
    ...
    q="""
    %s %s
    INTO OUTFILE '%s'
    FIELDS TERMINATED BY '%s'
    ENCLOSED BY '%s'
    LINES TERMINATED BY '\r\n';
    """ % (in_qry, limit, out_file, opt.mysql_col_delim,opt.mysql_quote)
    p1 = Popen(['echo', q], stdout=PIPE,stderr=PIPE,env=env)
    p2 = Popen(loadConf, stdin=p1.stdout, stdout=PIPE,stderr=PIPE)
    ...
    
  2. Compress and load data to S3 using boto Python module and multipart upload.

    使用boto Python模块和分段上传将数据压缩并加载到S3。

    conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
    bucket = conn.get_bucket(bucket_name)
    k = Key(bucket)
    k.key = s3_key_name
    k.set_contents_from_file(file_handle, cb=progress, num_cb=20, 
    reduced_redundancy=use_rr )
    
  3. Use psycopg2 COPY command to append data to Redshift table.

    使用psycopg2 COPY命令将数据附加到Redshift表。

    sql="""
    copy %s from '%s' 
    CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' 
    DELIMITER '%s' 
    FORMAT CSV %s 
    %s 
    %s 
    %s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
    

#6


0  

Since the resolution has already been provided, I'll not repeat the obvious.

由于已经提供了决议,我不会重复这一点。

However, in case you receive some more error which you're not able to figure out, simply execute on your workbench while you're connected to any of the Redshift accounts:

但是,如果您收到一些您无法弄清楚的错误,只需在连接到任何Redshift帐户时在您的工作台上执行:

select * from stl_load_errors [where ...];

stl_load_errors contains all the Amazon RS load errors in historical fashion where a normal user can view details corresponding to his / her own account but a superuser can have all the access.

stl_load_errors以历史方式包含所有Amazon RS加载错误,其中普通用户可以查看与他/她自己的帐户相对应的详细信息,但超级用户可以拥有所有访问权限。

The details are captured elaborately at : Amazon STL Load Errors Documentation

精确捕获详细信息:Amazon STL加载错误文档

#7


0  

Little late to comment but it can be useful:-

发表评论的时间不多,但它可能很有用: -

You can use an open source project to copy tables directly from mysql to redshift - sqlshift.

您可以使用开源项目将表直接从mysql复制到redshift - sqlshift。

It only requires spark and if you have yarn then it can also be used.

它只需要火花,如果你有纱线,那么它也可以使用。

Benefits:- It will automatically decides distkey and interleaved sortkey using primary key.

好处: - 它将使用主键自动决定distkey和interleaved sortkey。

#1


12  

The problem is finally resolved by using:

最终通过使用以下方法解决问题:

copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' removequotes;

从's3://ciphor/TABLE_A.csv'CREDENTIALS'复制TABLE_A'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'delimiter','removequotes;

More information can be found here http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

更多信息请访问http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

#2


9  

Now Amazon Redshift supports CSV option for COPY command. It's better to use this option to import CSV formatted data correctly. The format is shown bellow.

现在,Amazon Redshift支持COPY命令的CSV选项。最好使用此选项正确导入CSV格式的数据。格式如下所示。

COPY [table-name] from 's3://[bucket-name]/[file-path or prefix]' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV;

COPY [table-name]来自's3:// [bucket-name] / [file-path或prefix]'CREDENTIALS'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'CSV;

The default delimiter is ( , ) and the default quotes is ( " ). Also you can import TSV formatted data with CSV and DELIMITER option like this.

默认分隔符是(,),默认引号是(“)。您也可以使用CSV和DELIMITER选项导入TSV格式的数据。

COPY [table-name] from 's3://[bucket-name]/[file-path or prefix]' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV DELIMITER '\t';

COPY [table-name]来自's3:// [bucket-name] / [file-path或prefix]'CREDENTIALS'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'CSV DELIMITER'\ t';

There are some disadvantages to use the old way(DELIMITER and REMOVEQUOTES) that REMOVEQUOTES does not support to have a new line or a delimiter character within an enclosed filed. If the data can include this kind of characters, you should use CSV option.

使用REMOVEQUOTES不支持在封闭字段中包含新行或分隔符的旧方法(DELIMITER和REMOVEQUOTES)存在一些缺点。如果数据可以包含此类字符,则应使用CSV选项。

See the following link for the details.

有关详细信息,请参阅以下链接。

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

#3


0  

If you want to save your self some code/ you have a very basic use case you can use Amazon Data Pipeline. it stats a spot instance and perform the transformation within amazon network and it's really intuitive tool (but very simple so you can't do complex things with it)

如果您想保存自己的一些代码/您有一个非常基本的用例,您可以使用Amazon Data Pipeline。它统计了一个现场实例,并在亚马逊网络中执行转换,它是非常直观的工具(但非常简单,所以你不能用它做复杂的事情)

#4


0  

You can try with this

你可以试试这个

copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' csv;

从's3://ciphor/TABLE_A.csv'CREDENTIALS'复制TABLE_A'aws_access_key_id = xxxx; aws_secret_access_key = xxxx'csv;

CSV itself means comma separated values, no need to provide delimiter with this. Please refer link.

CSV本身表示逗号分隔值,无需为此提供分隔符。请参考链接。

[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]

[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]

#5


0  

It looks like you are trying to load local file into REDSHIFT table. CSV file has to be on S3 for COPY command to work.

看起来您正在尝试将本地文件加载到REDSHIFT表中。 CSV文件必须在S3上才能使COPY命令生效。

If you can extract data from table to CSV file you have one more scripting option. You can use Python/boto/psycopg2 combo to script your CSV load to Amazon Redshift.

如果您可以从表格提取数据到CSV文件,则还有一个脚本选项。您可以使用Python / boto / psycopg2组合将CSV加载脚本编写到Amazon Redshift。

In my MySQL_To_Redshift_Loader I do the following:

在我的MySQL_To_Redshift_Loader中,我执行以下操作:

  1. Extract data from MySQL into temp file.

    将MySQL中的数据提取到临时文件中。

    loadConf=[ db_client_dbshell ,'-u', opt.mysql_user,'-p%s' % opt.mysql_pwd,'-D',opt.mysql_db_name, '-h', opt.mysql_db_server]    
    ...
    q="""
    %s %s
    INTO OUTFILE '%s'
    FIELDS TERMINATED BY '%s'
    ENCLOSED BY '%s'
    LINES TERMINATED BY '\r\n';
    """ % (in_qry, limit, out_file, opt.mysql_col_delim,opt.mysql_quote)
    p1 = Popen(['echo', q], stdout=PIPE,stderr=PIPE,env=env)
    p2 = Popen(loadConf, stdin=p1.stdout, stdout=PIPE,stderr=PIPE)
    ...
    
  2. Compress and load data to S3 using boto Python module and multipart upload.

    使用boto Python模块和分段上传将数据压缩并加载到S3。

    conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
    bucket = conn.get_bucket(bucket_name)
    k = Key(bucket)
    k.key = s3_key_name
    k.set_contents_from_file(file_handle, cb=progress, num_cb=20, 
    reduced_redundancy=use_rr )
    
  3. Use psycopg2 COPY command to append data to Redshift table.

    使用psycopg2 COPY命令将数据附加到Redshift表。

    sql="""
    copy %s from '%s' 
    CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' 
    DELIMITER '%s' 
    FORMAT CSV %s 
    %s 
    %s 
    %s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
    

#6


0  

Since the resolution has already been provided, I'll not repeat the obvious.

由于已经提供了决议,我不会重复这一点。

However, in case you receive some more error which you're not able to figure out, simply execute on your workbench while you're connected to any of the Redshift accounts:

但是,如果您收到一些您无法弄清楚的错误,只需在连接到任何Redshift帐户时在您的工作台上执行:

select * from stl_load_errors [where ...];

stl_load_errors contains all the Amazon RS load errors in historical fashion where a normal user can view details corresponding to his / her own account but a superuser can have all the access.

stl_load_errors以历史方式包含所有Amazon RS加载错误,其中普通用户可以查看与他/她自己的帐户相对应的详细信息,但超级用户可以拥有所有访问权限。

The details are captured elaborately at : Amazon STL Load Errors Documentation

精确捕获详细信息:Amazon STL加载错误文档

#7


0  

Little late to comment but it can be useful:-

发表评论的时间不多,但它可能很有用: -

You can use an open source project to copy tables directly from mysql to redshift - sqlshift.

您可以使用开源项目将表直接从mysql复制到redshift - sqlshift。

It only requires spark and if you have yarn then it can also be used.

它只需要火花,如果你有纱线,那么它也可以使用。

Benefits:- It will automatically decides distkey and interleaved sortkey using primary key.

好处: - 它将使用主键自动决定distkey和interleaved sortkey。