使用Apache Spark拆解测试数据

I am building an ETL process using PySpark which is on Python 3, Apache Spark 2 and Fedora 20.

我正在使用PySpark构建一个ETL进程,它位于Python 3,Apache Spark 2和Fedora 20上。

I am also building automated tests against the framework but am struggling with the tear down of data at the end of the tests.

我也正在构建针对框架的自动化测试,但我正在努力在测试结束时拆除数据。

I can set up specific data in an AWS Redshift cluster using Spark but, short of wiping out all data in a table, don't seem to be able to delete specific data.

我可以使用Spark在AWS Redshift集群中设置特定数据,但是,如果不删除表中的所有数据,则似乎无法删除特定数据。

If I try and run a DELETE FROM...WHERE... I get a command not allowed error. It isn't a permissions issue as the exact command runs for the exact same user in our DB IDE (Aquafold data studio 17).

如果我尝试运行DELETE FROM ... WHERE ...我得到一个命令不允许错误。这不是权限问题,因为确切的命令在我们的DB IDE(Aquafold数据工作室17)中为完全相同的用户运行。

Short of installing something like psycopg2 or pyodbc (which feels like overkill) I am not sure how to achieve the equivalent DELETE with a WHERE clause.

没有安装类似psycopg2或pyodbc的东西(感觉有点矫枉过正)我不知道如何使用WHERE子句实现等效的DELETE。

1 个解决方案

#1

The AWS Redshift driver has a number of options and amongst these are two in particular

AWS Redshift驱动程序有许多选项,其中有两个选项

preations
postactions

These can contain SQL statement(s) to be executed on AWS Redshift.

这些可以包含要在AWS Redshift上执行的SQL语句。

Where there are multiple statements they must be separated by a semi-colon character.

如果有多个语句,则必须用分号字符分隔。

DO NOT terminate the final statement with a semi-colon as the driver thinks that there is a subsequent query and fails.

不要使用分号终止最终语句,因为驱动程序认为存在后续查询并且失败。

When either preactions or postactions are mentioned in the options then they must contain valid SQL, they cannot be blank. A simple SELECT 1 will suffice.

当选项中提到preactions或postactions时,它们必须包含有效的SQL,它们不能为空。一个简单的SELECT 1就足够了。

Note that there is a further option that it is advisable to use.

请注意,还有一个选项建议使用。

"extracopyoptions="ACCEPTINVCHARS ' '"

AWS Redshift does not support NVARCHAR characters and treats NVARCHAR as a synonym for VARCHAR. The option above recodes any characters not understood by RedShift as an empty space.

AWS Redshift不支持NVARCHAR字符,并将NVARCHAR视为VARCHAR的同义词。上面的选项将RedShift不理解的任何字符重新编码为空格。

#1