spark.sql.crossJoin。支持火花2. x

I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled) that prohibits cartesian products and an Exception is thrown. How can I set spark.sql.crossJoin.enabled=true, preferably by using an initialization action? spark.sql.crossJoin.enabled=true

我使用的是“预览”谷歌DataProc图像1.1和Spark 2.0.0。为了完成我的一个操作，我必须完成一个笛卡尔积。自2.0.0版本以来，已经创建了一个spark配置参数(spark.sql)。禁止笛卡尔积和一个异常被抛出。我怎么能设置spark.sql.crossJoin。enabled=true，最好使用初始化操作?spark.sql.crossJoin.enabled = true

3 个解决方案

#1

For changing default values of configuration settings in Dataproc, you don't even need an init action, you can use the --properties flag when creating your cluster from the command-line:

为了更改Dataproc中配置设置的默认值，您甚至不需要一个init操作，您可以在从命令行创建集群时使用—properties标志:

gcloud dataproc clusters create --properties spark:spark.sql.crossJoin.enabled=true my-cluster ...

#2

Spark 2.1+

火花2.1 +

You can use crossJoin:

您可以使用crossJoin:

df1.crossJoin(df2)

It makes your intention explicit and keeps more conservative configuration in place to protect you from unintended cross joins.

它使您的意图显式，并保持更保守的配置，以保护您不受无意的交叉连接。

Spark 2.0

火花2.0

SQL properties can be set dynamically on runtime with RuntimeConfig.set method so you should be able to call

可以在运行时使用RuntimeConfig动态地设置SQL属性。设置方法，以便您能够调用。

spark.conf.set("spark.sql.crossJoin.enabled", true)

whenever you want to explicitly allow Cartesian product.

无论何时你要显式地允许笛卡尔积。

#3

The TPCDS query set benchmarks have queries that contain CROSS JOINS and unless you explicitly write CROSS JOIN or dynamically set Spark's default property to true Spark.conf.set("spark.sql.crossJoin.enabled", true) you will run into an exception error.

TPCDS查询集的基准有包含交叉连接的查询，除非您显式地写入交叉连接或动态地将Spark的默认属性设置为true Spark.conf.set(“spark.sql.crossJoin”)。您将会遇到异常错误。

The error appears on TPCDS queries 28,61, 88, and 90 becuase the original query syntax from Transaction Processing Committee (TPC) contains commas and Spark's default join operation is an inner join. My team has also decided to use CROSS JOIN in lieu of changing Spark's default properties.

该错误出现在TPCDS查询28、61、88和90上，因为事务处理委员会(TPC)的原始查询语法包含逗号，Spark的默认连接操作是内部连接。我的团队还决定使用交叉连接代替更改Spark的默认属性。

#1