使用sparklyr删除/更新分区[重复]

This question already has an answer here:

这个问题在这里已有答案:

Overwrite only some partitions in a partitioned spark Dataset 3 answers

仅覆盖分区的spark数据集3答案中的某些分区

I'm using the spark_write_table function from sparklyr to write tables into HDFS, using the partition_by parameter to define how to store them:

我正在使用sparklyr中的spark_write_table函数将表写入HDFS,使用partition_by参数定义如何存储它们:

 R> my_table %>% 
       spark_write_table(., 
             path="mytable",
             mode="append",
             partition_by=c("col1", "col2")
        )

However, now I want to update the table by altering just one partition, instead of writing the whole table again.

但是,现在我想通过改变一个分区来更新表,而不是再次编写整个表。

In Hadoop-SQL I would do something like:

在Hadoop-SQL中,我会做类似的事情:

INSERT INTO TABLE mytable
PARTITION (col1 = 'my_partition')
VALUES (myvalues..)

Is there an equivalent option to do this in sparklyr correctly? I cannot find it in the documentation.

是否有相同的选项在sparklyr中正确执行此操作?我在文档中找不到它。

Re - duplication note: this question is specifically about the way to do this in R with the sparklyr function, while the other question is about general Hive syntax

重复注释:这个问题具体是关于使用sparklyr函数在R中执行此操作的方法,而另一个问题是关于一般Hive语法

1 个解决方案

#1

Thanks all for the comments.

谢谢大家的评论。

It seems there is no way to do this with sparklyr directly, but this is what I am going to do.

似乎没有办法直接用sparklyr做这个,但这就是我要做的。

In short, I'll save the new partition file in a temporary table, use Hadoop SQL commands to drop the partition, then another SQL command to insert into the temporary table into it.

简而言之,我将新的分区文件保存在临时表中,使用Hadoop SQL命令删除分区,然后将另一个SQL命令插入到临时表中。

> dbGetQuery(con, 
   "ALTER TABLE mytable DROP IF EXISTS PARTITION (mycol='partition1');")

> spark_write_table(new_partition, "tmp_partition_table")

> dbGetQuery(con, 
      "INSERT VALUES INTO TABLE mytable
       PARTITION (mycol='partition1') 
       SELECT * 
       FROM tmp_partition_table "
   )

#1