This question already has an answer here:
这个问题在这里已有答案:
- Overwrite only some partitions in a partitioned spark Dataset 3 answers
仅覆盖分区的spark数据集3答案中的某些分区
I'm using the spark_write_table function from sparklyr to write tables into HDFS, using the partition_by parameter to define how to store them:
我正在使用sparklyr中的spark_write_table函数将表写入HDFS,使用partition_by参数定义如何存储它们:
R> my_table %>%
spark_write_table(.,
path="mytable",
mode="append",
partition_by=c("col1", "col2")
)
However, now I want to update the table by altering just one partition, instead of writing the whole table again.
但是,现在我想通过改变一个分区来更新表,而不是再次编写整个表。
In Hadoop-SQL I would do something like:
在Hadoop-SQL中,我会做类似的事情:
INSERT INTO TABLE mytable
PARTITION (col1 = 'my_partition')
VALUES (myvalues..)
Is there an equivalent option to do this in sparklyr correctly? I cannot find it in the documentation.
是否有相同的选项在sparklyr中正确执行此操作?我在文档中找不到它。
Re - duplication note: this question is specifically about the way to do this in R with the sparklyr function, while the other question is about general Hive syntax
重复注释:这个问题具体是关于使用sparklyr函数在R中执行此操作的方法,而另一个问题是关于一般Hive语法
1 个解决方案
#1
0
Thanks all for the comments.
谢谢大家的评论。
It seems there is no way to do this with sparklyr directly, but this is what I am going to do.
似乎没有办法直接用sparklyr做这个,但这就是我要做的。
In short, I'll save the new partition file in a temporary table, use Hadoop SQL commands to drop the partition, then another SQL command to insert into the temporary table into it.
简而言之,我将新的分区文件保存在临时表中,使用Hadoop SQL命令删除分区,然后将另一个SQL命令插入到临时表中。
> dbGetQuery(con,
"ALTER TABLE mytable DROP IF EXISTS PARTITION (mycol='partition1');")
> spark_write_table(new_partition, "tmp_partition_table")
> dbGetQuery(con,
"INSERT VALUES INTO TABLE mytable
PARTITION (mycol='partition1')
SELECT *
FROM tmp_partition_table "
)
#1
0
Thanks all for the comments.
谢谢大家的评论。
It seems there is no way to do this with sparklyr directly, but this is what I am going to do.
似乎没有办法直接用sparklyr做这个,但这就是我要做的。
In short, I'll save the new partition file in a temporary table, use Hadoop SQL commands to drop the partition, then another SQL command to insert into the temporary table into it.
简而言之,我将新的分区文件保存在临时表中,使用Hadoop SQL命令删除分区,然后将另一个SQL命令插入到临时表中。
> dbGetQuery(con,
"ALTER TABLE mytable DROP IF EXISTS PARTITION (mycol='partition1');")
> spark_write_table(new_partition, "tmp_partition_table")
> dbGetQuery(con,
"INSERT VALUES INTO TABLE mytable
PARTITION (mycol='partition1')
SELECT *
FROM tmp_partition_table "
)