如何将来自不同Dataframe的项目连接到一个公共DataFrame

时间:2022-12-14 12:15:14

Suppose We have a Dataframe 'A':

假设我们有一个Dataframe'A':

Id    Name    FavColor    Address
1     John    Black       xyz
2     Mathew  Orange      www
3     Russel  Red         xxx

Now I have a case where different datasets comes as to update values in some columns, for example Let us have DataFrame 'B' :

现在我有一个案例,其中不同的数据集来更新某些列中的值,例如让我们有DataFrame'B':

Id    FavColor
1     Red
2     Black

and DataFrame 'C' :

和DataFrame'C':

Id    Address
1     aaa
3     bbb

now in this case updates 'B' and 'C' needs to be merged in 'A', I tried merging 'B' and 'C' first and then merging it to 'A', but when I merge 'B' and 'C' I get :

现在在这种情况下,更新'B'和'C'需要合并在'A'中,我尝试首先合并'B'和'C',然后将它合并到'A',但是当我合并'B'和'我得到了:

Id    FavColor    Address
1     Red         aaa
2     Black       null
3     null        bbb

and if I merge this with 'A' it will be wrong as Address of Id=2 will become null and FavColor of Id=3 will become null. How can I merge the coming updated Data with 'A' and the coming data may have new attribute in that case it should show null for the items which do not have value for that attribute in 'A'.

如果我将其与'A'合并,那将是错误的,因为Id = 2的地址将变为空,并且Id = 3的FavColor将变为空。如何将即将更新的数据与“A”合并,并且即将到来的数据可能具有新属性,在这种情况下,它应该对“A”中没有该属性值的项目显示null。

1 个解决方案

#1


0  

Try merging data by using left join and getting only updated rows. Below code merges A and B, then you can merge their result with C in the same way.

尝试使用左连接合并数据并仅获取更新的行。下面的代码合并A和B,然后您可以以相同的方式将其结果与C合并。

scala> A.join(B, A("Id") === B("Id"), "left").
     | withColumn("merged", when(B("FavColor").isNotNull, B("FavColor")).otherwise(A("FavColor"))).
     | drop(B("FavColor")).drop(A("FavColor")).drop(B("Id")).
     | withColumnRenamed("merged", "FavColor").show()

+---+------+-------+--------+
| Id|  Name|Address|FavColor|
+---+------+-------+--------+
|  1|  John|    xyz|     Red|
|  2|Mathew|    www|   Black|
|  3|Russel|    xxx|     Red|
+---+------+-------+--------+

#1


0  

Try merging data by using left join and getting only updated rows. Below code merges A and B, then you can merge their result with C in the same way.

尝试使用左连接合并数据并仅获取更新的行。下面的代码合并A和B,然后您可以以相同的方式将其结果与C合并。

scala> A.join(B, A("Id") === B("Id"), "left").
     | withColumn("merged", when(B("FavColor").isNotNull, B("FavColor")).otherwise(A("FavColor"))).
     | drop(B("FavColor")).drop(A("FavColor")).drop(B("Id")).
     | withColumnRenamed("merged", "FavColor").show()

+---+------+-------+--------+
| Id|  Name|Address|FavColor|
+---+------+-------+--------+
|  1|  John|    xyz|     Red|
|  2|Mathew|    www|   Black|
|  3|Russel|    xxx|     Red|
+---+------+-------+--------+