使用另一个数据表中的列中的值更新一个数据表中的一列NA

I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.

我在这个主题上看过类似的帖子,但无法理解解决方案。基本上我有一个数据表(DT1)与分数和一些NA。只要有'NA',我想要一个引用另一个数据(DT2)表的过程,该表在两个表(通道)中都有一个公共列,然后使用DT2的得分列值更新DT1中的NA。我正在选择一个数据表类,因为我认为它是一个有效的选择。

DT1

         tract CreditScore    
1: 36107020401         635  
2: 36083052403          NA  
3: 36091062602          NA  
4: 36067013000          NA  
5: 36083052304          NA

DT2

         tract CreditScore  
1: 36107020401         635  
2: 36083052403         650  
3: 36091062602         335  
4: 36067013000         777  
5: 36083052304         663

1 个解决方案

#1

We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.

我们为一些data.table概念创建了新的(并且更全面的)HTML插图。看看我们正在处理的其他小插曲。我正在研究连接的小插图,这样做完成后可以更好地澄清这些类型的问题。

The idea is to first setkey() on DT1 on the column tract.

我们的想法是首先在列道上的DT1上设置keykey()。

setkey(DT1, tract)

In data.tables, a join of the form x[i] requires key for x, but not necessarily for i. This results in two scenarios:

在data.tables中,x [i]形式的连接需要x的键,但不一定是i的键。这导致两种情况:

If i also has key set -- the first key column of i is matched against first key column of x, second against second and so on..

如果我也有密钥集 - i的第一个键列与x的第一个键列匹配,第二个键与第二个键匹配,依此类推。
If i doesn't have key set -- the first column of i is matched against the first key column of x, second column of i against second key column of x and so on..

如果我没有密钥集 - i的第一列与x的第一个键列相匹配,i的第二列与x的第二个键列匹配,依此类推。

In this case, since your first column in i is also tract, we'll skip setting key on i.

在这种情况下,由于你的第一列也是第二列,我们将跳过设置键i。

Then, we perform a join of the form x[i]. By doing this, for each i the matching row indices in x is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1's CreditScore column with DT2's on those matching rows..

然后,我们执行x [i]形式的连接。通过这样做,对于每个i,计算x中的匹配行索引,然后实现连接结果。但是,我们不希望整个连接结果作为新的data.table。相反,我们希望在这些匹配行上使用DT2更新DT1的CreditScore列。

In data.tables, we can perform that operation while joining, by providing the expression in j, as follows:

在data.tables中,我们可以通过在j中提供表达式来加入时执行该操作,如下所示:

DT1[DT2, CreditScore := i.CreditScore]
#          tract CreditScore
# 1: 36067013000         777
# 2: 36083052304         663
# 3: 36083052403         650
# 4: 36091062602         335
# 5: 36107020401         635

DT1[DT2 part finds the matching rows in DT1 for each row in DT2. And if there's a match, we want DT2's value to be updated in DT1. We accomplish that by using i.CreditScore -- it refers to DT2's CreditScore column (i. is a prefix used to distinguish columns with identical names between x and i data.tables).

DT1 [DT2部分在DT2中为DT2中的每一行查找匹配的行。如果匹配,我们希望在DT1中更新DT2的值。我们通过使用i.CreditScore实现了这一点 - 它引用了DT2的CreditScore列(i。是用于区分x和i data.tables之间具有相同名称的列的前缀)。

Update: As pointed out under comments, the solution above would also update the non-NA values in DT1. Therefore the way to do it would be:

更新:正如评论中所指出的,上述解决方案还将更新DT1中的非NA值。因此,这样做的方法是:

DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]

On those rows where CreditScore from DT1 is NA, replace CreditScore from DT1 with the values from CreditScore obtained from the join of DT2[.(.SD)], where .SD corresponds to the subset of data.table that contains all the rows where CreditScore is NA.

在来自DT1的CreditScore为NA的那些行中,将来自DT1的CreditScore替换为从DT2 [。(。SD)]的连接获得的CreditScore的值,其中.SD对应于包含所有行的data.table的子集。 CreditScore是NA。

HTH

#1