I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.
我在这个主题上看过类似的帖子,但无法理解解决方案。基本上我有一个数据表(DT1)与分数和一些NA。只要有'NA',我想要一个引用另一个数据(DT2)表的过程,该表在两个表(通道)中都有一个公共列,然后使用DT2的得分列值更新DT1中的NA。我正在选择一个数据表类,因为我认为它是一个有效的选择。
DT1
tract CreditScore
1: 36107020401 635
2: 36083052403 NA
3: 36091062602 NA
4: 36067013000 NA
5: 36083052304 NA
DT2
tract CreditScore
1: 36107020401 635
2: 36083052403 650
3: 36091062602 335
4: 36067013000 777
5: 36083052304 663
1 个解决方案
#1
10
We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.
我们为一些data.table概念创建了新的(并且更全面的)HTML插图。看看我们正在处理的其他小插曲。我正在研究连接的小插图,这样做完成后可以更好地澄清这些类型的问题。
The idea is to first setkey()
on DT1
on the column tract
.
我们的想法是首先在列道上的DT1上设置keykey()。
setkey(DT1, tract)
In data.tables, a join of the form x[i]
requires key for x
, but not necessarily for i
. This results in two scenarios:
在data.tables中,x [i]形式的连接需要x的键,但不一定是i的键。这导致两种情况:
-
If
i
also has key set -- the first key column ofi
is matched against first key column ofx
, second against second and so on..如果我也有密钥集 - i的第一个键列与x的第一个键列匹配,第二个键与第二个键匹配,依此类推。
-
If
i
doesn't have key set -- the first column ofi
is matched against the first key column ofx
, second column ofi
against second key column ofx
and so on..如果我没有密钥集 - i的第一列与x的第一个键列相匹配,i的第二列与x的第二个键列匹配,依此类推。
In this case, since your first column in i
is also tract
, we'll skip setting key on i
.
在这种情况下,由于你的第一列也是第二列,我们将跳过设置键i。
Then, we perform a join of the form x[i]
. By doing this, for each i
the matching row indices in x
is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1
's CreditScore
column with DT2
's on those matching rows..
然后,我们执行x [i]形式的连接。通过这样做,对于每个i,计算x中的匹配行索引,然后实现连接结果。但是,我们不希望整个连接结果作为新的data.table。相反,我们希望在这些匹配行上使用DT2更新DT1的CreditScore列。
In data.tables, we can perform that operation while joining, by providing the expression in j
, as follows:
在data.tables中,我们可以通过在j中提供表达式来加入时执行该操作,如下所示:
DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635
DT1[DT2
part finds the matching rows in DT1
for each row in DT2
. And if there's a match, we want DT2
's value to be updated in DT1
. We accomplish that by using i.CreditScore
-- it refers to DT2
's CreditScore
column (i.
is a prefix used to distinguish columns with identical names between x
and i
data.tables).
DT1 [DT2部分在DT2中为DT2中的每一行查找匹配的行。如果匹配,我们希望在DT1中更新DT2的值。我们通过使用i.CreditScore实现了这一点 - 它引用了DT2的CreditScore列(i。是用于区分x和i data.tables之间具有相同名称的列的前缀)。
Update: As pointed out under comments, the solution above would also update the non-NA values in DT1
. Therefore the way to do it would be:
更新:正如评论中所指出的,上述解决方案还将更新DT1中的非NA值。因此,这样做的方法是:
DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]
On those rows where CreditScore
from DT1
is NA
, replace CreditScore
from DT1
with the values from CreditScore
obtained from the join of DT2[.(.SD)]
, where .SD
corresponds to the subset of data.table that contains all the rows where CreditScore
is NA
.
在来自DT1的CreditScore为NA的那些行中,将来自DT1的CreditScore替换为从DT2 [。(。SD)]的连接获得的CreditScore的值,其中.SD对应于包含所有行的data.table的子集。 CreditScore是NA。
HTH
#1
10
We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.
我们为一些data.table概念创建了新的(并且更全面的)HTML插图。看看我们正在处理的其他小插曲。我正在研究连接的小插图,这样做完成后可以更好地澄清这些类型的问题。
The idea is to first setkey()
on DT1
on the column tract
.
我们的想法是首先在列道上的DT1上设置keykey()。
setkey(DT1, tract)
In data.tables, a join of the form x[i]
requires key for x
, but not necessarily for i
. This results in two scenarios:
在data.tables中,x [i]形式的连接需要x的键,但不一定是i的键。这导致两种情况:
-
If
i
also has key set -- the first key column ofi
is matched against first key column ofx
, second against second and so on..如果我也有密钥集 - i的第一个键列与x的第一个键列匹配,第二个键与第二个键匹配,依此类推。
-
If
i
doesn't have key set -- the first column ofi
is matched against the first key column ofx
, second column ofi
against second key column ofx
and so on..如果我没有密钥集 - i的第一列与x的第一个键列相匹配,i的第二列与x的第二个键列匹配,依此类推。
In this case, since your first column in i
is also tract
, we'll skip setting key on i
.
在这种情况下,由于你的第一列也是第二列,我们将跳过设置键i。
Then, we perform a join of the form x[i]
. By doing this, for each i
the matching row indices in x
is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1
's CreditScore
column with DT2
's on those matching rows..
然后,我们执行x [i]形式的连接。通过这样做,对于每个i,计算x中的匹配行索引,然后实现连接结果。但是,我们不希望整个连接结果作为新的data.table。相反,我们希望在这些匹配行上使用DT2更新DT1的CreditScore列。
In data.tables, we can perform that operation while joining, by providing the expression in j
, as follows:
在data.tables中,我们可以通过在j中提供表达式来加入时执行该操作,如下所示:
DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635
DT1[DT2
part finds the matching rows in DT1
for each row in DT2
. And if there's a match, we want DT2
's value to be updated in DT1
. We accomplish that by using i.CreditScore
-- it refers to DT2
's CreditScore
column (i.
is a prefix used to distinguish columns with identical names between x
and i
data.tables).
DT1 [DT2部分在DT2中为DT2中的每一行查找匹配的行。如果匹配,我们希望在DT1中更新DT2的值。我们通过使用i.CreditScore实现了这一点 - 它引用了DT2的CreditScore列(i。是用于区分x和i data.tables之间具有相同名称的列的前缀)。
Update: As pointed out under comments, the solution above would also update the non-NA values in DT1
. Therefore the way to do it would be:
更新:正如评论中所指出的,上述解决方案还将更新DT1中的非NA值。因此,这样做的方法是:
DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]
On those rows where CreditScore
from DT1
is NA
, replace CreditScore
from DT1
with the values from CreditScore
obtained from the join of DT2[.(.SD)]
, where .SD
corresponds to the subset of data.table that contains all the rows where CreditScore
is NA
.
在来自DT1的CreditScore为NA的那些行中,将来自DT1的CreditScore替换为从DT2 [。(。SD)]的连接获得的CreditScore的值,其中.SD对应于包含所有行的data.table的子集。 CreditScore是NA。
HTH