I'm looking for a python-like dictionary structure in R to replace values in a large dataset (>100 MB) and I think data.table package can help me do this. However, I cannot find out an easy way to solve the problem.
我正在寻找R中的类python字典结构来替换大型数据集(>100 MB)中的值,我认为是数据。桌包可以帮助我做到这一点。然而,我找不到一个简单的方法来解决这个问题。
For example, I have two data.table:
例如,我有两个数据。
Table A:
表一:
V1 V2
1: A B
2: C D
3: C D
4: B C
5: D A
Table B:
表2:
V3 V4
1: A 1
2: B 2
3: C 3
4: D 4
I want to use B as a dictionary to replace the values in A. So the result I want to get is:
我想用B作为字典来替换a中的值,所以我想得到的结果是:
Table R:
表R:
V5 V6
1 2
3 4
3 4
2 3
4 1
What I did is:
我所做的是:
c2=tB[tA[,list(V2)],list(V4)]
c1=tB[tA[,list(V1)],list(V4)]
Although I specified j=list(V4), it still returned me with the values of V3. I don't know why.
虽然我指定了j=list(V4),但它仍然返回了V3的值。我不知道为什么。
c2:
c2:
V3 V4
1: B 2
2: D 4
3: D 4
4: C 3
5: A 1
c1:
c1:
V3 V4
1: A 1
2: C 3
3: C 3
4: B 2
5: D 4
Finally, I combined the two V4
columns and got the result I want.
最后,我结合了两个V4列并得到了我想要的结果。
But I think there should be a much easier way to do this. Any ideas?
但是我认为应该有一个更简单的方法。什么好主意吗?
2 个解决方案
#1
2
Here's an alternative way:
这里的一个替代方法:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA]$V4)
}
# V1 V2
# 1: 1 2
# 2: 3 4
# 3: 3 4
# 4: 2 3
# 5: 4 1
Since thisA
is character column, we don't need the J()
(for convenience). Here, A
's columns are replaced by reference, and is therefore also memory efficient. But if you don't want to replace A
, then you can just use cA <- copy(A)
and replace cA
's columns.
由于thisA是字符列,我们不需要J()(为了方便)。在这里,A的列被引用所取代,因此也具有内存效率。但是如果您不想替换A,那么您可以使用cA <- copy(A)并替换cA的列。
Alternatively, using :=
:
另外,使用:=:
A[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
# or
ans = copy(A)[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
(Following user2923419's comment): You can drop the J()
if the lookup is a single column of type character (just for convenience).
(根据user2923419的注释):如果查找是字符类型的单个列(只是为了方便),可以删除J()。
In 1.9.3, when j
is a single column, it returns a vector (based on user request). So, it's a bit more natural data.table syntax:
在1.9.3中,当j是单个列时,它返回一个向量(基于用户请求)。这是更自然的数据。表的语法:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA, V4])
}
#2
0
I am not sure how fast this is with big data, but chmatch
is supposed to be fast.
我不确定大数据的速度有多快,但chmatch应该很快。
tA[ , lapply(.SD,function(x) tB$V4[chmatch(x,tB$V3)])]
V1 V2
1: 1 2
2: 3 4
3: 3 4
4: 2 3
5: 4 1
#1
2
Here's an alternative way:
这里的一个替代方法:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA]$V4)
}
# V1 V2
# 1: 1 2
# 2: 3 4
# 3: 3 4
# 4: 2 3
# 5: 4 1
Since thisA
is character column, we don't need the J()
(for convenience). Here, A
's columns are replaced by reference, and is therefore also memory efficient. But if you don't want to replace A
, then you can just use cA <- copy(A)
and replace cA
's columns.
由于thisA是字符列,我们不需要J()(为了方便)。在这里,A的列被引用所取代,因此也具有内存效率。但是如果您不想替换A,那么您可以使用cA <- copy(A)并替换cA的列。
Alternatively, using :=
:
另外,使用:=:
A[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
# or
ans = copy(A)[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
(Following user2923419's comment): You can drop the J()
if the lookup is a single column of type character (just for convenience).
(根据user2923419的注释):如果查找是字符类型的单个列(只是为了方便),可以删除J()。
In 1.9.3, when j
is a single column, it returns a vector (based on user request). So, it's a bit more natural data.table syntax:
在1.9.3中,当j是单个列时,它返回一个向量(基于用户请求)。这是更自然的数据。表的语法:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA, V4])
}
#2
0
I am not sure how fast this is with big data, but chmatch
is supposed to be fast.
我不确定大数据的速度有多快,但chmatch应该很快。
tA[ , lapply(.SD,function(x) tB$V4[chmatch(x,tB$V3)])]
V1 V2
1: 1 2
2: 3 4
3: 3 4
4: 2 3
5: 4 1