I have a data.table with which I'd like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular example, I'd like to multiply all of these columns by -1.
我有一个数据。我想对某些列执行相同操作的表。这些列的名称在字符向量中给出。在这个例子中,我想把所有的列都乘以-1。
Some toy data and a vector specifying relevant columns:
一些玩具数据和指定相关列的向量:
library(data.table)
dt <- data.table(a = 1:3, b = 1:3, d = 1:3)
cols <- c("a", "b")
Right now I'm doing it this way, looping over the character vector:
现在我这样做,在字符向量上循环:
for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
}
Is there a way to do this directly without the for loop?
有没有一种不用for循环就能直接完成的方法?
3 个解决方案
#1
97
This seems to work:
这似乎工作:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
结果是
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
这里有一些技巧:
- Because there are parentheses in
(cols) :=
, the result is assigned to the columns specified incols
, instead of to some new variable named "cols". - 因为在(cols)中有括号:=,结果被分配到cols中指定的列,而不是一个名为“cols”的新变量。
-
.SDcols
tells the call that we're only looking at those columns, and allows us to use.SD
, theS
ubset of theD
ata associated with those columns. - 。sdcols告诉调用我们只查看那些列,并允许我们使用。sd,与这些列相关联的数据的子集。
-
lapply(.SD, ...)
operates on.SD
, which is a list of columns (like all data.frames and data.tables).lapply
returns a list, so in the endj
looks likecols := list(...)
. - 拉普兰人(。它是一个列列表(与所有的data.frame和data.tables一样)。lapply返回一个列表,所以最后j看起来像cols:= list(…)。
EDIT: Here's another way that is probably faster, as @Arun mentioned:
编辑:这里有另一种可能更快的方式,@Arun提到:
for (j in cols) set(dt, j = j, value = -dt[[j]])
#2
7
I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.
我想添加一个答案,当您也想更改列的名称时。如果你想计算多列的对数,这是非常有用的,这在经验工作中经常是这样的。
cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]
#3
2
UPDATE: Following is a neat way to do it without for loop
更新:下面是一个没有循环的简洁的方法。
dt[,(cols):= - dt[,..cols]]
It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result
这是一种简单的代码可读性方法。但在性能方面,它仍然落后于弗兰克的解决方案,低于微基准测试结果。
mbm = microbenchmark(
base = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_solution2 = for (j in cols) set(dt, j = j, value = -dt[[j]]),
hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
orhans_solution = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_solution2 = dt[,(cols):= - dt[,..cols]],
times=1000
)
mbm
Unit: microseconds
expr min lq mean median uq max neval
base_solution 3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789 1000
franks_solution1 313.846 349.1285 448.4770 379.8970 447.384 5654.149 1000
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229 9723.070 1000
hannes_solution 326.154 405.5385 561.8263 495.1795 576.000 12432.400 1000
orhans_solution 3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202 1000
orhans_solution2 752.000 831.5900 1061.6974 897.6405 1026.872 9913.018 1000
as shown in below chart
如下图所示
My Previous Answer: The following also works
我之前的回答是:以下也是有效的
for (j in cols)
dt[,(j):= -1 * dt[, ..j]]
#1
97
This seems to work:
这似乎工作:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
结果是
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
这里有一些技巧:
- Because there are parentheses in
(cols) :=
, the result is assigned to the columns specified incols
, instead of to some new variable named "cols". - 因为在(cols)中有括号:=,结果被分配到cols中指定的列,而不是一个名为“cols”的新变量。
-
.SDcols
tells the call that we're only looking at those columns, and allows us to use.SD
, theS
ubset of theD
ata associated with those columns. - 。sdcols告诉调用我们只查看那些列,并允许我们使用。sd,与这些列相关联的数据的子集。
-
lapply(.SD, ...)
operates on.SD
, which is a list of columns (like all data.frames and data.tables).lapply
returns a list, so in the endj
looks likecols := list(...)
. - 拉普兰人(。它是一个列列表(与所有的data.frame和data.tables一样)。lapply返回一个列表,所以最后j看起来像cols:= list(…)。
EDIT: Here's another way that is probably faster, as @Arun mentioned:
编辑:这里有另一种可能更快的方式,@Arun提到:
for (j in cols) set(dt, j = j, value = -dt[[j]])
#2
7
I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.
我想添加一个答案,当您也想更改列的名称时。如果你想计算多列的对数,这是非常有用的,这在经验工作中经常是这样的。
cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]
#3
2
UPDATE: Following is a neat way to do it without for loop
更新:下面是一个没有循环的简洁的方法。
dt[,(cols):= - dt[,..cols]]
It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result
这是一种简单的代码可读性方法。但在性能方面,它仍然落后于弗兰克的解决方案,低于微基准测试结果。
mbm = microbenchmark(
base = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_solution2 = for (j in cols) set(dt, j = j, value = -dt[[j]]),
hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
orhans_solution = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_solution2 = dt[,(cols):= - dt[,..cols]],
times=1000
)
mbm
Unit: microseconds
expr min lq mean median uq max neval
base_solution 3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789 1000
franks_solution1 313.846 349.1285 448.4770 379.8970 447.384 5654.149 1000
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229 9723.070 1000
hannes_solution 326.154 405.5385 561.8263 495.1795 576.000 12432.400 1000
orhans_solution 3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202 1000
orhans_solution2 752.000 831.5900 1061.6974 897.6405 1026.872 9913.018 1000
as shown in below chart
如下图所示
My Previous Answer: The following also works
我之前的回答是:以下也是有效的
for (j in cols)
dt[,(j):= -1 * dt[, ..j]]