如何对数据表中的每个指定列应用相同的函数

时间:2022-10-03 20:22:44

I have a data.table with which I'd like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular example, I'd like to multiply all of these columns by -1.

我有一个数据。我想对某些列执行相同操作的表。这些列的名称在字符向量中给出。在这个例子中,我想把所有的列都乘以-1。

Some toy data and a vector specifying relevant columns:

一些玩具数据和指定相关列的向量:

library(data.table)
dt <- data.table(a = 1:3, b = 1:3, d = 1:3)
cols <- c("a", "b")

Right now I'm doing it this way, looping over the character vector:

现在我这样做,在字符向量上循环:

for (col in 1:length(cols)) {
   dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
}

Is there a way to do this directly without the for loop?

有没有一种不用for循环就能直接完成的方法?

3 个解决方案

#1


97  

This seems to work:

这似乎工作:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

结果是

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

这里有一些技巧:

  • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
  • 因为在(cols)中有括号:=,结果被分配到cols中指定的列,而不是一个名为“cols”的新变量。
  • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
  • 。sdcols告诉调用我们只查看那些列,并允许我们使用。sd,与这些列相关联的数据的子集。
  • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).
  • 拉普兰人(。它是一个列列表(与所有的data.frame和data.tables一样)。lapply返回一个列表,所以最后j看起来像cols:= list(…)。

EDIT: Here's another way that is probably faster, as @Arun mentioned:

编辑:这里有另一种可能更快的方式,@Arun提到:

for (j in cols) set(dt, j = j, value = -dt[[j]])

#2


7  

I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.

我想添加一个答案,当您也想更改列的名称时。如果你想计算多列的对数,这是非常有用的,这在经验工作中经常是这样的。

cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]

#3


2  

UPDATE: Following is a neat way to do it without for loop

更新:下面是一个没有循环的简洁的方法。

dt[,(cols):= - dt[,..cols]]

It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result

这是一种简单的代码可读性方法。但在性能方面,它仍然落后于弗兰克的解决方案,低于微基准测试结果。

mbm = microbenchmark(
  base = for (col in 1:length(cols)) {
    dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
  },
  franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
  franks_solution2 =  for (j in cols) set(dt, j = j, value = -dt[[j]]),
  hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
  orhans_solution = for (j in cols) dt[,(j):= -1 * dt[,  ..j]],
  orhans_solution2 = dt[,(cols):= - dt[,..cols]],
  times=1000
)
mbm

Unit: microseconds
expr                  min        lq      mean    median       uq       max neval
base_solution    3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789  1000  
franks_solution1  313.846  349.1285  448.4770  379.8970  447.384  5654.149  1000    
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229  9723.070  1000    
hannes_solution   326.154  405.5385  561.8263  495.1795  576.000 12432.400  1000
orhans_solution  3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202  1000  
orhans_solution2  752.000  831.5900 1061.6974  897.6405 1026.872  9913.018  1000

as shown in below chart

如下图所示

如何对数据表中的每个指定列应用相同的函数

My Previous Answer: The following also works

我之前的回答是:以下也是有效的

for (j in cols)
  dt[,(j):= -1 * dt[,  ..j]]

#1


97  

This seems to work:

这似乎工作:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

结果是

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

这里有一些技巧:

  • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
  • 因为在(cols)中有括号:=,结果被分配到cols中指定的列,而不是一个名为“cols”的新变量。
  • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
  • 。sdcols告诉调用我们只查看那些列,并允许我们使用。sd,与这些列相关联的数据的子集。
  • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).
  • 拉普兰人(。它是一个列列表(与所有的data.frame和data.tables一样)。lapply返回一个列表,所以最后j看起来像cols:= list(…)。

EDIT: Here's another way that is probably faster, as @Arun mentioned:

编辑:这里有另一种可能更快的方式,@Arun提到:

for (j in cols) set(dt, j = j, value = -dt[[j]])

#2


7  

I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.

我想添加一个答案,当您也想更改列的名称时。如果你想计算多列的对数,这是非常有用的,这在经验工作中经常是这样的。

cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]

#3


2  

UPDATE: Following is a neat way to do it without for loop

更新:下面是一个没有循环的简洁的方法。

dt[,(cols):= - dt[,..cols]]

It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result

这是一种简单的代码可读性方法。但在性能方面,它仍然落后于弗兰克的解决方案,低于微基准测试结果。

mbm = microbenchmark(
  base = for (col in 1:length(cols)) {
    dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
  },
  franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
  franks_solution2 =  for (j in cols) set(dt, j = j, value = -dt[[j]]),
  hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
  orhans_solution = for (j in cols) dt[,(j):= -1 * dt[,  ..j]],
  orhans_solution2 = dt[,(cols):= - dt[,..cols]],
  times=1000
)
mbm

Unit: microseconds
expr                  min        lq      mean    median       uq       max neval
base_solution    3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789  1000  
franks_solution1  313.846  349.1285  448.4770  379.8970  447.384  5654.149  1000    
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229  9723.070  1000    
hannes_solution   326.154  405.5385  561.8263  495.1795  576.000 12432.400  1000
orhans_solution  3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202  1000  
orhans_solution2  752.000  831.5900 1061.6974  897.6405 1026.872  9913.018  1000

as shown in below chart

如下图所示

如何对数据表中的每个指定列应用相同的函数

My Previous Answer: The following also works

我之前的回答是:以下也是有效的

for (j in cols)
  dt[,(j):= -1 * dt[,  ..j]]