I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
我想要转换数据的一个子集。新班的学生。这里有一个常见的问题(在data.table中转换列类),但答案是创建一个新对象,而不是在starter对象上操作。
Take this example:
把这个例子:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols
columns to (e.g.) a factor? In a normal data.frame you could do this:
如何最好地将cols列转换为(例如)一个因子?在正常情况下,你可以这样做:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
但这对数据不适用。这个也不行
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
Matt Dowle(2013年12月)在相关问题中给出了如下评论,效果不错,但似乎不那么优雅。
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)
?
目前有更好的数据吗?表答案(即短+不生成计数器变量),还是应该使用上面的+ rm(j)?
2 个解决方案
#1
28
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
除了使用Matt Dowle建议的选项外,更改列类的另一种方法如下:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the :=
operator you update the datatable by reference. A check whether this worked:
通过使用:=操作符,可以通过引用更新datatable。检查这是否有效:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...)
as follows:
正如@MattDowle在评论中建议的,您还可以使用for(…)set(…)的组合如下:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
结果是一样的。第三种选择是:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...)
option is about three times faster than the lapply
option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
在较小的数据集中,for(…)set(…)选项比lapply选项*倍(但这并不重要,因为它是一个小数据集)。在较大的数据集中(例如200万行),每种方法花费的时间都是相同的。对于大型数据集的测试,我使用:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
有时,您需要做一点不同的事情(例如,当数值作为一个因素存储时)。然后你必须使用这样的东西:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table
-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE
.
警告:下面的解释不是数据。table-way做事的。由于复制被创建并存储在内存中(如@Frank所指出的),因此无法通过引用更新datatable,这将增加内存使用量。它更多的是为了解释with = FALSE的工作。
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE
as follows:
当您想要以与dataframe相同的方式更改列类时,必须加上= FALSE:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
检查这是否有效:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE
, datatable will evaluate dat[, cols]
as a vector. Check the difference in output between dat[, cols]
and dat[, cols, with = FALSE]
:
如果不添加= FALSE, datatable将dat[, cols]作为一个向量进行计算。检查dat[, cols]和dat[, cols, with = FALSE]输出的差异:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
#2
1
You can use .SDcols
:
您可以使用.SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]
dat, cols] <- dat, lapply(。SD因素),.SDcols =关口]
#1
28
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
除了使用Matt Dowle建议的选项外,更改列类的另一种方法如下:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the :=
operator you update the datatable by reference. A check whether this worked:
通过使用:=操作符,可以通过引用更新datatable。检查这是否有效:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...)
as follows:
正如@MattDowle在评论中建议的,您还可以使用for(…)set(…)的组合如下:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
结果是一样的。第三种选择是:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...)
option is about three times faster than the lapply
option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
在较小的数据集中,for(…)set(…)选项比lapply选项*倍(但这并不重要,因为它是一个小数据集)。在较大的数据集中(例如200万行),每种方法花费的时间都是相同的。对于大型数据集的测试,我使用:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
有时,您需要做一点不同的事情(例如,当数值作为一个因素存储时)。然后你必须使用这样的东西:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table
-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE
.
警告:下面的解释不是数据。table-way做事的。由于复制被创建并存储在内存中(如@Frank所指出的),因此无法通过引用更新datatable,这将增加内存使用量。它更多的是为了解释with = FALSE的工作。
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE
as follows:
当您想要以与dataframe相同的方式更改列类时,必须加上= FALSE:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
检查这是否有效:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE
, datatable will evaluate dat[, cols]
as a vector. Check the difference in output between dat[, cols]
and dat[, cols, with = FALSE]
:
如果不添加= FALSE, datatable将dat[, cols]作为一个向量进行计算。检查dat[, cols]和dat[, cols, with = FALSE]输出的差异:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
#2
1
You can use .SDcols
:
您可以使用.SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]
dat, cols] <- dat, lapply(。SD因素),.SDcols =关口]