I have a quite big data frame in R with two columns. I am trying to make out of the Code
column (factor
type with 858 levels) the dummy variables. The problem is that the R Studio always crashed when I am trying to do that.
我在R中有一个非常大的数据框,有两列。我试图从代码列(具有858级别的因子类型)虚拟变量。问题是当我试图这样做时,R Studio总是崩溃。
> str(d)
'data.frame': 649226 obs. of 2 variables:
$ User: int 210 210 210 210 269 317 317 317 317 326 ...
$ Code : Factor w/ 858 levels "AA02","AA03",..: 164 494 538 626 464 496 435 464 475 163 ...
The User
column is not unique, meaning that there can be several rows with the same User
. Doesn't matter if in the end the amount of rows remains the same or the rows with the same User
are merged into one row having several columns non-empty with the count of Code
s.
“用户”列不是唯一的,这意味着可以有多个具有相同用户的行。如果最后行数保持不变或者具有相同User的行被合并到具有几个非空列的行和代码计数的行中,则无关紧要。
I found couple of solutions that work for a smaller dataset, but not for mine.
我找到了几个适用于较小数据集的解决方案,但不适用于我的解决方案。
-
Tried using
model.matrix
, but the R Studio just crashes尝试使用model.matrix,但R Studio只是崩溃了
m <- model.matrix( ~ Code, data = d)
在此处找到自动将R因子扩展为每个因子级别的1/0指标变量的集合
-
Tried
for
cycle withifelse
, but the code run for 4 hours and then I noticed that the R Studio crashed.尝试使用ifelse循环,但代码运行了4个小时,然后我注意到R Studio崩溃了。
for (t in unique(d$Code)) { d[paste("Code", t, sep = "")] <- ifelse(d$Code == t, 1, 0) }
Found here Create new dummy variable columns from categorical variable
在此处找到从分类变量创建新的虚拟变量列
Would be great if you can recommend me some method which is fast and working for such type of data.
如果你能推荐一些快速且适用于此类数据的方法,那就太棒了。
Thanks!
1 个解决方案
#1
1
This worked for me perfectly:
这完全适合我:
library(reshape2)
m <- acast(data = d, User ~ Code)
The only thing was that it produced NA
s, instead of 0
s, but this can be easily changed with this:
唯一的问题是它产生了NAs,而不是0,但这可以很容易地改变:
m[is.na(m)] <- 0
#1
1
This worked for me perfectly:
这完全适合我:
library(reshape2)
m <- acast(data = d, User ~ Code)
The only thing was that it produced NA
s, instead of 0
s, but this can be easily changed with this:
唯一的问题是它产生了NAs,而不是0,但这可以很容易地改变:
m[is.na(m)] <- 0