In R, I have a data table with a column ("prncl_diag") that has values of diagnoses. These diagnosis values (in the prncl_diag column) all appear as columns in the data table as well. There are ~2.5K diagnosis columns of which a subset appear as values in the "prncl_diag" column.
在R中,我有一个数据表,其中包含一个具有诊断值的列(“prncl_diag”)。这些诊断值(在prncl_diag列中)也都显示为数据表中的列。有~2.5K诊断列,其子集在“prncl_diag”列中显示为值。
I want to update the diagnosis indicator columns with a 1 if its name appears in a given row of "prncl_diag" column.
如果诊断指示符列的名称出现在“prncl_diag”列的给定行中,我想更新诊断指示符列。
That isn't explained too well, but here is a minimal working example.
这没有解释得太好,但这是一个最小的工作示例。
dt <- data.table(heart_failure = c(0, 1, 0),
kidney_failure = c(1, 0, 0),
death = c(1, 1, 1),
prncl_diag = c('heart_failure', 'kidney_failure', 'death'))
for (i in 1:nrow(dt)) {
name <- dt[i, prncl_diag]
dt <- dt[i, eval(name) := 1]
}
This code works and updates row 1 of "heart_failure" to a 1, updates row 2 of "kidney_failure" to a 1, and doesn't change row 3 of "death" column as it is already 1.
此代码工作并将“heart_failure”的第1行更新为1,将“kidney_failure”的第2行更新为1,并且不会更改“death”列的第3行,因为它已经为1。
However, the code is slow with a data table of 5M rows and I know I am not utilizing the structure of data.table.
但是,代码很慢,数据表为5M行,我知道我没有使用data.table的结构。
Please advise for more efficient solutions. Interested to learn about R, data.table, and efficiency from the * community.
请告知更有效的解决方案。有兴趣了解*社区的R,data.table和效率。
3 个解决方案
#1
2
One option is to subset by unique values in prncl_diag
.
一种选择是通过prncl_diag中的唯一值进行子集化。
for (val in unique(dt$prncl_diag)) {
dt[prncl_diag == val, (val) := 1]
}
That's the way I would probably go about it, especially if there is a small number of unique values in prncl_diag
relative to the number of rows.
这就是我可能会采用的方式,特别是如果prncl_diag中存在少量与行数相关的唯一值。
Result:
# heart_failure kidney_failure death prncl_diag
# 1: 1 1 1 heart_failure
# 2: 1 1 1 kidney_failure
# 3: 0 0 1 death
#2
1
Here's an answer with tidyverse
这是tidyverse的答案
library(tidyverse)
map_df(1:nrow(dt), ~dt[.x,] %>% mutate_at(vars(.$prncl_diag), function(y) ifelse(y==0,1,y)))
heart_failure kidney_failure death prncl_diag
1 1 1 1 heart_failure
2 1 1 1 kidney_failure
3 0 0 1 death
#3
1
I think this'll achieve what you want.
我认为这将实现你想要的。
> dt[, .SD
][, rID := 1:.N
][, melt(.SD, id.vars=c('prncl_diag', 'rID'))
][prncl_diag == variable, value := 1
][, dcast(.SD, prncl_diag + rID ~ variable, value.var='value')
][, rID := NULL
][]
prncl_diag heart_failure kidney_failure death
1: death 0 0 1
2: heart_failure 1 1 1
3: kidney_failure 1 1 1
>
#1
2
One option is to subset by unique values in prncl_diag
.
一种选择是通过prncl_diag中的唯一值进行子集化。
for (val in unique(dt$prncl_diag)) {
dt[prncl_diag == val, (val) := 1]
}
That's the way I would probably go about it, especially if there is a small number of unique values in prncl_diag
relative to the number of rows.
这就是我可能会采用的方式,特别是如果prncl_diag中存在少量与行数相关的唯一值。
Result:
# heart_failure kidney_failure death prncl_diag
# 1: 1 1 1 heart_failure
# 2: 1 1 1 kidney_failure
# 3: 0 0 1 death
#2
1
Here's an answer with tidyverse
这是tidyverse的答案
library(tidyverse)
map_df(1:nrow(dt), ~dt[.x,] %>% mutate_at(vars(.$prncl_diag), function(y) ifelse(y==0,1,y)))
heart_failure kidney_failure death prncl_diag
1 1 1 1 heart_failure
2 1 1 1 kidney_failure
3 0 0 1 death
#3
1
I think this'll achieve what you want.
我认为这将实现你想要的。
> dt[, .SD
][, rID := 1:.N
][, melt(.SD, id.vars=c('prncl_diag', 'rID'))
][prncl_diag == variable, value := 1
][, dcast(.SD, prncl_diag + rID ~ variable, value.var='value')
][, rID := NULL
][]
prncl_diag heart_failure kidney_failure death
1: death 0 0 1
2: heart_failure 1 1 1
3: kidney_failure 1 1 1
>