Say I have (fake) patient data from their visits:
假设我的访问中有(假的)患者数据:
## Create a fake dataframe
foo <- data.frame(PatientNumber=c(11,11,11,22,22,33,33,33,44,55,55),
VisitDate=c("11/03/07","11/03/07","11/20/07","12/20/08",
"12/30/09","09/20/12","09/20/12","10/25/07","05/09/08","06/09/13","06/09/13"),
ICD9=c(10,15,10,30,30,25,60,25,14,40,13))
Which gives:
这使:
PatientNumber VisitDate ICD9
1 11 11/03/07 10
2 11 11/03/07 15
3 11 11/20/07 10
4 22 12/20/08 30
5 22 12/30/09 30
6 33 09/20/12 25
7 33 09/20/12 60
8 33 10/25/07 25
9 44 05/09/08 14
10 55 06/09/13 40
11 55 06/09/13 13
I would like to have a unique row for each patient at a given visit date. If the patient has multiple codes for a date, I would like a new column for all ICD code given at that visit. This is what it would look like:
我希望在给定的访问日期为每位患者设置一个独特的行。如果患者有多个日期代码,我想在该访问时给出所有ICD代码的新列。这就是它的样子:
WhatIWant <- data.frame(PatientNumber=c(11,11,22,22,33,33,44,55),
VisitDate=c("11/03/07", "11/20/07", "12/20/08", "12/30/09", "09/20/12","10/25/07","05/09/08","06/09/13"),
ICD9_1=c(10,10,30,30,25,25,14,40),
ICD9_2= c(15,NA,NA,NA,60,NA,NA,13))
> WhatIWant
PatientNumber VisitDate ICD9_1 ICD9_2
1 11 11/03/07 10 15
2 11 11/20/07 10 NA
3 22 12/20/08 30 NA
4 22 12/30/09 30 NA
5 33 09/20/12 25 60
6 33 10/25/07 25 NA
7 44 05/09/08 14 NA
8 55 06/09/13 40 13
I've tried reshape, but it seems to add all the ICD9 codes in a column and add the value in a column if they have a value or not (as shown below).I will end up with something like 200 columns, I would only like 3 (the max # of codes per patient per visit in the data set I have, ie ICD9_1, ICD9_2, ICD9_3).
我已经尝试过重塑,但它似乎在列中添加了所有ICD9代码,如果它们有值或者没有值,则在列中添加值(如下所示)。我最终会得到类似200列的内容,我会只有3(我拥有的数据集中每位患者每个患者的最大代码数,即ICD9_1,ICD9_2,ICD9_3)。
test <- reshape(foo, idvar = c("VisitDate"), timevar = c("PatientNumber"), direction = "wide")
> test
VisitDate ICD9.11 ICD9.22 ICD9.33 ICD9.44 ICD9.55
1 0007-11-03 10 NA NA NA NA
3 0007-11-20 10 NA NA NA NA
4 0008-12-20 NA 30 NA NA NA
5 0009-12-30 NA 30 NA NA NA
6 0012-09-20 NA NA 25 NA NA
8 0007-10-25 NA NA 25 NA NA
9 0008-05-09 NA NA NA 14 NA
10 0013-06-09 NA NA NA NA 40
Sorry if the title isn't as specific as it could be, I'm not really sure how to exactly title what I am looking for.
对不起,如果标题不是那么具体,我真的不确定如何准确标题我想要的。
Thanks in advance for your help!
在此先感谢您的帮助!
3 个解决方案
#1
6
The basic problem for reshape
in this case is that it doesn't have a real "time" variable. That's easy to create with ave
:
在这种情况下重塑的基本问题是它没有真正的“时间”变量。使用ave很容易创建:
foo$time <- with(foo, ave(rep(1, nrow(foo)),
PatientNumber, VisitDate,
FUN = seq_along))
Then, you can use reshape
as follows:
然后,您可以使用reshape如下:
reshape(foo, direction = "wide",
idvar=c("PatientNumber", "VisitDate"),
timevar="time")
# PatientNumber VisitDate ICD9.1 ICD9.2
# 1 11 11/03/07 10 15
# 3 11 11/20/07 10 NA
# 4 22 12/20/08 30 NA
# 5 22 12/30/09 30 NA
# 6 33 09/20/12 25 60
# 8 33 10/25/07 25 NA
# 9 44 05/09/08 14 NA
# 10 55 06/09/13 40 13
Of course, once you have that "time" variable, you can also use dcast
from "reshape2".
当然,一旦你有了“时间”变量,你也可以使用“reshape2”中的dcast。
library(reshape2)
dcast(foo, PatientNumber + VisitDate ~ time, value.var="ICD9")
#2
7
Also,
也,
library(dplyr)
library(tidyr) # See below on how to get tidyr
foo %>%
group_by(PatientNumber, VisitDate) %>%
mutate(n=paste("ICD9",row_number(), sep="_")) %>%
spread(n, ICD9)
#Source: local data frame [8 x 4]
# PatientNumber VisitDate ICD9_1 ICD9_2
#1 11 11/03/07 10 15
#2 11 11/20/07 10 NA
#3 22 12/20/08 30 NA
#4 22 12/30/09 30 NA
#5 33 09/20/12 25 60
#6 33 10/25/07 25 NA
#7 44 05/09/08 14 NA
#8 55 06/09/13 40 13
Package tidyr
is not available on CRAN yet. Install it like this (see tidyr
git):
CRAN尚未提供包tidyr。像这样安装(参见tidyr git):
# install.packages("devtools")
devtools::install_github("hadley/tidyr")
#3
2
You could use aggregate
:
你可以使用聚合:
max_visits = 2
aggregate(ICD9 ~ PatientNumber + VisitDate, foo,
function(x) x[seq_len(max_visits)]) #note that output is 3 columns
# PatientNumber VisitDate ICD9.1 ICD9.2
#1 44 05/09/08 14 NA
#2 55 06/09/13 40 13
#3 33 09/20/12 25 60
#4 33 10/25/07 25 NA
#5 11 11/03/07 10 15
#6 11 11/20/07 10 NA
#7 22 12/20/08 30 NA
#8 22 12/30/09 30 NA
If you don't know the maximum possible visits ("max_visits"), you could:
如果您不知道可能的最大访问次数(“max_visits”),您可以:
max_visits = max(ave(foo[["ICD9"]],
foo[["PatientNumber"]], foo[["VisitDate"]],
FUN = length))
max_visits
#[1] 2
EDIT:
编辑:
As noted by @AnandaMahto in the comments you could turn your 3-column aggregate
d "foo" (say "aggfoo") to 4 columns with something like:
正如@AnandaMahto在评论中所指出的那样,你可以将你的3列聚合“foo”(比如说“aggfoo”)改为4列,例如:
dim(aggfoo)
#[1] 8 3
dim(do.call(data.frame, aggfoo))
#[1] 8 4
dim(data.frame(unclass(aggfoo)))
#[1] 8 4
That's not necessary, though, as even with 3 columns it's still convenient to call each "ICD9" column: aggfoo$ICD9[, 1]
and aggfoo$ICD9[, 2]
instead of aggfoo$ICD9.1
and aggfoo$ICD9.2
.
这是没有必要的,但是,因为即使有3列,它仍然方便的调用每个 “ICD9” 列:aggfoo $ ICD9 [,1]和aggfoo $ ICD9 [,2]而不是aggfoo $ ICD9.1和aggfoo $ ICD9.2 。
#1
6
The basic problem for reshape
in this case is that it doesn't have a real "time" variable. That's easy to create with ave
:
在这种情况下重塑的基本问题是它没有真正的“时间”变量。使用ave很容易创建:
foo$time <- with(foo, ave(rep(1, nrow(foo)),
PatientNumber, VisitDate,
FUN = seq_along))
Then, you can use reshape
as follows:
然后,您可以使用reshape如下:
reshape(foo, direction = "wide",
idvar=c("PatientNumber", "VisitDate"),
timevar="time")
# PatientNumber VisitDate ICD9.1 ICD9.2
# 1 11 11/03/07 10 15
# 3 11 11/20/07 10 NA
# 4 22 12/20/08 30 NA
# 5 22 12/30/09 30 NA
# 6 33 09/20/12 25 60
# 8 33 10/25/07 25 NA
# 9 44 05/09/08 14 NA
# 10 55 06/09/13 40 13
Of course, once you have that "time" variable, you can also use dcast
from "reshape2".
当然,一旦你有了“时间”变量,你也可以使用“reshape2”中的dcast。
library(reshape2)
dcast(foo, PatientNumber + VisitDate ~ time, value.var="ICD9")
#2
7
Also,
也,
library(dplyr)
library(tidyr) # See below on how to get tidyr
foo %>%
group_by(PatientNumber, VisitDate) %>%
mutate(n=paste("ICD9",row_number(), sep="_")) %>%
spread(n, ICD9)
#Source: local data frame [8 x 4]
# PatientNumber VisitDate ICD9_1 ICD9_2
#1 11 11/03/07 10 15
#2 11 11/20/07 10 NA
#3 22 12/20/08 30 NA
#4 22 12/30/09 30 NA
#5 33 09/20/12 25 60
#6 33 10/25/07 25 NA
#7 44 05/09/08 14 NA
#8 55 06/09/13 40 13
Package tidyr
is not available on CRAN yet. Install it like this (see tidyr
git):
CRAN尚未提供包tidyr。像这样安装(参见tidyr git):
# install.packages("devtools")
devtools::install_github("hadley/tidyr")
#3
2
You could use aggregate
:
你可以使用聚合:
max_visits = 2
aggregate(ICD9 ~ PatientNumber + VisitDate, foo,
function(x) x[seq_len(max_visits)]) #note that output is 3 columns
# PatientNumber VisitDate ICD9.1 ICD9.2
#1 44 05/09/08 14 NA
#2 55 06/09/13 40 13
#3 33 09/20/12 25 60
#4 33 10/25/07 25 NA
#5 11 11/03/07 10 15
#6 11 11/20/07 10 NA
#7 22 12/20/08 30 NA
#8 22 12/30/09 30 NA
If you don't know the maximum possible visits ("max_visits"), you could:
如果您不知道可能的最大访问次数(“max_visits”),您可以:
max_visits = max(ave(foo[["ICD9"]],
foo[["PatientNumber"]], foo[["VisitDate"]],
FUN = length))
max_visits
#[1] 2
EDIT:
编辑:
As noted by @AnandaMahto in the comments you could turn your 3-column aggregate
d "foo" (say "aggfoo") to 4 columns with something like:
正如@AnandaMahto在评论中所指出的那样,你可以将你的3列聚合“foo”(比如说“aggfoo”)改为4列,例如:
dim(aggfoo)
#[1] 8 3
dim(do.call(data.frame, aggfoo))
#[1] 8 4
dim(data.frame(unclass(aggfoo)))
#[1] 8 4
That's not necessary, though, as even with 3 columns it's still convenient to call each "ICD9" column: aggfoo$ICD9[, 1]
and aggfoo$ICD9[, 2]
instead of aggfoo$ICD9.1
and aggfoo$ICD9.2
.
这是没有必要的,但是,因为即使有3列,它仍然方便的调用每个 “ICD9” 列:aggfoo $ ICD9 [,1]和aggfoo $ ICD9 [,2]而不是aggfoo $ ICD9.1和aggfoo $ ICD9.2 。