在R中使用整形()函数,从宽到长。

时间:2022-01-29 04:28:43

I'm trying to re-arrange the data in R from something like:

我试着重新排列R中的数据

Patient ID,Episode Number,Admission Date (A),Admission Date (H),Admission Time (A),Admission Time (H)
1,5,20/08/2011,21/08/2011,1200,1300
2,6,21/08/2011,22/08/2011,1300,1400
3,7,22/08/2011,23/08/2011,1400,1500
4,8,23/08/2011,24/08/2011,1500,1600

to something like:

类似的:

Record Type,Patient ID,Episode Number,Admission Date,Admission Time
H,1,5,20/08/2011,1200
A,1,5,21/08/2011,1300
H,2,6,21/08/2011,1300
A,2,6,22/08/2011,1400
H,3,7,22/08/2011,1400
A,3,7,23/08/2011,1500
H,4,8,23/08/2011,1500
A,4,8,24/08/2011,1600

(I've used CSV format so it's easier to to use them as test data).

(我使用了CSV格式,因此更容易将其用作测试数据)。

I tried the reshape() function and it kind of worked:

我尝试了整形()功能,它的作用是:

> reshape(foo, direction = "long", idvar = 1, varying = 3:dim(foo)[2], 
> sep = "..", timevar = "dataset")
     Patient.ID Episode.Number dataset Admission.Date Admission.Time
1.A.          1              5      A.     20/08/2011           1200
2.A.          2              6      A.     21/08/2011           1300
3.A.          3              7      A.     22/08/2011           1400
4.A.          4              8      A.     23/08/2011           1500
1.H.          1              5      H.     21/08/2011           1300
2.H.          2              6      H.     22/08/2011           1400
3.H.          3              7      H.     23/08/2011           1500
4.H.          4              8      H.     24/08/2011           1600

But it's not in the exact format I wanted (I want for each "Patient ID", the first row is "H" and second row is "A").

但它不是我想要的格式(我想要每个“病人ID”,第一行是“H”,第二行是“A”)。

In addition, when I extend this to the read data (which has 250+ columns) it failed:

此外,当我将其扩展到读取数据(有250+列)时,它失败了:

> reshape(realdata, direction = "long", idvar = 1, varying = 
> 6:dim(foo)[2], sep = "..", timevar = "dataset")
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  : 
  'varying' arguments must be the same length

I think partly because the colnames look like:

我认为部分原因是,colname是这样的:

> colnames(foo)
  [1] "Unique.Key"                                    
  [2] "Campus.Code"                                   
  [3] "UR"                                            
  [4] "Terminal.digit"                                
  [5] "Admission.date..A."                      
  [6] "Admission.date..H."                     
  [7] "Admission.time..A."                      
  [8] "Admission.time..H."     
  .
  .
  .
 [31] "Medicare.Number"                               
 [32] "Payor"                                         
 [33] "Doctor.specialty"                              
 [34] "Clinic"     
  .
  .
  .
 [202] "Admission.Source..A."                    
 [203] "Admission.Source..H."  

i.e. there are "common columns" (without suffixes) between columns with suffixes (hope this makes sense).

也就是说,在有后缀的列之间有“公共的列”(没有后缀)(希望这是有意义的)。

2 个解决方案

#1


1  

The suggestions to use melt and cast (now dcast and family) from the "reshape" (now "reshape2") package won't get you to the form you are looking for with your data. In particular, you'll need to do some additional processing if your end goal is the "semi-long" format you describe.

从“重塑”(现在的“重塑”)软件包中使用熔体和cast(现在是dcast和family)的建议不会让你得到你想要的数据。特别是,如果您的最终目标是您所描述的“半长”格式,您将需要进行一些额外的处理。

There are two problems you raise in your question:

在你的问题中,你提出了两个问题:

First is the ordering of the results. As @RichieCotton points out in his comment and @mac in his answer, a call to order() is enough to solve that problem.

首先是结果的排序。正如@RichieCotton在他的评论和@mac的回答中指出的那样,调用order()就足以解决这个问题。

Second is the error:

第二是错误的:

Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
  'varying' arguments must be the same length

This is because, as you guessed, there are non-varying columns in your list of varying = 6:dim(foo)[2] selection.

这是因为,正如您所猜测的,在您的列表中有一些没有变化的列:dim(foo)[2]选择。

One easy way to fix this is to use grep to identify which columns are varying and use that to specify your columns instead of using an (incorrect) catchall as you did. Here's a worked example:

解决这个问题的一个简单方法是使用grep来识别哪些列是不同的,并使用它来指定您的列,而不是像您那样使用(不正确的)集。这里有一个例子:

set.seed(1)
foo <- data.frame(Unique.Key = 1:4, Campus.Code = LETTERS[1:4], 
                  Admission.Date..A = 11:14, Admission.Date..H = 21:24,
                  Medicare.Number = letters[1:4], Payor = letters[1:4],
                  Admission.Source..A = rnorm(4), 
                  Admission.Source..H = rnorm(4))
foo
#   Unique.Key Campus.Code Admission.Date..A Admission.Date..H Medicare.Number
# 1          1           A                11                21               a
# 2          2           B                12                22               b
# 3          3           C                13                23               c
# 4          4           D                14                24               d
#   Payor Admission.Source..A Admission.Source..H
# 1     a          -0.6264538           0.3295078
# 2     b           0.1836433          -0.8204684
# 3     c          -0.8356286           0.4874291
# 4     d           1.5952808           0.7383247

Figure out which columns are varying and use that as your varying argument:

找出哪些列是变化的,并将其作为你的不同参数:

varyingCols <- grep("\\.\\.A$|\\.\\.H$", names(foo))

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = varyingCols, sep = "..")
out[order(out$Unique.Key, rev(out$time)), ]
#     Unique.Key Campus.Code Medicare.Number Payor time Admission.Date Admission.Source
# 1.H          1           A               a     a    H             21        0.3295078
# 1.A          1           A               a     a    A             11       -0.6264538
# 2.H          2           B               b     b    H             22       -0.8204684
# 2.A          2           B               b     b    A             12        0.1836433
# 3.H          3           C               c     c    H             23        0.4874291
# 3.A          3           C               c     c    A             13       -0.8356286
# 4.H          4           D               d     d    H             24        0.7383247
# 4.A          4           D               d     d    A             14        1.5952808

If your data are small (not many columns), you can manually count the location of the varying columns and specify the vector. As you've already noticed, any columns not specified in idvar or varying get recycled appropriately.

如果您的数据很小(不是很多列),您可以手动计算不同列的位置并指定向量。正如您已经注意到的,在idvar中没有指定的任何列都可以适当地回收。

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = c(3, 4, 7, 8), sep = "..")

#2


0  

You might be able to get what you're after by using melt and cast or reshape, but you're looking for something pretty specific, so it might be simpler to do the reshaping directly. You can subset the original data into two separate data frames, (one for A, one for H) then glue them back together.

你也许可以通过使用熔体、石膏或整形来得到你想要的东西,但你需要的是一些非常具体的东西,所以直接进行整形可能会更简单。您可以将原始数据集中到两个单独的数据帧中(一个用于一个,一个用于H),然后将它们粘在一起。

The code below works on your sample data, but i also tried to write it flexibly enough so that it will hopefully work on your larger data set, as long as the columns are named consistently with the ..A. and ..H. suffixes.

下面的代码在您的示例数据上工作,但是我也尝试灵活地编写它,以便它能够在您的更大的数据集上工作,只要这些列始终以..A命名。和. . H。后缀。

#grab the common columns and the "A" columns
#(by using grepl to find any column that doesn't end in ".H.")
foo.a <- foo[,!grepl(x=colnames(foo),pattern = "\\.H\\.$")]

#strip the "..A." from the end of the ".A." column names
colnames(foo.a) <- sub(x=colnames(foo.a),
                   pattern="(.*)\\.\\.A\\.$",
                   rep = "\\1")
foo.a$Record.Type <- "A"

#grab the common columns and the "H" columns
#(by using grepl to find any column that doesn't end in ".A.")
foo.h <- foo[,!grepl(x=colnames(foo),pattern = "\\.A\\.$")]

#strip the "..H." from the end of the "..H." column names
colnames(foo.h) <- sub(x=colnames(foo.h),
                   pattern="(.*)\\.\\.H\\.$",
                   rep = "\\1")
foo.h$Record.Type <- "H"

#stick them back together
new.foo <- rbind(foo.a,foo.h)

#order by Patient.ID
new.foo <- new.foo[with(new.foo,order(Patient.ID)),]

#re-order the columns as you like
new.foo <- new.foo[,c(1,2,5,3,4)]

This gives me:

这给我:

> new.foo  
   Patient.ID Episode.Number Record.Type Admission.Date Admission.Time  
1          1              5           A     20/08/2011           1200  
5          1              5           H     21/08/2011           1300  
2          2              6           A     21/08/2011           1300  
6          2              6           H     22/08/2011           1400  
3          3              7           A     22/08/2011           1400  
7          3              7           H     23/08/2011           1500  
4          4              8           A     23/08/2011           1500  
8          4              8           H     24/08/2011           1600  

#1


1  

The suggestions to use melt and cast (now dcast and family) from the "reshape" (now "reshape2") package won't get you to the form you are looking for with your data. In particular, you'll need to do some additional processing if your end goal is the "semi-long" format you describe.

从“重塑”(现在的“重塑”)软件包中使用熔体和cast(现在是dcast和family)的建议不会让你得到你想要的数据。特别是,如果您的最终目标是您所描述的“半长”格式,您将需要进行一些额外的处理。

There are two problems you raise in your question:

在你的问题中,你提出了两个问题:

First is the ordering of the results. As @RichieCotton points out in his comment and @mac in his answer, a call to order() is enough to solve that problem.

首先是结果的排序。正如@RichieCotton在他的评论和@mac的回答中指出的那样,调用order()就足以解决这个问题。

Second is the error:

第二是错误的:

Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
  'varying' arguments must be the same length

This is because, as you guessed, there are non-varying columns in your list of varying = 6:dim(foo)[2] selection.

这是因为,正如您所猜测的,在您的列表中有一些没有变化的列:dim(foo)[2]选择。

One easy way to fix this is to use grep to identify which columns are varying and use that to specify your columns instead of using an (incorrect) catchall as you did. Here's a worked example:

解决这个问题的一个简单方法是使用grep来识别哪些列是不同的,并使用它来指定您的列,而不是像您那样使用(不正确的)集。这里有一个例子:

set.seed(1)
foo <- data.frame(Unique.Key = 1:4, Campus.Code = LETTERS[1:4], 
                  Admission.Date..A = 11:14, Admission.Date..H = 21:24,
                  Medicare.Number = letters[1:4], Payor = letters[1:4],
                  Admission.Source..A = rnorm(4), 
                  Admission.Source..H = rnorm(4))
foo
#   Unique.Key Campus.Code Admission.Date..A Admission.Date..H Medicare.Number
# 1          1           A                11                21               a
# 2          2           B                12                22               b
# 3          3           C                13                23               c
# 4          4           D                14                24               d
#   Payor Admission.Source..A Admission.Source..H
# 1     a          -0.6264538           0.3295078
# 2     b           0.1836433          -0.8204684
# 3     c          -0.8356286           0.4874291
# 4     d           1.5952808           0.7383247

Figure out which columns are varying and use that as your varying argument:

找出哪些列是变化的,并将其作为你的不同参数:

varyingCols <- grep("\\.\\.A$|\\.\\.H$", names(foo))

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = varyingCols, sep = "..")
out[order(out$Unique.Key, rev(out$time)), ]
#     Unique.Key Campus.Code Medicare.Number Payor time Admission.Date Admission.Source
# 1.H          1           A               a     a    H             21        0.3295078
# 1.A          1           A               a     a    A             11       -0.6264538
# 2.H          2           B               b     b    H             22       -0.8204684
# 2.A          2           B               b     b    A             12        0.1836433
# 3.H          3           C               c     c    H             23        0.4874291
# 3.A          3           C               c     c    A             13       -0.8356286
# 4.H          4           D               d     d    H             24        0.7383247
# 4.A          4           D               d     d    A             14        1.5952808

If your data are small (not many columns), you can manually count the location of the varying columns and specify the vector. As you've already noticed, any columns not specified in idvar or varying get recycled appropriately.

如果您的数据很小(不是很多列),您可以手动计算不同列的位置并指定向量。正如您已经注意到的,在idvar中没有指定的任何列都可以适当地回收。

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = c(3, 4, 7, 8), sep = "..")

#2


0  

You might be able to get what you're after by using melt and cast or reshape, but you're looking for something pretty specific, so it might be simpler to do the reshaping directly. You can subset the original data into two separate data frames, (one for A, one for H) then glue them back together.

你也许可以通过使用熔体、石膏或整形来得到你想要的东西,但你需要的是一些非常具体的东西,所以直接进行整形可能会更简单。您可以将原始数据集中到两个单独的数据帧中(一个用于一个,一个用于H),然后将它们粘在一起。

The code below works on your sample data, but i also tried to write it flexibly enough so that it will hopefully work on your larger data set, as long as the columns are named consistently with the ..A. and ..H. suffixes.

下面的代码在您的示例数据上工作,但是我也尝试灵活地编写它,以便它能够在您的更大的数据集上工作,只要这些列始终以..A命名。和. . H。后缀。

#grab the common columns and the "A" columns
#(by using grepl to find any column that doesn't end in ".H.")
foo.a <- foo[,!grepl(x=colnames(foo),pattern = "\\.H\\.$")]

#strip the "..A." from the end of the ".A." column names
colnames(foo.a) <- sub(x=colnames(foo.a),
                   pattern="(.*)\\.\\.A\\.$",
                   rep = "\\1")
foo.a$Record.Type <- "A"

#grab the common columns and the "H" columns
#(by using grepl to find any column that doesn't end in ".A.")
foo.h <- foo[,!grepl(x=colnames(foo),pattern = "\\.A\\.$")]

#strip the "..H." from the end of the "..H." column names
colnames(foo.h) <- sub(x=colnames(foo.h),
                   pattern="(.*)\\.\\.H\\.$",
                   rep = "\\1")
foo.h$Record.Type <- "H"

#stick them back together
new.foo <- rbind(foo.a,foo.h)

#order by Patient.ID
new.foo <- new.foo[with(new.foo,order(Patient.ID)),]

#re-order the columns as you like
new.foo <- new.foo[,c(1,2,5,3,4)]

This gives me:

这给我:

> new.foo  
   Patient.ID Episode.Number Record.Type Admission.Date Admission.Time  
1          1              5           A     20/08/2011           1200  
5          1              5           H     21/08/2011           1300  
2          2              6           A     21/08/2011           1300  
6          2              6           H     22/08/2011           1400  
3          3              7           A     22/08/2011           1400  
7          3              7           H     23/08/2011           1500  
4          4              8           A     23/08/2011           1500  
8          4              8           H     24/08/2011           1600