r使用未知列数重新整形数据

时间:2021-08-23 19:20:02

I'm sure this is trivial but I can't find how to do it.

我确信这是微不足道的,但我找不到怎么做。

I have a data frame in which there are individuals, each of which can have several properties, and each property is classified in a number of ways. Currenly it's in long shape, with a record looking like (in schematic form, actually it's a little more complicated):

我有一个数据框,其中有个人,每个人可以有几个属性,每个属性都以多种方式分类。 Currenly它的形状很长,看起来很像(在原理图中,实际上它有点复杂):

IndividualID Property PropClass 
1            X         A 
1            Y         B 
2            X         A 
3            Y         B
3            W         C
3            Z         A

What I want is one row for each individual ID, with the individual ID and then pairs of columns for each property and PropClass that that individual has on the original file, so in this case:

我想要的是每个单独ID的一行,具有单个ID,然后是每个属性的列对以及该个人在原始文件上具有的PropClass,因此在这种情况下:

 IndividualID  Prop1   PropClass1 Prop2  PropClass2  Prop3  PropClass3
 1             X       A          Y      B           NA     NA
 2             X       A          NA     NA          NA     NA
 3             Y       B          W      C           Z      A

So there have to be as many Prop and PropClass variables as the maximum number of rows for any individualID in the original data set (which is not large, about 5), and where an individual has fewer rows in the original dataset than that maximum number, the extra columns that don't mean anything for that individual have NAs in them. The order of the Prop and PropClass variables for an individual doesn't matter (though it may as well be the original order on the long format file).

因此,必须有与原始数据集中任何个别ID的最大行数(不大,约为5)一样多的Prop和PropClass变量,并且个人在原始数据集中的行数少于该最大数量,对于那个人来说没有任何意义的额外列在其中有NA。个人的Prop和PropClass变量的顺序无关紧要(尽管它也可能是长格式文件的原始顺序)。

Obviously it's easy to do this (e.g. using reshape) if you have one pair of Prop and propClass columns for every possible value of Prop, but there are several hundred possible values of Prop so the file gets huge and unhelpful. I can't believe there is not a simple way to do what I want, but I haven't found it despite what seems to me to be assiduous searching. Please tell me I'm being an idiot, and if so, how I might cure my idiocy.

显然,如果你为Prop的每个可能值都有一对Prop和propClass列,那么很容易做到这一点(例如使用reshape),但是Prop有几百个可能的值,所以文件变得庞大而且无益。我无法相信没有一种简单的方法可以做我想要的东西,但我还是没有找到它,尽管在我看来,这是一种刻薄的搜索。请告诉我,我是个白痴,如果是的话,我怎么能治好我的愚蠢。

2 个解决方案

#1


2  

There's probably a more efficient way to do this, but I can't think of it right now. With two variables that need to be transformed into wide format, I think you may need to cast them separately and then merge the two together. I'd love to be proved wrong though. To do this, I create two new variables which generate a column sequence for each new ID. This will allow them to be filled with NAs easily. With the new columns, it's pretty easy to cast them into the right format and merge them together.

这可能是一种更有效的方法,但我现在想不到它。有两个变量需要转换为宽格式,我想你可能需要单独转换它们,然后将两者合并在一起。我很想被证明是错的。为此,我创建了两个新变量,为每个新ID生成一个列序列。这将使他们能够轻松填充NA。使用新列,将它们转换为正确的格式并将它们合并在一起非常容易。

library(plyr)
library(reshape2)

#Assumes your data is read into a variable named x
x <- ddply(x, "IndividualID", transform, 
      castPropClass = paste0("PropClass", seq(length(PropClass))),
      castProp = paste0("Prop", seq(length(Property))))

#Use these two new variables to cast into wide format. Wrap in merge to join together:
merge(dcast(IndividualID ~ castPropClass, value.var = "PropClass", data = x),
      dcast(IndividualID ~ castProp,      value.var = "Property",  data = x))
#Gives you this:
  IndividualID PropClass1 PropClass2 PropClass3 Prop1 Prop2 Prop3
1            1          A          B       <NA>     X     Y  <NA>
2            2          A       <NA>       <NA>     X  <NA>  <NA>
3            3          B          C          A     Y     W     Z

This obviously doesn't have the right "order" of columns, but the data itself is right.

这显然没有列的正确“顺序”,但数据本身是正确的。

#2


1  

Would something like this be acceptable?

这样的事情会被接受吗?

test.dt<-data.frame(id=(c(1,1,2,3,3,3)), property=(c("X","Y","X","Y","W","Z")), property.clss=(c("A","B","A","B","C","A")))
library(reshape)
m<-melt(data=test.dt, id.vars="id", measure.vars=c("property.clss"))
m
n<-melt(data=test.dt, id.vars="id", measure.vars=c("property"))
n
c1<-data.frame(cast(m, id~value))
colnames(c1)<-c("id", paste("property",colnames(c1)[colnames(c1)!="id"],sep=""))
c1
c2<-data.frame(cast(n,id~value))
colnames(c2)<-c("id", paste("property.clss",(colnames(c2)[colnames(c2)!="id"]),sep=""))
c2
merge(c1,c2,by="id")

#1


2  

There's probably a more efficient way to do this, but I can't think of it right now. With two variables that need to be transformed into wide format, I think you may need to cast them separately and then merge the two together. I'd love to be proved wrong though. To do this, I create two new variables which generate a column sequence for each new ID. This will allow them to be filled with NAs easily. With the new columns, it's pretty easy to cast them into the right format and merge them together.

这可能是一种更有效的方法,但我现在想不到它。有两个变量需要转换为宽格式,我想你可能需要单独转换它们,然后将两者合并在一起。我很想被证明是错的。为此,我创建了两个新变量,为每个新ID生成一个列序列。这将使他们能够轻松填充NA。使用新列,将它们转换为正确的格式并将它们合并在一起非常容易。

library(plyr)
library(reshape2)

#Assumes your data is read into a variable named x
x <- ddply(x, "IndividualID", transform, 
      castPropClass = paste0("PropClass", seq(length(PropClass))),
      castProp = paste0("Prop", seq(length(Property))))

#Use these two new variables to cast into wide format. Wrap in merge to join together:
merge(dcast(IndividualID ~ castPropClass, value.var = "PropClass", data = x),
      dcast(IndividualID ~ castProp,      value.var = "Property",  data = x))
#Gives you this:
  IndividualID PropClass1 PropClass2 PropClass3 Prop1 Prop2 Prop3
1            1          A          B       <NA>     X     Y  <NA>
2            2          A       <NA>       <NA>     X  <NA>  <NA>
3            3          B          C          A     Y     W     Z

This obviously doesn't have the right "order" of columns, but the data itself is right.

这显然没有列的正确“顺序”,但数据本身是正确的。

#2


1  

Would something like this be acceptable?

这样的事情会被接受吗?

test.dt<-data.frame(id=(c(1,1,2,3,3,3)), property=(c("X","Y","X","Y","W","Z")), property.clss=(c("A","B","A","B","C","A")))
library(reshape)
m<-melt(data=test.dt, id.vars="id", measure.vars=c("property.clss"))
m
n<-melt(data=test.dt, id.vars="id", measure.vars=c("property"))
n
c1<-data.frame(cast(m, id~value))
colnames(c1)<-c("id", paste("property",colnames(c1)[colnames(c1)!="id"],sep=""))
c1
c2<-data.frame(cast(n,id~value))
colnames(c2)<-c("id", paste("property.clss",(colnames(c2)[colnames(c2)!="id"]),sep=""))
c2
merge(c1,c2,by="id")