What I need:
I have a huge data frame with the following columns (and some more, but these are not important). Here's an example:
我有一个巨大的数据框,包含以下列(还有一些,但这些并不重要)。这是一个例子:
user_id video_id group_id x y
1 1 0 0 39 108
2 1 0 0 39 108
3 1 10 0 135 180
4 2 0 0 20 123
User, video and group IDs are factors, of course. For example, there are 20 videos, but each of them has several "observations" for each user and group.
当然,用户,视频和组ID是因素。例如,有20个视频,但每个视频对每个用户和组都有几个“观察”。
I'd like to transform this data frame into the following format, where there are as many x.N
, y.N
as there are users (N
).
我想将这个数据帧转换成以下格式,其中有x.N,y.N和用户(N)一样多。
video_id x.1 y.1 x.2 y.2 …
0 39 108 20 123
So, for video 0
, the x and y values from user 1 are in columns x.1
and y.1
, respectively. For user 2, their values are in columns x.2
, y.2
, and so on.
因此,对于视频0,来自用户1的x和y值分别在列x.1和y.1中。对于用户2,它们的值在列x.2,y.2等中。
What I've tried:
I made myself a list of data frames that are solely composed of all the x, y
observations for each video_id
:
我自己制作了一个数据帧列表,它们只包含每个video_id的所有x,y观测值:
summaryList = dlply(allData, .(user_id), function(x) unique(x[c("video_id","x","y")]) )
That's how it looks like:
这就是它的样子:
List of 15
$ 1 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 1 11 8 5 12 9 20 13 7 10 ...
..$ x : int [1:20] 39 135 86 122 28 167 203 433 549 490 ...
..$ y : int [1:20] 108 180 164 103 187 128 185 355 360 368 ...
$ 2 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 2 14 15 4 20 6 19 3 13 18 ...
..$ x : int [1:20] 128 688 435 218 528 362 299 134 83 417 ...
..$ y : int [1:20] 165 117 135 179 96 328 332 563 623 476 ...
Where I'm stuck:
What's left to do is:
剩下要做的是:
-
Merge each data frame from thesummaryList
with each other, based on thevideo_id
. I can't find a nice way to access the actual data frames in the list, which aresummaryList[1]$`1`
,summaryList[2]$`2`
, et cetera.根据video_id,将summaryList中的每个数据帧相互合并。我找不到一种很好的方法来访问列表中的实际数据帧,它们是summaryList [1] $`1`,summaryList [2] $`2`等等。
@James found out a partial solution:
@James找到了部分解决方案:
Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
-
Ensure the column names are renamed after the user ID and not kept as-is. Right now my
summaryList
doesn't contain any info about the user ID, and the output ofReduce
has duplicate column names likex.x y.x x.y y.y x.x y.x
and so on.确保在用户标识之后重命名列名称,而不是保持原样。现在我的summaryList不包含任何关于用户ID的信息,Reduce的输出有重复的列名,如x.x y.x x.y y.y x.x y.x等等。
How do I go about doing this? Or is there any easier way to get to the result than what I'm currently doing?
我该怎么做呢?或者有没有比我目前正在做的更简单的方法来获得结果?
2 个解决方案
#1
3
Reduce
does the trick:
减少诀窍:
reducedData <- Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
… but you need to fix the names
afterwards:
...但你需要事后修改名称:
names(reducedData)[-1] <- do.call(function(...) paste(...,sep="."),expand.grid(letters[24:25],names(summaryList)))
The result is:
结果是:
video_id x.1 y.1 x.2 y.2 x.3 y.3 x.4 y.4 x.5 y.5 x.6 y.6 x.7 y.7 x.8
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994
#2
4
I am still somewhat confused. However, I guess you simply want to melt
and dcast
.
我还是有些困惑。但是,我猜你只想融化和播放。
library(reshape2)
d <- melt(allData,id.vars=c("user_id","video_id"), measure.vars=c("x","y"))
dcast(d,video_id~user_id+variable,value.var="value",fun.aggregate=mean)
Resulting in:
导致:
video_id 1_x 1_y 2_x 2_y 3_x 3_y 4_x 4_y 5_x 5_y 6_x 6_y 7_x 7_y 8_x 8_y 9_x 9_y 10_x 10_y 11_x 11_y 12_x 12_y 14_x 14_y 15_x 15_y 16_x 16_y
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210 134 58 244 910 403 152 52 1092 617 1012 114 1105 424 548 394
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994 114 854 129 781 306 672 -1 1096 354 525 524 150
#1
3
Reduce
does the trick:
减少诀窍:
reducedData <- Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
… but you need to fix the names
afterwards:
...但你需要事后修改名称:
names(reducedData)[-1] <- do.call(function(...) paste(...,sep="."),expand.grid(letters[24:25],names(summaryList)))
The result is:
结果是:
video_id x.1 y.1 x.2 y.2 x.3 y.3 x.4 y.4 x.5 y.5 x.6 y.6 x.7 y.7 x.8
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994
#2
4
I am still somewhat confused. However, I guess you simply want to melt
and dcast
.
我还是有些困惑。但是,我猜你只想融化和播放。
library(reshape2)
d <- melt(allData,id.vars=c("user_id","video_id"), measure.vars=c("x","y"))
dcast(d,video_id~user_id+variable,value.var="value",fun.aggregate=mean)
Resulting in:
导致:
video_id 1_x 1_y 2_x 2_y 3_x 3_y 4_x 4_y 5_x 5_y 6_x 6_y 7_x 7_y 8_x 8_y 9_x 9_y 10_x 10_y 11_x 11_y 12_x 12_y 14_x 14_y 15_x 15_y 16_x 16_y
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210 134 58 244 910 403 152 52 1092 617 1012 114 1105 424 548 394
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994 114 854 129 781 306 672 -1 1096 354 525 524 150