ffbase
provides the function ffdfdply
to split and aggregate data rows. This answer (https://*.com/a/20954315/336311) explains how that can basically work. I still cannot figure out how to split by multiple columns.
ffbase提供函数ffdfdply来分割和聚合数据行。这个答案(https://*.com/a/20954315/336311)解释了它是如何工作的。我还是搞不清怎么除以多列。
My challange is that a split variable is required. This must be unique for each combination of the two variables, I'd like to split by. Still, in my 4-column data frame (about 50M rows), it would require a lot of memory, if creating a character vector by paste()
.
我的挑战是需要一个分割变量。这对于两个变量的每一个组合来说都是唯一的,我想把它除以。不过,在我的4列数据框架(大约50M行)中,如果通过paste()创建字符向量,则需要大量内存。
This is where I got stuck...
这就是我被困住的地方……
require("ff")
require("ffbase")
load.ffdf(dir="ffdf.shares.02")
# Aggregation by articleID/measure
levels(ffshares$measure) # "comments", "likes", "shares", "totals", "tw"
splitBy = paste(as.character(ffshares$articleID), ffshares$measure, sep="")
tmp = ffdfdply(fftest, split=splitBy, FUN=function(x) {
return(list(
"articleID" = x[1,"articleID"],
"measure" = x[1,"measure"],
# I need vectors for each entry
"sx" = unlist(x$value),
"st" = unlist(x$time)
))
}
)
Of course, I could use shorter levels for ffshares$measure
or simply use the numeric codes, but this still won't solve the underlying problem that splitBy
grows enormously large.
当然,我可以对ffshare $measure使用更短的级别,或者仅仅使用数字代码,但这仍然不能解决splitBy变得非常大的潜在问题。
Sample Data
样本数据
articleID measure time value
100 41 shares 2015-01-03 23:20:34 4
101 41 tw 2015-01-03 23:30:30 24
102 41 totals 2015-01-03 23:30:38 6
103 41 likes 2015-01-03 23:30:38 2
104 41 comments 2015-01-03 23:30:38 0
105 41 shares 2015-01-03 23:30:38 4
106 41 tw 2015-01-03 23:40:24 24
107 41 totals 2015-01-03 23:40:35 6
108 41 likes 2015-01-03 23:40:35 2
...
1000 42 shares 2015-01-04 20:10:50 0
1001 42 tw 2015-01-04 21:10:45 24
1002 42 totals 2015-01-04 21:10:35 0
1003 42 likes 2015-01-04 21:10:35 0
1004 42 comments 2015-01-04 21:10:35 0
1005 42 shares 2015-01-04 21:10:35 0
1006 42 tw 2015-01-04 22:10:45 24
1007 42 totals 2015-01-04 22:10:43 0
1008 42 likes 2015-01-04 22:10:43 0
...
1 个解决方案
#1
3
# Use this, this makes sure your data does not get into RAM completely but only in chunks of 100000 records
ffshares$splitBy <- with(ffshares[c("articleID", "measure")], paste(articleID, measure, sep=""),
by = 100000)
length(levels(ffshares$splitBy)) ## how many levels are in there - don't know from your question
tmp <- ffdfdply(ffshares, split=ffshares$splitBy, FUN=function(x) {
## In x you are getting a data.frame in RAM with all records of possibly several articleID/measure combinations
## You should write a function which returns a data.frame. E.g. the following returns the mean value by articleID/measure and the first and last timepoint
x <- data.table::setDT(x)
xagg <- x[, list(value = mean(value),
first.timepoint = min(time),
last.timepoint = max(time)), by = list(articleID, measure)]
## the function should return a data frame as indicated in the help of ffdfdply, not a list
setDF(xagg)
})
## tmp is an ffdf
#1
3
# Use this, this makes sure your data does not get into RAM completely but only in chunks of 100000 records
ffshares$splitBy <- with(ffshares[c("articleID", "measure")], paste(articleID, measure, sep=""),
by = 100000)
length(levels(ffshares$splitBy)) ## how many levels are in there - don't know from your question
tmp <- ffdfdply(ffshares, split=ffshares$splitBy, FUN=function(x) {
## In x you are getting a data.frame in RAM with all records of possibly several articleID/measure combinations
## You should write a function which returns a data.frame. E.g. the following returns the mean value by articleID/measure and the first and last timepoint
x <- data.table::setDT(x)
xagg <- x[, list(value = mean(value),
first.timepoint = min(time),
last.timepoint = max(time)), by = list(articleID, measure)]
## the function should return a data frame as indicated in the help of ffdfdply, not a list
setDF(xagg)
})
## tmp is an ffdf