How to combine factor levels from two empty data.frames?
如何结合两个空数据框架的因子水平?
I have a big data set splitted into separate files. I need a data.frame that will have all possible levels for factor columns, but I can't load all parts at once, only part by part.
我将大数据集拆分为单独的文件。我需要一个data.frame,它将具有因子列的所有可能级别,但我不能一次加载所有部分,只能逐个加载。
Is there a way to do something like:
有没有办法做这样的事情:
data_structure = NULL
for (chunk_i in chunks){
# load chunk_i data
if(is.null(data_structure)){
data_structure = data_i
} else {
# at this line factor levels will NOT be combined as I expect
# but instead factor levels from 'data' will be stored to 'data_structure'
data_structure = rbind(data_structure, data)
}
rm(data)
# empty data frame, since I can't keep all data in memory
# I want to keep only metadata, like factor levels
data_structure = data_structure[0, ]
}
And this data_structure is needed to later convert factors to binary columns like this:
并且需要此data_structure以便稍后将因子转换为二进制列,如下所示:
result_i = model.matrix(~ . + 0, data=data_i, contrasts.arg =
lapply(data_structure, contrasts, contrasts=FALSE))
If factor levels a gathered from all parts of data then I can be sure that result_i will have exactly same binary columns as all other parts of data, even if in this particular case data_i have less factor levels in some columns.
如果从数据的所有部分收集因子水平,那么我可以确定result_i将具有与所有其他数据部分完全相同的二进制列,即使在这种特定情况下data_i在某些列中具有较少的因子水平。
UPDATE
Right now I use this solution:
现在我使用这个解决方案:
all_levels = list()
for_each_chunk(function(data) {
data_levels = Filter(Negate(is.null), sapply(data, levels))
factor_names = unique(c(names(all_levels), names(data_levels)))
lapply(factor_names, FUN=function(name){
all_levels[[name]] <<- unique(c(all_levels[[name]], data_levels[[name]]))
})
})
Not so elegant as for me, but haven't found nothing better yet.
不像我那么优雅,但还没有找到更好的东西。
1 个解决方案
#1
It might be a very silly solution that I am proposing. Why don't you do a stratified sample of each an every chunk and then read those chunks into a single dataframe. This way I think all the levels will be stored in metadata. You can do a stratified sample using the sampling
package in R, or you can use this function that I had picked up from GIT hub sometime back:
我提议这可能是一个非常愚蠢的解决方案。为什么不对每个块进行分层抽样,然后将这些块读入单个数据帧。这样我认为所有级别都将存储在元数据中。您可以使用R中的采样包进行分层样本,或者您可以使用我之前从GIT中心获取的此功能:
stratified <- function(df, group, size, select = NULL,
replace = FALSE, bothSets = FALSE) {
if (is.null(select)) {
df <- df
} else {
if (is.null(names(select))) stop("'select' must be a named list")
if (!all(names(select) %in% names(df)))
stop("Please verify your 'select' argument")
temp <- sapply(names(select),
function(x) df[[x]] %in% select[[x]])
df <- df[rowSums(temp) == length(select), ]
}
df.interaction <- interaction(df[group], drop = TRUE)
df.table <- table(df.interaction)
df.split <- split(df, df.interaction)
if (length(size) > 1) {
if (length(size) != length(df.split))
stop("Number of groups is ", length(df.split),
" but number of sizes supplied is ", length(size))
if (is.null(names(size))) {
n <- setNames(size, names(df.split))
message(sQuote("size"), " vector entered as:\n\nsize = structure(c(",
paste(n, collapse = ", "), "),\n.Names = c(",
paste(shQuote(names(n)), collapse = ", "), ")) \n\n")
} else {
ifelse(all(names(size) %in% names(df.split)),
n <- size[names(df.split)],
stop("Named vector supplied with names ",
paste(names(size), collapse = ", "),
"\n but the names for the group levels are ",
paste(names(df.split), collapse = ", ")))
}
} else if (size < 1) {
n <- round(df.table * size, digits = 0)
} else if (size >= 1) {
if (all(df.table >= size) || isTRUE(replace)) {
n <- setNames(rep(size, length.out = length(df.split)),
names(df.split))
} else {
message(
"Some groups\n---",
paste(names(df.table[df.table < size]), collapse = ", "),
"---\ncontain fewer observations",
" than desired number of samples.\n",
"All observations have been returned from those groups.")
n <- c(sapply(df.table[df.table >= size], function(x) x = size),
df.table[df.table < size])
}
}
temp <- lapply(
names(df.split),
function(x) df.split[[x]][sample(df.table[x],
n[x], replace = replace), ])
set1 <- do.call("rbind", temp)
if (isTRUE(bothSets)) {
set2 <- df[!rownames(df) %in% rownames(set1), ]
list(SET1 = set1, SET2 = set2)
} else {
set1
}
}
#1
It might be a very silly solution that I am proposing. Why don't you do a stratified sample of each an every chunk and then read those chunks into a single dataframe. This way I think all the levels will be stored in metadata. You can do a stratified sample using the sampling
package in R, or you can use this function that I had picked up from GIT hub sometime back:
我提议这可能是一个非常愚蠢的解决方案。为什么不对每个块进行分层抽样,然后将这些块读入单个数据帧。这样我认为所有级别都将存储在元数据中。您可以使用R中的采样包进行分层样本,或者您可以使用我之前从GIT中心获取的此功能:
stratified <- function(df, group, size, select = NULL,
replace = FALSE, bothSets = FALSE) {
if (is.null(select)) {
df <- df
} else {
if (is.null(names(select))) stop("'select' must be a named list")
if (!all(names(select) %in% names(df)))
stop("Please verify your 'select' argument")
temp <- sapply(names(select),
function(x) df[[x]] %in% select[[x]])
df <- df[rowSums(temp) == length(select), ]
}
df.interaction <- interaction(df[group], drop = TRUE)
df.table <- table(df.interaction)
df.split <- split(df, df.interaction)
if (length(size) > 1) {
if (length(size) != length(df.split))
stop("Number of groups is ", length(df.split),
" but number of sizes supplied is ", length(size))
if (is.null(names(size))) {
n <- setNames(size, names(df.split))
message(sQuote("size"), " vector entered as:\n\nsize = structure(c(",
paste(n, collapse = ", "), "),\n.Names = c(",
paste(shQuote(names(n)), collapse = ", "), ")) \n\n")
} else {
ifelse(all(names(size) %in% names(df.split)),
n <- size[names(df.split)],
stop("Named vector supplied with names ",
paste(names(size), collapse = ", "),
"\n but the names for the group levels are ",
paste(names(df.split), collapse = ", ")))
}
} else if (size < 1) {
n <- round(df.table * size, digits = 0)
} else if (size >= 1) {
if (all(df.table >= size) || isTRUE(replace)) {
n <- setNames(rep(size, length.out = length(df.split)),
names(df.split))
} else {
message(
"Some groups\n---",
paste(names(df.table[df.table < size]), collapse = ", "),
"---\ncontain fewer observations",
" than desired number of samples.\n",
"All observations have been returned from those groups.")
n <- c(sapply(df.table[df.table >= size], function(x) x = size),
df.table[df.table < size])
}
}
temp <- lapply(
names(df.split),
function(x) df.split[[x]][sample(df.table[x],
n[x], replace = replace), ])
set1 <- do.call("rbind", temp)
if (isTRUE(bothSets)) {
set2 <- df[!rownames(df) %in% rownames(set1), ]
list(SET1 = set1, SET2 = set2)
} else {
set1
}
}