从箱子和频率开始,而不是从样本中创建多个直方图?

时间:2021-01-20 14:54:15

I have a dataframe of size 10^6x3, i.e., 1 million samples for three variables. I would like to create three histograms in the same plot with overlay (alpha blending?) using R. The problem is that managing that many samples on my pc is possible (they fit in memory and R doesn't hang up forever), but not lightning fast. The code that generated the samples also gives me back lower and upper bin boundaries, and corresponding frequencies. Of course, this is much less data: I can choose the number of bins, but let's say 30 bins for variables, so 30x2x3=180 doubles. Is there a way in R to create overlayed histograms starting from bins and frequencies data? I would like to use ggplot2, but I'm open to solutions with base R or other packages. Also, what would you do in my situation? Would you use the original samples, and don't care about the longer computational time/memory occupation? Or would you go for bin/freqs? I'd like to use the raw data, but I'm worried that R could get too slow or hog too much memory, and that this could create issues in following computations. Thus a solution using raw data but optimized for speed/memory would be great, otherwise it's ok to use bin/freqs (if at all possible!).

我有一个大小为10 ^ 6x3的数据帧,即三个变量的100万个样本。我想在相同的情节中使用R在叠加(alpha混合?)中创建三个直方图。问题是在我的电脑上管理那么多样本是可能的(它们适合内存而R不会永远挂起),但是不快闪电生成样本的代码还可以返回较低和较高的bin边界以及相应的频率。当然,这个数据要少得多:我可以选择容器的数量,但是假设30个容器用于变量,所以30x2x3 = 180倍。在R中是否有办法从箱和频率数据开始创建重叠直方图?我想使用ggplot2,但我对基础R或其他软件包的解决方案持开放态度。另外,在我的情况下你会怎么做?您是否会使用原始样本,而不关心更长的计算时间/内存占用?或者你会去bin / freqs?我想使用原始数据,但我担心R可能会变得太慢或占用太多内存,这可能会在后续计算中产生问题。因此,使用原始数据但针对速度/内存进行了优化的解决方案将是很好的,否则可以使用bin / freqs(如果可能的话!)。

2 个解决方案

#1


1  

Yes, of course you can! Using the bins and frequencies you can make a bar graph.

是的,当然可以!使用箱和频率,您可以制作条形图。

dat <- data.frame(group = rep(c('a', 'b'), each = 10),
                  bin = rep(1:10, 2),
                  frequency = rnorm(20, 5))
library(ggplot2)

Using alpha blending as you suggested:

根据您的建议使用Alpha混合:

ggplot(dat, aes(x = bin, y = frequency, fill = group)) + 
  geom_bar(stat = 'identity', position = position_identity(), alpha = 0.4)

从箱子和频率开始,而不是从样本中创建多个直方图?

Or we dodge the bars:

或者我们躲避酒吧:

ggplot(dat, aes(x = bin, y = frequency, fill = group)) + 
  geom_bar(stat = 'identity', position = 'dodge')

从箱子和频率开始,而不是从样本中创建多个直方图?

#2


1  

I was curious about "not lightning fast". The dataset below (1e6 cases X 3 variables) renders in ~6 sec on my machine (Core i7, Win7 x64). Is that too slow?

我很好奇“不快闪电”。下面的数据集(1e6例X 3变量)在我的机器上呈现~6秒(Core i7,Win7 x64)。那太慢了吗?

set.seed(1)    # for reproducible example
df <- data.frame(matrix(rnorm(3e6, mean=rep(c(0,3,6), each=1e6)), ncol=3))
names(df) <- c("A","B","C")

library(ggplot2)
library(reshape2)
gg.df <- melt(df, variable.name="category")

system.time({
  ggp <- ggplot(gg.df, aes(x=value, fill=category)) + 
    stat_bin(geom="bar", position="identity", alpha=0.7)
  plot(ggp)
})
#    user  system elapsed 
#    5.68    0.53    6.24 

#1


1  

Yes, of course you can! Using the bins and frequencies you can make a bar graph.

是的,当然可以!使用箱和频率,您可以制作条形图。

dat <- data.frame(group = rep(c('a', 'b'), each = 10),
                  bin = rep(1:10, 2),
                  frequency = rnorm(20, 5))
library(ggplot2)

Using alpha blending as you suggested:

根据您的建议使用Alpha混合:

ggplot(dat, aes(x = bin, y = frequency, fill = group)) + 
  geom_bar(stat = 'identity', position = position_identity(), alpha = 0.4)

从箱子和频率开始,而不是从样本中创建多个直方图?

Or we dodge the bars:

或者我们躲避酒吧:

ggplot(dat, aes(x = bin, y = frequency, fill = group)) + 
  geom_bar(stat = 'identity', position = 'dodge')

从箱子和频率开始,而不是从样本中创建多个直方图?

#2


1  

I was curious about "not lightning fast". The dataset below (1e6 cases X 3 variables) renders in ~6 sec on my machine (Core i7, Win7 x64). Is that too slow?

我很好奇“不快闪电”。下面的数据集(1e6例X 3变量)在我的机器上呈现~6秒(Core i7,Win7 x64)。那太慢了吗?

set.seed(1)    # for reproducible example
df <- data.frame(matrix(rnorm(3e6, mean=rep(c(0,3,6), each=1e6)), ncol=3))
names(df) <- c("A","B","C")

library(ggplot2)
library(reshape2)
gg.df <- melt(df, variable.name="category")

system.time({
  ggp <- ggplot(gg.df, aes(x=value, fill=category)) + 
    stat_bin(geom="bar", position="identity", alpha=0.7)
  plot(ggp)
})
#    user  system elapsed 
#    5.68    0.53    6.24