将直方图绘制为概率(相对频率)

时间:2021-12-01 14:56:40

I am having trouble plotting a histogram as a pdf (probability)

我很难将直方图绘制成pdf(概率)

I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.

我想要所有的部分之和等于一个区域,这样就更容易比较数据集。出于某种原因,每当我指定break(默认为4或任何糟糕的情况)时,它就不再希望将垃圾箱作为一个概率,而是将垃圾箱作为一个频率计数。

hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)

What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)

我应该把这条线改成什么?我需要一个概率分布和大量的箱子。(我有600万个数据点)

This is in the R help, but I don't know how to override it:

这是在R的帮助下,但我不知道如何重写它:

freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).

频率逻辑;如果是真,直方图图形是频率的表示,结果的计数部分;如果是假的,概率密度,分量密度,被绘制出来(因此直方图的总面积为1)。如果且仅当中断是等距的(且概率没有指定),则默认为TRUE。

Thanks

谢谢

edit: details

编辑:细节

hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0

嗯,我的情节是在1以上的,如果是概率的话,这很令人困惑。我知道它和箱子的宽度有什么关系。我或多或少地想让每个箱子值1分,同时还有很多箱子。换句话说,除非是直接在1.0,所有其他的垃圾箱都是0.0,否则不应该超过1.0。就像现在这样,我有一个箱子,在15。0左右。

edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?

编辑:在bin @Dwin中以%点的高度:那么我如何绘制概率?我意识到积分仍然会给我1.0因为x轴上的单位,但这不是我想要的。假设我有100个点,其中5个落进了第一个箱子,那么这个箱子应该在。05的高度。这就是我想要的。我做错了吗?还有另外一种方法吗?

I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?

我知道有多少个点。是否有一种方法可以将每个bin计数在频率直方图中除以这个数字?

5 个解决方案

#1


35  

To answer the request to plot probabilities rather than densities:

要回答这个请求,以绘制概率而不是密度:

h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)

#2


2  

Are you sure? This is working for me:

你确定吗?这是为我工作:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

最后两个结果相乘得到一个概率密度和1。记住,这里的bin宽度很重要。

This is with

这是与

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1

#3


2  

The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

默认的中断次数是log2(N),其中N为600万,所以应该是22。如果你只看到4次休息,那可能是因为你的电话里有xlim。这并没有改变底层的直方图,它只影响它的一部分。如果你做

h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))

you should get a result of 1.

你应该得到1的结果。


The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

数据的密度与测量单位有关;因此,您需要确保“no bin height应该在1.0以上”实际上是有意义的。举个例子,假设我们有很多英尺的尺寸。我们把测量的直方图绘制成一个密度。然后我们将所有的测量值转换为英寸(乘以12),然后再做一个密度直方图。密度的高度将是原始数据的1/12,尽管数据本质上是相同的。类似地,你可以把所有的数乘以15,使你的bin高度小于1。

Does the value 1.0 have some kind of significance?

值1.0有什么意义吗?

#4


0  

I observed that, in histogram density = relative frequency / corresponding bin width

我观察到,在直方图密度=相对频率/对应的bin宽度。

Example 1:

示例1:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h2 = hist(nums, plot=F)

h2 =嘘(num情节= F)

rf2 = h2$counts / sum(h2$counts)

rf2 = h2$count / sum(h2$计数)

d2 = rf2 / diff(h2$breaks)

d2 = rf2 / diff(h2$break)

h2$density

h2美元密度

[1] 0.06 0.00 0.02 0.01 0.01

[1]0.06 0.02 0.02 0.01 0.01。

d2

d2

[1] 0.06 0.00 0.02 0.01 0.01

[1]0.06 0.02 0.02 0.01 0.01。

Example 2:

示例2:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

h3 = hist(nums, plot=F, break =c(1,30,40,50))

rf3 = h3$counts / sum(h3$counts)

rf3 = h3$count / sum(h3$计数)

d3 = rf3 / diff(h3$breaks)

d3 = rf3 / diff(h3$break)

h3$density

h3美元密度

[1] 0.02758621 0.01000000 0.01000000

[1]0.02758621 0.01000000 0.01000000

d3

d3

[1] 0.02758621 0.01000000 0.01000000

[1]0.02758621 0.01000000 0.01000000

#5


-1  

R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

R有个bug。如果在数据a框架中有离散数据(有1列),并且调用hist(DF,freq=FALSE),相对密度将是错误的(summing to >1)。在我看来,这不应该发生。

The solution is to call unlist() on the object first. This fixes the plot. 将直方图绘制为概率(相对频率)将直方图绘制为概率(相对频率) (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

解决方案是先在对象上调用unlist()。这个补丁情节。(我也修改了文本,来自http://www.electionstudies.org/studypages/anes_timeseries_2012.htm)

#1


35  

To answer the request to plot probabilities rather than densities:

要回答这个请求,以绘制概率而不是密度:

h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)

#2


2  

Are you sure? This is working for me:

你确定吗?这是为我工作:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

最后两个结果相乘得到一个概率密度和1。记住,这里的bin宽度很重要。

This is with

这是与

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1

#3


2  

The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

默认的中断次数是log2(N),其中N为600万,所以应该是22。如果你只看到4次休息,那可能是因为你的电话里有xlim。这并没有改变底层的直方图,它只影响它的一部分。如果你做

h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))

you should get a result of 1.

你应该得到1的结果。


The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

数据的密度与测量单位有关;因此,您需要确保“no bin height应该在1.0以上”实际上是有意义的。举个例子,假设我们有很多英尺的尺寸。我们把测量的直方图绘制成一个密度。然后我们将所有的测量值转换为英寸(乘以12),然后再做一个密度直方图。密度的高度将是原始数据的1/12,尽管数据本质上是相同的。类似地,你可以把所有的数乘以15,使你的bin高度小于1。

Does the value 1.0 have some kind of significance?

值1.0有什么意义吗?

#4


0  

I observed that, in histogram density = relative frequency / corresponding bin width

我观察到,在直方图密度=相对频率/对应的bin宽度。

Example 1:

示例1:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h2 = hist(nums, plot=F)

h2 =嘘(num情节= F)

rf2 = h2$counts / sum(h2$counts)

rf2 = h2$count / sum(h2$计数)

d2 = rf2 / diff(h2$breaks)

d2 = rf2 / diff(h2$break)

h2$density

h2美元密度

[1] 0.06 0.00 0.02 0.01 0.01

[1]0.06 0.02 0.02 0.01 0.01。

d2

d2

[1] 0.06 0.00 0.02 0.01 0.01

[1]0.06 0.02 0.02 0.01 0.01。

Example 2:

示例2:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

h3 = hist(nums, plot=F, break =c(1,30,40,50))

rf3 = h3$counts / sum(h3$counts)

rf3 = h3$count / sum(h3$计数)

d3 = rf3 / diff(h3$breaks)

d3 = rf3 / diff(h3$break)

h3$density

h3美元密度

[1] 0.02758621 0.01000000 0.01000000

[1]0.02758621 0.01000000 0.01000000

d3

d3

[1] 0.02758621 0.01000000 0.01000000

[1]0.02758621 0.01000000 0.01000000

#5


-1  

R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

R有个bug。如果在数据a框架中有离散数据(有1列),并且调用hist(DF,freq=FALSE),相对密度将是错误的(summing to >1)。在我看来,这不应该发生。

The solution is to call unlist() on the object first. This fixes the plot. 将直方图绘制为概率(相对频率)将直方图绘制为概率(相对频率) (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

解决方案是先在对象上调用unlist()。这个补丁情节。(我也修改了文本,来自http://www.electionstudies.org/studypages/anes_timeseries_2012.htm)