I am having trouble plotting a histogram as a pdf (probability)
我很难将直方图绘制成pdf(概率)
I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.
我想要所有的部分之和等于一个区域,这样就更容易比较数据集。出于某种原因,每当我指定break(默认为4或任何糟糕的情况)时,它就不再希望将垃圾箱作为一个概率,而是将垃圾箱作为一个频率计数。
hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)
What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)
我应该把这条线改成什么?我需要一个概率分布和大量的箱子。(我有600万个数据点)
This is in the R help, but I don't know how to override it:
这是在R的帮助下,但我不知道如何重写它:
freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).
频率逻辑;如果是真,直方图图形是频率的表示,结果的计数部分;如果是假的,概率密度,分量密度,被绘制出来(因此直方图的总面积为1)。如果且仅当中断是等距的(且概率没有指定),则默认为TRUE。
Thanks
谢谢
edit: details
编辑:细节
hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0
嗯,我的情节是在1以上的,如果是概率的话,这很令人困惑。我知道它和箱子的宽度有什么关系。我或多或少地想让每个箱子值1分,同时还有很多箱子。换句话说,除非是直接在1.0,所有其他的垃圾箱都是0.0,否则不应该超过1.0。就像现在这样,我有一个箱子,在15。0左右。
edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?
编辑:在bin @Dwin中以%点的高度:那么我如何绘制概率?我意识到积分仍然会给我1.0因为x轴上的单位,但这不是我想要的。假设我有100个点,其中5个落进了第一个箱子,那么这个箱子应该在。05的高度。这就是我想要的。我做错了吗?还有另外一种方法吗?
I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?
我知道有多少个点。是否有一种方法可以将每个bin计数在频率直方图中除以这个数字?
5 个解决方案
#1
35
To answer the request to plot probabilities rather than densities:
要回答这个请求,以绘制概率而不是密度:
h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
#2
2
Are you sure? This is working for me:
你确定吗?这是为我工作:
> vec <- rnorm(6000000)
>
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01
Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.
最后两个结果相乘得到一个概率密度和1。记住,这里的bin宽度很重要。
This is with
这是与
> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
#3
2
The default number of breaks is around log2(N)
where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim
in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do
默认的中断次数是log2(N),其中N为600万,所以应该是22。如果你只看到4次休息,那可能是因为你的电话里有xlim。这并没有改变底层的直方图,它只影响它的一部分。如果你做
h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))
you should get a result of 1.
你应该得到1的结果。
The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.
数据的密度与测量单位有关;因此,您需要确保“no bin height应该在1.0以上”实际上是有意义的。举个例子,假设我们有很多英尺的尺寸。我们把测量的直方图绘制成一个密度。然后我们将所有的测量值转换为英寸(乘以12),然后再做一个密度直方图。密度的高度将是原始数据的1/12,尽管数据本质上是相同的。类似地,你可以把所有的数乘以15,使你的bin高度小于1。
Does the value 1.0 have some kind of significance?
值1.0有什么意义吗?
#4
0
I observed that, in histogram density = relative frequency / corresponding bin width
我观察到,在直方图密度=相对频率/对应的bin宽度。
Example 1:
示例1:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h2 = hist(nums, plot=F)
h2 =嘘(num情节= F)
rf2 = h2$counts / sum(h2$counts)
rf2 = h2$count / sum(h2$计数)
d2 = rf2 / diff(h2$breaks)
d2 = rf2 / diff(h2$break)
h2$density
h2美元密度
[1] 0.06 0.00 0.02 0.01 0.01
[1]0.06 0.02 0.02 0.01 0.01。
d2
d2
[1] 0.06 0.00 0.02 0.01 0.01
[1]0.06 0.02 0.02 0.01 0.01。
Example 2:
示例2:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h3 = hist(nums, plot=F, breaks=c(1,30,40,50))
h3 = hist(nums, plot=F, break =c(1,30,40,50))
rf3 = h3$counts / sum(h3$counts)
rf3 = h3$count / sum(h3$计数)
d3 = rf3 / diff(h3$breaks)
d3 = rf3 / diff(h3$break)
h3$density
h3美元密度
[1] 0.02758621 0.01000000 0.01000000
[1]0.02758621 0.01000000 0.01000000
d3
d3
[1] 0.02758621 0.01000000 0.01000000
[1]0.02758621 0.01000000 0.01000000
#5
-1
R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.
R有个bug。如果在数据a框架中有离散数据(有1列),并且调用hist(DF,freq=FALSE),相对密度将是错误的(summing to >1)。在我看来,这不应该发生。
The solution is to call unlist() on the object first. This fixes the plot. (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)
解决方案是先在对象上调用unlist()。这个补丁情节。(我也修改了文本,来自http://www.electionstudies.org/studypages/anes_timeseries_2012.htm)
#1
35
To answer the request to plot probabilities rather than densities:
要回答这个请求,以绘制概率而不是密度:
h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
#2
2
Are you sure? This is working for me:
你确定吗?这是为我工作:
> vec <- rnorm(6000000)
>
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01
Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.
最后两个结果相乘得到一个概率密度和1。记住,这里的bin宽度很重要。
This is with
这是与
> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
#3
2
The default number of breaks is around log2(N)
where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim
in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do
默认的中断次数是log2(N),其中N为600万,所以应该是22。如果你只看到4次休息,那可能是因为你的电话里有xlim。这并没有改变底层的直方图,它只影响它的一部分。如果你做
h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))
you should get a result of 1.
你应该得到1的结果。
The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.
数据的密度与测量单位有关;因此,您需要确保“no bin height应该在1.0以上”实际上是有意义的。举个例子,假设我们有很多英尺的尺寸。我们把测量的直方图绘制成一个密度。然后我们将所有的测量值转换为英寸(乘以12),然后再做一个密度直方图。密度的高度将是原始数据的1/12,尽管数据本质上是相同的。类似地,你可以把所有的数乘以15,使你的bin高度小于1。
Does the value 1.0 have some kind of significance?
值1.0有什么意义吗?
#4
0
I observed that, in histogram density = relative frequency / corresponding bin width
我观察到,在直方图密度=相对频率/对应的bin宽度。
Example 1:
示例1:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h2 = hist(nums, plot=F)
h2 =嘘(num情节= F)
rf2 = h2$counts / sum(h2$counts)
rf2 = h2$count / sum(h2$计数)
d2 = rf2 / diff(h2$breaks)
d2 = rf2 / diff(h2$break)
h2$density
h2美元密度
[1] 0.06 0.00 0.02 0.01 0.01
[1]0.06 0.02 0.02 0.01 0.01。
d2
d2
[1] 0.06 0.00 0.02 0.01 0.01
[1]0.06 0.02 0.02 0.01 0.01。
Example 2:
示例2:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h3 = hist(nums, plot=F, breaks=c(1,30,40,50))
h3 = hist(nums, plot=F, break =c(1,30,40,50))
rf3 = h3$counts / sum(h3$counts)
rf3 = h3$count / sum(h3$计数)
d3 = rf3 / diff(h3$breaks)
d3 = rf3 / diff(h3$break)
h3$density
h3美元密度
[1] 0.02758621 0.01000000 0.01000000
[1]0.02758621 0.01000000 0.01000000
d3
d3
[1] 0.02758621 0.01000000 0.01000000
[1]0.02758621 0.01000000 0.01000000
#5
-1
R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.
R有个bug。如果在数据a框架中有离散数据(有1列),并且调用hist(DF,freq=FALSE),相对密度将是错误的(summing to >1)。在我看来,这不应该发生。
The solution is to call unlist() on the object first. This fixes the plot. (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)
解决方案是先在对象上调用unlist()。这个补丁情节。(我也修改了文本,来自http://www.electionstudies.org/studypages/anes_timeseries_2012.htm)