如何使用R绘制长尾数据的直方图?

时间:2021-10-07 21:21:12

I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.

我的数据主要集中在一个小范围(1-10),但有很多点(比如10%)在(10-1000)。我想为这些数据绘制直方图,重点放在(1-10),但也会显示(10-1000)数据。类似于直方图的对数刻度。

Yes, i know this means not all bins are of equal size

是的,我知道这意味着并非所有垃圾桶都具有相同的尺寸

A simple hist(x) gives 如何使用R绘制长尾数据的直方图? while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives 如何使用R绘制长尾数据的直方图?

一个简单的hist(x)给出了hist(x,breaks = c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10, 15,20,50,100,200,500,1000,10000)))给出

none of which is what I want.

这些都不是我想要的。

update following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):

按照这里的答案更新我现在产生的东西几乎就是我想要的东西(我用连续的情节代替条形直方图):

breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]

如何使用R绘制长尾数据的直方图? the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.

唯一的问题是我想在比例尺和实际条形图之间进行匹配。这样做有两个选择:一个是简单地使用绘制条形的实际边距(如何?)然后得到“丑陋”的x轴标签,如1.1754,1.2985等。另一个,我更喜欢,是控制实际使用的箱子边距使他们匹配休息时间。

3 个解决方案

#1


7  

Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :

使用ggplot2似乎是最简单的选择。如果您想要更好地控制轴和休息时间,可以执行以下操作:

EDIT : new code provided

编辑:提供新代码

x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)

breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)


H <- hist(log10(x),plot=F)


plot(H$mids,H$counts,type="n",
      xaxt="n",
      xlab="X",ylab="Counts",
      main="Histogram of X",
      bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)

#Creation X axis
axis(1,at=at,labels=10^at)

This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.

这跟我到ggplot2的距离很近。将背景设置为灰色不是那么简单,但如果您定义一个具有绘图屏幕大小的矩形并将背景设置为灰色,则可行。

Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.

检查我使用的所有功能,还有?par。它将允许您构建自己的图形。希望这可以帮助。

如何使用R绘制长尾数据的直方图?

#2


9  

Log scale histograms are easier with ggplot than with base graphics. Try something like

使用ggplot比使用基本图形更容易记录比例直方图。尝试类似的东西

library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()

If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.

如果您迫切需要基础图形,则需要绘制没有轴的对数比例直方图,然后手动添加轴。

h <- hist(log10(dfr$x), axes = FALSE) 
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)

For completeness, the lattice solution would be

为了完整性,晶格解决方案将是

library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))

AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:

为什么在基本情况下需要记录值的说明:

If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.

如果您绘制没有对数转换的数据,那么大多数数据都会聚集在左侧的条形图中。

hist(dfr$x)

The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.

hist函数忽略log参数(因为它干扰了中断的计算),所以这不起作用。

hist(dfr$x, log = "y")

Neither does this.

这也不是。

par(xlog = TRUE)
hist(dfr$x)

That means that we need to log transform the data before we draw the plot.

这意味着我们需要在绘制绘图之前记录变换数据。

    hist(log10(dfr$x))

Unfortunately, this messes up the axes, which brings us to workaround above.

不幸的是,这会弄乱轴,这让我们在上面解决了问题。

#3


1  

A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:

动态图也有助于此图。使用Rstudio中的操作包来执行动态范围直方图:

library(manipulate)
data_distribution <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))

Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this: 如何使用R绘制长尾数据的直方图?

然后,您将能够使用滑块在动态选择的范围内查看特定分布,如下所示:

#1


7  

Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :

使用ggplot2似乎是最简单的选择。如果您想要更好地控制轴和休息时间,可以执行以下操作:

EDIT : new code provided

编辑:提供新代码

x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)

breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)


H <- hist(log10(x),plot=F)


plot(H$mids,H$counts,type="n",
      xaxt="n",
      xlab="X",ylab="Counts",
      main="Histogram of X",
      bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)

#Creation X axis
axis(1,at=at,labels=10^at)

This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.

这跟我到ggplot2的距离很近。将背景设置为灰色不是那么简单,但如果您定义一个具有绘图屏幕大小的矩形并将背景设置为灰色,则可行。

Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.

检查我使用的所有功能,还有?par。它将允许您构建自己的图形。希望这可以帮助。

如何使用R绘制长尾数据的直方图?

#2


9  

Log scale histograms are easier with ggplot than with base graphics. Try something like

使用ggplot比使用基本图形更容易记录比例直方图。尝试类似的东西

library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()

If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.

如果您迫切需要基础图形,则需要绘制没有轴的对数比例直方图,然后手动添加轴。

h <- hist(log10(dfr$x), axes = FALSE) 
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)

For completeness, the lattice solution would be

为了完整性,晶格解决方案将是

library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))

AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:

为什么在基本情况下需要记录值的说明:

If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.

如果您绘制没有对数转换的数据,那么大多数数据都会聚集在左侧的条形图中。

hist(dfr$x)

The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.

hist函数忽略log参数(因为它干扰了中断的计算),所以这不起作用。

hist(dfr$x, log = "y")

Neither does this.

这也不是。

par(xlog = TRUE)
hist(dfr$x)

That means that we need to log transform the data before we draw the plot.

这意味着我们需要在绘制绘图之前记录变换数据。

    hist(log10(dfr$x))

Unfortunately, this messes up the axes, which brings us to workaround above.

不幸的是,这会弄乱轴,这让我们在上面解决了问题。

#3


1  

A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:

动态图也有助于此图。使用Rstudio中的操作包来执行动态范围直方图:

library(manipulate)
data_distribution <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))

Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this: 如何使用R绘制长尾数据的直方图?

然后,您将能够使用滑块在动态选择的范围内查看特定分布,如下所示: