Main Question
I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.
我在理解为什么对日期,标签和休息的处理不像在R中使用ggplot2做直方图时所期望的那样。
I'm looking for:
我要找:
- A histogram of the frequency of my dates
- 我的约会频率的直方图。
- Tick marks centered under the matching bars
- 在匹配的栏中以刻度标记。
- Date labels in
%Y-b
format - 日期标签以%Y-b格式。
- Appropriate limits; minimized empty space between edge of grid space and outermost bars
- 适当的限制;最小化网格空间边缘与最外层条之间的空白空间。
I've uploaded my data to pastebin to make this reproducible. I've created several columns as I wasn't sure the best way to do this:
我把我的数据上传到pastebin,让这个可以复制。我创建了一些列,因为我不确定这样做的最佳方式:
> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
YM Date Year Month
1 2008-Apr 2008-04-01 2008 4
2 2009-Apr 2009-04-01 2009 4
3 2009-Apr 2009-04-01 2009 4
4 2009-Apr 2009-04-01 2009 4
5 2009-Apr 2009-04-01 2009 4
6 2009-Apr 2009-04-01 2009 4
Here's what I tried:
这就是我试着:
library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
ggplot(dates, aes(x=converted)) + geom_histogram()
+ opts(axis.text.x = theme_text(angle=90))
Which yields this graph. I wanted %Y-%b
formatting, though, so I hunted around and tried the following, based on this SO:
收益率这个图。我想要%Y-%b格式,所以我搜索并尝试了下面的内容,基于此:
ggplot(dates, aes(x=converted)) + geom_histogram()
+ scale_x_date(labels=date_format("%Y-%b"),
+ breaks = "1 month")
+ opts(axis.text.x = theme_text(angle=90))
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
That gives me this graph
得到这个图。
- Correct x axis label format
- 正确的x轴标签格式。
- The frequency distribution has changed shape (binwidth issue?)
- 频率分布已经改变了形状(binwidth issue?)
- Tick marks don't appear centered under bars
- 标记不要出现在栏中。
- The xlims have changed as well
- xlims也发生了变化。
I worked through the example in the ggplot2 documentation at the scale_x_date
section and geom_line()
appears to break, label, and center ticks correctly when I use it with my same x-axis data. I don't understand why the histogram is different.
我在scale_x_date部分的ggplot2文档中完成了这个示例,并且在使用相同的x轴数据时,它会正确地打破、标签和中心标记。我不明白为什么直方图是不同的。
Updates based on answers from edgester and gauden
I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely. Note the differences between the two answers' resulting graphs after the code.
起初,我认为高顿的回答帮助我解决了问题,但现在我更仔细地看了之后,感到很困惑。注意两个答案在代码后生成的图形之间的差异。
Assume for both:
假设为:
library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
Based on @edgester's answer below, I was able to do the following:
根据@edgester的回答,我可以做到以下几点:
freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Here is my attempt based on gauden's answer:
以下是我基于gauden的回答的尝试:
dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(labels = date_format("%Y-%b"),
breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Plot based on edgester's approach:
基于edgester方法的情节:
Plot based on gauden's approach:
基于gauden方法的情节:
Note the following:
请注意以下几点:
- gaps in gauden's plot for 2009-Dec and 2010-Mar;
table(dates$Date)
reveals that there are 19 instances of2009-12-01
and 26 instances of2010-03-01
in the data - gauden在2009- 12年度和2010- 3年度计划中的差距;表(日期$Date)显示在数据中有19个2009-12-01和26个2010-03-01的实例。
- edgester's plot starts at 2008-Apr and ends at 2012-May. This is correct based on a minimum value in the data of 2008-04-01 and a max date of 2012-05-01. For some reason gauden's plot starts in 2008-Mar and still somehow manages to end at 2012-May. After counting bins and reading along the month labels, for the life of me I can't figure out which plot has an extra or is missing a bin of the histogram!
- 埃德格斯特的故事始于2008年4月,并于2012年5月结束。这是基于2008-04-01和2012-05-01的最大值的数据的最小值。由于某种原因,gauden的阴谋从2008年开始,直到2012年5月结束。在清点了箱子和阅读了月的标签之后,为了我的生活,我不知道哪个图有一个额外的或者是丢失了一个箱子的直方图!
Any thoughts on the differences here? edgester's method of creating a separate count
对这里的差异有什么看法吗?edgester创建单独计数的方法。
Related References
As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:
顺便说一下,这里还有一些其他地方,有关于日期和ggplot2的信息,供路人寻求帮助:
- Started here at learnr.wordpress, a popular R blog. It stated that I needed to get my data into POSIXct format, which I now think is false and wasted my time.
- 在learnr开始。一个很受欢迎的博客。它声明我需要将我的数据转换成POSIXct格式,我现在认为这是错误的,浪费了我的时间。
- Another learnr post recreates a time series in ggplot2, but wasn't really applicable to my situation.
- 另一个learnr post在ggplot2中重新创建了一个时间序列,但并不真正适合我的情况。
-
r-bloggers has a post on this, but it appears outdated. The simple
format=
option did not work for me. - r-blogger在这上面有一个帖子,但是看起来已经过时了。简单的格式=选项对我不起作用。
-
This SO question is playing with breaks and labels. I tried treating my
Date
vector as continuous and don't think it worked so well. It looked like it was overlaying the same label text over and over so the letters looked kind of odd. The distribution is sort of correct but there are odd breaks. My attempt based on the accepted answer was like so (result here). - 这个问题就是如何利用休息和标签。我试着把我的日期矢量当作连续的,但我认为它不太好用。看起来好像是把相同的标签文本覆盖了一遍又一遍,所以这些字母看起来有点奇怪。分布是正确的,但有一些奇怪的断裂。我基于公认答案的尝试是这样的(结果在这里)。
3 个解决方案
#1
28
UPDATE
更新
Version 2: Using Date class
I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date
does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).
我更新示例,以演示如何对齐标签和设置对绘图的限制。我也证明了。在使用时,日期确实有效(实际上,它可能比我之前的示例更适合您的数据)。
The Target Plot v2
The Code v2
And here is (somewhat excessively) commented code:
这里是(有点过分)注释代码:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
Version 1: Using POSIXct
I try a solution that does everything in ggplot2
, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.
我尝试了一个解决方案,在ggplot2中完成所有事情,没有聚合,并且在2009年初到2011年底设置了x轴的限制。
The Target Plot v1
The Code v1
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.
当然,它可以使用在轴上的标签选项,但这是在绘图程序包中,用一个干净的简短的例程来结束绘图。
#2
5
I think the key thing is that you need to do the frequency calculation outside of ggplot. Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors. Here is some example code:
我认为关键在于你需要在ggplot之外进行频率计算。使用聚合()与地_bar(stat="identity")得到一个没有重新排序因素的直方图。下面是一些示例代码:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
#3
0
The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter: ... + Geom_histogram (binwidth = 30, color = "white") + ... If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.
“基于Gauden方法的图”标题下的错误图是由于binwidth参数:……+地貌图(binwidth = 30, color = "white") +…如果我们将30的值改为小于20的值,比如10,你将得到所有的频率。
In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.
在统计数据中,值比表示更重要,它更重要的是一个乏味的图形,而不是一个非常漂亮的图片,而是错误。
#1
28
UPDATE
更新
Version 2: Using Date class
I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date
does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).
我更新示例,以演示如何对齐标签和设置对绘图的限制。我也证明了。在使用时,日期确实有效(实际上,它可能比我之前的示例更适合您的数据)。
The Target Plot v2
The Code v2
And here is (somewhat excessively) commented code:
这里是(有点过分)注释代码:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
Version 1: Using POSIXct
I try a solution that does everything in ggplot2
, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.
我尝试了一个解决方案,在ggplot2中完成所有事情,没有聚合,并且在2009年初到2011年底设置了x轴的限制。
The Target Plot v1
The Code v1
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.
当然,它可以使用在轴上的标签选项,但这是在绘图程序包中,用一个干净的简短的例程来结束绘图。
#2
5
I think the key thing is that you need to do the frequency calculation outside of ggplot. Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors. Here is some example code:
我认为关键在于你需要在ggplot之外进行频率计算。使用聚合()与地_bar(stat="identity")得到一个没有重新排序因素的直方图。下面是一些示例代码:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
#3
0
The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter: ... + Geom_histogram (binwidth = 30, color = "white") + ... If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.
“基于Gauden方法的图”标题下的错误图是由于binwidth参数:……+地貌图(binwidth = 30, color = "white") +…如果我们将30的值改为小于20的值,比如10,你将得到所有的频率。
In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.
在统计数据中,值比表示更重要,它更重要的是一个乏味的图形,而不是一个非常漂亮的图片,而是错误。