用R表示纵向分类数据的好方法

时间:2021-04-28 23:00:17

[Update: Although I've accepted an answer, please add another answer if you have additional visualization ideas (whether in R or another language/program). Texts on categorical data analysis don't seem to say much about visualizing longitudinal data, while texts on longitudinal data analysis don't seem to say much about visualizing within-subject changes over time in category membership. Having more answers to this question will make it a better resource on an issue that doesn't get much coverage in standard references.]

[更新:虽然我已经接受了一个答案,但是如果你有更多的可视化的想法(无论是在R还是其他语言/程序中),请添加另一个答案。关于分类数据分析的文章似乎对可视化纵向数据没有太多的说明,而关于纵向数据分析的文章似乎对可视化类别中随时间变化的主题变化没有多少说明。对这个问题有更多的答案将使它成为一个更好的资源,在一个没有得到很多标准参考的问题。

A colleague just gave me a longitudinal categorical data set to look at and I'm trying to figure out how to capture the longitudinal aspect in a visualization. I'm posting here, because I'd like to do this in R, but please let me know if it makes sense to also cross-post to Cross-Validated, since cross-posting is generally discouraged.

我的一个同事给了我一个纵向分类数据集,我想弄明白如何在可视化中捕捉纵向方面。我在这里发布,因为我想在R中发布,但是请让我知道交叉发布是否有意义,因为交叉发布通常是不允许的。

Quick background: The data track the academic standing from term to term for students who went through an academic advising program. The data are in long format and have five variables: "id", "cohort", "term", "standing", and "termGPA". The first two identify the student and the term in which they were in the advising program. The last three are the terms when the student's academic standing and GPA were recorded. I've pasted in some sample data below using dput.

快速背景:这些数据追踪那些通过学术咨询项目的学生的学术地位。数据格式很长,有五个变量:“id”、“队列”、“term”、“standing”和“termGPA”。前两个是学生和他们在建议计划中的术语。最后三个是记录学生的学业成绩和平均绩点。我使用dput粘贴了下面的一些示例数据。

I've created a mosaic plot (see below) that groups students by cohort, standing, and term. This shows what fraction of students were in each academic-standing category in each term. But this doesn't capture the longitudinal aspect--the fact that individual students are tracked over time. I'd like to track the path that groups of students with a given academic standing take over time.

我创建了一个马赛克图(见下文),根据学生的队列、站姿和学期进行分组。这显示了每个学期每个学术类别中学生的比例。但这并没有反映出纵向方面的问题——学生个体会随着时间的推移而被跟踪。我想追踪一群有一定学术地位的学生走过的道路。

For example: Of students with standing "AP" (academic probation) in Fall 2009 ("F09"), what fraction were still AP in future terms, and what fraction moved into other categories (e.g., GS, "good standing")? Are there differences between cohorts in terms of movement between categories with time since entry into the advising program?

例如:在2009年秋季的“学业缓刑”(“F09”)的学生中,哪些比例在未来仍然是AP,哪些比例会转移到其他类别(例如,GS,“良好信誉”)?不同组别之间是否存在差异?不同组别之间的移动时间是否有差异?

I couldn't quite figure out how to capture this longitudinal aspect in an R graphic. The vcd package has facilities for visualizing categorical data, but doesn't seem to address longitudinal categorical data. Are there "standard" methods for visualizing longitudinal categorical data? Does R have packages designed for this? Is long format appropriate for this type of data or would I be better off with wide format?

我不太明白如何在R图中捕捉这个纵向方面。vcd包有可视化分类数据的工具,但似乎没有处理纵向分类数据。是否有可视化纵向分类数据的“标准”方法?R有为此设计的包装吗?长格式适用于这种类型的数据,还是使用宽格式更好?

I would appreciate suggestions for solving this particular problem and also suggestions for articles, books, etc. for learning more about visualizing longitudinal categorical data.

我希望能对解决这个问题提出建议,也希望能对文章、书籍等提出建议,以了解更多关于可视化纵向分类数据的知识。

Here's the code I used to make the mosaic plot. The code uses the data listed below with dput.

这是我用来制作镶嵌图的代码。代码使用下面列出的数据。

library(RColorBrewer)

# create a table object for plotting
df1.tab = table(df1$cohort, df1$term, df1$standing,
            dnn=c("Cohort\nAcademic Standing", "Term", "Standing"))

# create a mosaic plot
plot(df1.tab, las=1, dir=c("h","v","h"), 
     col=brewer.pal(8,"Dark2"),
     main="Fall 2009 and Fall 2010 Cohorts")

Here's the mosaic plot (side question: is there any way to make the columns for the F10 cohort sit directly under and have the same width as the columns for the F09 cohort, even when there's no data for some terms in the F10 cohort?):

这是mosaic的图表(边问:是否有方法可以让F10队列的列与F09队列的列保持相同的宽度,即使在F10队列中没有数据的情况下?)

用R表示纵向分类数据的好方法

And here's the data used to create the table and the plot:

这是用来创建表格和图表的数据:

df1 =
structure(list(id = c(101L, 102L, 103L, 104L, 105L, 106L, 107L, 
108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 
119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 104L, 
105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 
116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 
102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 
113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 
124L, 125L, 101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 
110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 
121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 104L, 105L, 106L, 
107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 
118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 
104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 
115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 
101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 
112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 
123L, 124L, 125L), cohort = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 
1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("F09", "F10"), class = c("ordered", 
"factor")), term = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("S09", "F09", "S10", 
"F10", "S11", "F11", "S12"), class = c("ordered", "factor")), 
    standing = structure(c(2L, 4L, 1L, 4L, NA, 4L, 1L, NA, NA, 
    NA, NA, 2L, 2L, 1L, 4L, 4L, 1L, 3L, NA, NA, 4L, 3L, 1L, 4L, 
    NA, 2L, 1L, 3L, 3L, NA, 1L, 2L, NA, NA, NA, NA, 2L, 4L, 3L, 
    4L, 4L, 4L, 2L, NA, NA, 4L, 2L, 4L, 4L, NA, 3L, 4L, 6L, 6L, 
    1L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 4L, 6L, 4L, 4L, 1L, 4L, 1L, 
    2L, 4L, 3L, 1L, 4L, 1L, 6L, 1L, 6L, 6L, 7L, 4L, 4L, 2L, 2L, 
    4L, 2L, 6L, 4L, 6L, 7L, 4L, 2L, 4L, 1L, 2L, 4L, 6L, 6L, 4L, 
    2L, 2L, 3L, 6L, 6L, 7L, 4L, 4L, 3L, 4L, 4L, 6L, 2L, 1L, 6L, 
    6L, 4L, 2L, 1L, 7L, 2L, 4L, 6L, 6L, 4L, 4L, 3L, 6L, 4L, 6L, 
    2L, 4L, 4L, 6L, 4L, 4L, 6L, 3L, 2L, 6L, 6L, 4L, 2L, 6L, 3L, 
    4L, 4L, 6L, 6L, 4L, 4L, 5L, 6L, 4L, 6L, 4L, 4L, 4L, 5L, 4L, 
    4L, 6L, 6L, 2L, 6L, 6L, 4L, 3L, 6L, 6L, 4L, 4L, 6L, 6L, 4L, 
    4L), .Label = c("AP", "CP", "DQ", "GS", "DM", "NE", "WD"), class = "factor"), 
    termGPA = c(1.433, 1.925, 1, 1.68, NA, 1.579, 1.233, NA, 
    NA, NA, NA, 2.009, 1.675, 0, 1.5, 1.86, 0.5, 0.94, NA, NA, 
    1.777, 1.1, 1.133, 1.675, NA, 2, 1.25, 1.66, 0, NA, 1.525, 
    2.25, NA, NA, NA, NA, 1.66, 2.325, 0, 2.308, 1.6, 1.825, 
    2.33, NA, NA, 2.65, 2.65, 2.85, 3.233, NA, 1.25, 1.575, NA, 
    NA, 1, 2.385, 3.133, 0, 0, 1.729, 1.075, 0, 4, NA, 2.74, 
    0, 1.369, 2.53, 0, 2.65, 2.75, 0, 0.333, 3.367, 1, NA, 0.1, 
    NA, NA, 1, 2.2, 2.18, 2.31, 1.75, 3.073, 0.7, NA, 1.425, 
    NA, 2.74, 2.9, 0.692, 2, 0.75, 1.675, 2.4, NA, NA, 3.829, 
    2.33, 2.3, 1.5, NA, NA, NA, 2.69, 1.52, 0.838, 2.35, 1.55, 
    NA, 1.35, 0.66, NA, NA, 1.35, 1.9, 1.04, NA, 1.464, 2.94, 
    NA, NA, 3.72, 2.867, 1.467, NA, 3.133, NA, 1, 2.458, 1.214, 
    NA, 3.325, 2.315, NA, 1, 2.233, NA, NA, 2.567, 1, NA, 0, 
    3.325, 2.077, NA, NA, 3.85, 2.718, 1.385, NA, 2.333, NA, 
    2.675, 1.267, 1.6, 1.388, 3.433, 0.838, NA, NA, 0, NA, NA, 
    2.6, 0, NA, NA, 1, 2.825, NA, NA, 3.838, 2.883)), .Names = c("id", 
"cohort", "term", "standing", "termGPA"), row.names = c("101.F09.s09", 
"102.F09.s09", "103.F09.s09", "104.F09.s09", "105.F10.s09", "106.F09.s09", 
"107.F09.s09", "108.F10.s09", "109.F10.s09", "110.F10.s09", "111.F10.s09", 
"112.F09.s09", "113.F09.s09", "114.F09.s09", "115.F09.s09", "116.F09.s09", 
"117.F09.s09", "118.F09.s09", "119.F10.s09", "120.F10.s09", "121.F09.s09", 
"122.F09.s09", "123.F09.s09", "124.F09.s09", "125.F10.s09", "101.F09.f09", 
"102.F09.f09", "103.F09.f09", "104.F09.f09", "105.F10.f09", "106.F09.f09", 
"107.F09.f09", "108.F10.f09", "109.F10.f09", "110.F10.f09", "111.F10.f09", 
"112.F09.f09", "113.F09.f09", "114.F09.f09", "115.F09.f09", "116.F09.f09", 
"117.F09.f09", "118.F09.f09", "119.F10.f09", "120.F10.f09", "121.F09.f09", 
"122.F09.f09", "123.F09.f09", "124.F09.f09", "125.F10.f09", "101.F09.s10", 
"102.F09.s10", "103.F09.s10", "104.F09.s10", "105.F10.s10", "106.F09.s10", 
"107.F09.s10", "108.F10.s10", "109.F10.s10", "110.F10.s10", "111.F10.s10", 
"112.F09.s10", "113.F09.s10", "114.F09.s10", "115.F09.s10", "116.F09.s10", 
"117.F09.s10", "118.F09.s10", "119.F10.s10", "120.F10.s10", "121.F09.s10", 
"122.F09.s10", "123.F09.s10", "124.F09.s10", "125.F10.s10", "101.F09.f10", 
"102.F09.f10", "103.F09.f10", "104.F09.f10", "105.F10.f10", "106.F09.f10", 
"107.F09.f10", "108.F10.f10", "109.F10.f10", "110.F10.f10", "111.F10.f10", 
"112.F09.f10", "113.F09.f10", "114.F09.f10", "115.F09.f10", "116.F09.f10", 
"117.F09.f10", "118.F09.f10", "119.F10.f10", "120.F10.f10", "121.F09.f10", 
"122.F09.f10", "123.F09.f10", "124.F09.f10", "125.F10.f10", "101.F09.s11", 
"102.F09.s11", "103.F09.s11", "104.F09.s11", "105.F10.s11", "106.F09.s11", 
"107.F09.s11", "108.F10.s11", "109.F10.s11", "110.F10.s11", "111.F10.s11", 
"112.F09.s11", "113.F09.s11", "114.F09.s11", "115.F09.s11", "116.F09.s11", 
"117.F09.s11", "118.F09.s11", "119.F10.s11", "120.F10.s11", "121.F09.s11", 
"122.F09.s11", "123.F09.s11", "124.F09.s11", "125.F10.s11", "101.F09.f11", 
"102.F09.f11", "103.F09.f11", "104.F09.f11", "105.F10.f11", "106.F09.f11", 
"107.F09.f11", "108.F10.f11", "109.F10.f11", "110.F10.f11", "111.F10.f11", 
"112.F09.f11", "113.F09.f11", "114.F09.f11", "115.F09.f11", "116.F09.f11", 
"117.F09.f11", "118.F09.f11", "119.F10.f11", "120.F10.f11", "121.F09.f11", 
"122.F09.f11", "123.F09.f11", "124.F09.f11", "125.F10.f11", "101.F09.s12", 
"102.F09.s12", "103.F09.s12", "104.F09.s12", "105.F10.s12", "106.F09.s12", 
"107.F09.s12", "108.F10.s12", "109.F10.s12", "110.F10.s12", "111.F10.s12", 
"112.F09.s12", "113.F09.s12", "114.F09.s12", "115.F09.s12", "116.F09.s12", 
"117.F09.s12", "118.F09.s12", "119.F10.s12", "120.F10.s12", "121.F09.s12", 
"122.F09.s12", "123.F09.s12", "124.F09.s12", "125.F10.s12"), reshapeLong = structure(list(
    varying = list(c("s09as", "f09as", "s10as", "f10as", "s11as", 
    "f11as", "s12as"), c("s09termGPA", "f09termGPA", "s10termGPA", 
    "f10termGPA", "s11termGPA", "f11termGPA", "s12termGPA")), 
    v.names = c("standing", "termGPA"), idvar = c("id", "cohort"
    ), timevar = "term"), .Names = c("varying", "v.names", "idvar", 
"timevar")), class = "data.frame")

3 个解决方案

#1


30  

Here are a few ideas for plotting your data. I've used ggplot2, and I've reformatted the data a bit in places.

下面是一些绘制数据的方法。我使用了ggplot2,并且在一些地方重新格式化了数据。

Figure 1

用R表示纵向分类数据的好方法 I've used a stacked barplot to mimic your mosaic plot and solve the alignment issue.

我使用了一个堆叠的barplot来模拟你的mosaic绘图并解决对齐问题。

Figure 2

用R表示纵向分类数据的好方法 Data points for each student are connected by a gray line, making this reminiscent of a parallel coordinates plot. Coloring the points shows the categorical standing. Using GPA on the y-axis helps spread out the points to reduce overplotting, and shows correlation of standing and GPA. A major problem is that many valid standing datapoints drop out because they lack a matching termGPA value.

每个学生的数据点都被一条灰色的线连接起来,这让人联想到一个平行坐标图。在这些点上色显示了分类的地位。在y轴上使用GPA有助于将分数展开以减少过度绘制,并且显示了站姿和GPA的相关性。一个主要问题是,许多有效的站数据点退出,因为它们缺少匹配的termGPA值。

Figure 3

用R表示纵向分类数据的好方法 Here I've created a new variable called initial_standing to use for facetting. Each panel contains students who match in both cohort and initial_standing. Plotting the id as text makes this figure a bit cluttered, but could be useful in some cases.

在这里,我创建了一个名为initial_standing的新变量,用于facetting。每个小组都包含符合队列和initial_standing的学生。将id标为文本会使这个图有点混乱,但在某些情况下可能有用。

Figure 4

用R表示纵向分类数据的好方法 This plot is like a heatmap where each row is a student. I controlled the order of the id axis to force initial_standing and cohort groupings to stay together. If you have many more rows, you may want to consider sorting rows by some type of clustering.

这幅图就像一张热图,每一行都是学生。我控制了id轴的顺序,以迫使initial_standing和队列分组保持在一起。如果有更多的行,您可能需要考虑通过某种类型的集群对行进行排序。

library(ggplot2)

# Create new data frame for determining initial standing.
standing_data = data.frame(id=unique(df1$id), initial_standing=NA, cohort=NA)

for (i in 1:nrow(standing_data)) {
    id = standing_data$id[i]
    subdat = df1[df1$id == id, ]
    subdat = subdat[complete.cases(subdat), ]
    initial_standing = subdat$standing[which.min(subdat$term)]
    standing_data[i, "initial_standing"] = as.character(initial_standing)
    standing_data[i, "cohort"] = as.character(subdat$cohort[1])
}

standing_data$cohort = factor(standing_data$cohort, levels=levels(df1$cohort))
standing_data$initial_standing = factor(standing_data$initial_standing,
                                        levels=levels(df1$standing))

# Add the new column (initial_standing) to df1.
df1 = merge(df1, standing_data[, c("id", "initial_standing")], by="id")

# Remove rows where standing is missing. Make some plots tidier.
df1 = df1[!is.na(df1$standing), ]

# Create id factor, controlling the sort order of the levels.     
id_order = order(standing_data$initial_standing, standing_data$cohort)
df1$id = factor(df1$id, levels=as.character(standing_data$id)[id_order])


p1 = ggplot(df1, aes(x=term, fill=standing)) +
     geom_bar(position="fill", colour="grey20", size=0.5, width=1.0) +
     facet_grid(cohort ~ .) +
     scale_fill_brewer(palette="Set1")

p2 = ggplot(df1, aes(x=term, y=termGPA, group=id)) + 
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     facet_grid(cohort ~ .) +
     scale_colour_brewer(palette="Set1")

p3 = ggplot(df1, aes(x=term, y=termGPA, group=id)) +
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     geom_text(aes(label=id), hjust=-0.30, size=3) +
     facet_grid(initial_standing ~ cohort) +
     scale_colour_brewer(palette="Set1")


p4 = ggplot(df1, aes(x=term, y=id, fill=standing)) + 
     geom_tile(colour="grey20") +
     facet_grid(initial_standing ~ ., space="free_y", scales="free_y") +
     scale_fill_brewer(palette="Set1") +
     opts(panel.grid.major=theme_blank()) +
     opts(panel.grid.minor=theme_blank())

ggsave("plot_1.png", p1, width=10, height=6.25, dpi=80)
ggsave("plot_2.png", p2, width=10, height=6.25, dpi=80)
ggsave("plot_3.png", p3, width=10, height=6.25, dpi=80)
ggsave("plot_4.png", p4, width=10, height=6.25, dpi=80)

#2


9  

In researching my question, I've found a few other options that I'll list here.

在研究我的问题时,我发现了一些其他的选择,我将在这里列出。

A number of relatively new R packages are designed for visualizing and analyzing "life history" or "multistate sequence" data. The idea is that over time people (or objects) enter and exit various categories--for example, career changes, marriage and divorce, health and disease, or, in my case, categories of academic standing in college.

一些相对较新的R包被设计用于可视化和分析“生命历史”或“多状态序列”数据。我们的想法是,随着时间的推移,人们(或对象)进入和退出各种类别——例如,职业变化、婚姻和离婚、健康和疾病,或者,就我而言,大学里的学术地位类别。

R packages for visualizing sequence or life history data include biograph, mentioned by @timriffe in a comment above, and TraMineR. The author of the biograph package, Frans Willekens, has a book on the package, Biograph. Multistate analysis of life histories with R, that will be published by Springer this fall. TraMineR has a detailed user manual at the link above and also a shorter JSS article. JSS also has a special issue on multi-state models in the context of risk analysis that discusses additional R packages for multistate modeling.

用于可视化序列或生命历史数据的R包包括biograph, @timriffe在上面的注释中提到,和TraMineR。传记小说的作者,Frans Willekens,有一本关于传记的书。用R对生命历史的多态分析,施普林格将在今年秋天出版。在上面的链接中,TraMineR有详细的用户手册,还有一篇简短的JSS文章。JSS在风险分析的上下文中也有一个关于多状态模型的特殊问题,该问题讨论了用于多状态建模的附加R包。

I also found some specialized software designed to visualize movements between categories over time. Parallel Sets is a simple, free program for producing basic visualizations, although it has limited flexibility. Lifeflow is more sophisticated. It's also free, but you have to send an email to the creator requesting a copy.

我还发现了一些专门的软件,用于可视化不同类别之间的移动。并行集是一个简单的、免费的用于生成基本可视化的程序,尽管它的灵活性有限。Lifeflow是更复杂的。它也是免费的,但是你必须发送一封电子邮件给创作者,要求拷贝。

I'll add more details to this answer, once I've had a chance to try out these tools.

一旦我有机会尝试这些工具,我将在这个答案中添加更多的细节。

#3


3  

I wish I had found @bdemarest's answer before I wrote an R package to solve this problem, but since the OP requested additional updates, I'll share one more solution. What bdemarest suggested in Figure 4 is what I have been calling a type of horizontal line plot.

我希望在编写一个R包来解决这个问题之前找到@bdemarest的答案,但是由于OP要求进行额外的更新,所以我将共享另一个解决方案。bdemarest在图4中建议的是我所称的一种水平线图。

In developing the longCatEDA R package, we found that sorting the data was crucial to making useful plots (see example(sorter) and the report linked in the comment below for technical details), especially as the size of the problem became large. For example, we started the problem with daily drinking data (abstinent, use, abuse) for several thousand participants over 3 years (>1000 days).

在开发longCatEDA R包时,我们发现,对数据进行排序对于生成有用的图(参见示例(sorter)和下面注释中链接的报告以获取技术细节)至关重要,特别是当问题变得很大时。例如,我们从每天的饮酒数据(戒除、使用、虐待)开始,对几千名参与者进行了3年的调查(>1000天)。

The code to apply the horizontal line plot to @eipi10's data is below. Figure 1 stratifies by term, and Figure 2 stratifies by first status as with Figure 4 of @bdemarest, though the results are not identical due to within strata sorting.

下面是将水平线图应用到@eipi10数据的代码。图1按项分层,图2按第一状态分层,如图4 @bdemarest,但由于地层内部分选,结果并不相同。

Figure 1

用R表示纵向分类数据的好方法

Figure 2

用R表示纵向分类数据的好方法

# libraries
install.packages('longCatEDA')
library(longCatEDA)
library(RColorBrewer)

# transform data long to wide
dfw <- reshape(df1,
           timevar = 'term',
           idvar = c('id', 'cohort'),
           direction = 'wide')

# set up objects required by longCat()
y <- dfw[,seq(3,15,by=2)]
Labels <- levels(df1$standing)
tLabels <- levels(df1$term)
groupLabels <- levels(dfw$cohort)

# use the same colors as bdemarest
cols <- brewer.pal(7, "Set1")

# plot the longCat object
png('plot1.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc <- longCat(y=y, Labels=Labels, tLabels=tLabels, id=dfw$id) 
longCatPlot(lc, cols=cols, xlab='Term', lwd=8, legendBuffer=0)
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

# stratify by term
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc.g <- sorter(lc, group=dfw$cohort, groupLabels=groupLabels)
longCatPlot(lc.g, cols=cols, xlab='Term', lwd=8, legendBuffer=0) 
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

# stratify by first status, akin to Figure 4 by bdemarest
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
first <- apply(!is.na(y), 1, function(x) which(x)[1])
first <- y[cbind(seq_along(first), first)]
lc.1 <- sorter(lc, group=factor(first), groupLabels = sort(unique(first)))
longCatPlot(lc.1, cols=cols, xlab='Term', lwd=8, legendBuffer=0) 
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

#1


30  

Here are a few ideas for plotting your data. I've used ggplot2, and I've reformatted the data a bit in places.

下面是一些绘制数据的方法。我使用了ggplot2,并且在一些地方重新格式化了数据。

Figure 1

用R表示纵向分类数据的好方法 I've used a stacked barplot to mimic your mosaic plot and solve the alignment issue.

我使用了一个堆叠的barplot来模拟你的mosaic绘图并解决对齐问题。

Figure 2

用R表示纵向分类数据的好方法 Data points for each student are connected by a gray line, making this reminiscent of a parallel coordinates plot. Coloring the points shows the categorical standing. Using GPA on the y-axis helps spread out the points to reduce overplotting, and shows correlation of standing and GPA. A major problem is that many valid standing datapoints drop out because they lack a matching termGPA value.

每个学生的数据点都被一条灰色的线连接起来,这让人联想到一个平行坐标图。在这些点上色显示了分类的地位。在y轴上使用GPA有助于将分数展开以减少过度绘制,并且显示了站姿和GPA的相关性。一个主要问题是,许多有效的站数据点退出,因为它们缺少匹配的termGPA值。

Figure 3

用R表示纵向分类数据的好方法 Here I've created a new variable called initial_standing to use for facetting. Each panel contains students who match in both cohort and initial_standing. Plotting the id as text makes this figure a bit cluttered, but could be useful in some cases.

在这里,我创建了一个名为initial_standing的新变量,用于facetting。每个小组都包含符合队列和initial_standing的学生。将id标为文本会使这个图有点混乱,但在某些情况下可能有用。

Figure 4

用R表示纵向分类数据的好方法 This plot is like a heatmap where each row is a student. I controlled the order of the id axis to force initial_standing and cohort groupings to stay together. If you have many more rows, you may want to consider sorting rows by some type of clustering.

这幅图就像一张热图,每一行都是学生。我控制了id轴的顺序,以迫使initial_standing和队列分组保持在一起。如果有更多的行,您可能需要考虑通过某种类型的集群对行进行排序。

library(ggplot2)

# Create new data frame for determining initial standing.
standing_data = data.frame(id=unique(df1$id), initial_standing=NA, cohort=NA)

for (i in 1:nrow(standing_data)) {
    id = standing_data$id[i]
    subdat = df1[df1$id == id, ]
    subdat = subdat[complete.cases(subdat), ]
    initial_standing = subdat$standing[which.min(subdat$term)]
    standing_data[i, "initial_standing"] = as.character(initial_standing)
    standing_data[i, "cohort"] = as.character(subdat$cohort[1])
}

standing_data$cohort = factor(standing_data$cohort, levels=levels(df1$cohort))
standing_data$initial_standing = factor(standing_data$initial_standing,
                                        levels=levels(df1$standing))

# Add the new column (initial_standing) to df1.
df1 = merge(df1, standing_data[, c("id", "initial_standing")], by="id")

# Remove rows where standing is missing. Make some plots tidier.
df1 = df1[!is.na(df1$standing), ]

# Create id factor, controlling the sort order of the levels.     
id_order = order(standing_data$initial_standing, standing_data$cohort)
df1$id = factor(df1$id, levels=as.character(standing_data$id)[id_order])


p1 = ggplot(df1, aes(x=term, fill=standing)) +
     geom_bar(position="fill", colour="grey20", size=0.5, width=1.0) +
     facet_grid(cohort ~ .) +
     scale_fill_brewer(palette="Set1")

p2 = ggplot(df1, aes(x=term, y=termGPA, group=id)) + 
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     facet_grid(cohort ~ .) +
     scale_colour_brewer(palette="Set1")

p3 = ggplot(df1, aes(x=term, y=termGPA, group=id)) +
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     geom_text(aes(label=id), hjust=-0.30, size=3) +
     facet_grid(initial_standing ~ cohort) +
     scale_colour_brewer(palette="Set1")


p4 = ggplot(df1, aes(x=term, y=id, fill=standing)) + 
     geom_tile(colour="grey20") +
     facet_grid(initial_standing ~ ., space="free_y", scales="free_y") +
     scale_fill_brewer(palette="Set1") +
     opts(panel.grid.major=theme_blank()) +
     opts(panel.grid.minor=theme_blank())

ggsave("plot_1.png", p1, width=10, height=6.25, dpi=80)
ggsave("plot_2.png", p2, width=10, height=6.25, dpi=80)
ggsave("plot_3.png", p3, width=10, height=6.25, dpi=80)
ggsave("plot_4.png", p4, width=10, height=6.25, dpi=80)

#2


9  

In researching my question, I've found a few other options that I'll list here.

在研究我的问题时,我发现了一些其他的选择,我将在这里列出。

A number of relatively new R packages are designed for visualizing and analyzing "life history" or "multistate sequence" data. The idea is that over time people (or objects) enter and exit various categories--for example, career changes, marriage and divorce, health and disease, or, in my case, categories of academic standing in college.

一些相对较新的R包被设计用于可视化和分析“生命历史”或“多状态序列”数据。我们的想法是,随着时间的推移,人们(或对象)进入和退出各种类别——例如,职业变化、婚姻和离婚、健康和疾病,或者,就我而言,大学里的学术地位类别。

R packages for visualizing sequence or life history data include biograph, mentioned by @timriffe in a comment above, and TraMineR. The author of the biograph package, Frans Willekens, has a book on the package, Biograph. Multistate analysis of life histories with R, that will be published by Springer this fall. TraMineR has a detailed user manual at the link above and also a shorter JSS article. JSS also has a special issue on multi-state models in the context of risk analysis that discusses additional R packages for multistate modeling.

用于可视化序列或生命历史数据的R包包括biograph, @timriffe在上面的注释中提到,和TraMineR。传记小说的作者,Frans Willekens,有一本关于传记的书。用R对生命历史的多态分析,施普林格将在今年秋天出版。在上面的链接中,TraMineR有详细的用户手册,还有一篇简短的JSS文章。JSS在风险分析的上下文中也有一个关于多状态模型的特殊问题,该问题讨论了用于多状态建模的附加R包。

I also found some specialized software designed to visualize movements between categories over time. Parallel Sets is a simple, free program for producing basic visualizations, although it has limited flexibility. Lifeflow is more sophisticated. It's also free, but you have to send an email to the creator requesting a copy.

我还发现了一些专门的软件,用于可视化不同类别之间的移动。并行集是一个简单的、免费的用于生成基本可视化的程序,尽管它的灵活性有限。Lifeflow是更复杂的。它也是免费的,但是你必须发送一封电子邮件给创作者,要求拷贝。

I'll add more details to this answer, once I've had a chance to try out these tools.

一旦我有机会尝试这些工具,我将在这个答案中添加更多的细节。

#3


3  

I wish I had found @bdemarest's answer before I wrote an R package to solve this problem, but since the OP requested additional updates, I'll share one more solution. What bdemarest suggested in Figure 4 is what I have been calling a type of horizontal line plot.

我希望在编写一个R包来解决这个问题之前找到@bdemarest的答案,但是由于OP要求进行额外的更新,所以我将共享另一个解决方案。bdemarest在图4中建议的是我所称的一种水平线图。

In developing the longCatEDA R package, we found that sorting the data was crucial to making useful plots (see example(sorter) and the report linked in the comment below for technical details), especially as the size of the problem became large. For example, we started the problem with daily drinking data (abstinent, use, abuse) for several thousand participants over 3 years (>1000 days).

在开发longCatEDA R包时,我们发现,对数据进行排序对于生成有用的图(参见示例(sorter)和下面注释中链接的报告以获取技术细节)至关重要,特别是当问题变得很大时。例如,我们从每天的饮酒数据(戒除、使用、虐待)开始,对几千名参与者进行了3年的调查(>1000天)。

The code to apply the horizontal line plot to @eipi10's data is below. Figure 1 stratifies by term, and Figure 2 stratifies by first status as with Figure 4 of @bdemarest, though the results are not identical due to within strata sorting.

下面是将水平线图应用到@eipi10数据的代码。图1按项分层,图2按第一状态分层,如图4 @bdemarest,但由于地层内部分选,结果并不相同。

Figure 1

用R表示纵向分类数据的好方法

Figure 2

用R表示纵向分类数据的好方法

# libraries
install.packages('longCatEDA')
library(longCatEDA)
library(RColorBrewer)

# transform data long to wide
dfw <- reshape(df1,
           timevar = 'term',
           idvar = c('id', 'cohort'),
           direction = 'wide')

# set up objects required by longCat()
y <- dfw[,seq(3,15,by=2)]
Labels <- levels(df1$standing)
tLabels <- levels(df1$term)
groupLabels <- levels(dfw$cohort)

# use the same colors as bdemarest
cols <- brewer.pal(7, "Set1")

# plot the longCat object
png('plot1.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc <- longCat(y=y, Labels=Labels, tLabels=tLabels, id=dfw$id) 
longCatPlot(lc, cols=cols, xlab='Term', lwd=8, legendBuffer=0)
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

# stratify by term
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc.g <- sorter(lc, group=dfw$cohort, groupLabels=groupLabels)
longCatPlot(lc.g, cols=cols, xlab='Term', lwd=8, legendBuffer=0) 
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

# stratify by first status, akin to Figure 4 by bdemarest
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
first <- apply(!is.na(y), 1, function(x) which(x)[1])
first <- y[cbind(seq_along(first), first)]
lc.1 <- sorter(lc, group=factor(first), groupLabels = sort(unique(first)))
longCatPlot(lc.1, cols=cols, xlab='Term', lwd=8, legendBuffer=0) 
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()