使用R将数据从交叉表重新格式化为每行一个数据格式

时间:2021-01-03 04:53:06

I'm using R to pull in data through an API and merge all of it into a single table, which I then write to a CSV file. To graph it properly in Tableau, however, I need to prepare the data by using their reformatting tool for Excel to get it from a cross-tablulated format to a format where each line contains only one piece of data. For example, taking something from the format:

我正在使用R通过API提取数据并将所有数据合并到一个表中,然后我将其写入CSV文件。但是,要在Tableau中正确绘制图形,我需要使用Excel的重新格式化工具准备数据,以便将其从交叉表格格式转换为每行仅包含一个数据的格式。例如,从格式中取一些东西:

ID,Gender,School,Math,English,Science
1,M,West,90,80,70
2,F,South,50,50,50

To:

至:

ID,Gender,School,Subject,Score
1,M,West,Math,90
1,M,West,English,80
1,M,West,Science,70
2,F,South,Math,50
2,F,South,English,50
2,F,South,Science,50

Are there any existing tools in R or in an R library that would allow me to do this, or that would provide a starting point? I am trying to automate the preparation of data for Tableau so that I just need to run a single script to get it formatted properly, and would like to remove the manual Excel step if possible.

R或R库中是否有任何现有工具可以让我这样做,或者提供一个起点?我正在尝试自动为Tableau准备数据,这样我只需运行一个脚本就可以正确格式化,并且如果可能的话,想要删除手动Excel步骤。

1 个解决方案

#1


1  

In R and several other programs, this process is referred to as "reshaping" data. In fact, the Tableau page that you originally linked to speaks of their "Excel Reshaper plugin".

在R和其他几个程序中,这个过程被称为“重塑”数据。实际上,您最初链接的Tableau页面会说明其“Excel Reshaper插件”。

In base R, there are a few functions to reshape data, such as the (notorious) reshape() function which takes panel data from a wide form to a long form, and stack() which creates skinny stacks of your data.

在基础R中,有一些重塑数据的函数,例如(臭名昭着)reshape()函数,它将面板数据从宽格式转换为长格式,而stack()则创建数据的紧密堆栈。

The "reshape2" package seems to be much more popular for such data transformations, though. Here's an example of "melting" your sample data, which I've stored in a data.frame named "mydf":

不过,“reshape2”软件包在这种数据转换中似乎更受欢迎。这是一个“融化”您的样本数据的示例,我将其存储在名为“mydf”的data.frame中:

library(reshape2)
melt(mydf, id.vars=c("ID", "Gender", "School"), 
     value.name="Score", variable.name="Subject")
#   ID Gender School Subject Score
# 1  1      M   West    Math    90
# 2  2      F  South    Math    50
# 3  1      M   West English    80
# 4  2      F  South English    50
# 5  1      M   West Science    70
# 6  2      F  South Science    50

For this example, base R's reshape() isn't really appropriate, but stack() is. Here, I've stacked just the last three columns:

对于这个例子,base R的reshape()不是很合适,但是stack()是。在这里,我只堆叠了最后三列:

stack(mydf[4:6])
#   values     ind
# 1     90    Math
# 2     50    Math
# 3     80 English
# 4     50 English
# 5     70 Science
# 6     50 Science

To get the data.frame you are looking for, you would cbind the first three columns with the above output.

要获取您正在寻找的data.frame,您可以使用上面的输出来绑定前三列。


For reference, Hadley Wickham's Tidy Data paper is a good entry point into thinking about how the structure of your data might facilitate further processing and visualization.

作为参考,Hadley Wickham的Tidy Data论文是思考数据结构如何促进进一步处理和可视化的良好切入点。

#1


1  

In R and several other programs, this process is referred to as "reshaping" data. In fact, the Tableau page that you originally linked to speaks of their "Excel Reshaper plugin".

在R和其他几个程序中,这个过程被称为“重塑”数据。实际上,您最初链接的Tableau页面会说明其“Excel Reshaper插件”。

In base R, there are a few functions to reshape data, such as the (notorious) reshape() function which takes panel data from a wide form to a long form, and stack() which creates skinny stacks of your data.

在基础R中,有一些重塑数据的函数,例如(臭名昭着)reshape()函数,它将面板数据从宽格式转换为长格式,而stack()则创建数据的紧密堆栈。

The "reshape2" package seems to be much more popular for such data transformations, though. Here's an example of "melting" your sample data, which I've stored in a data.frame named "mydf":

不过,“reshape2”软件包在这种数据转换中似乎更受欢迎。这是一个“融化”您的样本数据的示例,我将其存储在名为“mydf”的data.frame中:

library(reshape2)
melt(mydf, id.vars=c("ID", "Gender", "School"), 
     value.name="Score", variable.name="Subject")
#   ID Gender School Subject Score
# 1  1      M   West    Math    90
# 2  2      F  South    Math    50
# 3  1      M   West English    80
# 4  2      F  South English    50
# 5  1      M   West Science    70
# 6  2      F  South Science    50

For this example, base R's reshape() isn't really appropriate, but stack() is. Here, I've stacked just the last three columns:

对于这个例子,base R的reshape()不是很合适,但是stack()是。在这里,我只堆叠了最后三列:

stack(mydf[4:6])
#   values     ind
# 1     90    Math
# 2     50    Math
# 3     80 English
# 4     50 English
# 5     70 Science
# 6     50 Science

To get the data.frame you are looking for, you would cbind the first three columns with the above output.

要获取您正在寻找的data.frame,您可以使用上面的输出来绑定前三列。


For reference, Hadley Wickham's Tidy Data paper is a good entry point into thinking about how the structure of your data might facilitate further processing and visualization.

作为参考,Hadley Wickham的Tidy Data论文是思考数据结构如何促进进一步处理和可视化的良好切入点。