I am trying to clean up some data that has been incorrectly entered. The question for the variable allows for multiple responses out of five choices, numbered as 1 to 5. The data has been entered in the following manner (this is just an example--there are many more variables and many more observations in the actual data frame):
我正在尝试清理一些输入错误的数据。变量的问题允许五个选项中的多个响应,编号为1到5.数据已按以下方式输入(这只是一个示例 - 实际数据中有更多变量和更多观察值帧):
data
V1
1 1, 2, 3
2 1, 2, 4
3 2, 3, 4, 5
4 1, 3, 4
5 1, 3, 5
6 2, 3, 4, 5
Here's some code to recreate that example data:
以下是重新创建示例数据的一些代码:
data = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5",
"1, 3, 4", "1, 3, 5", "2, 3, 4, 5"))
What I actually need is the data to be treated more... binary--like a set of "yes/no" questions--entered in a data frame that looks more like:
我真正需要的是要处理的数据更多...二进制 - 就像一组“是/否”问题 - 输入一个看起来更像的数据框:
data
V1.1 V1.2 V1.3 V1.4 V1.5
1 1 1 1 NA NA
2 1 1 NA 1 NA
3 NA 1 1 1 1
4 1 NA 1 1 NA
5 1 NA 1 NA 1
6 NA 1 1 1 1
The actual variable names don't matter at the moment--I can easily fix that. Also, it doesn't matter too much whether the missing elements are "O", "NA", or blank--again, that's something I can fix later.
实际变量名称目前无关紧要 - 我可以轻松解决这个问题。此外,缺少的元素是“O”,“NA”还是空白并不重要 - 再次,这是我以后可以解决的问题。
I've tried using the transform
function from the reshape
package as well as a fed different things with strsplit
, but I can't get either to do what I am looking for. I've also looked at many other related questions on *, but they don't seem to be quite the same problem.
我已经尝试过使用reshape包中的转换函数以及使用strsplit来提供不同的东西,但我无法做到我正在寻找的东西。我还在*上查看了许多其他相关问题,但它们似乎并不是完全相同的问题。
2 个解决方案
#1
8
You just need to write a function and use apply
. First some dummy data:
你只需要编写一个函数并使用apply。首先是一些虚拟数据:
##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5",
"1, 3, 4", "1, 3, 5", "2, 3, 4, 5"),
stringsAsFactors=FALSE)
Next, create a function that takes in a row and transforms as necessary
接下来,创建一个接受一行并根据需要进行转换的函数
make_row = function(i, ncol=5) {
##Could make the default NA if needed
m = numeric(ncol)
v = as.numeric(strsplit(i, ",")[[1]])
m[v] = 1
return(m)
}
Then use apply
and transpose the result
然后使用apply并转置结果
t(apply(dd, 1, make_row))
#2
6
A long time later, I finally got around to creating a package ("splitstackshape") that deals with this kind of data in an efficient manner. So, for the convenience of others (and some self-promotion, of course) here's a compact solution.
很久以后,我终于开始创建一个以有效方式处理这类数据的包(“splitstackshape”)。因此,为了方便他人(当然还有一些自我推销),这是一个紧凑的解决方案。
The relevant function for this problem is cSplit_e
.
此问题的相关功能是cSplit_e。
First, the default settings, which retains the original column and uses NA
as the fill:
首先是默认设置,它保留原始列并使用NA作为填充:
library(splitstackshape)
cSplit_e(data, "V1")
# V1 V1_1 V1_2 V1_3 V1_4 V1_5
# 1 1, 2, 3 1 1 1 NA NA
# 2 1, 2, 4 1 1 NA 1 NA
# 3 2, 3, 4, 5 NA 1 1 1 1
# 4 1, 3, 4 1 NA 1 1 NA
# 5 1, 3, 5 1 NA 1 NA 1
# 6 2, 3, 4, 5 NA 1 1 1 1
Second, with dropping the original column and using 0
as the fill.
其次,删除原始列并使用0作为填充。
cSplit_e(data, "V1", drop = TRUE, fill = 0)
# V1_1 V1_2 V1_3 V1_4 V1_5
# 1 1 1 1 0 0
# 2 1 1 0 1 0
# 3 0 1 1 1 1
# 4 1 0 1 1 0
# 5 1 0 1 0 1
# 6 0 1 1 1 1
#1
8
You just need to write a function and use apply
. First some dummy data:
你只需要编写一个函数并使用apply。首先是一些虚拟数据:
##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5",
"1, 3, 4", "1, 3, 5", "2, 3, 4, 5"),
stringsAsFactors=FALSE)
Next, create a function that takes in a row and transforms as necessary
接下来,创建一个接受一行并根据需要进行转换的函数
make_row = function(i, ncol=5) {
##Could make the default NA if needed
m = numeric(ncol)
v = as.numeric(strsplit(i, ",")[[1]])
m[v] = 1
return(m)
}
Then use apply
and transpose the result
然后使用apply并转置结果
t(apply(dd, 1, make_row))
#2
6
A long time later, I finally got around to creating a package ("splitstackshape") that deals with this kind of data in an efficient manner. So, for the convenience of others (and some self-promotion, of course) here's a compact solution.
很久以后,我终于开始创建一个以有效方式处理这类数据的包(“splitstackshape”)。因此,为了方便他人(当然还有一些自我推销),这是一个紧凑的解决方案。
The relevant function for this problem is cSplit_e
.
此问题的相关功能是cSplit_e。
First, the default settings, which retains the original column and uses NA
as the fill:
首先是默认设置,它保留原始列并使用NA作为填充:
library(splitstackshape)
cSplit_e(data, "V1")
# V1 V1_1 V1_2 V1_3 V1_4 V1_5
# 1 1, 2, 3 1 1 1 NA NA
# 2 1, 2, 4 1 1 NA 1 NA
# 3 2, 3, 4, 5 NA 1 1 1 1
# 4 1, 3, 4 1 NA 1 1 NA
# 5 1, 3, 5 1 NA 1 NA 1
# 6 2, 3, 4, 5 NA 1 1 1 1
Second, with dropping the original column and using 0
as the fill.
其次,删除原始列并使用0作为填充。
cSplit_e(data, "V1", drop = TRUE, fill = 0)
# V1_1 V1_2 V1_3 V1_4 V1_5
# 1 1 1 1 0 0
# 2 1 1 0 1 0
# 3 0 1 1 1 1
# 4 1 0 1 1 0
# 5 1 0 1 0 1
# 6 0 1 1 1 1