I'm just starting with R and I though I've been doing pretty well, but this one killed me! :)
我刚从R开始,我觉得我做得很好,但是这个杀了我!:)
I've got a data frame:
我有一个数据框:
df <- data.frame(
col1=letters[1:5],col2=c(NA,letters[4:1]),col3=letters[11:15],
col1_rr=letters[15:11], col2_rr=letters[2], col3_rr=c(letters[11:14], "oz"))
which looks like this:
它看起来像这样:
col1 col2 col3 col1_rr col2_rr col3_rr
1 a NA k o b k
2 b d l n b l
3 c c m m b m
4 d b n l b n
5 e a o k b oz
Note the pattern in column names. For each column colX
, there's an equivalent of colX_rr
.
注意列名中的模式。对于每个列colX,有一个等价的colX_rr。
Now, I want to check if the content of df[1,"col1"]
is contained in df[1,"col1_rr"]
.
现在,我想检查一下df[1,"col1"]的内容是否包含在df[1,"col1_rr"中。
For example, this statement is false for df[1,"col1"]
, but true for all col3
cells (even df[5,"col3"]
, since o
is contained in oz
).
例如,这个语句对于df[1,"col1"]是错误的,但是对于所有col3细胞(甚至df[5,"col3"])都是正确的,因为o包含在oz中)。
I know I can use grepl
for that:
我知道我可以使用grepl:
> grepl(df[1,"col3"], df[1,"col3_rr"])
[1] TRUE
> grepl(df[2,"col1"], df[2,"col1_rr"])
[1] FALSE
> grepl(df[1,"col2"], df[1,"col2_rr"])
[1] NA
> grepl(df[5,"col3"], df[5,"col3_rr"])
[1] TRUE
And next: generally speaking if the character y
from [z , colX]
is contained in the equivalent [z , colX_rr]
cell, then I want to create a new column y
and input 1
for the given row. If the character y
from [z , colX]
is NOT contained in the equivalent [z , colX_rr]
then input 0
in df$y
.
然后:一般来说,如果来自[z, colX]的y字符包含在等效的[z, colX_rr]单元中,那么我想为给定的行创建一个新的列y并输入1。如果来自[z, colX]的字符y不包含在等效的[z, colX_rr]中,则输入0为df$y。
So I would have something like this in the end:
最后我会有这样的东西
col1 col2 col3 col1_rr col2_rr col3_rr a b (...) k(...)
1 a NA k o b k 0 0 1
2 b d l n b l 0 0 0
3 c c m m b m 0 0 0
4 d b n l b n 0 1 0
5 e a o k b oz 0 0 0
In each cell of the column range col1:col3
there's only one single letter and it occurs only once for each participant (row). The content of column range col1_rr
: col3_rr
is pretty messy and contains strings of different length, but each letter also exists only once for each row.
在列范围的每个单元格中,col1:col3只有一个字母,每个参与者(行)只出现一次。列范围col1_rr: col3_rr的内容相当混乱,包含不同长度的字符串,但是每个字母对于每一行也只存在一次。
Note that NA
s also occur in the table.
注意,NAs也出现在表中。
I want this to be automatised, because there're 50 columns in the real data. But if necessary I can write a separate line of script for each column.
我希望它是自动的,因为真实数据中有50列。但是如果需要的话,我可以为每一列编写单独的一行脚本。
The letters in the real data are from the range letters[1:14]
, so there will be only 14 new columns in the end, each containing values 0 or 1 (or TRUE
/FALSE
, if this makes the solution any simpler).
真实数据中的字母来自范围字母[1:14],因此最后将只有14个新列,每个列包含值0或1(或TRUE/FALSE,如果这使解决方案更简单的话)。
I've been trying with ifelse
and merge
but don't know how to make them work for this problem, for such a complex rule.
我一直在尝试用ifelse和merge来解决这个问题,但是我不知道如何使它们适用于这样一个复杂的规则。
Thanks!
谢谢!
1 个解决方案
#1
1
Here is a solution, but not very elegant. It uses the reshape2
package :
这里有一个解决方案,但不是很优雅。它使用了reshape2包:
df <- data.frame(col1=letters[1:5],col2=c(NA,letters[4:1]),col3=letters[11:15],
col1_rr=letters[15:11], col2_rr=letters[2], col3_rr=c(letters[11:14], "oz"))
col.vars <- names(df)[1:3]
colrr.vars <- names(df)[4:6]
df$id <- 1:nrow(df)
df.var <- melt(df[,c("id",col.vars)], id.vars="id")
df.var_rr<- melt(df[,c("id",colrr.vars)], id.vars="id")
let <- names(table(unlist(df[,1:3])))
m <- data.frame(sapply(let, function(l) df.var$value==l & grepl(l, df.var_rr$value)))
cbind(df, aggregate(m, list(df.var$id), sum))
Which gives :
这使:
col1 col2 col3 col1_rr col2_rr col3_rr id Group.1 a b c d e k l m n o
1 a <NA> k o b k 1 1 0 NA 0 0 0 1 0 0 0 0
2 b d l n b l 2 2 0 0 0 0 0 0 1 0 0 0
3 c c m m b m 3 3 0 0 0 0 0 0 0 1 0 0
4 d b n l b n 4 4 0 1 0 0 0 0 0 0 1 0
5 e a o k b oz 5 5 0 0 0 0 0 0 0 0 0 1
#1
1
Here is a solution, but not very elegant. It uses the reshape2
package :
这里有一个解决方案,但不是很优雅。它使用了reshape2包:
df <- data.frame(col1=letters[1:5],col2=c(NA,letters[4:1]),col3=letters[11:15],
col1_rr=letters[15:11], col2_rr=letters[2], col3_rr=c(letters[11:14], "oz"))
col.vars <- names(df)[1:3]
colrr.vars <- names(df)[4:6]
df$id <- 1:nrow(df)
df.var <- melt(df[,c("id",col.vars)], id.vars="id")
df.var_rr<- melt(df[,c("id",colrr.vars)], id.vars="id")
let <- names(table(unlist(df[,1:3])))
m <- data.frame(sapply(let, function(l) df.var$value==l & grepl(l, df.var_rr$value)))
cbind(df, aggregate(m, list(df.var$id), sum))
Which gives :
这使:
col1 col2 col3 col1_rr col2_rr col3_rr id Group.1 a b c d e k l m n o
1 a <NA> k o b k 1 1 0 NA 0 0 0 1 0 0 0 0
2 b d l n b l 2 2 0 0 0 0 0 0 1 0 0 0
3 c c m m b m 3 3 0 0 0 0 0 0 0 1 0 0
4 d b n l b n 4 4 0 1 0 0 0 0 0 0 1 0
5 e a o k b oz 5 5 0 0 0 0 0 0 0 0 0 1