比较相应的单元格并创建一个以内容命名的新列

时间:2021-06-08 22:54:38

I'm just starting with R and I though I've been doing pretty well, but this one killed me! :)

我刚从R开始,我觉得我做得很好,但是这个杀了我!:)

I've got a data frame:

我有一个数据框:

df <- data.frame(
col1=letters[1:5],col2=c(NA,letters[4:1]),col3=letters[11:15],
col1_rr=letters[15:11], col2_rr=letters[2], col3_rr=c(letters[11:14], "oz"))

which looks like this:

它看起来像这样:

     col1      col2    col3    col1_rr  col2_rr  col3_rr
 1     a        NA       k        o        b        k
 2     b        d        l        n        b        l
 3     c        c        m        m        b        m
 4     d        b        n        l        b        n
 5     e        a        o        k        b        oz

Note the pattern in column names. For each column colX, there's an equivalent of colX_rr.

注意列名中的模式。对于每个列colX,有一个等价的colX_rr。

Now, I want to check if the content of df[1,"col1"] is contained in df[1,"col1_rr"].

现在,我想检查一下df[1,"col1"]的内容是否包含在df[1,"col1_rr"中。

For example, this statement is false for df[1,"col1"], but true for all col3 cells (even df[5,"col3"], since o is contained in oz).

例如,这个语句对于df[1,"col1"]是错误的,但是对于所有col3细胞(甚至df[5,"col3"])都是正确的,因为o包含在oz中)。

I know I can use grepl for that:

我知道我可以使用grepl:

 > grepl(df[1,"col3"], df[1,"col3_rr"])
 [1] TRUE
 > grepl(df[2,"col1"], df[2,"col1_rr"])
 [1] FALSE
 > grepl(df[1,"col2"], df[1,"col2_rr"])
 [1] NA
 > grepl(df[5,"col3"], df[5,"col3_rr"])
 [1] TRUE

And next: generally speaking if the character y from [z , colX] is contained in the equivalent [z , colX_rr] cell, then I want to create a new column y and input 1 for the given row. If the character y from [z , colX] is NOT contained in the equivalent [z , colX_rr] then input 0 in df$y.

然后:一般来说,如果来自[z, colX]的y字符包含在等效的[z, colX_rr]单元中,那么我想为给定的行创建一个新的列y并输入1。如果来自[z, colX]的字符y不包含在等效的[z, colX_rr]中,则输入0为df$y。

So I would have something like this in the end:

最后我会有这样的东西

     col1      col2    col3    col1_rr  col2_rr  col3_rr     a        b (...)  k(...)
 1     a        NA       k        o        b        k        0        0        1
 2     b        d        l        n        b        l        0        0        0
 3     c        c        m        m        b        m        0        0        0
 4     d        b        n        l        b        n        0        1        0
 5     e        a        o        k        b        oz       0        0        0

In each cell of the column range col1:col3 there's only one single letter and it occurs only once for each participant (row). The content of column range col1_rr : col3_rr is pretty messy and contains strings of different length, but each letter also exists only once for each row.

在列范围的每个单元格中,col1:col3只有一个字母,每个参与者(行)只出现一次。列范围col1_rr: col3_rr的内容相当混乱,包含不同长度的字符串,但是每个字母对于每一行也只存在一次。

Note that NAs also occur in the table.

注意,NAs也出现在表中。

I want this to be automatised, because there're 50 columns in the real data. But if necessary I can write a separate line of script for each column.

我希望它是自动的,因为真实数据中有50列。但是如果需要的话,我可以为每一列编写单独的一行脚本。

The letters in the real data are from the range letters[1:14], so there will be only 14 new columns in the end, each containing values 0 or 1 (or TRUE/FALSE, if this makes the solution any simpler).

真实数据中的字母来自范围字母[1:14],因此最后将只有14个新列,每个列包含值0或1(或TRUE/FALSE,如果这使解决方案更简单的话)。

I've been trying with ifelse and merge but don't know how to make them work for this problem, for such a complex rule.

我一直在尝试用ifelse和merge来解决这个问题,但是我不知道如何使它们适用于这样一个复杂的规则。

Thanks!

谢谢!

1 个解决方案

#1


1  

Here is a solution, but not very elegant. It uses the reshape2 package :

这里有一个解决方案,但不是很优雅。它使用了reshape2包:

df <- data.frame(col1=letters[1:5],col2=c(NA,letters[4:1]),col3=letters[11:15],
                 col1_rr=letters[15:11], col2_rr=letters[2], col3_rr=c(letters[11:14], "oz"))
col.vars <- names(df)[1:3]
colrr.vars <- names(df)[4:6]
df$id <- 1:nrow(df)
df.var <- melt(df[,c("id",col.vars)], id.vars="id")
df.var_rr<- melt(df[,c("id",colrr.vars)], id.vars="id")
let <- names(table(unlist(df[,1:3])))
m <- data.frame(sapply(let, function(l) df.var$value==l & grepl(l, df.var_rr$value)))
cbind(df, aggregate(m, list(df.var$id), sum))

Which gives :

这使:

  col1 col2 col3 col1_rr col2_rr col3_rr id Group.1 a  b c d e k l m n o
1    a <NA>    k       o       b       k  1       1 0 NA 0 0 0 1 0 0 0 0
2    b    d    l       n       b       l  2       2 0  0 0 0 0 0 1 0 0 0
3    c    c    m       m       b       m  3       3 0  0 0 0 0 0 0 1 0 0
4    d    b    n       l       b       n  4       4 0  1 0 0 0 0 0 0 1 0
5    e    a    o       k       b      oz  5       5 0  0 0 0 0 0 0 0 0 1

#1


1  

Here is a solution, but not very elegant. It uses the reshape2 package :

这里有一个解决方案,但不是很优雅。它使用了reshape2包:

df <- data.frame(col1=letters[1:5],col2=c(NA,letters[4:1]),col3=letters[11:15],
                 col1_rr=letters[15:11], col2_rr=letters[2], col3_rr=c(letters[11:14], "oz"))
col.vars <- names(df)[1:3]
colrr.vars <- names(df)[4:6]
df$id <- 1:nrow(df)
df.var <- melt(df[,c("id",col.vars)], id.vars="id")
df.var_rr<- melt(df[,c("id",colrr.vars)], id.vars="id")
let <- names(table(unlist(df[,1:3])))
m <- data.frame(sapply(let, function(l) df.var$value==l & grepl(l, df.var_rr$value)))
cbind(df, aggregate(m, list(df.var$id), sum))

Which gives :

这使:

  col1 col2 col3 col1_rr col2_rr col3_rr id Group.1 a  b c d e k l m n o
1    a <NA>    k       o       b       k  1       1 0 NA 0 0 0 1 0 0 0 0
2    b    d    l       n       b       l  2       2 0  0 0 0 0 0 1 0 0 0
3    c    c    m       m       b       m  3       3 0  0 0 0 0 0 0 1 0 0
4    d    b    n       l       b       n  4       4 0  1 0 0 0 0 0 0 1 0
5    e    a    o       k       b      oz  5       5 0  0 0 0 0 0 0 0 0 1