在dplyr中结合grepl过滤观察结果

时间:2021-08-13 22:26:05

I am trying to work out how to filter some observations from a large dataset using dplyr and grepl . I am not wedded to grepl, if other solutions would be more optimal.

我试图找出如何使用dplyr和grepl从大型数据集中过滤一些观察结果。如果其他解决方案更加优化,我不会坚持使用grepl。

Take this sample df:

拿这个样本df:

df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", 
                          "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") )
df1


#     fruit group
#1    apple     A
#2   orange     B
#3   xapple     A
#4  xorange     B
#5  applexx     A
#6 orangexx     B
#7  banxana     A
#8  appxxle     B

I want to:

我要:

  1. filter out those cases beginning with 'x'
  2. 过滤掉以'x'开头的那些案例
  3. filter out those cases ending with 'xx'
  4. 过滤掉那些以'xx'结尾的案例

I have managed to work out how to get rid of everything that contains 'x' or 'xx', but not beginning with or ending with. Here is how to get rid of everything with 'xx' inside (not just ending with):

我已经设法弄清楚如何摆脱包含'x'或'xx'的所有内容,但不是以开头或结尾。以下是如何摆脱内部'xx'的所有内容(不仅仅是结尾):

df1 %>%  filter(!grepl("xx",fruit))

#    fruit group
#1   apple     A
#2  orange     B
#3  xapple     A
#4 xorange     B
#5 banxana     A

This obviously 'erroneously' (from my point of view) filtered 'appxxle'.

这显然是“错误的”(从我的观点来看)过滤了'appxxle'。

I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.

我从来没有完全掌握正则表达式。我一直在尝试修改代码,例如:grepl(“^(?!x)。* $”,df1 $ fruit,perl = TRUE)尝试使其在filter命令中工作,但我还是没有得到它。

Expected output:

预期产量:

#      fruit group
#1     apple     A
#2    orange     B
#3   banxana     A
#4   appxxle     B

I'd like to do this inside dplyr if possible.

如果可能的话,我想在dplyr中这样做。

1 个解决方案

#1


34  

I didn't understand your second regex, but this more basic regex seems to do the trick:

我不明白你的第二个正则表达式,但这个更基本的正则表达式似乎可以解决这个问题:

df1 %>% filter(!grepl("^x|xx$", fruit))
###
    fruit group
1   apple     A
2  orange     B
3 banxana     A
4 appxxle     B

And I assume you know this, but you don't have to use dplyr here at all:

我假设你知道这一点,但你根本不需要在这里使用dplyr:

df1[!grepl("^x|xx$", df1$fruit), ]
###
    fruit group
1   apple     A
2  orange     B
7 banxana     A
8 appxxle     B

The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.

正则表达式正在寻找以x OR结尾的字符串xx。 ^和$分别是字符串开头和结尾的正则表达式锚点。 |是OR运算符。我们否定了grepl的结果!所以我们发现的字符串与正则表达式中的字符串不匹配。

#1


34  

I didn't understand your second regex, but this more basic regex seems to do the trick:

我不明白你的第二个正则表达式,但这个更基本的正则表达式似乎可以解决这个问题:

df1 %>% filter(!grepl("^x|xx$", fruit))
###
    fruit group
1   apple     A
2  orange     B
3 banxana     A
4 appxxle     B

And I assume you know this, but you don't have to use dplyr here at all:

我假设你知道这一点,但你根本不需要在这里使用dplyr:

df1[!grepl("^x|xx$", df1$fruit), ]
###
    fruit group
1   apple     A
2  orange     B
7 banxana     A
8 appxxle     B

The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.

正则表达式正在寻找以x OR结尾的字符串xx。 ^和$分别是字符串开头和结尾的正则表达式锚点。 |是OR运算符。我们否定了grepl的结果!所以我们发现的字符串与正则表达式中的字符串不匹配。