如何删除字符向量中的重复元素

s <- "height(female), weight, BMI, and BMI."

In the above string, the word BMI is repeated twice. I would like the string to be:

在上面的字符串中，单词BMI重复了两次。我希望弦是:

"height (female), weight, and BMI."

I have tried the following to break the string down into unique parts:

我尝试过以下方法将弦分解成独特的部分:

> unique(strsplit(s, " ")[[1]])
[1] "height"      "(female),"   "weight,"    "BMI," "and"         "BMI."

But since "BMI," and "BMI." are not the same strings, using unique does not get rid of one of them.

但是因为“BMI”和“BMI”不是相同的字符串，使用unique并不能去掉其中的一个。

EDIT: How can I go about moving repeated phrases? (i.e. body mass index instead of BMI)

编辑:我怎样才能移动重复的短语?(即体重指数而非BMI)

s <- "height (female), weight, weight, body mass index, body mass index." 
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2") 
> stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")
[1] "height (female), weight, body mass index, body mass index."

3 个解决方案

#1

It might help to replace the unwanted duplicates first using a regex like this:

首先使用这样的regex可能有助于替换不需要的副本:

(?<=,|^)([()\w\s]+),\s(.*?)((?: and)?(?=\1))

Demo

演示

Explanation

解释

(?<=, |^)\b front boundary. (\b should work too but is not properly anchored that way)
(?< = | ^)\ b前边界。(b \b也应该有效，但不能正确锚定)
([()\w\s]+), block element
((()\ w \ s]+),块元素
\s(.*?)((?: and)? everything inbetween
\ s(. * ?)((?))吗?所有的中间画
(?=\1)) repeated element
(? = \ 1)重复元素

Code Sample:

代码示例:

#install.packages("stringr")
library(stringr)
s <- "height(female), weight, BMI, and BMI."
stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")

Output:

输出:

[1] "height(female), weight, and BMI."

Regarding the separation of parts in parenthesis, use another replacement like that:

关于括号中部分的分离，请使用另一个替换:

stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

Output:

输出:

[1] "height (female), weight, and BMI."

Test and putting things together:

测试并把东西放在一起:

s <- c("height(female), weight, BMI, and BMI."
       ,"height(female), weight, whatever it is, and whatever it is."
       ,"height(female), weight, age, height(female), and BMI."
       ,"weight, weight.")
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")
stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

Output:

输出:

[1] "height (female), weight, and BMI."      "height (female), weight, and whatever it is."
[3] "weight, age, height (female), and BMI." "weight."

#2

You can give this regex a try:

你可以试试这个regex:

(\b\w+\b)[^\w\r\n]+(?=.*\1)

and replace each match with a blank string

用一个空字符串替换每个匹配项

Click for Demo

点击演示

Check the Ruby Code

检查Ruby代码

Input

输入

height(female), weight, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, and BMI.
height(female), weight, BMI, age, and BMI.

Output

输出

height(female), weight, and BMI.
height(female), weight, age, and BMI.

Explanation:

解释:

(\b\w+\b) - matches 1+ occurrences of a word character surrounded by word boundaries and capture it in group 1
(\b\w+\b) -匹配一个单词字符出现的次数超过1次，并将其捕获到组1中
[^\w\r\n]+ - matches 1+ occurrences of any character that is neither a word nor a newline character. So, this will match ,, ., or spaces.
[^ \ w \ r \ n]+ -匹配1 +出现的任何字符,既不是一个词,也不是一个换行符。所以，这个会匹配，或者空格。
(?=.*\1) - positive lookahead to validate that whatever is matched in group 1 must come again later in the string. Only, in that case, the replacement will be made.
(?=.*\1) -正的前视，以验证组1中匹配的内容必须在字符串的后面再次出现。只有在这种情况下，才会进行替换。

Note: This will keep the last occurrence of the repeated-words.

注意:这将保持重复单词的最后一次出现。

Alternatively, you can use (\b[^,]+)[, ]+(?=.*\1), if the repeated words contain spaces too.

或者,您可以使用b(\[^,]+)[,]+(? = . * \ 1),如果重复单词包含空格。

#3

library(stringr)

s <- "height(female), weight, BMI, and BMI, and more even more BMI."
pieces <- unlist(str_split(s, "\\b"))
non_word <- !grepl("\\w", pieces)

# if you want to keep just the last instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = TRUE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, ,  , and  even more BMI."

# if you want to keep just the first instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = FALSE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, BMI, and ,  more even  ."

#1

It might help to replace the unwanted duplicates first using a regex like this:

首先使用这样的regex可能有助于替换不需要的副本:

(?<=,|^)([()\w\s]+),\s(.*?)((?: and)?(?=\1))

Demo

演示

Explanation

解释

(?<=, |^)\b front boundary. (\b should work too but is not properly anchored that way)
(?< = | ^)\ b前边界。(b \b也应该有效，但不能正确锚定)
([()\w\s]+), block element
((()\ w \ s]+),块元素
\s(.*?)((?: and)? everything inbetween
\ s(. * ?)((?))吗?所有的中间画
(?=\1)) repeated element
(? = \ 1)重复元素

Code Sample:

代码示例:

#install.packages("stringr")
library(stringr)
s <- "height(female), weight, BMI, and BMI."
stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")

Output:

输出:

[1] "height(female), weight, and BMI."

Regarding the separation of parts in parenthesis, use another replacement like that:

关于括号中部分的分离，请使用另一个替换:

stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

Output:

输出:

[1] "height (female), weight, and BMI."

Test and putting things together:

测试并把东西放在一起:

s <- c("height(female), weight, BMI, and BMI."
       ,"height(female), weight, whatever it is, and whatever it is."
       ,"height(female), weight, age, height(female), and BMI."
       ,"weight, weight.")
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")
stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

Output:

输出:

[1] "height (female), weight, and BMI."      "height (female), weight, and whatever it is."
[3] "weight, age, height (female), and BMI." "weight."

#2

You can give this regex a try:

你可以试试这个regex:

(\b\w+\b)[^\w\r\n]+(?=.*\1)

and replace each match with a blank string

用一个空字符串替换每个匹配项

Click for Demo

点击演示

Check the Ruby Code

检查Ruby代码

Input

输入

height(female), weight, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, and BMI.
height(female), weight, BMI, age, and BMI.

Output

输出

height(female), weight, and BMI.
height(female), weight, age, and BMI.

Explanation:

解释:

(\b\w+\b) - matches 1+ occurrences of a word character surrounded by word boundaries and capture it in group 1
(\b\w+\b) -匹配一个单词字符出现的次数超过1次，并将其捕获到组1中
[^\w\r\n]+ - matches 1+ occurrences of any character that is neither a word nor a newline character. So, this will match ,, ., or spaces.
[^ \ w \ r \ n]+ -匹配1 +出现的任何字符,既不是一个词,也不是一个换行符。所以，这个会匹配，或者空格。
(?=.*\1) - positive lookahead to validate that whatever is matched in group 1 must come again later in the string. Only, in that case, the replacement will be made.
(?=.*\1) -正的前视，以验证组1中匹配的内容必须在字符串的后面再次出现。只有在这种情况下，才会进行替换。

Note: This will keep the last occurrence of the repeated-words.

注意:这将保持重复单词的最后一次出现。

Alternatively, you can use (\b[^,]+)[, ]+(?=.*\1), if the repeated words contain spaces too.

或者,您可以使用b(\[^,]+)[,]+(? = . * \ 1),如果重复单词包含空格。

#3

library(stringr)

s <- "height(female), weight, BMI, and BMI, and more even more BMI."
pieces <- unlist(str_split(s, "\\b"))
non_word <- !grepl("\\w", pieces)

# if you want to keep just the last instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = TRUE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, ,  , and  even more BMI."

# if you want to keep just the first instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = FALSE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, BMI, and ,  more even  ."

秒客网

如何删除字符向量中的重复元素

3 个解决方案

#1

#2

#3

#1

#2

#3

相关文章