I have a vector of text strings containing smilies and a dictionary containing only the smilies.
我有一个包含smilies的文本字符串向量和一个只包含smilies的字典。
A <- c("This :/ :/ :) ^^","is :/ ^^", "weird^^ :)")
B <- c(":)",":/","^^")
I would like to extract all matches of smilies for each text string including duplicates, so my output should look like this:
我想为每个文本字符串提取所有匹配的smilies,包括重复的,所以我的输出应该是这样的:
[[1]]
[1] ":/" ":/" ":)" "^^"
[[2]]
[1] ":/" "^^"
[[3]]
[1] "^^" ":)"
This is what I tried so far:
这就是我目前所尝试的:
# does not return duplicates
sapply(A, function(x) B[str_detect(x, fixed(B))], USE.NAMES = FALSE)
[[1]]
[1] ":)" ":/" "^^"
[[2]]
[1] ":/" "^^"
[[3]]
[1] ":)" "^^"
# Only returns first instance
str_extract_all(A,fixed(B))
[[1]]
[1] ":)"
[[2]]
[1] ":/"
[[3]]
[1] "^^"
# returns error because of unescaped characters
rm_default(A,pattern=B,fixed=TRUE,extract=TRUE)
Error in stringi::stri_extract_all_regex(text.var, pattern) :
Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
In addition: Warning messages:
1: In if (substring(pattern, 1, 4) == "@rm_") { :
the condition has length > 1 and only the first element will be used
2: In if (substring(pattern, 1, 1) == "@") { :
the condition has length > 1 and only the first element will be used
Any help is much appreciated.
非常感谢您的帮助。
2 个解决方案
#1
1
One option is to do strsplit
and then extract the elements that are contained in 'B'
一种选择是执行strsplit,然后提取“B”中包含的元素
lapply(strsplit(A, "[A-Za-z ]"), function(x) x[x %in% B])
#[[1]]
#[1] ":/" ":/" ":)" "^^"
#[[2]]
#[1] ":/" "^^"
#[[3]]
#[1] "^^" ":)"
#2
1
You may build a regex dynamically using the items in your B
list by first sorting the items by length in a descending order (so that if you have :))
and :)
the first could be extracted - that is a requirement for an unanchored NFA expression where the first alternative in an alternation group "wins", see [the Remember That The Regex Engine Is Eager section), and escape each item. Then just call regmatches
/ stringr::str_extract_all
:
你可以动态构建正则表达式使用B列表中的项目,首先在降序排序的项目长度(如果你:))和:)第一个可以提取的,是要求一个非固定NFA表达式第一选择在一群交替“赢了”,看到(记住,正则表达式引擎部分),每个条目和逃逸。然后只需调用regmatches / stringr:::str_extract_all:
regex.escape <- function(string) {
gsub("([][{}()+*^${|\\\\?])", "\\\\\\1", string)
}
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
A <- c("This :/ :/ :) ^^","is :/ ^^", "weird^^ :)")
B <- c(":)",":/","^^")
B <- sort.by.length.desc(B)
pattern <- paste(regex.escape(B), collapse="|")
regmatches(A, gregexpr(pattern, A))
See the R demo online.
在线观看R演示。
In this case, the pattern will be :\)|:/|\^\^
and the output will be
在这种情况下,模式将是:\)|:/ | \ ^ \ ^和输出
[[1]]
[1] ":/" ":/" ":)" "^^"
[[2]]
[1] ":/" "^^"
[[3]]
[1] "^^" ":)"
#1
1
One option is to do strsplit
and then extract the elements that are contained in 'B'
一种选择是执行strsplit,然后提取“B”中包含的元素
lapply(strsplit(A, "[A-Za-z ]"), function(x) x[x %in% B])
#[[1]]
#[1] ":/" ":/" ":)" "^^"
#[[2]]
#[1] ":/" "^^"
#[[3]]
#[1] "^^" ":)"
#2
1
You may build a regex dynamically using the items in your B
list by first sorting the items by length in a descending order (so that if you have :))
and :)
the first could be extracted - that is a requirement for an unanchored NFA expression where the first alternative in an alternation group "wins", see [the Remember That The Regex Engine Is Eager section), and escape each item. Then just call regmatches
/ stringr::str_extract_all
:
你可以动态构建正则表达式使用B列表中的项目,首先在降序排序的项目长度(如果你:))和:)第一个可以提取的,是要求一个非固定NFA表达式第一选择在一群交替“赢了”,看到(记住,正则表达式引擎部分),每个条目和逃逸。然后只需调用regmatches / stringr:::str_extract_all:
regex.escape <- function(string) {
gsub("([][{}()+*^${|\\\\?])", "\\\\\\1", string)
}
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
A <- c("This :/ :/ :) ^^","is :/ ^^", "weird^^ :)")
B <- c(":)",":/","^^")
B <- sort.by.length.desc(B)
pattern <- paste(regex.escape(B), collapse="|")
regmatches(A, gregexpr(pattern, A))
See the R demo online.
在线观看R演示。
In this case, the pattern will be :\)|:/|\^\^
and the output will be
在这种情况下,模式将是:\)|:/ | \ ^ \ ^和输出
[[1]]
[1] ":/" ":/" ":)" "^^"
[[2]]
[1] ":/" "^^"
[[3]]
[1] "^^" ":)"