I am trying to match an exact word in in a vector with variable strings. For this I am using boundaries. However, I would like for hyphen/dash not to be considered as word boundary. Here is an example:
我试图在一个带有变量字符串的向量中匹配一个确切的单词。为此,我使用边界。但是,我想将连字符/短划线视为字边界。这是一个例子:
vector<-c(
"ARNT",
"ACF, ASP, ACF64",
"BID",
"KTN1, KTN",
"NCRNA00181, A1BGAS, A1BG-AS",
"KTN1-AS1")
To match strings that contain "KTN1" I am using:
要匹配包含“KTN1”的字符串我正在使用:
grep("(?i)(?=.*\\bKTN1\\b)", vector, perl=T)
But this matches both "KTN1" and "KTN1-AS1".
但这与“KTN1”和“KTN1-AS1”相匹配。
Is there a way I could treat the dash as a character so that "KTN1-AS1" is considered a whole word?
有没有办法可以将短划线视为一个角色,以便“KTN1-AS1”被认为是一个完整的单词?
2 个解决方案
#1
4
To match a particular word from an vector element, you need to use functions like regmatches
, str_extract_all
(from stringr package) not grep, since grep would return only the element index where the match is found.
要匹配vector元素中的特定单词,需要使用regmatches,str_extract_all(来自stringr包)等函数而不是grep,因为grep只返回找到匹配项的元素索引。
> vector<-c(
+ "ARNT",
+ "ACF, ASP, ACF64",
+ "BID",
+ "KTN1, KTN",
+ "NCRNA00181, A1BGAS, A1BG-AS",
+ "KTN1-AS1")
> regmatches(vector, regexpr("(?i)\\bKTN1[-\\w]*\\b", vector, perl=T))
[1] "KTN1" "KTN1-AS1"
OR
> library(stringr)
> unlist(str_extract_all(vector[grep("(?i)\\bKTN1[-\\w]*\\b", vector)], perl("(?i).*\\bKTN1[-\\w]*\\b")))
[1] "KTN1" "KTN1-AS1"
Update:
> grep("\\bKTN1(?=$|,)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1
followed by a comma or end of the line.
返回包含字符串KTN1后跟逗号或行尾的元素。
OR
> grep("\\bKTN1\\b(?!-)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1
not followed by a hyphen.
返回包含字符串KTN1后面没有连字符的元素。
#2
3
I would keep this simple and create a DIY Boundary.
我会保持这个简单并创建一个DIY边界。
grep('(^|[^-\\w])KTN1([^-\\w]|$)', vector, ignore.case = TRUE)
We use a capture group to define the boundaries. We match a character that is not a hyphen or a word character — beginning or end of string, which is closer to the intent of the \b
boundary .
我们使用捕获组来定义边界。我们匹配一个不是连字符或单词字符的字符 - 字符串的开头或结尾,它更接近\ b边界的意图。
#1
4
To match a particular word from an vector element, you need to use functions like regmatches
, str_extract_all
(from stringr package) not grep, since grep would return only the element index where the match is found.
要匹配vector元素中的特定单词,需要使用regmatches,str_extract_all(来自stringr包)等函数而不是grep,因为grep只返回找到匹配项的元素索引。
> vector<-c(
+ "ARNT",
+ "ACF, ASP, ACF64",
+ "BID",
+ "KTN1, KTN",
+ "NCRNA00181, A1BGAS, A1BG-AS",
+ "KTN1-AS1")
> regmatches(vector, regexpr("(?i)\\bKTN1[-\\w]*\\b", vector, perl=T))
[1] "KTN1" "KTN1-AS1"
OR
> library(stringr)
> unlist(str_extract_all(vector[grep("(?i)\\bKTN1[-\\w]*\\b", vector)], perl("(?i).*\\bKTN1[-\\w]*\\b")))
[1] "KTN1" "KTN1-AS1"
Update:
> grep("\\bKTN1(?=$|,)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1
followed by a comma or end of the line.
返回包含字符串KTN1后跟逗号或行尾的元素。
OR
> grep("\\bKTN1\\b(?!-)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1
not followed by a hyphen.
返回包含字符串KTN1后面没有连字符的元素。
#2
3
I would keep this simple and create a DIY Boundary.
我会保持这个简单并创建一个DIY边界。
grep('(^|[^-\\w])KTN1([^-\\w]|$)', vector, ignore.case = TRUE)
We use a capture group to define the boundaries. We match a character that is not a hyphen or a word character — beginning or end of string, which is closer to the intent of the \b
boundary .
我们使用捕获组来定义边界。我们匹配一个不是连字符或单词字符的字符 - 字符串的开头或结尾,它更接近\ b边界的意图。