I am new to R programming and want to try extracting alphanumeric words AND words containing more than 1 uppercase.
我是R编程新手,我想尝试提取包含超过1个大写字母的字母和数字单词。
Below is an example of the string and my desired output for it.
下面是一个字符串的示例,以及它的期望输出。
x <- c("123AB123 Electrical CDe FG123-4 ...",
"12/1/17 ABCD How are you today A123B",
"20.9.12 Eat / Drink XY1234 for PQRS1",
"Going home H123a1 ab-cd1",
"Change channel for al1234 to al5678")
#Desired Output
#[1] "123AB123 CDe FG123-4" "ABCD A123B" "XY1234 PQRS"
#[2] "H123a1 ab-cd1" "al1234 al5678"
I have come across 2 separate solutions so far on Stack Overflow:
到目前为止,我已经遇到了两个关于栈溢出的独立解决方案:
- Extracts all words that contain a number --> Not helpful to me because the column I'm applying the function to contains many date strings; "12/1/17 ABCD How are you today A123B"
- 提取所有包含数字的单词——>对我没有帮助,因为我正在应用这个函数来包含许多日期字符串;“12/1/17 ABCD,你好吗?”
- Identify strings that have more than one caps/uppercase --> Pierre Lafortune has provided the following solution:
- 识别有多个大写/大写字母的字符串——> Pierre Lafortune提供了以下解决方案:
how-to-count-capslock-in-string-using-r
how-to-count-capslock-in-string-using-r
library(stringr)
str_count(x, "\\b[A-Z]{2,}\\b")
His code provides the number of times a string has more than 1 uppercase but I want to extract those words in addition to extracting alphanumeric words too.
他的代码提供了一个字符串拥有超过1个大写字母的次数,但除了提取字母数字单词外,我还想提取这些单词。
Forgive me if my question or research is not comprehensive enough. I will post my researched solution for extracting all words containing a number in 12 hours when i have access to my work station which contains R and the dataset.
请原谅我的问题或研究不够全面。我将发布我的研究解决方案,在我访问包含R和数据集的工作站上的12小时内提取包含一个数字的所有单词。
2 个解决方案
#1
1
A single regex solution will also work:
一个regex解决方案也可以工作:
> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
This will also work with regmatches
in base R using the PCRE regex engine:
这也将与使用PCRE regex引擎的R基regmatches一起使用:
> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
Why does it work?
为什么会这样呢?
-
(?<!\\S)
- finds a position after a whitespace or start of string - (?
-
(?:
- start of a non-capturing group that has two alternative patterns defined:-
(?=\\S*\\p{L})(?=\\S*\\d)\\S+
-
(?=\\S*\\p{L})
- make sure there is a letter after 0+ non-whitespace chars (for better performance, replace\\S*
with[^\\s\\p{L}]*
) - (? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)
-
(?=\\S*\\d)
- make sure there is a digit after 0+ non-whitespace chars (for better performance, replace\\S*
with[^\\s\\d]*
) - (? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)
-
\\S+
- match 1 or more non-whitespace chars - \\S+ -匹配1或更多的非空白字符。
-
- (? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + - 1或更多非空字符相匹配
-
|
- or - |——或者
-
(?:\\S*\\p{Lu}){2}\\S*
:-
(?:\\S*\\p{Lu}){2}
- 2 occurrences of 0+ non-whitespace chars (\\S*
, for better performace, replace with[^\\s\\p{Lu}]*
) followed with 1 uppercase letter (\\p{Lu}
) - (?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})
-
\\S*
- 0+ non-whitespace chars - \S* - 0+非空格字符
-
- (?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
-
- (?:开始无组织,有两个替代模式定义:(? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ * \ \ p { 1 }),确保有一个字母后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + -匹配1或更多非空字符|——或者(?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
-
)
- end of the non-capturing group. - ) -非捕获组结束。
To join the matches pertaining to each character vector, you may use
要加入与每个字符向量相关的匹配项,可以使用。
unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))
See an online R demo.
查看在线R演示。
Output:
输出:
[1] "123AB123 CDe FG123-4" "ABCD A123B" "XY1234 PQRS1"
[4] "H123a1 ab-cd1" "al1234 al5678"
#2
2
This works:
如此:
library(stringr)
# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))
# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1
# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')
# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')
# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
#1
1
A single regex solution will also work:
一个regex解决方案也可以工作:
> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
This will also work with regmatches
in base R using the PCRE regex engine:
这也将与使用PCRE regex引擎的R基regmatches一起使用:
> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
Why does it work?
为什么会这样呢?
-
(?<!\\S)
- finds a position after a whitespace or start of string - (?
-
(?:
- start of a non-capturing group that has two alternative patterns defined:-
(?=\\S*\\p{L})(?=\\S*\\d)\\S+
-
(?=\\S*\\p{L})
- make sure there is a letter after 0+ non-whitespace chars (for better performance, replace\\S*
with[^\\s\\p{L}]*
) - (? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)
-
(?=\\S*\\d)
- make sure there is a digit after 0+ non-whitespace chars (for better performance, replace\\S*
with[^\\s\\d]*
) - (? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)
-
\\S+
- match 1 or more non-whitespace chars - \\S+ -匹配1或更多的非空白字符。
-
- (? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + - 1或更多非空字符相匹配
-
|
- or - |——或者
-
(?:\\S*\\p{Lu}){2}\\S*
:-
(?:\\S*\\p{Lu}){2}
- 2 occurrences of 0+ non-whitespace chars (\\S*
, for better performace, replace with[^\\s\\p{Lu}]*
) followed with 1 uppercase letter (\\p{Lu}
) - (?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})
-
\\S*
- 0+ non-whitespace chars - \S* - 0+非空格字符
-
- (?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
-
- (?:开始无组织,有两个替代模式定义:(? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ * \ \ p { 1 }),确保有一个字母后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + -匹配1或更多非空字符|——或者(?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
-
)
- end of the non-capturing group. - ) -非捕获组结束。
To join the matches pertaining to each character vector, you may use
要加入与每个字符向量相关的匹配项,可以使用。
unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))
See an online R demo.
查看在线R演示。
Output:
输出:
[1] "123AB123 CDe FG123-4" "ABCD A123B" "XY1234 PQRS1"
[4] "H123a1 ab-cd1" "al1234 al5678"
#2
2
This works:
如此:
library(stringr)
# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))
# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1
# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')
# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')
# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"