使用R提取大于1个大写字母和数字的单词

时间:2022-02-11 21:38:17

I am new to R programming and want to try extracting alphanumeric words AND words containing more than 1 uppercase.

我是R编程新手,我想尝试提取包含超过1个大写字母的字母和数字单词。

Below is an example of the string and my desired output for it.

下面是一个字符串的示例,以及它的期望输出。

    x <- c("123AB123 Electrical CDe FG123-4 ...", 
           "12/1/17 ABCD How are you today A123B", 
           "20.9.12 Eat / Drink XY1234 for PQRS1",
           "Going home H123a1 ab-cd1",
           "Change channel for al1234 to al5678")

    #Desired Output
    #[1] "123AB123 CDe FG123-4"  "ABCD A123B"  "XY1234 PQRS"  
    #[2] "H123a1 ab-cd1"  "al1234 al5678"

I have come across 2 separate solutions so far on Stack Overflow:

到目前为止,我已经遇到了两个关于栈溢出的独立解决方案:

  1. Extracts all words that contain a number --> Not helpful to me because the column I'm applying the function to contains many date strings; "12/1/17 ABCD How are you today A123B"
  2. 提取所有包含数字的单词——>对我没有帮助,因为我正在应用这个函数来包含许多日期字符串;“12/1/17 ABCD,你好吗?”
  3. Identify strings that have more than one caps/uppercase --> Pierre Lafortune has provided the following solution:
  4. 识别有多个大写/大写字母的字符串——> Pierre Lafortune提供了以下解决方案:

how-to-count-capslock-in-string-using-r

how-to-count-capslock-in-string-using-r

    library(stringr)
    str_count(x, "\\b[A-Z]{2,}\\b") 

His code provides the number of times a string has more than 1 uppercase but I want to extract those words in addition to extracting alphanumeric words too.

他的代码提供了一个字符串拥有超过1个大写字母的次数,但除了提取字母数字单词外,我还想提取这些单词。

Forgive me if my question or research is not comprehensive enough. I will post my researched solution for extracting all words containing a number in 12 hours when i have access to my work station which contains R and the dataset.

请原谅我的问题或研究不够全面。我将发布我的研究解决方案,在我访问包含R和数据集的工作站上的12小时内提取包含一个数字的所有单词。

2 个解决方案

#1


1  

A single regex solution will also work:

一个regex解决方案也可以工作:

> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

This will also work with regmatches in base R using the PCRE regex engine:

这也将与使用PCRE regex引擎的R基regmatches一起使用:

> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678" 

Why does it work?

为什么会这样呢?

  • (?<!\\S) - finds a position after a whitespace or start of string
  • (?
  • (?: - start of a non-capturing group that has two alternative patterns defined:
    • (?=\\S*\\p{L})(?=\\S*\\d)\\S+
      • (?=\\S*\\p{L}) - make sure there is a letter after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\p{L}]*)
      • (? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)
      • (?=\\S*\\d) - make sure there is a digit after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\d]*)
      • (? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)
      • \\S+ - match 1 or more non-whitespace chars
      • \\S+ -匹配1或更多的非空白字符。
    • (? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + - 1或更多非空字符相匹配
    • | - or
    • |——或者
    • (?:\\S*\\p{Lu}){2}\\S*:
      • (?:\\S*\\p{Lu}){2} - 2 occurrences of 0+ non-whitespace chars (\\S*, for better performace, replace with [^\\s\\p{Lu}]*) followed with 1 uppercase letter (\\p{Lu})
      • (?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})
      • \\S* - 0+ non-whitespace chars
      • \S* - 0+非空格字符
    • (?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
  • (?:开始无组织,有两个替代模式定义:(? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ * \ \ p { 1 }),确保有一个字母后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + -匹配1或更多非空字符|——或者(?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
  • ) - end of the non-capturing group.
  • ) -非捕获组结束。

To join the matches pertaining to each character vector, you may use

要加入与每个字符向量相关的匹配项,可以使用。

unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))

See an online R demo.

查看在线R演示。

Output:

输出:

[1] "123AB123 CDe FG123-4" "ABCD A123B"           "XY1234 PQRS1"        
[4] "H123a1 ab-cd1"        "al1234 al5678" 

#2


2  

This works:

如此:

library(stringr)

# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))

# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1

# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')

# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')

# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]

 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678" 

#1


1  

A single regex solution will also work:

一个regex解决方案也可以工作:

> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

This will also work with regmatches in base R using the PCRE regex engine:

这也将与使用PCRE regex引擎的R基regmatches一起使用:

> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678" 

Why does it work?

为什么会这样呢?

  • (?<!\\S) - finds a position after a whitespace or start of string
  • (?
  • (?: - start of a non-capturing group that has two alternative patterns defined:
    • (?=\\S*\\p{L})(?=\\S*\\d)\\S+
      • (?=\\S*\\p{L}) - make sure there is a letter after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\p{L}]*)
      • (? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)
      • (?=\\S*\\d) - make sure there is a digit after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\d]*)
      • (? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)
      • \\S+ - match 1 or more non-whitespace chars
      • \\S+ -匹配1或更多的非空白字符。
    • (? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ S * \ \ p { L }),确保有一个字母后0 +非空字符(更好的性能,取代\ \ S * ^ \[\ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + - 1或更多非空字符相匹配
    • | - or
    • |——或者
    • (?:\\S*\\p{Lu}){2}\\S*:
      • (?:\\S*\\p{Lu}){2} - 2 occurrences of 0+ non-whitespace chars (\\S*, for better performace, replace with [^\\s\\p{Lu}]*) followed with 1 uppercase letter (\\p{Lu})
      • (?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})
      • \\S* - 0+ non-whitespace chars
      • \S* - 0+非空格字符
    • (?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
  • (?:开始无组织,有两个替代模式定义:(? = \ \ S * \ \ p { L })(? = \ \ * \ \ d)\ \ S +(? = \ \ * \ \ p { 1 }),确保有一个字母后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ p { L }]*)(? = \ \ * \ \ d),确保有一个数字后0 +非空字符(获得更好的性能,取代\ \ S *与[^ \ \ S \ \ d]*)\ \ S + -匹配1或更多非空字符|——或者(?:\ \ * \ \ p {陆}){ 2 } \ \ S *:(?:\ \ * \ \ p {陆}){ 2 } - 2出现0 +非空字符(\ \ S *,为更好的性能,替换为[^ \ \ S \ \ p {陆}]*)紧随其后1大写字母(\ \ p {陆})\ \ S * - 0 +非空字符
  • ) - end of the non-capturing group.
  • ) -非捕获组结束。

To join the matches pertaining to each character vector, you may use

要加入与每个字符向量相关的匹配项,可以使用。

unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))

See an online R demo.

查看在线R演示。

Output:

输出:

[1] "123AB123 CDe FG123-4" "ABCD A123B"           "XY1234 PQRS1"        
[4] "H123a1 ab-cd1"        "al1234 al5678" 

#2


2  

This works:

如此:

library(stringr)

# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))

# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1

# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')

# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')

# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]

 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"