计算R中字符串中单词的数量?

时间:2022-09-13 11:49:46

Is there a function to count the number of words in a string? for example

是否有一个函数来计算字符串中单词的数量?例如

str1 <- "How many words are in this sentence"

to return a result of 7

返回7的结果。

Thanks.

谢谢。

13 个解决方案

#1


58  

Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.

使用正则表达式符号\\W来匹配非单词字符,使用+表示一行中的一个或多个,并与gregexpr一起查找字符串中的所有匹配项。单词是单词分隔符加1的数目。

lengths(gregexpr("\\W+", str1)) + 1

This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.

这将失败与空白字符串的开始或者结束字符向量,当一个“词”并不满足\ \ W的非文字的概念(与其他正则表达式,可以工作\ \ S +,[[:α:]],等等,但总是会有边界情况与正则表达式的方法),等等。很可能比strsplit更高效的解决方案,这将为每个单词分配内存。正则表达式在正则表达式中被描述。

Update As noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation

在注释和@Andri的不同答案中,更新的方法都失败了(0)和一个字串,并且使用了末尾的标点符号。

> str1 = c("", "x", "x y", "x y!" , "x y! z")
> lengths(gregexpr("[A-z]\\W+", str1)) + 1L
[1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might

许多其他的答案也在这些或类似的(例如,多个空格)案例中失败。我想我的答案在最初的回答中关于“一个词的概念”的警告涵盖了标点符号的问题(解决方案:选择一个不同的正则表达式,例如,[[:空格:]]+),但是0和1个词是一个问题;@Andri的解决方案无法区分0和1个单词。因此,采取“积极”的方法来找到合适的词语。

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

导致

> sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
[1] 0 1 2 2 3

Again the regular expression might be refined for different notions of 'word'.

再一次,正则表达式可能被细化为“word”的不同概念。

I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is

我喜欢使用gregexpr(),因为它的内存效率很高。使用strsplit()(如@user813966,但使用正则表达式来分隔单词)的另一种选择,并利用原始的限定词的概念。

> lengths(strsplit(str1, "\\W+"))
[1] 0 1 2 2 3

This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.

这需要为所创建的每个单词以及中间的词表分配新的内存。当数据“大”时,这可能会比较昂贵,但在大多数情况下它可能是有效的和可理解的。

#2


29  

Most simple way would be:

最简单的方法是:

require(stringr)
str_count("one,   two three 4,,,, 5 6", "\\S+")

... counting all sequences on non-space characters (\\S+).

…计算非空间字符的所有序列(\\S+)。

But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?

但是一个小的函数能让我们决定我们想要计算哪一种类型的单词它对整个向量也适用吗?

require(stringr)
nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

nwords("one,   two three 4,,,, 5 6")
# 3

nwords("one,   two three 4,,,, 5 6", pseudo=T)
# 6

#3


15  

str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])

The gsub(' {2,}',' ',str1) makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.

gsub(' {2,}',' ',str1)确保所有单词只通过一个空格分隔,用一个空格替换两个或多个空格的所有发生。

The strsplit(str,' ') splits the sentence at every space and returns the result in a list. The [[1]] grabs the vector of words out of that list. The length counts up how many words.

strsplit(str,')在每个空格处拆分句子,并将结果返回到列表中。[[1]]从该列表中获取单词向量。长度计算了多少个单词。

> str1 <- "How many words are in this     sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> strsplit(str2,' ')[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7

#4


11  

You can use str_match_all, with a regular expression that would identify your words. The following works with initial, final and duplicated spaces.

您可以使用str_match_all,使用正则表达式来标识您的单词。下面的工作有初始的,最终的和重复的空间。

library(stringr)
s <-  "
  Day after day, day after day,
  We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\\S+" )  # Sequences of non-spaces
length(m[[1]])

#5


10  

Try this function from stringi package

从stringi包中尝试这个函数。

   require(stringi)
   > s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
    +        "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
    +        "Cras vel lorem. Etiam pellentesque aliquet tellus.",
    +        "")
    > stri_stats_latex(s)
        CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
              133             0            30            24             0             0 

#6


9  

I use the str_count function from the stringr library with the escape sequence \w that represents:

我使用stringr库中的str_count函数与escape sequence \w表示:

any ‘word’ character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered)

任何“字”字符(在当前语言环境中字母、数字或下划线:在UTF-8模式中只考虑ASCII字母和数字)

Example:

例子:

> str_count("How many words are in this sentence", '\\w+')
[1] 7

Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.

在我能测试的所有其他9个答案中,只有2个(由文森特·Zoonekynd和petermeissner)为本文提供的所有输入工作,但它们也需要stringr。

But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase ?".

但是,只有这个解决方案与目前为止所有的输入一起工作,加上诸如“foo+bar+baz~spam+鸡蛋”或“Combien de mots sont dans cette短语”的输入。

Benchmark:

基准:

library(stringr)

questions <-
  c(
    "", "x", "x y", "x y!", "x y! z",
    "foo+bar+baz~spam+eggs",
    "one,   two three 4,,,, 5 6",
    "How many words are in this sentence",
    "How  many words    are in this   sentence",
    "Combien de mots sont dans cette phrase ?",
    "
    Day after day, day after day,
    We stuck, nor breath nor motion;
    "
  )

answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)

score <- function(f) sum(unlist(lapply(questions, f)) == answers)

funs <-
  c(
    function(s) sapply(gregexpr("\\W+", s), length) + 1,
    function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
    function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
    function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
    function(s) length(str_match_all(s, "\\S+")[[1]]),
    function(s) str_count(s, "\\S+"),
    function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
    function(s) length(unlist(strsplit(s," "))),
    function(s) sapply(strsplit(s, " "), length),
    function(s) str_count(s, '\\w+')
  )

unlist(lapply(funs, score))

Output:

输出:

6 10 10  8  9  9  7  6  6 11

#7


6  

You can use wc function in library qdap:

您可以在库qdap中使用wc函数:

> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7

#8


6  

You can remove double spaces and count the number of " " in the string to get the count of words. Use stringr and rm_white {qdapRegex}

您可以删除两个空格,并计算字符串中“”的数量,以得到单词的计数。使用stringr和rm_white {qdapRegex}

str_count(rm_white(s), " ") +1

#9


5  

Try this

试试这个

length(unlist(strsplit(str1," ")))

#10


4  

The solution 7 does not give the correct result in the case there's just one word. You should not just count the elements in gregexpr's result (which is -1 if there where not matches) but count the elements > 0.

解决方案7没有给出正确的结果,因为只有一个单词。您不应该只计算gregexpr结果中的元素(如果没有匹配的话,它是-1),但是要计算> 0的元素。

Ergo:

因此:

sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1 

#11


3  

You can use strsplit and sapply functions

可以使用strsplit和sapply函数。

sapply(strsplit(str1, " "), length)

#12


0  

Use nchar

使用nchar

if vector of strings is called x

如果弦的向量叫做x。

(nchar(x) - nchar(gsub(' ','',x))) + 1

Find out number of spaces then add one

查找空格数,然后添加一个空格。

#13


0  

require(stringr) str_count(x,"\w+") # will be fine with double/triple spaces between words

要求(stringr) str_count(x,“\w+”)#将在单词之间使用双/三空格。

All other answers have issues with more than one space between the words.

所有其他答案都有一个以上的空格。

#1


58  

Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.

使用正则表达式符号\\W来匹配非单词字符,使用+表示一行中的一个或多个,并与gregexpr一起查找字符串中的所有匹配项。单词是单词分隔符加1的数目。

lengths(gregexpr("\\W+", str1)) + 1

This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.

这将失败与空白字符串的开始或者结束字符向量,当一个“词”并不满足\ \ W的非文字的概念(与其他正则表达式,可以工作\ \ S +,[[:α:]],等等,但总是会有边界情况与正则表达式的方法),等等。很可能比strsplit更高效的解决方案,这将为每个单词分配内存。正则表达式在正则表达式中被描述。

Update As noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation

在注释和@Andri的不同答案中,更新的方法都失败了(0)和一个字串,并且使用了末尾的标点符号。

> str1 = c("", "x", "x y", "x y!" , "x y! z")
> lengths(gregexpr("[A-z]\\W+", str1)) + 1L
[1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might

许多其他的答案也在这些或类似的(例如,多个空格)案例中失败。我想我的答案在最初的回答中关于“一个词的概念”的警告涵盖了标点符号的问题(解决方案:选择一个不同的正则表达式,例如,[[:空格:]]+),但是0和1个词是一个问题;@Andri的解决方案无法区分0和1个单词。因此,采取“积极”的方法来找到合适的词语。

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

导致

> sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
[1] 0 1 2 2 3

Again the regular expression might be refined for different notions of 'word'.

再一次,正则表达式可能被细化为“word”的不同概念。

I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is

我喜欢使用gregexpr(),因为它的内存效率很高。使用strsplit()(如@user813966,但使用正则表达式来分隔单词)的另一种选择,并利用原始的限定词的概念。

> lengths(strsplit(str1, "\\W+"))
[1] 0 1 2 2 3

This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.

这需要为所创建的每个单词以及中间的词表分配新的内存。当数据“大”时,这可能会比较昂贵,但在大多数情况下它可能是有效的和可理解的。

#2


29  

Most simple way would be:

最简单的方法是:

require(stringr)
str_count("one,   two three 4,,,, 5 6", "\\S+")

... counting all sequences on non-space characters (\\S+).

…计算非空间字符的所有序列(\\S+)。

But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?

但是一个小的函数能让我们决定我们想要计算哪一种类型的单词它对整个向量也适用吗?

require(stringr)
nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

nwords("one,   two three 4,,,, 5 6")
# 3

nwords("one,   two three 4,,,, 5 6", pseudo=T)
# 6

#3


15  

str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])

The gsub(' {2,}',' ',str1) makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.

gsub(' {2,}',' ',str1)确保所有单词只通过一个空格分隔,用一个空格替换两个或多个空格的所有发生。

The strsplit(str,' ') splits the sentence at every space and returns the result in a list. The [[1]] grabs the vector of words out of that list. The length counts up how many words.

strsplit(str,')在每个空格处拆分句子,并将结果返回到列表中。[[1]]从该列表中获取单词向量。长度计算了多少个单词。

> str1 <- "How many words are in this     sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> strsplit(str2,' ')[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7

#4


11  

You can use str_match_all, with a regular expression that would identify your words. The following works with initial, final and duplicated spaces.

您可以使用str_match_all,使用正则表达式来标识您的单词。下面的工作有初始的,最终的和重复的空间。

library(stringr)
s <-  "
  Day after day, day after day,
  We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\\S+" )  # Sequences of non-spaces
length(m[[1]])

#5


10  

Try this function from stringi package

从stringi包中尝试这个函数。

   require(stringi)
   > s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
    +        "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
    +        "Cras vel lorem. Etiam pellentesque aliquet tellus.",
    +        "")
    > stri_stats_latex(s)
        CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
              133             0            30            24             0             0 

#6


9  

I use the str_count function from the stringr library with the escape sequence \w that represents:

我使用stringr库中的str_count函数与escape sequence \w表示:

any ‘word’ character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered)

任何“字”字符(在当前语言环境中字母、数字或下划线:在UTF-8模式中只考虑ASCII字母和数字)

Example:

例子:

> str_count("How many words are in this sentence", '\\w+')
[1] 7

Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.

在我能测试的所有其他9个答案中,只有2个(由文森特·Zoonekynd和petermeissner)为本文提供的所有输入工作,但它们也需要stringr。

But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase ?".

但是,只有这个解决方案与目前为止所有的输入一起工作,加上诸如“foo+bar+baz~spam+鸡蛋”或“Combien de mots sont dans cette短语”的输入。

Benchmark:

基准:

library(stringr)

questions <-
  c(
    "", "x", "x y", "x y!", "x y! z",
    "foo+bar+baz~spam+eggs",
    "one,   two three 4,,,, 5 6",
    "How many words are in this sentence",
    "How  many words    are in this   sentence",
    "Combien de mots sont dans cette phrase ?",
    "
    Day after day, day after day,
    We stuck, nor breath nor motion;
    "
  )

answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)

score <- function(f) sum(unlist(lapply(questions, f)) == answers)

funs <-
  c(
    function(s) sapply(gregexpr("\\W+", s), length) + 1,
    function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
    function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
    function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
    function(s) length(str_match_all(s, "\\S+")[[1]]),
    function(s) str_count(s, "\\S+"),
    function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
    function(s) length(unlist(strsplit(s," "))),
    function(s) sapply(strsplit(s, " "), length),
    function(s) str_count(s, '\\w+')
  )

unlist(lapply(funs, score))

Output:

输出:

6 10 10  8  9  9  7  6  6 11

#7


6  

You can use wc function in library qdap:

您可以在库qdap中使用wc函数:

> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7

#8


6  

You can remove double spaces and count the number of " " in the string to get the count of words. Use stringr and rm_white {qdapRegex}

您可以删除两个空格,并计算字符串中“”的数量,以得到单词的计数。使用stringr和rm_white {qdapRegex}

str_count(rm_white(s), " ") +1

#9


5  

Try this

试试这个

length(unlist(strsplit(str1," ")))

#10


4  

The solution 7 does not give the correct result in the case there's just one word. You should not just count the elements in gregexpr's result (which is -1 if there where not matches) but count the elements > 0.

解决方案7没有给出正确的结果,因为只有一个单词。您不应该只计算gregexpr结果中的元素(如果没有匹配的话,它是-1),但是要计算> 0的元素。

Ergo:

因此:

sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1 

#11


3  

You can use strsplit and sapply functions

可以使用strsplit和sapply函数。

sapply(strsplit(str1, " "), length)

#12


0  

Use nchar

使用nchar

if vector of strings is called x

如果弦的向量叫做x。

(nchar(x) - nchar(gsub(' ','',x))) + 1

Find out number of spaces then add one

查找空格数,然后添加一个空格。

#13


0  

require(stringr) str_count(x,"\w+") # will be fine with double/triple spaces between words

要求(stringr) str_count(x,“\w+”)#将在单词之间使用双/三空格。

All other answers have issues with more than one space between the words.

所有其他答案都有一个以上的空格。