用JavaScript数字符串中的句子

There are already a couple of similar questions:

已经有几个类似的问题:

Splitting textarea sentences into array and finding out which sentence changed on keyup()
将textarea语句分割为数组，并找出在keyup()上哪个句子发生了更改。
JS RegEx to split text into sentences
JS RegEx将文本分割成句子
Javascript RegExp for splitting text into sentences and keeping the delimiter
Javascript RegExp将文本分割成句子并保持分隔符。
Split string into sentences in javascript
用javascript将字符串分割成句子

My situation is a bit different.

我的情况有点不同。

I need to count the number of sentences in a string.

我需要数一下字符串中句子的数量。

The closest answer to what I need would be:

最接近我需要的答案是:

str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

The only problem here is that this RegEx assumes a sentence starts with a capital letter, which may not always be the case.

这里唯一的问题是，这个RegEx假设一个句子以大写字母开头，这可能并不总是这样。

To be more specific, I would define a sentence as:

更具体地说，我将把一个句子定义为:

Starting with a letter (capital or not), a number or even a symbol (such as $ or €).
开始(资本)的一封信中,数量甚至符号(如美元或€)。
Ending with a punctuation sign, such as a " . ", a " ? " or a " ! ".
以标点符号结尾，如“”。”,一个“?“或者”!”。

However, if a sentence contains a number, which itself contains a " . " or a " , ", then the sentence should be considered as one sentence and not two.

但是，如果一个句子包含一个数字，而这个数字本身包含一个“”。"或"，"，那么句子应该被认为是一个句子而不是两个。

Last but not least, we can assume that, except the first sentence, a sentence is preceded by a space.

最后但同样重要的是，我们可以假设，除了第一个句子，一个句子前面还有一个空格。

Given a random string, how can I count the number of sentences it contains with Javascript (or CoffeeScript for that matter)?

给定一个随机字符串，如何计算它使用Javascript(或CoffeeScript)包含的句子数量?

1 个解决方案

#1

One regex to solve your problem is:

解决您的问题的一个regex是:

\w[.?!](\s|$)

The parts are as follows:

部分内容如下:

\w - Word character
\[.?!] - Punctuation as specified.
(\s|$) - Whitespace character OR the end of the string.

You may be able to use a character class instead of a group:

您可以使用字符类而不是组:

[\s|$]

For the final element, but that isn't working on https://regex101.com/.

对于最后的元素，但这并不适用于https://regex101.com/。

Tested on the following:

在以下测试:

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

与普遍的看法相反，Lorem Ipsum并不是简单的随机文本。它起源于公元前45年的古典拉丁文学，使它超过2000年。理查德·麦克林托克(Richard McClintock)是弗吉尼亚州汉普顿-悉尼学院(Hampden-Sydney College)的拉丁语教授。Lorem Ipsum来自西塞罗在公元前45年写的“de Finibus Bonorum et Malorum”(善与恶的极端)的1.10.32和1.10.33节。这本书是关于伦理学理论的专著，在文艺复兴时期非常流行。洛伦Ipsum的第一行"Lorem Ipsum dolor sit amet. "，来自于第1.10.32节中的一条线。

And finds six sentences (bolded the end of sentences, not the actual match). Note that the different grouping might pose a problem if you're depending on it for any reason.

然后找到六句话(在句尾加粗，而不是实际的匹配)。注意，如果出于任何原因依赖于不同的分组，则可能会产生问题。

#1