是什么区别[\ s \ S] *？和。*？在Java正则表达式？

I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):

我已经开发了一个正则表达式来识别文本文件中的xml块。表达式看起来像这样（我删除了所有java转义斜杠以使其易于阅读）：

<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>

Then I optimised it and replaced [\s\S]*? with .*? It suddenly stopped recognising the xml.

然后我优化它并替换[\ s \ S] *？用。*？它突然停止识别xml。

As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to . I didn't use greedy filters, so what could be the difference?

据我所知，\ s表示所有空格符号，\ S表示所有非白色间隔符号或[^ \ s]，因此[\ s \ S]在逻辑上应相当于。我没有使用贪婪的过滤器，那么有什么区别呢？

2 个解决方案

#1

The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.

正则表达式。和\ s \ S不等同，因为。默认情况下不会捕获行终止符（如新行）。

According to the oracle website, . matches

根据oracle网站，。火柴

Any character (may or may not match line terminators)

任何字符（可能与行终止符匹配也可能不匹配）

while a line terminator is any of the following:

行终止符是以下任何一种：

A newline (line feed) character ('\n'),

换行符（换行符）（'\ n'），

A carriage-return character followed immediately by a newline character ("\r\n"),

一个回车符后面跟一个换行符（“\ r \ n”），

A standalone carriage-return character ('\r'),

一个独立的回车符（'\ r'），

A next-line character ('\u0085'),

下一行字符（'\ u0085'），

A line-separator character ('\u2028'), or

行分隔符（'\ u2028'）或

A paragraph-separator character ('\u2029).

段落分隔符（'\ u2029）。

The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:

只要未设置必要的标志，这两个表达式就不相同。再次引用oracle网站：

If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.

如果激活了UNIX_LINES模式，则唯一识别的行终止符是换行符。

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.

正则表达式。除非指定了DOTALL标志，否则匹配除行终止符之外的任何字符。

#2

Here is a sheet explaining all the regex commands.

这是一张说明所有正则表达式命令的表单。

Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

基本上，\ s \ S将拾取所有字符，包括换行符。鉴于。默认情况下不接收线路终结器（需要设置某些标志来接收它们）。

#1