是什么区别[\ s \ S] *?和。*?在Java正则表达式?

时间:2022-10-14 20:13:32

I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):

我已经开发了一个正则表达式来识别文本文件中的xml块。表达式看起来像这样(我删除了所有java转义斜杠以使其易于阅读):

<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>

Then I optimised it and replaced [\s\S]*? with .*? It suddenly stopped recognising the xml.

然后我优化它并替换[\ s \ S] *?用。*?它突然停止识别xml。

As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to . I didn't use greedy filters, so what could be the difference?

据我所知,\ s表示所有空格符号,\ S表示所有非白色间隔符号或[^ \ s],因此[\ s \ S]在逻辑上应相当于。我没有使用贪婪的过滤器,那么有什么区别呢?

2 个解决方案

#1


9  

The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.

正则表达式。和\ s \ S不等同,因为。默认情况下不会捕获行终止符(如新行)。

According to the oracle website, . matches

根据oracle网站,。火柴

Any character (may or may not match line terminators)

任何字符(可能与行终止符匹配也可能不匹配)

while a line terminator is any of the following:

行终止符是以下任何一种:

  • A newline (line feed) character ('\n'),
  • 换行符(换行符)('\ n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • 一个回车符后面跟一个换行符(“\ r \ n”),
  • A standalone carriage-return character ('\r'),
  • 一个独立的回车符('\ r'),
  • A next-line character ('\u0085'),
  • 下一行字符('\ u0085'),
  • A line-separator character ('\u2028'), or
  • 行分隔符('\ u2028')或
  • A paragraph-separator character ('\u2029).
  • 段落分隔符('\ u2029)。

The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:

只要未设置必要的标志,这两个表达式就不相同。再次引用oracle网站:

If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.

如果激活了UNIX_LINES模式,则唯一识别的行终止符是换行符。

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.

正则表达式。除非指定了DOTALL标志,否则匹配除行终止符之外的任何字符。

#2


2  

Here is a sheet explaining all the regex commands.

这是一张说明所有正则表达式命令的表单。

Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

基本上,\ s \ S将拾取所有字符,包括换行符。鉴于。默认情况下不接收线路终结器(需要设置某些标志来接收它们)。

#1


9  

The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.

正则表达式。和\ s \ S不等同,因为。默认情况下不会捕获行终止符(如新行)。

According to the oracle website, . matches

根据oracle网站,。火柴

Any character (may or may not match line terminators)

任何字符(可能与行终止符匹配也可能不匹配)

while a line terminator is any of the following:

行终止符是以下任何一种:

  • A newline (line feed) character ('\n'),
  • 换行符(换行符)('\ n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • 一个回车符后面跟一个换行符(“\ r \ n”),
  • A standalone carriage-return character ('\r'),
  • 一个独立的回车符('\ r'),
  • A next-line character ('\u0085'),
  • 下一行字符('\ u0085'),
  • A line-separator character ('\u2028'), or
  • 行分隔符('\ u2028')或
  • A paragraph-separator character ('\u2029).
  • 段落分隔符('\ u2029)。

The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:

只要未设置必要的标志,这两个表达式就不相同。再次引用oracle网站:

If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.

如果激活了UNIX_LINES模式,则唯一识别的行终止符是换行符。

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.

正则表达式。除非指定了DOTALL标志,否则匹配除行终止符之外的任何字符。

#2


2  

Here is a sheet explaining all the regex commands.

这是一张说明所有正则表达式命令的表单。

Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

基本上,\ s \ S将拾取所有字符,包括换行符。鉴于。默认情况下不接收线路终结器(需要设置某些标志来接收它们)。