I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):
我已经开发了一个正则表达式来识别文本文件中的xml块。表达式看起来像这样(我删除了所有java转义斜杠以使其易于阅读):
<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>
Then I optimised it and replaced [\s\S]*?
with .*?
It suddenly stopped recognising the xml.
然后我优化它并替换[\ s \ S] *?用。*?它突然停止识别xml。
As far as I know, \s
means all white-space symbols and \S
means all non white-spaced symbols or [^\s]
so [\s\S]
logically should be equivalent to .
I didn't use greedy filters, so what could be the difference?
据我所知,\ s表示所有空格符号,\ S表示所有非白色间隔符号或[^ \ s],因此[\ s \ S]在逻辑上应相当于。我没有使用贪婪的过滤器,那么有什么区别呢?
2 个解决方案
#1
9
The regular expressions .
and \s\S
are not equivalent, since .
doesn't catch line terminators (like new line) by default.
正则表达式。和\ s \ S不等同,因为。默认情况下不会捕获行终止符(如新行)。
According to the oracle website, .
matches
根据oracle网站,。火柴
Any character (may or may not match line terminators)
任何字符(可能与行终止符匹配也可能不匹配)
while a line terminator is any of the following:
行终止符是以下任何一种:
- A newline (line feed) character (
'\n'
),- 换行符(换行符)('\ n'),
- A carriage-return character followed immediately by a newline character (
"\r\n"
),- 一个回车符后面跟一个换行符(“\ r \ n”),
- A standalone carriage-return character (
'\r'
),- 一个独立的回车符('\ r'),
- A next-line character (
'\u0085'
),- 下一行字符('\ u0085'),
- A line-separator character (
'\u2028'
), or- 行分隔符('\ u2028')或
- A paragraph-separator character (
'\u2029
).- 段落分隔符('\ u2029)。
The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:
只要未设置必要的标志,这两个表达式就不相同。再次引用oracle网站:
If
UNIX_LINES
mode is activated, then the only line terminators recognized are newline characters.如果激活了UNIX_LINES模式,则唯一识别的行终止符是换行符。
The regular expression
.
matches any character except a line terminator unless theDOTALL
flag is specified.正则表达式。除非指定了DOTALL标志,否则匹配除行终止符之外的任何字符。
#2
2
Here is a sheet explaining all the regex commands.
这是一张说明所有正则表达式命令的表单。
Basically, \s\S
will pickup all characters, including newlines. Whereas .
does not pickup line terminators per default (certain flags need to be set to pick them up).
基本上,\ s \ S将拾取所有字符,包括换行符。鉴于。默认情况下不接收线路终结器(需要设置某些标志来接收它们)。
#1
9
The regular expressions .
and \s\S
are not equivalent, since .
doesn't catch line terminators (like new line) by default.
正则表达式。和\ s \ S不等同,因为。默认情况下不会捕获行终止符(如新行)。
According to the oracle website, .
matches
根据oracle网站,。火柴
Any character (may or may not match line terminators)
任何字符(可能与行终止符匹配也可能不匹配)
while a line terminator is any of the following:
行终止符是以下任何一种:
- A newline (line feed) character (
'\n'
),- 换行符(换行符)('\ n'),
- A carriage-return character followed immediately by a newline character (
"\r\n"
),- 一个回车符后面跟一个换行符(“\ r \ n”),
- A standalone carriage-return character (
'\r'
),- 一个独立的回车符('\ r'),
- A next-line character (
'\u0085'
),- 下一行字符('\ u0085'),
- A line-separator character (
'\u2028'
), or- 行分隔符('\ u2028')或
- A paragraph-separator character (
'\u2029
).- 段落分隔符('\ u2029)。
The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:
只要未设置必要的标志,这两个表达式就不相同。再次引用oracle网站:
If
UNIX_LINES
mode is activated, then the only line terminators recognized are newline characters.如果激活了UNIX_LINES模式,则唯一识别的行终止符是换行符。
The regular expression
.
matches any character except a line terminator unless theDOTALL
flag is specified.正则表达式。除非指定了DOTALL标志,否则匹配除行终止符之外的任何字符。
#2
2
Here is a sheet explaining all the regex commands.
这是一张说明所有正则表达式命令的表单。
Basically, \s\S
will pickup all characters, including newlines. Whereas .
does not pickup line terminators per default (certain flags need to be set to pick them up).
基本上,\ s \ S将拾取所有字符,包括换行符。鉴于。默认情况下不接收线路终结器(需要设置某些标志来接收它们)。