While writing this answer, I had to match exclusively on linebreaks instead of using the s
-flag (dotall
- dot matches linebreaks).
在写这个答案的时候,我必须完全匹配换行,而不是使用s标志(dotall - dot匹配linebreak)。
The sites usually used to test regular expressions behave differently when trying to match on \n
or \r\n
.
通常用于测试正则表达式的站点在尝试与\n或\r\n匹配时表现出不同的行为。
I noticed
我注意到
-
Regex101 matches linebreaks only on
\n
(example - delete\r
and it matches)Regex101匹配仅在\n上的linebreak(示例-删除\r和它匹配)
-
RegExr matches linebreaks neither on
\n
nor on\r\n
and I can't find something to make it match a linebreak, except for them
-flag and\s
(example)RegExr匹配的linebreak既不是在\n也不是在\r\n,我找不到使它与linebreak匹配的东西,除了m-flag和\s(例子)
-
Debuggex behaves even more different:
in this example it matches only on\r\n
, while
here it only matches on\n
, with the same flags and engine specifiedDebuggex的行为更加不同:在这个示例中,它只匹配\r\n,而在这里它只匹配\n,并且指定了相同的标志和引擎。
I'm fully aware of the m
-flag (multiline - makes ^
match the start and $
the end of a line), but sometimes this is not an option. Same with \s
, as it matches tabs and spaces, too.
我充分意识到m-flag(多行-使^匹配的开始和结束美元线),但有时这不是一个选择。与\s相同,因为它也匹配制表符和空格。
My thought to use the unicode newline character (\u0085
) wasn't successful, so:
我想使用unicode换行符(\u0085)并不成功,所以:
- Is there a failsafe way to integrate the match on a linebreak (preferably regardless of the language used) into a regular expression?
- 是否有一种万无一失的方法来将一个linebreak(最好是不考虑使用的语言)集成到一个正则表达式中?
- Why do the above mentioned sites behave differently (especially Debuggex, matching once only on
\n
and once only on\r\n
)? - 为什么上面提到的站点的行为不一样(特别是Debuggex,只在\n上匹配一次,一次只在\r\n上)?
3 个解决方案
#1
75
Gonna answer in opposite direction ;)
要向相反的方向回答;
2) For a full explanation about \r and \n I have to refer to this question, which is far more complete than I will post here: Difference between \n and \r?
2)关于这个问题,我有一个完整的解释,我必须要提到这个问题,这个问题要比我在这里发布的要完整得多:\n和\r的区别?
Long story short, Linux uses \n for a new-line, Windows \r\n and old Macs \r. So there are multiple ways to write a newline. Your second tool (RegExr) does for example match on the single \r.
长话短说,Linux使用\n的新线路,Windows \r\n和旧的Macs \r。所以有多种方法来写一条换行。您的第二个工具(RegExr)在单个\r上进行匹配。
1) [\r\n]+
as Ilya suggested will work, but will also match multiple consecutive new-lines. (\r\n|\r|\n)
is more correct.
1)[\r\n]+正如Ilya建议的,但也将匹配多个连续的新线路。(| | \ r \ n \ r \ n)是正确的。
#2
3
You have different line endings in the example texts in Debuggex. What is especially interesting is that Debuggex seems to have identified which line ending style you used first, and it converts all additional line endings entered to that style.
在Debuggex的示例文本中有不同的行结束。尤其有趣的是,Debuggex似乎已经确定了您使用的第一个行结束样式,并且它将所有附加的行结尾转换为该样式。
I used Notepad++ to paste sample text in Unix and Windows format into Debuggex, and whichever I pasted first is what that session of Debuggex stuck with.
我使用Notepad++将Unix和Windows格式的示例文本粘贴到Debuggex中,而我所粘贴的任何一个都是调试器所支持的。
So, you should wash your text through your text editor before pasting it into Debuggex. Ensure that you're pasting the style you want. Debuggex defaults to Unix style (\n).
因此,在将文本粘贴到Debuggex之前,您应该通过文本编辑器来清洗文本。确保你正在粘贴你想要的样式。Debuggex默认为Unix风格(\n)。
Also, NEL (\u0085) is something different entirely: https://en.wikipedia.org/wiki/Newline#Unicode
另外,NEL (\u0085)完全不同:https://en.wikipedia.org/wiki/Newline#Unicode。
(\r?\n)
will cover Unix and Windows. You'll need something more complex, like (\r\n|\r|\n)
, if you want to match old Mac too.
(\r?\n)将包括Unix和Windows。如果你想要匹配旧的Mac,你需要一些更复杂的东西,比如(\r\n|\r|\n)。
#3
1
This only applies to question 1.
这只适用于问题1。
I have an app that runs on Windows and uses a multi-line MFC editor box.
The editor box expects CRLF linebreaks, but I need to parse the text enterred
with some really big/nasty regexs'.
我有一个运行在Windows上的应用程序,并使用一个多行MFC编辑器框。编辑框希望CRLF换行符,但是我需要对文本进行解析,其中包含一些非常大/讨厌的regexs。
I didn't want to be stressing about this while writing the regex, so
I ended up normalizing back and forth between the parser and editor so that
the regexs' just use \n
. I also trap paste operations and convert them for the boxes.
在编写regex时,我不希望在这方面感到紧张,因此我最终在解析器和编辑器之间来回转换,以便regexs“只使用\n”。我还设置了粘贴操作,并将它们转换为框。
This does not take much time.
This is what I use.
这不会花很多时间。这是我用的。
boost::regex CRLFCRtoLF (
" \\r\\n | \\r(?!\\n) "
, MODx);
boost::regex CRLFCRtoCRLF (
" \\r\\n?+ | \\n "
, MODx);
// Convert (All style) linebreaks to linefeeds
// ---------------------------------------
void ReplaceCRLFCRtoLF( string& strSrc, string& strDest )
{
strDest = boost::regex_replace ( strSrc, CRLFCRtoLF, "\\n" );
}
// Convert linefeeds to linebreaks (Windows)
// ---------------------------------------
void ReplaceCRLFCRtoCRLF( string& strSrc, string& strDest )
{
strDest = boost::regex_replace ( strSrc, CRLFCRtoCRLF, "\\r\\n" );
}
#1
75
Gonna answer in opposite direction ;)
要向相反的方向回答;
2) For a full explanation about \r and \n I have to refer to this question, which is far more complete than I will post here: Difference between \n and \r?
2)关于这个问题,我有一个完整的解释,我必须要提到这个问题,这个问题要比我在这里发布的要完整得多:\n和\r的区别?
Long story short, Linux uses \n for a new-line, Windows \r\n and old Macs \r. So there are multiple ways to write a newline. Your second tool (RegExr) does for example match on the single \r.
长话短说,Linux使用\n的新线路,Windows \r\n和旧的Macs \r。所以有多种方法来写一条换行。您的第二个工具(RegExr)在单个\r上进行匹配。
1) [\r\n]+
as Ilya suggested will work, but will also match multiple consecutive new-lines. (\r\n|\r|\n)
is more correct.
1)[\r\n]+正如Ilya建议的,但也将匹配多个连续的新线路。(| | \ r \ n \ r \ n)是正确的。
#2
3
You have different line endings in the example texts in Debuggex. What is especially interesting is that Debuggex seems to have identified which line ending style you used first, and it converts all additional line endings entered to that style.
在Debuggex的示例文本中有不同的行结束。尤其有趣的是,Debuggex似乎已经确定了您使用的第一个行结束样式,并且它将所有附加的行结尾转换为该样式。
I used Notepad++ to paste sample text in Unix and Windows format into Debuggex, and whichever I pasted first is what that session of Debuggex stuck with.
我使用Notepad++将Unix和Windows格式的示例文本粘贴到Debuggex中,而我所粘贴的任何一个都是调试器所支持的。
So, you should wash your text through your text editor before pasting it into Debuggex. Ensure that you're pasting the style you want. Debuggex defaults to Unix style (\n).
因此,在将文本粘贴到Debuggex之前,您应该通过文本编辑器来清洗文本。确保你正在粘贴你想要的样式。Debuggex默认为Unix风格(\n)。
Also, NEL (\u0085) is something different entirely: https://en.wikipedia.org/wiki/Newline#Unicode
另外,NEL (\u0085)完全不同:https://en.wikipedia.org/wiki/Newline#Unicode。
(\r?\n)
will cover Unix and Windows. You'll need something more complex, like (\r\n|\r|\n)
, if you want to match old Mac too.
(\r?\n)将包括Unix和Windows。如果你想要匹配旧的Mac,你需要一些更复杂的东西,比如(\r\n|\r|\n)。
#3
1
This only applies to question 1.
这只适用于问题1。
I have an app that runs on Windows and uses a multi-line MFC editor box.
The editor box expects CRLF linebreaks, but I need to parse the text enterred
with some really big/nasty regexs'.
我有一个运行在Windows上的应用程序,并使用一个多行MFC编辑器框。编辑框希望CRLF换行符,但是我需要对文本进行解析,其中包含一些非常大/讨厌的regexs。
I didn't want to be stressing about this while writing the regex, so
I ended up normalizing back and forth between the parser and editor so that
the regexs' just use \n
. I also trap paste operations and convert them for the boxes.
在编写regex时,我不希望在这方面感到紧张,因此我最终在解析器和编辑器之间来回转换,以便regexs“只使用\n”。我还设置了粘贴操作,并将它们转换为框。
This does not take much time.
This is what I use.
这不会花很多时间。这是我用的。
boost::regex CRLFCRtoLF (
" \\r\\n | \\r(?!\\n) "
, MODx);
boost::regex CRLFCRtoCRLF (
" \\r\\n?+ | \\n "
, MODx);
// Convert (All style) linebreaks to linefeeds
// ---------------------------------------
void ReplaceCRLFCRtoLF( string& strSrc, string& strDest )
{
strDest = boost::regex_replace ( strSrc, CRLFCRtoLF, "\\n" );
}
// Convert linefeeds to linebreaks (Windows)
// ---------------------------------------
void ReplaceCRLFCRtoCRLF( string& strSrc, string& strDest )
{
strDest = boost::regex_replace ( strSrc, CRLFCRtoCRLF, "\\r\\n" );
}