用换行符删除页脚文本

时间:2021-08-04 12:13:54

I am hoping this is quite simple... I am trying to remove a footer from a block of text using a regular expression, this includes the two initial line breaks which is where my problem lies.

我希望这很简单……我正在尝试使用正则表达式从文本块中删除页脚,这包括两个初始换行符,这就是我的问题所在。

    Message body blah blah balh
    {Line Break}
    {Line Break}
    ----------------------------------
    Custom footer text

I have been experimenting with variations of /\?(\r\n)(\r\n)([-{34}])/.* but nothing is working.

我一直在试验各种变化的/\?(\r\n)(\r\n)([-{34}])/。但是没有任何东西是有效的。

1 个解决方案

#1


3  

I made a test and this works:

我做了一个测试

[\r\n]*-{34}[\w\s\n\r]*

Here's the code:

这是代码:

var input = @"Message body blah blah balh


----------------------------------
Custom footer text";

var pattern = @"[\r\n]*-{34}[\w\s\n\r]*";
var clean = Regex.Replace(input, pattern, "", RegexOptions.Multiline);

Console.WriteLine(clean);

The output is the expected one:

输出是预期的:

Message body blah blah balh

There were several problems with the initial approach. Some of them were pointed out by abc667 in the comment above.

最初的方法存在几个问题。abc667在上面的评论中指出了其中的一些。

Here are two others:

这里有两个:

  • when you do (\r\n), you are expecting the exact character sequence CR, LF. In some operating systems however, a line break can be represented by only a \n (LF). To make the pattern work for both cases, you could use a character class, like so: [\r\n]*. This means: "all the sequence of \n and/or \r characters you can find, in any order".

    当你这样做的时候(\r\n),你正在期待确切的字符序列CR, LF。然而,在某些操作系统中,断行只能用一个\n (LF)表示。要使模式对这两种情况都有效,您可以使用字符类,如:[\r\n]*。这意味着:“你可以找到任何顺序的\n和/或\r字符”。

  • the dot (.) matches any single character except \n (see docs). In some regex flavours it may also match newlines under special conditions (see "(dot)" here), but not in .NET. This is why I replaced the .* that was supposed to match everything after the dotted line with [\w\s\r\n]* that will match any word characters, space characters, CR and LF.

    点(.)匹配除\n(参见文档)之外的任何单个字符。在某些regex风格中,它也可能在特殊条件下匹配新行(参见这里的“(dot)”),但在。net中不会。这就是为什么我用[\w\s\r\n]*替换了.*,该*应该在虚线之后匹配所有内容,该*将匹配任何单词字符、空格字符、CR和LF。

#1


3  

I made a test and this works:

我做了一个测试

[\r\n]*-{34}[\w\s\n\r]*

Here's the code:

这是代码:

var input = @"Message body blah blah balh


----------------------------------
Custom footer text";

var pattern = @"[\r\n]*-{34}[\w\s\n\r]*";
var clean = Regex.Replace(input, pattern, "", RegexOptions.Multiline);

Console.WriteLine(clean);

The output is the expected one:

输出是预期的:

Message body blah blah balh

There were several problems with the initial approach. Some of them were pointed out by abc667 in the comment above.

最初的方法存在几个问题。abc667在上面的评论中指出了其中的一些。

Here are two others:

这里有两个:

  • when you do (\r\n), you are expecting the exact character sequence CR, LF. In some operating systems however, a line break can be represented by only a \n (LF). To make the pattern work for both cases, you could use a character class, like so: [\r\n]*. This means: "all the sequence of \n and/or \r characters you can find, in any order".

    当你这样做的时候(\r\n),你正在期待确切的字符序列CR, LF。然而,在某些操作系统中,断行只能用一个\n (LF)表示。要使模式对这两种情况都有效,您可以使用字符类,如:[\r\n]*。这意味着:“你可以找到任何顺序的\n和/或\r字符”。

  • the dot (.) matches any single character except \n (see docs). In some regex flavours it may also match newlines under special conditions (see "(dot)" here), but not in .NET. This is why I replaced the .* that was supposed to match everything after the dotted line with [\w\s\r\n]* that will match any word characters, space characters, CR and LF.

    点(.)匹配除\n(参见文档)之外的任何单个字符。在某些regex风格中,它也可能在特殊条件下匹配新行(参见这里的“(dot)”),但在。net中不会。这就是为什么我用[\w\s\r\n]*替换了.*,该*应该在虚线之后匹配所有内容,该*将匹配任何单词字符、空格字符、CR和LF。