正则表达式替换的输出令人困惑

时间:2021-07-16 13:05:30

I've been trying to understand the output of a Regex.Replace call and I am puzzled as to its output.

我一直在尝试理解Regex的输出。代替call,我对它的输出感到困惑。

I have a Dictionary<string, string>. I want to search for the keys in the input string and replace them with the corresponding value if the string exists at the beginning of the string, at the end of the string, or in the middle of the string if it is surrounded by one or more spaces on each side.

我有一个字典 。我想搜索键的输入字符串,并将其替换为相应的值,如果字符串存在于字符串的开始,结束时字符串,或中间的字符串,如果周围是一个或多个空格两侧。 ,>

My input string is as follows:

我的输入字符串如下:

North S West N East W South E W S N West South

The regular expression in this code comes out as:

此代码中的正则表达式如下:

(^| +?)SOUTH($| +?)|(^| +?)NORTH($| +?)|(^| +?)EAST($| +?)|(^| +?)WEST($| +?)|(^| +?)E($| +?)|(^| +?)W($| +?)|(^| +?)N($| +?)|(^| +?)S($| +?)

My expected output is:

我的预期的输出是:

N SOUTH W NORTH E WEST S EAST WEST SOUTH NORTH W S

My actual output is:

我的实际输出是:

N S W N E W S E WEST S NORTH WEST S

The code is below. The RegEx pattern is constructed from the keys of the dictionary. I feel I am just misunderstanding something simple about regular expressions. Why does it pick up some of the words but not all of them? For example, why does it not match the word West near the end of the string, but does match the word West near the beginning of the string? I have added code to write each of the matches and the pattern string but I am stumped.

下面的代码。RegEx模式由字典的键构造。我觉得我只是误解了正则表达式的一些简单的东西。为什么它会选择一些词而不是全部?例如,为什么它不匹配字符串末尾的West这个词,但是在字符串的开头与West匹配呢?我添加了代码来编写每个匹配项和模式字符串,但是我遇到了麻烦。

void Main()
{
        var directions = new Dictionary<string, string>
        {
            {"SOUTH", "S"},
            {"NORTH", "N"},
            {"EAST", "E"},
            {"WEST", "W"},
            {"E", "EAST"},
            {"W", "WEST"},
            {"N", "NORTH"},
            {"S", "SOUTH"},
        };

        string input = @"North S West N East W South E W S N West South";

        Console.WriteLine(doReplace(input, directions));
}

private string doReplace(string input, Dictionary<string, string> lookup)
{
    string output = null;

    //Construct the regular expression pattern
    string searchPattern = string.Join(@"|", lookup.Select(s => @"(^| +?)" + s.Key + @"($| +?)").ToArray());
    Console.WriteLine(searchPattern);

    //Perform the replace
    output = Regex.Replace(input.ToUpper(), searchPattern, new MatchEvaluator(m =>
    {
        //Write out each match found
        Console.WriteLine("[{0}]", m.Value);

        string tmp = m.Value.Trim();
        string result = tmp;
        lookup.TryGetValue(tmp, out result);

        //This return statement is for the lambda not the method.
        return m.Value.Replace(tmp, result);
    }), RegexOptions.ExplicitCapture|RegexOptions.Singleline);

    return output;
}

1 个解决方案

#1


3  

Your problem is that the elements of your regex (unless the matches are at the start/end of the string) require at least one space before and after the match:

您的问题是正则表达式的元素(除非匹配在字符串的开始/结束)在匹配之前和之后至少需要一个空格:

(^| +?)SOUTH($| +?)

matches a space, then SOUTH, then another space. Now if the next potential match starts right after that, there would have to be a second space character to start the next match. But you only have single spaces between words, so at most every other word can match.

匹配一个空间,然后是南方,然后是另一个空间。现在,如果下一个潜在的匹配在那之后开始,那么必须有第二个空格字符才能开始下一个匹配。但是单词之间只有一个空格,所以其他单词最多可以匹配。

You can visualize this here, for example.

你可以在这里想象一下。

If your goal is to only match entire words instead of substrings, use \b word boundary anchors. \bSOUTH\b will match SOUTH but not SOUTHERN.

如果你的目标是只匹配整个单词而不是子字符串,使用\b字边界锚。

#1


3  

Your problem is that the elements of your regex (unless the matches are at the start/end of the string) require at least one space before and after the match:

您的问题是正则表达式的元素(除非匹配在字符串的开始/结束)在匹配之前和之后至少需要一个空格:

(^| +?)SOUTH($| +?)

matches a space, then SOUTH, then another space. Now if the next potential match starts right after that, there would have to be a second space character to start the next match. But you only have single spaces between words, so at most every other word can match.

匹配一个空间,然后是南方,然后是另一个空间。现在,如果下一个潜在的匹配在那之后开始,那么必须有第二个空格字符才能开始下一个匹配。但是单词之间只有一个空格,所以其他单词最多可以匹配。

You can visualize this here, for example.

你可以在这里想象一下。

If your goal is to only match entire words instead of substrings, use \b word boundary anchors. \bSOUTH\b will match SOUTH but not SOUTHERN.

如果你的目标是只匹配整个单词而不是子字符串,使用\b字边界锚。