由文化敏感的String.IndexOf方法匹配的子字符串长度

时间:2022-09-13 07:34:46

I tried writing a culture-aware string replacement method:

我尝试编写一种文化感知字符串替换方法:

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + oldValue.Length)
        : text;
}

However, it chokes on Unicode combining characters:

但是,它在Unicode组合字符上窒息:

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. INCORRECT: do
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. INCORRECT: dóf

To fix my code, I need to know that in the second example, String.IndexOf matched only one character (é) even though it searched for two (e\u0301). Similarly, I need to know that in the third example, String.IndexOf matched two characters (e\u0301) even though it only searched for one (é).

为了修复我的代码,我需要知道在第二个例子中,String.IndexOf只匹配一个字符(é),即使它搜索了两个(e \ u0301)。同样,我需要知道在第三个例子中,String.IndexOf匹配了两个字符(e \ u0301),即使它只搜索了一个(é)。

How can I determine the actual length of the substring matched by String.IndexOf?

如何确定String.IndexOf匹配的子字符串的实际长度?

NOTE: Performing Unicode normalization on text and oldValue (as suggested by James Keesey) would accommodate combining characters, but ligatures would still be a problem:

注意:对text和oldValue执行Unicode规范化(如James Keesey所建议)将适合组合字符,但连字仍然是一个问题:

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. INCORRECT: i
Console.WriteLine(Replace("oef", "œ", "i")); // 6. INCORRECT: ief

3 个解决方案

#1


5  

You will need to directly call FindNLSString or FindNLSStringEx yourself. String.IndexOf uses FindNLSStringEx but all the information you need is available in FindNLSString.

您需要自己直接调用FindNLSString或FindNLSStringEx。 String.IndexOf使用FindNLSStringEx,但FindNLSString中提供了所需的所有信息。

Here is an example of how to rewrite your Replace method that works against your test cases. Note that I am using the current user locale read up the API documentation if you want to use the system locale or provide your own. I am also passing in 0 for the flags which means it will use the default string comparison options for the locale, again the documentation can help you provide different options.

下面是一个如何重写适用于您的测试用例的Replace方法的示例。请注意,如果您要使用系统区域设置或提供自己的系统区域设置,我将使用当前用户区域设置读取API文档。我也传递0表示标志,这意味着它将使用语言环境的默认字符串比较选项,文档可以帮助您提供不同的选项。

public const int LOCALE_USER_DEFAULT = 0x0400;

[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);

public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
{
    int foundLength;
    int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
    return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
}

#2


2  

I spoke too soon (and had never seen this method before) but there is an alternative. You can use the StringInfo.ParseCombiningCharacters() method to get the start of each actual character and use that to determine the length of the string to replace.

我说得太早了(之前从未见过这种方法),但还有另一种选择。您可以使用StringInfo.ParseCombiningCharacters()方法获取每个实际字符的开头,并使用它来确定要替换的字符串的长度。


You will need to normalize both strings before you do the Index call. This will make sure that the source and target strings are the same length.

在进行索引调用之前,您需要规范化两个字符串。这将确保源和目标字符串的长度相同。

See the String.Normalize() reference page which describes this exact problem.

请参阅String.Normalize()参考页面,该页面描述了这个确切的问题。

#3


2  

Using the following methods works for your examples. It works by comparing values until it finds how many characters are needed in the source string to equal the oldValue, and using that instead of simply oldValue.Length.

使用以下方法适用于您的示例。它的工作原理是比较值,直到找到源字符串中需要多少个字符来等于oldValue,并使用它而不是简单地使用oldValue.Length。

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    if (index >= 0)
        return text.Substring(0, index) + newValue +
                 text.Substring(index + LengthInString(text, oldValue, index));
    else
        return text;
}
static int LengthInString(string text, string oldValue, int index)
{
    for (int length = 1; length <= text.Length - index; length++)
        if (string.Equals(text.Substring(index, length), oldValue,
                                            StringComparison.CurrentCulture))
            return length;
    throw new Exception("Oops!");
}

#1


5  

You will need to directly call FindNLSString or FindNLSStringEx yourself. String.IndexOf uses FindNLSStringEx but all the information you need is available in FindNLSString.

您需要自己直接调用FindNLSString或FindNLSStringEx。 String.IndexOf使用FindNLSStringEx,但FindNLSString中提供了所需的所有信息。

Here is an example of how to rewrite your Replace method that works against your test cases. Note that I am using the current user locale read up the API documentation if you want to use the system locale or provide your own. I am also passing in 0 for the flags which means it will use the default string comparison options for the locale, again the documentation can help you provide different options.

下面是一个如何重写适用于您的测试用例的Replace方法的示例。请注意,如果您要使用系统区域设置或提供自己的系统区域设置,我将使用当前用户区域设置读取API文档。我也传递0表示标志,这意味着它将使用语言环境的默认字符串比较选项,文档可以帮助您提供不同的选项。

public const int LOCALE_USER_DEFAULT = 0x0400;

[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);

public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
{
    int foundLength;
    int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
    return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
}

#2


2  

I spoke too soon (and had never seen this method before) but there is an alternative. You can use the StringInfo.ParseCombiningCharacters() method to get the start of each actual character and use that to determine the length of the string to replace.

我说得太早了(之前从未见过这种方法),但还有另一种选择。您可以使用StringInfo.ParseCombiningCharacters()方法获取每个实际字符的开头,并使用它来确定要替换的字符串的长度。


You will need to normalize both strings before you do the Index call. This will make sure that the source and target strings are the same length.

在进行索引调用之前,您需要规范化两个字符串。这将确保源和目标字符串的长度相同。

See the String.Normalize() reference page which describes this exact problem.

请参阅String.Normalize()参考页面,该页面描述了这个确切的问题。

#3


2  

Using the following methods works for your examples. It works by comparing values until it finds how many characters are needed in the source string to equal the oldValue, and using that instead of simply oldValue.Length.

使用以下方法适用于您的示例。它的工作原理是比较值,直到找到源字符串中需要多少个字符来等于oldValue,并使用它而不是简单地使用oldValue.Length。

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    if (index >= 0)
        return text.Substring(0, index) + newValue +
                 text.Substring(index + LengthInString(text, oldValue, index));
    else
        return text;
}
static int LengthInString(string text, string oldValue, int index)
{
    for (int length = 1; length <= text.Length - index; length++)
        if (string.Equals(text.Substring(index, length), oldValue,
                                            StringComparison.CurrentCulture))
            return length;
    throw new Exception("Oops!");
}