
时间:2022-12-06 21:44:31

I've been trying to split a string by another string via RegEx.Split() method in C#. Either data or splitter can have diacritics.


Let me give you an example:


Data: education


Expected result: e / du / cation

预期结果:e / du / cation


Data: èdùcation

Splitter: ed

Expected result: èd / ùcation


Is it possible? If it is, could you help me for writing the pattern?


1 个解决方案



There is no option in .NET's regular expression engine to "ignore diacritics", however it might be possible to work around it by making use of Unicode normal form-D (for "decomposed"). This is untested.


Accented characters can be represented in two ways:


  • As single pre-composed code points. Eg. U+00F9 (Latin Small Letter U with Grave).
  • 作为单个预先组成的代码点。例如。 U + 00F9(带坟墓的拉丁文小写字母U)。

  • As a base code point followed by one or more combining characters. Eg. U+0075, U-0300 (Latin Small Letter U, Combining Grave Accent).
  • 作为基本代码点后跟一个或多个组合字符。例如。 U + 0075,U-0300(拉丁文小写字母U,结合严重重音)。

Thus if you ensure the input data is decomposed (use String.Normalise(normalization) passing NormalizationForm.FormD), and that any potentially accented character in the pattern is replaced by



a base character B followed by zero or more code points in Unicode category "Mark, Spacing Combining".

基本字符B后跟Unicode类别“Mark,Spacing Combining”中的零个或多个代码点。

To include the text that matches the regex in the output make it capturing, so to match and capture both du and use (du\p{Mc}*).

要在输出中包含与正则表达式匹配的文本,请将其捕获,以便匹配并捕获du和dùuse(du \ p {Mc} *)。



There is no option in .NET's regular expression engine to "ignore diacritics", however it might be possible to work around it by making use of Unicode normal form-D (for "decomposed"). This is untested.


Accented characters can be represented in two ways:


  • As single pre-composed code points. Eg. U+00F9 (Latin Small Letter U with Grave).
  • 作为单个预先组成的代码点。例如。 U + 00F9(带坟墓的拉丁文小写字母U)。

  • As a base code point followed by one or more combining characters. Eg. U+0075, U-0300 (Latin Small Letter U, Combining Grave Accent).
  • 作为基本代码点后跟一个或多个组合字符。例如。 U + 0075,U-0300(拉丁文小写字母U,结合严重重音)。

Thus if you ensure the input data is decomposed (use String.Normalise(normalization) passing NormalizationForm.FormD), and that any potentially accented character in the pattern is replaced by



a base character B followed by zero or more code points in Unicode category "Mark, Spacing Combining".

基本字符B后跟Unicode类别“Mark,Spacing Combining”中的零个或多个代码点。

To include the text that matches the regex in the output make it capturing, so to match and capture both du and use (du\p{Mc}*).

要在输出中包含与正则表达式匹配的文本,请将其捕获,以便匹配并捕获du和dùuse(du \ p {Mc} *)。