RegEx:用另一个字符串拆分字符串,包括变音符号

时间:2022-12-06 21:44:31

I've been trying to split a string by another string via RegEx.Split() method in C#. Either data or splitter can have diacritics.

我一直在尝试通过C#中的RegEx.Split()方法将字符串拆分为另一个字符串。数据或分离器都可以具有变音符号。

Let me give you an example:

让我举一个例子:

Data: education

Splitter:

Expected result: e / du / cation

预期结果:e / du / cation

--or--

Data: èdùcation

Splitter: ed

Expected result: èd / ùcation

预期结果:èd/ùcation

Is it possible? If it is, could you help me for writing the pattern?

可能吗?如果是的话,你能帮我写一下这个模式吗?

1 个解决方案

#1


1  

There is no option in .NET's regular expression engine to "ignore diacritics", however it might be possible to work around it by making use of Unicode normal form-D (for "decomposed"). This is untested.

.NET的正则表达式引擎中没有“忽略变音符号”的选项,但是可以通过使用Unicode普通形式-D(用于“分解”)来解决它。这是未经测试的。

Accented characters can be represented in two ways:

重音字符可以用两种方式表示:

  • As single pre-composed code points. Eg. U+00F9 (Latin Small Letter U with Grave).
  • 作为单个预先组成的代码点。例如。 U + 00F9(带坟墓的拉丁文小写字母U)。

  • As a base code point followed by one or more combining characters. Eg. U+0075, U-0300 (Latin Small Letter U, Combining Grave Accent).
  • 作为基本代码点后跟一个或多个组合字符。例如。 U + 0075,U-0300(拉丁文小写字母U,结合严重重音)。

Thus if you ensure the input data is decomposed (use String.Normalise(normalization) passing NormalizationForm.FormD), and that any potentially accented character in the pattern is replaced by

因此,如果您确保输入数据被分解(使用String.Normalise(规范化)传递NormalizationForm.FormD),并且模式中任何可能重音的字符将被替换为

B\p{Mc}*

a base character B followed by zero or more code points in Unicode category "Mark, Spacing Combining".

基本字符B后跟Unicode类别“Mark,Spacing Combining”中的零个或多个代码点。

To include the text that matches the regex in the output make it capturing, so to match and capture both du and use (du\p{Mc}*).

要在输出中包含与正则表达式匹配的文本,请将其捕获,以便匹配并捕获du和dùuse(du \ p {Mc} *)。

#1


1  

There is no option in .NET's regular expression engine to "ignore diacritics", however it might be possible to work around it by making use of Unicode normal form-D (for "decomposed"). This is untested.

.NET的正则表达式引擎中没有“忽略变音符号”的选项,但是可以通过使用Unicode普通形式-D(用于“分解”)来解决它。这是未经测试的。

Accented characters can be represented in two ways:

重音字符可以用两种方式表示:

  • As single pre-composed code points. Eg. U+00F9 (Latin Small Letter U with Grave).
  • 作为单个预先组成的代码点。例如。 U + 00F9(带坟墓的拉丁文小写字母U)。

  • As a base code point followed by one or more combining characters. Eg. U+0075, U-0300 (Latin Small Letter U, Combining Grave Accent).
  • 作为基本代码点后跟一个或多个组合字符。例如。 U + 0075,U-0300(拉丁文小写字母U,结合严重重音)。

Thus if you ensure the input data is decomposed (use String.Normalise(normalization) passing NormalizationForm.FormD), and that any potentially accented character in the pattern is replaced by

因此,如果您确保输入数据被分解(使用String.Normalise(规范化)传递NormalizationForm.FormD),并且模式中任何可能重音的字符将被替换为

B\p{Mc}*

a base character B followed by zero or more code points in Unicode category "Mark, Spacing Combining".

基本字符B后跟Unicode类别“Mark,Spacing Combining”中的零个或多个代码点。

To include the text that matches the regex in the output make it capturing, so to match and capture both du and use (du\p{Mc}*).

要在输出中包含与正则表达式匹配的文本,请将其捕获,以便匹配并捕获du和dùuse(du \ p {Mc} *)。