RegEx替换为无效的数字字符引用

Need some serious RegEx help with replacing invalid Numeric character reference in Xml document.

需要一些严肃的RegEx帮助来替换Xml文档中的无效数字字符引用。

Some of our Xml data used production is being made unreadable due to a known bug in XmlWriter by which semicolons are getting dropped when you write XML entities. Unfortunately for me for some strange reason the production environment was not running on latest .Net framework which resulted in quite a lot of these kind of data being inserted to the database, and now I have to find a fix for reading back and fixing this data somehow.

由于XmlWriter中的一个已知错误,当您编写XML实体时,分号会被删除,因此我们使用的某些Xml数据生成无法读取。不幸的是,由于一些奇怪的原因,生产环境没有在最新的.Net框架上运行,导致相当多的这类数据被插入到数据库中,现在我必须找到一个用于回读和修复此数据的修复程序不知何故。

An example of misconstrued XML (in below XML look for &#xE1d& and &#x3A3):

误解XML的示例(在下面的XML中查找ฝ&和Σ):

<TestInvalidUnicodeReading Desc="a&#xF1;o &#x20AC;  &#x3A3 &#xC6; Jako efektivn&#x11;B;j&#x161;&#xED; se n&#xE1;m jev&#xED; po&#x159&#xE1d&#xE1;n&#xED; tzv. st&#x159ed;nictv&#xED;m na&#x161;ich an&#xFDc;h dealer&#x16F; v &#x10Cec;h&#xE1c;h a na Morav&#x11;B, kter&#xE9; prob&#x11;Bhnou v pr&#x16Fb;&#x11;Bhu z&#xE1;&#x159;&#xED; a &#x159;&#xEDjna.bddb26e234c5452aab7720c581e137f7" />

Now to fix this I have devised the following RegEx solution and use it in C# to find the match and add the missing semi-colon, which works partially:

现在为了解决这个问题,我设计了以下RegEx解决方案并在C#中使用它来查找匹配并添加缺少的分号,这部分工作:

&((?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+))[?&0-9a-zA-Z ])

Now the problem is with the &#xE1d& section.

现在问题在于ฝ&section。

Since when this above RegEx matches the previous match the next &#xE1d& is getting skipped. Can someone please lend me a hand finding a solution to this RegEx issue??

因为当上面的RegEx匹配上一个匹配时,下一个ฝ&将被跳过。有人可以帮我找一个解决这个RegEx问题的方法吗?

1 个解决方案

#1

I think you can improve the regex by using a negative lookahead assertion:

我认为你可以通过使用负前瞻断言来改进正则表达式:

&(#[0-9]+(?![0-9;])|#x[0-9a-fA-F]+(?![0-9a-fA-F;]))

will only match numeric character references that are not followed by a ;.

将仅匹配未后跟的数字字符引用;。

Explanation:

&                 # Match &
(                 # Start of capturing group:
 #[0-9]+          # Match either # plus digits  
 (?![0-9;])       # as long as they are not followed by a semicolon or more digits
|                 #
 #x[0-9a-fA-F]+   # match #x plus hex digits
 (?![0-9a-fA-F;]) # as long as they are not followed by semicolon or hex
)                 # End of group

Test it live on regex101.com.

在regex101.com上测试它。

#1