剥离无效XML字符的转义序列

时间:2021-12-08 22:27:06

According to the XML spec, only the following charcters are legal:

根据XML规范,只有以下字符是合法的:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I have a string named foo containing the JSON representation of an object. Some strings of the JSON object contain escape sequences for characters that are illegal in XML, e.g. \u0002 and \u000b.

我有一个名为foo的字符串,其中包含对象的JSON表示。 JSON对象的某些字符串包含XML中非法的字符的转义序列,例如\ u0002和\ u000b。

I want to strip those escape sequences from foo before throwing it to a JSON to XML converter, because the converter is a black box that provides no ability to handle those invalid characters.

我想从foo中删除那些转义序列,然后将其转换为JSON转换为XML转换器,因为转换器是一个黑盒子,无法处理这些无效字符。

Example for what I would like to do:

我想做的例子:

MAGIC_REGEX = "<here's what needs to be found>"  # TODO

String foo = "\\u0002bar b\\u000baz qu\\u000fx"
String clean_foo = foo.replace(MAGIC_REGEX, "�")  # � Unicode replacement character

System.out.println(clean_foo)  # Output is "bar baz qux"

How can I achieve that? Bonus points for solutions that use a regex instead of parsing the string and comparing Unicode codepoints.

我怎样才能做到这一点?使用正则表达式而不是解析字符串并比较Unicode代码点的解决方案的加分点。

I am aware of this SO question. However, my problem here are the escape sequences of the illegal characters, not the real characters themselves.

我知道这个问题。但是,我的问题是非法字符的转义序列,而不是真实字符本身。

1 个解决方案

#1


1  

I finally came up with this regex, which matches almost all illegal characters according to the XML spec, except the ones above #x10000 (#x11000 and onwards):

我终于提出了这个正则表达式,它根据XML规范匹配几乎所有非法字符,除了上面的#x10000(#x11000及以后):

# case-sensitive version
\\\\u(00(0[^9ADad]|1[0-9A-Fa-f])|D[8-9A-Fa-f][0-9A-Fa-f]{2}|[Ff]{3}[EFef])

# case-insensitive version
\\\\u(00(0[^9ad]|1[0-9a-f])|D[8-9a-f][0-9a-f]{2}|fff[ef])

#1


1  

I finally came up with this regex, which matches almost all illegal characters according to the XML spec, except the ones above #x10000 (#x11000 and onwards):

我终于提出了这个正则表达式,它根据XML规范匹配几乎所有非法字符,除了上面的#x10000(#x11000及以后):

# case-sensitive version
\\\\u(00(0[^9ADad]|1[0-9A-Fa-f])|D[8-9A-Fa-f][0-9A-Fa-f]{2}|[Ff]{3}[EFef])

# case-insensitive version
\\\\u(00(0[^9ad]|1[0-9a-f])|D[8-9a-f][0-9a-f]{2}|fff[ef])