According to the XML spec, only the following charcters are legal:
根据XML规范,只有以下字符是合法的:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
I have a string named foo
containing the JSON representation of an object. Some strings of the JSON object contain escape sequences for characters that are illegal in XML, e.g. \u0002
and \u000b
.
我有一个名为foo的字符串,其中包含对象的JSON表示。 JSON对象的某些字符串包含XML中非法的字符的转义序列,例如\ u0002和\ u000b。
I want to strip those escape sequences from foo
before throwing it to a JSON to XML converter, because the converter is a black box that provides no ability to handle those invalid characters.
我想从foo中删除那些转义序列,然后将其转换为JSON转换为XML转换器,因为转换器是一个黑盒子,无法处理这些无效字符。
Example for what I would like to do:
我想做的例子:
MAGIC_REGEX = "<here's what needs to be found>" # TODO
String foo = "\\u0002bar b\\u000baz qu\\u000fx"
String clean_foo = foo.replace(MAGIC_REGEX, "�") # � Unicode replacement character
System.out.println(clean_foo) # Output is "bar baz qux"
How can I achieve that? Bonus points for solutions that use a regex instead of parsing the string and comparing Unicode codepoints.
我怎样才能做到这一点?使用正则表达式而不是解析字符串并比较Unicode代码点的解决方案的加分点。
I am aware of this SO question. However, my problem here are the escape sequences of the illegal characters, not the real characters themselves.
我知道这个问题。但是,我的问题是非法字符的转义序列,而不是真实字符本身。
1 个解决方案
#1
1
I finally came up with this regex, which matches almost all illegal characters according to the XML spec, except the ones above #x10000
(#x11000
and onwards):
我终于提出了这个正则表达式,它根据XML规范匹配几乎所有非法字符,除了上面的#x10000(#x11000及以后):
# case-sensitive version
\\\\u(00(0[^9ADad]|1[0-9A-Fa-f])|D[8-9A-Fa-f][0-9A-Fa-f]{2}|[Ff]{3}[EFef])
# case-insensitive version
\\\\u(00(0[^9ad]|1[0-9a-f])|D[8-9a-f][0-9a-f]{2}|fff[ef])
#1
1
I finally came up with this regex, which matches almost all illegal characters according to the XML spec, except the ones above #x10000
(#x11000
and onwards):
我终于提出了这个正则表达式,它根据XML规范匹配几乎所有非法字符,除了上面的#x10000(#x11000及以后):
# case-sensitive version
\\\\u(00(0[^9ADad]|1[0-9A-Fa-f])|D[8-9A-Fa-f][0-9A-Fa-f]{2}|[Ff]{3}[EFef])
# case-insensitive version
\\\\u(00(0[^9ad]|1[0-9a-f])|D[8-9a-f][0-9a-f]{2}|fff[ef])