如何使用sed或Perl从xml文件中删除无效字符

时间:2021-09-07 22:23:57

I want to get rid of all invalid characters; example hexadecimal value 0x1A from an XML file using sed.
What is the regex and the command line?
EDIT
Added Perl tag hoping to get more responses. I prefer a one-liner solution.
EDIT
These are the valid XML characters

我想摆脱所有无效的角色;使用sed的XML文件中的示例十六进制值0x1A。什么是正则表达式和命令行?编辑添加了Perl标签,希望获得更多响应。我更喜欢单线解决方案。编辑这些是有效的XML字符

x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]

3 个解决方案

#1


8  

Assuming UTF-8 XML documents:

假设UTF-8 XML文档:

perl -CSDA -pe'
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml

If you want to encode the bad bytes instead,

如果你想编码坏字节,

perl -CSDA -pe'
   s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
      "&#".ord($1).";"
   /xeg;
' file.xml > file_fixed.xml

You can call it a few different ways:

您可以通过几种不同的方式调用它:

perl -CSDA     -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml     # Inplace with backup
perl -CSDA -i  -pe'...' file.xml     # Inplace without backup

#2


2  

The tr command would be simpler. So, try something like:

tr命令会更简单。所以,试试类似:

cat <filename> | tr -d '\032' > <newfilename>

Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr. Not sure if tr likes hex.

请注意,ascii字符'0x1a'的八进制值为'032',因此我们使用它代替tr。不确定tr是否喜欢hex。

#3


0  

Try:

尝试:

perl -pi -e 's/[^\x9\xA\xD\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}]//g' file.xml

#1


8  

Assuming UTF-8 XML documents:

假设UTF-8 XML文档:

perl -CSDA -pe'
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml

If you want to encode the bad bytes instead,

如果你想编码坏字节,

perl -CSDA -pe'
   s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
      "&#".ord($1).";"
   /xeg;
' file.xml > file_fixed.xml

You can call it a few different ways:

您可以通过几种不同的方式调用它:

perl -CSDA     -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml     # Inplace with backup
perl -CSDA -i  -pe'...' file.xml     # Inplace without backup

#2


2  

The tr command would be simpler. So, try something like:

tr命令会更简单。所以,试试类似:

cat <filename> | tr -d '\032' > <newfilename>

Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr. Not sure if tr likes hex.

请注意,ascii字符'0x1a'的八进制值为'032',因此我们使用它代替tr。不确定tr是否喜欢hex。

#3


0  

Try:

尝试:

perl -pi -e 's/[^\x9\xA\xD\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}]//g' file.xml