在c#中转义无效的XML字符

时间:2021-12-08 22:27:24

I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?

我有一个包含无效XML字符的字符串。如何在解析字符串之前转义(或删除)无效的XML字符?

6 个解决方案

#1


87  

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

作为删除无效XML字符的方法,我建议您使用XmlConvert。IsXmlChar方法。它是在。net Framework 4中添加的,也是在Silverlight中显示的。这是一个小样本:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:

作为转义无效XML字符的方法,我建议您使用XmlConvert。EncodeName方法。这是一个小样本:

void Main() {
    const string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    string encoded = XmlConvert.EncodeName(content);
    Console.WriteLine(IsValidXmlString(encoded)); // True

    string decoded = XmlConvert.DecodeName(encoded);
    Console.WriteLine(content == decoded); // True
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

Update: It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

更新:应该提到,编码操作生成的字符串长度大于或等于源字符串的长度。当您将编码的字符串存储在具有长度限制的字符串列中的数据库中并在应用程序中验证源字符串长度以适应数据列限制时,这一点可能很重要。

#2


60  

Use SecurityElement.Escape

使用SecurityElement.Escape

using System;
using System.Security;

class Sample {
  static void Main() {
    string text = "Escape characters : < > & \" \'";
    string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : &lt; &gt; &amp; &quot; &apos;
    Console.WriteLine(xmlText);
  }
}

#3


19  

If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.

如果您正在编写xml,只需使用框架提供的类来创建xml。你不必为逃避或任何事而烦恼。

Console.Write(new XElement("Data", "< > &"));

Will output

将输出

<Data>&lt; &gt; &amp;</Data>

If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.

如果需要读取格式错误的XML文件,请不要使用正则表达式。相反,使用Html敏捷包。

#4


4  

The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:

Irishman提供的RemoveInvalidXmlChars方法不支持代理字符。要测试它,请使用以下示例:

static void Main()
{
    const string content = "\v\U00010330";

    string newContent = RemoveInvalidXmlChars(content);

    Console.WriteLine(newContent);
}

This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.

这将返回一个空字符串,但它不应该返回!它应该返回“\U00010330”,因为字符U+10330是一个有效的XML字符。

To support surrogate characters, I suggest using the following method:

为了支持代理字符,我建议使用以下方法:

public static string RemoveInvalidXmlChars(string text)
{
    if (string.IsNullOrEmpty(text))
        return text;

    int length = text.Length;
    StringBuilder stringBuilder = new StringBuilder(length);

    for (int i = 0; i < length; ++i)
    {
        if (XmlConvert.IsXmlChar(text[i]))
        {
            stringBuilder.Append(text[i]);
        }
        else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
        {
            stringBuilder.Append(text[i]);
            stringBuilder.Append(text[i + 1]);
            ++i;
        }
    }

    return stringBuilder.ToString();
}

#5


3  

Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnessesarily:

下面是上述方法的优化版本RemoveInvalidXmlChars,它不会在每次调用时创建一个新的数组,因此需要强调GC:

public static string RemoveInvalidXmlChars(string text)
    {
        if (text == null) return text;
        if (text.Length == 0) return text;

        // a bit complicated, but avoids memory usage if not necessary
        StringBuilder result = null;
        for (int i = 0; i < text.Length; i++)
        {
            var ch = text[i];
            if (XmlConvert.IsXmlChar(ch))
            {
                result?.Append(ch);
            }
            else
            {
                if (result == null)
                {
                    result = new StringBuilder();
                    result.Append(text.Substring(0, i));
                }
            }
        }

        if (result == null)
            return text; // no invalid xml chars detected - return original text
        else
            return result.ToString();

    }

#6


0  

// Replace invalid characters with empty strings.
   Regex.Replace(inputString, @"[^\w\.@-]", ""); 

The regular expression pattern [^\w.@-] matches any character that is not a word character, a period, an @ symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.@-\%] also allows a percentage symbol and a backslash in an input string.

正则表达式模式[^ \ w。@-]匹配任何一个字字符,句号,@符号,或连字符。单词字符是任何字母、十进制数字或标点连接器,如下划线。匹配此模式的任何字符都被字符串替换。空,这是替换模式定义的字符串。要允许用户输入中的其他字符,请将这些字符添加到正则表达式模式中的字符类中。例如,正则表达式模式[^ \ w。也允许输入字符串中的百分比符号和反斜杠。

Regex.Replace(inputString, @"[!@#$%_]", "");

Refer this too :

请参考这个:

Removing Invalid Characters from XML Name Tag - RegEx C#

从XML名称标签- RegEx c#中删除无效字符

Here is a function to remove the characters from a specified XML string:

下面是一个从指定的XML字符串中删除字符的函数:

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

namespace XMLUtils
{
    class Standards
    {
        /// <summary>
        /// Strips non-printable ascii characters 
        /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
        /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
        /// </summary>
        /// <param name="content">contents</param>
        /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
        private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
        {    
            string pattern = String.Empty;
            switch (XMLVersion)
            {
                case "1.0":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
                    break;
                case "1.1":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
                    break;
                default:
                    throw new Exception("Error: Invalid XML Version!");
            }

            Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
            if (regex.IsMatch(tmpContents))
            {
                tmpContents = regex.Replace(tmpContents, String.Empty);
            }
            tmpContents = string.Empty;
        }
    }
}

#1


87  

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

作为删除无效XML字符的方法,我建议您使用XmlConvert。IsXmlChar方法。它是在。net Framework 4中添加的,也是在Silverlight中显示的。这是一个小样本:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:

作为转义无效XML字符的方法,我建议您使用XmlConvert。EncodeName方法。这是一个小样本:

void Main() {
    const string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    string encoded = XmlConvert.EncodeName(content);
    Console.WriteLine(IsValidXmlString(encoded)); // True

    string decoded = XmlConvert.DecodeName(encoded);
    Console.WriteLine(content == decoded); // True
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

Update: It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

更新:应该提到,编码操作生成的字符串长度大于或等于源字符串的长度。当您将编码的字符串存储在具有长度限制的字符串列中的数据库中并在应用程序中验证源字符串长度以适应数据列限制时,这一点可能很重要。

#2


60  

Use SecurityElement.Escape

使用SecurityElement.Escape

using System;
using System.Security;

class Sample {
  static void Main() {
    string text = "Escape characters : < > & \" \'";
    string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : &lt; &gt; &amp; &quot; &apos;
    Console.WriteLine(xmlText);
  }
}

#3


19  

If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.

如果您正在编写xml,只需使用框架提供的类来创建xml。你不必为逃避或任何事而烦恼。

Console.Write(new XElement("Data", "< > &"));

Will output

将输出

<Data>&lt; &gt; &amp;</Data>

If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.

如果需要读取格式错误的XML文件,请不要使用正则表达式。相反,使用Html敏捷包。

#4


4  

The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:

Irishman提供的RemoveInvalidXmlChars方法不支持代理字符。要测试它,请使用以下示例:

static void Main()
{
    const string content = "\v\U00010330";

    string newContent = RemoveInvalidXmlChars(content);

    Console.WriteLine(newContent);
}

This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.

这将返回一个空字符串,但它不应该返回!它应该返回“\U00010330”,因为字符U+10330是一个有效的XML字符。

To support surrogate characters, I suggest using the following method:

为了支持代理字符,我建议使用以下方法:

public static string RemoveInvalidXmlChars(string text)
{
    if (string.IsNullOrEmpty(text))
        return text;

    int length = text.Length;
    StringBuilder stringBuilder = new StringBuilder(length);

    for (int i = 0; i < length; ++i)
    {
        if (XmlConvert.IsXmlChar(text[i]))
        {
            stringBuilder.Append(text[i]);
        }
        else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
        {
            stringBuilder.Append(text[i]);
            stringBuilder.Append(text[i + 1]);
            ++i;
        }
    }

    return stringBuilder.ToString();
}

#5


3  

Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnessesarily:

下面是上述方法的优化版本RemoveInvalidXmlChars,它不会在每次调用时创建一个新的数组,因此需要强调GC:

public static string RemoveInvalidXmlChars(string text)
    {
        if (text == null) return text;
        if (text.Length == 0) return text;

        // a bit complicated, but avoids memory usage if not necessary
        StringBuilder result = null;
        for (int i = 0; i < text.Length; i++)
        {
            var ch = text[i];
            if (XmlConvert.IsXmlChar(ch))
            {
                result?.Append(ch);
            }
            else
            {
                if (result == null)
                {
                    result = new StringBuilder();
                    result.Append(text.Substring(0, i));
                }
            }
        }

        if (result == null)
            return text; // no invalid xml chars detected - return original text
        else
            return result.ToString();

    }

#6


0  

// Replace invalid characters with empty strings.
   Regex.Replace(inputString, @"[^\w\.@-]", ""); 

The regular expression pattern [^\w.@-] matches any character that is not a word character, a period, an @ symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.@-\%] also allows a percentage symbol and a backslash in an input string.

正则表达式模式[^ \ w。@-]匹配任何一个字字符,句号,@符号,或连字符。单词字符是任何字母、十进制数字或标点连接器,如下划线。匹配此模式的任何字符都被字符串替换。空,这是替换模式定义的字符串。要允许用户输入中的其他字符,请将这些字符添加到正则表达式模式中的字符类中。例如,正则表达式模式[^ \ w。也允许输入字符串中的百分比符号和反斜杠。

Regex.Replace(inputString, @"[!@#$%_]", "");

Refer this too :

请参考这个:

Removing Invalid Characters from XML Name Tag - RegEx C#

从XML名称标签- RegEx c#中删除无效字符

Here is a function to remove the characters from a specified XML string:

下面是一个从指定的XML字符串中删除字符的函数:

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

namespace XMLUtils
{
    class Standards
    {
        /// <summary>
        /// Strips non-printable ascii characters 
        /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
        /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
        /// </summary>
        /// <param name="content">contents</param>
        /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
        private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
        {    
            string pattern = String.Empty;
            switch (XMLVersion)
            {
                case "1.0":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
                    break;
                case "1.1":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
                    break;
                default:
                    throw new Exception("Error: Invalid XML Version!");
            }

            Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
            if (regex.IsMatch(tmpContents))
            {
                tmpContents = regex.Replace(tmpContents, String.Empty);
            }
            tmpContents = string.Empty;
        }
    }
}