SQL Server:从VARCHAR(MAX)字段中替换无效的XML字符

时间:2022-09-14 10:40:52

I have a VARCHAR(MAX) field which is being interfaced to an external system in XML format. The following errors were thrown by the interface:

我有一个VARCHAR(MAX)字段,它以XML格式与外部系统进行交互。界面抛出以下错误:

mywebsite.com-2015-0202.xml:413005: parser error : xmlParseCharRef: invalid xmlChar value 29
ne and Luke's family in Santa Fe. You know you have a standing invitation,
                                                                               ^
mywebsite.com-2015-0202.xml:455971: parser error : xmlParseCharRef: invalid xmlChar value 25
The apprentice nodded, because frankly, who hadnt? That diseases like chol
                                                      ^
mywebsite.com.com-2015-0202.xml:456077: parser error : xmlParseCharRef: invalid xmlChar value 28
bon mot; a sentimental love of nature and animals; the proverbial British 
                                                                               ^
mywebsite.com-2015-0202.xml:472073: parser error : xmlParseCharRef: invalid xmlChar value 20
"Andyou want that?"
          ^
mywebsite.com-2015-0202.xml:492912: parser error : xmlParseCharRef: invalid xmlChar value 25
She couldnt live like this anymore.

We found that the following list of characters are invalid:

我们发现下列字符列表无效:

�








	

























I am trying to clean this data, and I found a SQL function to clean these characters here. However, the function was taking NVARCHAR(4000) as input parameter, so I have changed the function to use VARCHAR(MAX) instead.

我正在尝试清理这些数据,我找到了一个SQL函数来清理这些字符。但是,函数以NVARCHAR(4000)为输入参数,所以我将函数改为VARCHAR(MAX)。

Could anyone please advise if changing the NVARCHAR(4000) to VARCHAR(MAX) would produce wrong results? Sorry, I wouldn't be able to test this interface locally so thought to seek opinion/advise.

请问如果将NVARCHAR(4000)改为VARCHAR(MAX)是否会产生错误的结果?对不起,我不能在本地测试这个接口,所以想征求意见/建议。

Original Function:

最初的功能:

CREATE FUNCTION fnStripLowAscii (@InputString nvarchar(4000))
RETURNS nvarchar(4000)
AS
BEGIN
IF @InputString IS NOT NULL
BEGIN
  DECLARE @Counter int, @TestString nvarchar(40)

  SET @TestString = '%[' + NCHAR(0) + NCHAR(1) + NCHAR(2) + NCHAR(3) + NCHAR(4) + NCHAR(5) + NCHAR(6) + NCHAR(7) + NCHAR(8) + NCHAR(11) + NCHAR(12) + NCHAR(14) + NCHAR(15) + NCHAR(16) + NCHAR(17) + NCHAR(18) + NCHAR(19) + NCHAR(20) + NCHAR(21) + NCHAR(22) + NCHAR(23) + NCHAR(24) + NCHAR(25) + NCHAR(26) + NCHAR(27) + NCHAR(28) + NCHAR(29) + NCHAR(30) + NCHAR(31) + ']%'

  SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)

  WHILE @Counter <> 0
  BEGIN
    SELECT @InputString = STUFF(@InputString, @Counter, 1, NCHAR(164))
    SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)
  END
END
RETURN(@InputString)
END

Modified Version:

修改版:

CREATE FUNCTION [dbo].RemoveInvalidXMLCharacters (@InputString VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    IF @InputString IS NOT NULL
    BEGIN
      DECLARE @Counter INT, @TestString NVARCHAR(40)

      SET @TestString = '%[' + NCHAR(0) + NCHAR(1) + NCHAR(2) + NCHAR(3) + NCHAR(4) + NCHAR(5) + NCHAR(6) + NCHAR(7) + NCHAR(8) + NCHAR(11) + NCHAR(12) + NCHAR(14) + NCHAR(15) + NCHAR(16) + NCHAR(17) + NCHAR(18) + NCHAR(19) + NCHAR(20) + NCHAR(21) + NCHAR(22) + NCHAR(23) + NCHAR(24) + NCHAR(25) + NCHAR(26) + NCHAR(27) + NCHAR(28) + NCHAR(29) + NCHAR(30) + NCHAR(31) + ']%'

      SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)

      WHILE @Counter <> 0
      BEGIN
        SELECT @InputString = STUFF(@InputString, @Counter, 1, ' ')
        SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)
      END
    END
    RETURN(@InputString)
END

3 个解决方案

#1


7  

There is a trick using the implicit conversion of VARBINARY to base64 and back:

使用VARBINARY到base64和back的隐式转换有一个诀窍:

Here your list of evil

这是你的罪恶清单

DECLARE @evilChars VARCHAR(MAX)=
  CHAR(0x0)
+ CHAR(0x1)
+ CHAR(0x2)
+ CHAR(0x3)
+ CHAR(0x4)
+ CHAR(0x5)
+ CHAR(0x6)
+ CHAR(0x7)
+ CHAR(0x8)
+ CHAR(0x9)
+ CHAR(0xa)
+ CHAR(0xb)
+ CHAR(0xc)
+ CHAR(0xd)
+ CHAR(0xe)
+ CHAR(0xf)
+ CHAR(0x10)
+ CHAR(0x11)
+ CHAR(0x12)
+ CHAR(0x13)
+ CHAR(0x14)
+ CHAR(0x15)
+ CHAR(0x16)
+ CHAR(0x17)
+ CHAR(0x18)
+ CHAR(0x19)
+ CHAR(0x1a)
+ CHAR(0x1b)
+ CHAR(0x1c)
+ CHAR(0x1d)
+ CHAR(0x1e)
+ CHAR(0x1f)
+ CHAR(0x7f);

This works

这是

DECLARE @XmlAsString NVARCHAR(MAX)=
(
    SELECT @evilChars FOR XML PATH('test')
);
SELECT @XmlAsString;

The result (some are "printed")

结果(有些是“打印”的)

<test>&#x00;&#x01;&#x02;&#x03;&#x04;&#x05;&#x06;&#x07;&#x08;    
&#x0B;&#x0C;&#x0D;&#x0E;&#x0F;&#x10;&#x11;&#x12;&#x13;&#x14;&#x15;&#x16;&#x17;&#x18;&#x19;&#x1A;&#x1B;&#x1C;&#x1D;&#x1E;&#x1F;</test>

The following is forbidden

以下是被禁止的

SELECT CAST(@XmlAsString AS XML)

But you can use the implicit conversion of VARBINARY to base64

但是可以使用VARBINARY到base64的隐式转换

DECLARE @base64 NVARCHAR(MAX)=
(
    SELECT CAST(@evilChars AS VARBINARY(MAX)) FOR XML PATH('test')
);
SELECT @base64;

The result

结果

<test>AAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh9/</test>

Now you've got your real XML including the special characters!

现在您已经拥有了真正的XML,包括特殊字符!

SELECT CAST(CAST(@base64 AS XML).value('/test[1]','varbinary(max)') AS VARCHAR(MAX)) FOR XML PATH('reconverted')

#2


1  

It is safe to use VARCHAR(MAX) as my data column is a VARCHAR(MAX) field. Also, there will be an overhead of converting VARCHAR(MAX) to NVARCHAR(MAX) if I pass a VARCHAR(MAX) field to the SQL function which accepts the NVARCHAR(MAX) param.

使用VARCHAR(MAX)是安全的,因为我的数据列是VARCHAR(MAX)字段。此外,如果我将VARCHAR(MAX)字段传递给接受NVARCHAR(MAX)参数的SQL函数,那么将会有将VARCHAR(MAX)转换为NVARCHAR(MAX)的开销。

Thank you very much @RhysJones, @Damien_The_Unbeliever for your comments.

非常感谢@RhysJones, @ damien_the_unfaithful to你的评论。

#3


0  

You need to use nvarchar(max) instead of varchar(max) but otherwise the change is fine.

您需要使用nvarchar(max)而不是varchar(max),否则更改是可以的。

#1


7  

There is a trick using the implicit conversion of VARBINARY to base64 and back:

使用VARBINARY到base64和back的隐式转换有一个诀窍:

Here your list of evil

这是你的罪恶清单

DECLARE @evilChars VARCHAR(MAX)=
  CHAR(0x0)
+ CHAR(0x1)
+ CHAR(0x2)
+ CHAR(0x3)
+ CHAR(0x4)
+ CHAR(0x5)
+ CHAR(0x6)
+ CHAR(0x7)
+ CHAR(0x8)
+ CHAR(0x9)
+ CHAR(0xa)
+ CHAR(0xb)
+ CHAR(0xc)
+ CHAR(0xd)
+ CHAR(0xe)
+ CHAR(0xf)
+ CHAR(0x10)
+ CHAR(0x11)
+ CHAR(0x12)
+ CHAR(0x13)
+ CHAR(0x14)
+ CHAR(0x15)
+ CHAR(0x16)
+ CHAR(0x17)
+ CHAR(0x18)
+ CHAR(0x19)
+ CHAR(0x1a)
+ CHAR(0x1b)
+ CHAR(0x1c)
+ CHAR(0x1d)
+ CHAR(0x1e)
+ CHAR(0x1f)
+ CHAR(0x7f);

This works

这是

DECLARE @XmlAsString NVARCHAR(MAX)=
(
    SELECT @evilChars FOR XML PATH('test')
);
SELECT @XmlAsString;

The result (some are "printed")

结果(有些是“打印”的)

<test>&#x00;&#x01;&#x02;&#x03;&#x04;&#x05;&#x06;&#x07;&#x08;    
&#x0B;&#x0C;&#x0D;&#x0E;&#x0F;&#x10;&#x11;&#x12;&#x13;&#x14;&#x15;&#x16;&#x17;&#x18;&#x19;&#x1A;&#x1B;&#x1C;&#x1D;&#x1E;&#x1F;</test>

The following is forbidden

以下是被禁止的

SELECT CAST(@XmlAsString AS XML)

But you can use the implicit conversion of VARBINARY to base64

但是可以使用VARBINARY到base64的隐式转换

DECLARE @base64 NVARCHAR(MAX)=
(
    SELECT CAST(@evilChars AS VARBINARY(MAX)) FOR XML PATH('test')
);
SELECT @base64;

The result

结果

<test>AAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh9/</test>

Now you've got your real XML including the special characters!

现在您已经拥有了真正的XML,包括特殊字符!

SELECT CAST(CAST(@base64 AS XML).value('/test[1]','varbinary(max)') AS VARCHAR(MAX)) FOR XML PATH('reconverted')

#2


1  

It is safe to use VARCHAR(MAX) as my data column is a VARCHAR(MAX) field. Also, there will be an overhead of converting VARCHAR(MAX) to NVARCHAR(MAX) if I pass a VARCHAR(MAX) field to the SQL function which accepts the NVARCHAR(MAX) param.

使用VARCHAR(MAX)是安全的,因为我的数据列是VARCHAR(MAX)字段。此外,如果我将VARCHAR(MAX)字段传递给接受NVARCHAR(MAX)参数的SQL函数,那么将会有将VARCHAR(MAX)转换为NVARCHAR(MAX)的开销。

Thank you very much @RhysJones, @Damien_The_Unbeliever for your comments.

非常感谢@RhysJones, @ damien_the_unfaithful to你的评论。

#3


0  

You need to use nvarchar(max) instead of varchar(max) but otherwise the change is fine.

您需要使用nvarchar(max)而不是varchar(max),否则更改是可以的。