Oracle DB中的XMLTYPE列中的XML编码

I have a table created like this:

我有这样一张桌子:

create table b (data timestamp, value XMLTYPE);

I run this script in TOAD 12.6 to store a XML in the table.

我在TOAD 12.6中运行这个脚本，以便在表中存储XML。

DECLARE
    lc_Soap         CLOB;
    lc_Request      CLOB;
    px_RequestXML   XMLTYPE
        := XMLTYPE ('<test><test1>ABDDÇJJSõ</test1></test>');
BEGIN
    DELETE b;

    lc_Soap :=
        '<?xml version="1.0" encoding="ISO-8859-1"?>
               <s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
                  <s:Header>
                      <h:AxisValues xmlns="urn:/microsoft/multichannelframework/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:h="urn:/microsoft/multichannelframework/">
                          <User xmlns="">TEST</User>
                      </h:AxisValues>
                  </s:Header>
                  <s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
                      <substr/>
                  </s:Body>
              </s:Envelope>';

    lc_Request :=
        pkg_utils.replace_clob (lc_Soap,
                                '<substr/>',
                                xml_utils.XMLTypeToClob (px_RequestXML));

    px_RequestXML := XMLTYPE.createXML (lc_Request);

    INSERT INTO b
         VALUES (SYSTIMESTAMP, px_RequestXML);

    COMMIT;
END;

When I try to see what is in the VALUE column I get this encoding UTF-8

当我尝试查看值列中的内容时，我得到了这个编码UTF-8

<?xml version="1.0" encoding="UTF-8"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
  <s:Header>
    <h:AxisValues xmlns="urn:/microsoft/multichannelframework/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:h="urn:/microsoft/multichannelframework/">
      <User xmlns="">TEST</User>
    </h:AxisValues>
  </s:Header>
  <s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <test>
      <test1>ABDDÃ‡JJSÃµ</test1>
    </test>
  </s:Body>
</s:Envelope>

But this script was build to run in a different DB user or a in Oracle JOB. And in that cases, the encoding is different:

但是这个脚本是在不同的DB用户或Oracle作业中运行的。在这种情况下，编码是不同的:

<?xml version="1.0" encoding="WINDOWS-1252"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
  <s:Header>
    <h:AxisValues xmlns="urn:/microsoft/multichannelframework/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:h="urn:/microsoft/multichannelframework/">
      <User xmlns="">TEST</User>
    </h:AxisValues>
  </s:Header>
  <s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <test>
      <test1>ABDDÇJJSõ</test1>
    </test>
  </s:Body>
</s:Envelope>

The NLS_CHARACTERSET parameter for DB is WE8MSWIN1252. Why this append? And Who can I always store as UTF-8?

DB的NLS_CHARACTERSET参数是WE8MSWIN1252。为什么这个附加吗?我还能把谁一直存储为UTF-8呢?

2 个解决方案

#1

Oracle will use the client characterset to create a XMLTYPE from a CLOB or String and completely ignore the encoding in the XML prolog (see docs). You may set encoding="blabla" and it will work. Oracle honors the encoding in the XML prolog only when you create a XMLTYPE from a BLOB.

Oracle将使用客户端字符集从CLOB或String创建XMLTYPE，并完全忽略XML prolog中的编码(请参阅文档)。您可以设置编码=“blabla”，它将会工作。只有当您从BLOB创建XMLTYPE时，Oracle才会认可XML prolog中的编码。

The client environment also drives the encoding when reading an XMLTYPE. If you want a XML document to be encoded in UTF-8 regardless of the client encoding, you have to retrieve it as BLOB.

客户端环境在读取XMLTYPE时也驱动编码。如果希望XML文档以UTF-8编码，而不考虑客户机编码，则必须以BLOB形式检索它。

Either via getBlobVal()

通过getBlobVal()

SELECT (c2).getBlobVal(nls_charset_id('UTF8')) FROM b;

or via xmlserialize()

或通过xmlserialize()

SELECT xmlserialize(DOCUMENT c2 AS BLOB ENCODING 'UTF-8') FROM b;

#2

When you include non-ASCII characters in a content sent from a client to the DB (eg ABDDÇJJSõ), a conversion may be necessary from the client character set to the DB character set. That can get complicated if the client is incorrect about the character set being used, or the database can't handle the characters. If the content comes from a file, there's also the risk of some other application mis-understanding the character set when processing the file (eg version control)

当你在内容中包含非ascii字符从客户机发送到数据库(比如ABDDCJJSo),从客户端字符集转换可能是必要的,可以得到复杂的数据库字符集。如果客户端所使用的字符集是不正确的,或数据库无法处理字符。如果内容来自一个文件，也有其他应用程序在处理文件时误解字符集的风险(如版本控制)

It is often safer to use encoded versions of any potential problem characters. You can use ASCIISTR to get an unambiguous converted version of the string, and UNISTR to convert it back.

使用任何潜在问题字符的编码版本通常更安全。可以使用ASCIISTR获取字符串的明确转换版本，并使用UNISTR将其转换回。

select asciistr('Çõ'), unistr('\00C7\00F5') from dual;

You can even check the characters are converted as you expect.

您甚至可以检查字符是否按照预期转换。

http://www.fileformat.info/info/unicode/char/00c7/index.htm http://www.fileformat.info/info/unicode/char/00f5/index.htm

If there are no non-ascii characters in the script, you eliminate a lot of potential problems. There may still be issues, but they'll be easier to diagnose.

如果脚本中没有非ascii字符，则可以消除许多潜在问题。可能仍然存在问题，但它们将更容易诊断。

#1