我如何告诉DOMDocument-> load()我想要它使用什么编码?

时间:2022-10-20 18:41:03

I search for and process XML files from elsewhere, and need to transform them with some XSLTs. No problem. Using PHP5 and the DOM library, everything's a snap. Worked fine, up till now. Today, funky characters were in the XML file -- "smart" quotes from Word, it looks like. Anyways, DOMDocument->load complained about them, saying that they weren't UTF-8, and to specify the encoding.

我从其他地方搜索和处理XML文件,并需要使用一些XSLT进行转换。没问题。使用PHP5和DOM库,一切都很简单。工作得很好,到现在为止。今天,时髦的角色在XML文件中 - 来自Word的“智能”引用,它看起来像。无论如何,DOMDocument-> load抱怨他们,说他们不是UTF-8,并指定编码。

Lo and behold, the encoding is not specified in these XML files. If I add in 'encoding="iso-8859-1"' to the header, it works fine. The rub is I have no control over these XML files.

请注意,这些XML文件中未指定编码。如果我在标题中添加'encoding =“iso-8859-1”',它可以正常工作。问题是我无法控制这些XML文件。

Reading the file into a string, modifying its header and writing it back out to another location seems to be my only option, but I'd prefer to do it without having to use temporary copies of the XML files at all. Is there any way to simply tell the parser to parse them as if they were iso-8859-1?

将文件读入字符串,修改其标题并将其写回另一个位置似乎是我唯一的选择,但我更愿意这样做,而不必使用XML文件的临时副本。有没有办法简单地告诉解析器解析它们就好像它们是iso-8859-1一样?

3 个解决方案

#1


9  

Does this work for you?

这对你有用吗?

$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->load($xmlPath);

Edit: Since it appears that this doesn't work, what you could do instead is similar to your existing method but without the temp file. Read the XML file from your source just using standard IO operations (file_get_contents() or something), then perform whatever changes to the encoding you need (iconv() or utf8_decode()) and then use loadXML()

编辑:由于看起来这不起作用,你可以做的是类似于你现有的方法,但没有临时文件。只需使用标准IO操作(file_get_contents()或其他东西)从源代码中读取XML文件,然后对所需的编码执行任何更改(iconv()或utf8_decode()),然后使用loadXML()

$myXMLString = file_get_contents($xmlPath);
$myXMLString = utf8_decode($myXMLString);
$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->loadXML($myXMLString);

#2


5  

I haven't found a way to set the default encoding (yet) but maybe the recover mode is feasible in this case.
When libxml encounters an encoding error and no encoding has been explicitly set it switches from unicode/utf8 to latin1 and continues parsing the document. But in the parser context the property wellFormed is set to 0/false. PHP's DOM extension considers the document valid if wellFormed is true or the DOMDocument object's attribute recover is true.

我还没有找到设置默认编码的方法,但在这种情况下恢复模式可能是可行的。当libxml遇到编码错误并且没有显式设置编码时,它会从unicode / utf8切换到latin1并继续解析文档。但是在解析器上下文中,属性wellFormed设置为0 / false。如果wellFormed为true或DOMDocument对象的属性recover为true,则PHP的DOM扩展认为文档有效。

<?php
// german Umlaut ä in latin1 = 0xE4
$xml = '<foo>'.chr(0xE4).'</foo>';

$doc = new DOMDocument;
$b = $doc->loadxml($xml);
echo 'with doc->recover=false(default) : ', ($b) ? 'success':'failed', "\n";

$doc = new DOMDocument;
$doc->recover = true;
$b = $doc->loadxml($xml);
echo 'with doc->recover=true : ', ($b) ? 'success':'failed', "\n";

prints

版画

Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in test.php on line 6
with doc->recover=false(default) : failed

Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in  test.php on line 11
with doc->recover=true : success

You still get the warning message (which can be suppressed with @$doc->load()) and it will also show up in the internal libxml errors (only once when the parser switches from utf8 to latin1). The error code for this particular error will be 9 (XML_ERR_INVALID_CHAR).

您仍然会收到警告消息(可以使用@ $ doc-> load()来抑制),它也会显示在内部libxml错误中(仅当解析器从utf8切换到latin1时才会出现一次)。此特定错误的错误代码为9(XML_ERR_INVALID_CHAR)。

<?php
$xml = sprintf('<foo>
    <ae>%s</ae>
    <oe>%s</oe>
    &
</foo>', chr(0xE4),chr(0xF6));

libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->recover = true;
libxml_clear_errors();
$b = $doc->loadxml($xml);
$invalidCharFound = false;
foreach(libxml_get_errors() as $error) {
    if ( 9==$error->code && !$invalidCharFound ) {
        $invalidCharFound = true;
        echo "found invalid char, possibly harmless\n";
    }
    else {
        echo "hm, that's probably more severe: ", $error->message, "\n";
    }
}

#3


2  

The ony way to specify the encoding is in the XML declaration at the start of the file:

指定编码的ony方法是在文件开头的XML声明中:

<?xml version="1.0" encoding="ISO-8859-1"?>

#1


9  

Does this work for you?

这对你有用吗?

$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->load($xmlPath);

Edit: Since it appears that this doesn't work, what you could do instead is similar to your existing method but without the temp file. Read the XML file from your source just using standard IO operations (file_get_contents() or something), then perform whatever changes to the encoding you need (iconv() or utf8_decode()) and then use loadXML()

编辑:由于看起来这不起作用,你可以做的是类似于你现有的方法,但没有临时文件。只需使用标准IO操作(file_get_contents()或其他东西)从源代码中读取XML文件,然后对所需的编码执行任何更改(iconv()或utf8_decode()),然后使用loadXML()

$myXMLString = file_get_contents($xmlPath);
$myXMLString = utf8_decode($myXMLString);
$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->loadXML($myXMLString);

#2


5  

I haven't found a way to set the default encoding (yet) but maybe the recover mode is feasible in this case.
When libxml encounters an encoding error and no encoding has been explicitly set it switches from unicode/utf8 to latin1 and continues parsing the document. But in the parser context the property wellFormed is set to 0/false. PHP's DOM extension considers the document valid if wellFormed is true or the DOMDocument object's attribute recover is true.

我还没有找到设置默认编码的方法,但在这种情况下恢复模式可能是可行的。当libxml遇到编码错误并且没有显式设置编码时,它会从unicode / utf8切换到latin1并继续解析文档。但是在解析器上下文中,属性wellFormed设置为0 / false。如果wellFormed为true或DOMDocument对象的属性recover为true,则PHP的DOM扩展认为文档有效。

<?php
// german Umlaut ä in latin1 = 0xE4
$xml = '<foo>'.chr(0xE4).'</foo>';

$doc = new DOMDocument;
$b = $doc->loadxml($xml);
echo 'with doc->recover=false(default) : ', ($b) ? 'success':'failed', "\n";

$doc = new DOMDocument;
$doc->recover = true;
$b = $doc->loadxml($xml);
echo 'with doc->recover=true : ', ($b) ? 'success':'failed', "\n";

prints

版画

Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in test.php on line 6
with doc->recover=false(default) : failed

Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in  test.php on line 11
with doc->recover=true : success

You still get the warning message (which can be suppressed with @$doc->load()) and it will also show up in the internal libxml errors (only once when the parser switches from utf8 to latin1). The error code for this particular error will be 9 (XML_ERR_INVALID_CHAR).

您仍然会收到警告消息(可以使用@ $ doc-> load()来抑制),它也会显示在内部libxml错误中(仅当解析器从utf8切换到latin1时才会出现一次)。此特定错误的错误代码为9(XML_ERR_INVALID_CHAR)。

<?php
$xml = sprintf('<foo>
    <ae>%s</ae>
    <oe>%s</oe>
    &
</foo>', chr(0xE4),chr(0xF6));

libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->recover = true;
libxml_clear_errors();
$b = $doc->loadxml($xml);
$invalidCharFound = false;
foreach(libxml_get_errors() as $error) {
    if ( 9==$error->code && !$invalidCharFound ) {
        $invalidCharFound = true;
        echo "found invalid char, possibly harmless\n";
    }
    else {
        echo "hm, that's probably more severe: ", $error->message, "\n";
    }
}

#3


2  

The ony way to specify the encoding is in the XML declaration at the start of the file:

指定编码的ony方法是在文件开头的XML声明中:

<?xml version="1.0" encoding="ISO-8859-1"?>