I have an XHTML document being passed to a PHP app via Greasemonkey AJAX. The PHP app uses UTF8. If I output the POST content straight back to a textarea in the AJAX receiving div, everything is still properly encoded in UTF8.
我有一个XHTML文档通过Greasemonkey AJAX传递给PHP应用程序。PHP应用程序使用UTF8。如果我直接将POST内容输出到AJAX接收div中的textarea,那么所有内容仍然是用UTF8正确编码的。
When I try to parse using XPath
当我尝试使用XPath进行解析时
$dom = new DOMDocument();
$dom->loadHTML($raw2);
$xpath = new DOMXPath($dom);
$query = '//td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
var_dump($node->wholeText);
}
dumped strings are not utf8. How do I force DOM/XPath to use UTF8?
被转储的字符串不是utf8。如何强制DOM/XPath使用UTF8?
5 个解决方案
#1
3
If it is a fully fledged valid xhtml document you shouldn't use loadhtml() but load()/loadxml().
如果它是一个功能齐全的有效xhtml文档,那么不应该使用loadhtml(),而应该使用load()/loadxml()。
Given the example xhtml document
以xhtml文档为例
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>xhtml test</title>
</head>
<body>
<h1>A Table</h1>
<table>
<tr><th>A</th><th>O</th><th>U</th></tr>
<tr><td>Ä</td><td>Ö</td><td>Ü</td></tr>
<tr><td>ä</td><td>ö</td><td>ü</td></tr>
</table>
</body>
</html>
the script
这个脚本
<?php
$raw2 = 'test.html';
$dom = new DOMDocument();
$dom->load($raw2);
$xpath = new DOMXPath($dom);
var_dump($xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml'));
$query = '//h:td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
foo($node->wholeText);
}
function foo($s) {
for($i=0; $i<strlen($s); $i++) {
printf('%02X ', ord($s[$i]));
}
echo "\n";
}
prints
打印
bool(true)
C3 84
C3 96
C3 9C
C3 A4
C3 B6
C3 BC
i.e. the output/strings are utf-8 encoded
即输出/字符串是utf-8编码的
#2
28
I had the same problem and I couldn't use tidy in my webserver. I found this solution and it worked fine:
我有同样的问题,我不能在我的webserver中使用tidy。我找到了这个解决方案,效果很好:
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"); $dom = new DomDocument(); $dom->loadHTML($html);
#3
1
I have not tried, but the second parameter of DOMDocument::__construct
seems to be related to the encoding ; maybe that'll help you :-)
我没有尝试过,但是DOMDocument的第二个参数::__construct似乎与编码有关;也许这对你有帮助。
Else, there is an encoding property in DOMDocument, which is writable.
否则,DOMDocument中有一个可写的编码属性。
The DOMXpath beeing constructed with the DOMDocument as parameter, maybe it'll work...
用DOMDocument作为参数构造的DOMXpath治理,可能会有用……
#4
1
A bit late in the game, but perhaps it helps someone...
游戏进行得有点晚了,但它可能会帮助某人……
The problem might be in the output, and not in the dom/xpath object itself.
问题可能出现在输出中,而不是dom/xpath对象本身。
If you would output the nodeValue directly, you would get corrupted characters e.g.:
如果你直接输出nodeValue,你会得到被损坏的字符,例如:
ìÂÂì ë¹Â디ì¤
ìì ë¹ë””ì¤ í°ì íì¤
You have to load your dom object with the second param "utf-8", new \DomDocument('1.0', 'utf-8')
, but still when you print the dom node list/element value you get broken characters:
您必须使用第二个参数“utf-8”、新的\DomDocument('1.0'、'utf-8')加载dom对象,但是当您打印dom节点列表/元素值时,您仍然会得到损坏的字符:
echo $contentItem->item($index)->nodeValue
echo $ contentItem - >项目(美元指数)- > nodeValue
you have to wrap it up with utf8_decode:
你必须用utf8_decode这个词来结束它:
echo utf8_decode($contentItem->item($index)->nodeValue) //output: 者不終朝而會,愚者可浹旬而學
回声utf8_decode($ contentItem - >项(美元指数)- > nodeValue)/ /输出:者不終朝而會,愚者可浹旬而學
#5
0
Struggled with similar problem (unable to force Xpath to use UTF-8 in combination with loadHTML), in the end this excellent article provided the solution: http://devzone.zend.com/article/8855
遇到类似的问题(无法强制Xpath与loadHTML一起使用UTF-8),最后这篇优秀的文章提供了解决方案:http://devzone.zend.com/article/8855。
workaround:
处理:
Insert an additional section with the appropriate Content-type HTTP-EQUIV meta tag immediately following the opening tag.
在打开标签后立即插入一个附加部分,其中包含适当的Content-type HTTP-EQUIV元标记。
#1
3
If it is a fully fledged valid xhtml document you shouldn't use loadhtml() but load()/loadxml().
如果它是一个功能齐全的有效xhtml文档,那么不应该使用loadhtml(),而应该使用load()/loadxml()。
Given the example xhtml document
以xhtml文档为例
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>xhtml test</title>
</head>
<body>
<h1>A Table</h1>
<table>
<tr><th>A</th><th>O</th><th>U</th></tr>
<tr><td>Ä</td><td>Ö</td><td>Ü</td></tr>
<tr><td>ä</td><td>ö</td><td>ü</td></tr>
</table>
</body>
</html>
the script
这个脚本
<?php
$raw2 = 'test.html';
$dom = new DOMDocument();
$dom->load($raw2);
$xpath = new DOMXPath($dom);
var_dump($xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml'));
$query = '//h:td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
foo($node->wholeText);
}
function foo($s) {
for($i=0; $i<strlen($s); $i++) {
printf('%02X ', ord($s[$i]));
}
echo "\n";
}
prints
打印
bool(true)
C3 84
C3 96
C3 9C
C3 A4
C3 B6
C3 BC
i.e. the output/strings are utf-8 encoded
即输出/字符串是utf-8编码的
#2
28
I had the same problem and I couldn't use tidy in my webserver. I found this solution and it worked fine:
我有同样的问题,我不能在我的webserver中使用tidy。我找到了这个解决方案,效果很好:
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"); $dom = new DomDocument(); $dom->loadHTML($html);
#3
1
I have not tried, but the second parameter of DOMDocument::__construct
seems to be related to the encoding ; maybe that'll help you :-)
我没有尝试过,但是DOMDocument的第二个参数::__construct似乎与编码有关;也许这对你有帮助。
Else, there is an encoding property in DOMDocument, which is writable.
否则,DOMDocument中有一个可写的编码属性。
The DOMXpath beeing constructed with the DOMDocument as parameter, maybe it'll work...
用DOMDocument作为参数构造的DOMXpath治理,可能会有用……
#4
1
A bit late in the game, but perhaps it helps someone...
游戏进行得有点晚了,但它可能会帮助某人……
The problem might be in the output, and not in the dom/xpath object itself.
问题可能出现在输出中,而不是dom/xpath对象本身。
If you would output the nodeValue directly, you would get corrupted characters e.g.:
如果你直接输出nodeValue,你会得到被损坏的字符,例如:
ìÂÂì ë¹Â디ì¤
ìì ë¹ë””ì¤ í°ì íì¤
You have to load your dom object with the second param "utf-8", new \DomDocument('1.0', 'utf-8')
, but still when you print the dom node list/element value you get broken characters:
您必须使用第二个参数“utf-8”、新的\DomDocument('1.0'、'utf-8')加载dom对象,但是当您打印dom节点列表/元素值时,您仍然会得到损坏的字符:
echo $contentItem->item($index)->nodeValue
echo $ contentItem - >项目(美元指数)- > nodeValue
you have to wrap it up with utf8_decode:
你必须用utf8_decode这个词来结束它:
echo utf8_decode($contentItem->item($index)->nodeValue) //output: 者不終朝而會,愚者可浹旬而學
回声utf8_decode($ contentItem - >项(美元指数)- > nodeValue)/ /输出:者不終朝而會,愚者可浹旬而學
#5
0
Struggled with similar problem (unable to force Xpath to use UTF-8 in combination with loadHTML), in the end this excellent article provided the solution: http://devzone.zend.com/article/8855
遇到类似的问题(无法强制Xpath与loadHTML一起使用UTF-8),最后这篇优秀的文章提供了解决方案:http://devzone.zend.com/article/8855。
workaround:
处理:
Insert an additional section with the appropriate Content-type HTTP-EQUIV meta tag immediately following the opening tag.
在打开标签后立即插入一个附加部分,其中包含适当的Content-type HTTP-EQUIV元标记。