I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).
我尝试使用DOMDocument解析一些HTML,但是当我这样做时,我突然失去了编码(至少在我看来是这样)。
$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $dom->saveHTML($div);
}
The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:
这段代码的结果是,我得到了一堆不是日语的字符。但是,如果我做的事:
echo $profile;
it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.
它显示正确。我尝试过saveHTML和saveXML,但都没有正确显示。我使用的是PHP 5.3。
What I see:
我所看到的:
ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åºã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ãã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å¦ã
What should be shown:
应该是显示:
イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学
EDIT: I've simplified the code down to five lines so you can test it yourself.
编辑:我已经将代码简化为5行,这样您就可以自己测试了。
$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;
Here is the html that is returned:
下面是返回的html:
<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åºã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
10 个解决方案
#1
320
DOMDocument::loadHTML
will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
DOMDocument::loadHTML将把你的字符串处理为ISO-8859-1,除非你不告诉它。这导致UTF-8字符串被错误地解释。
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
如果您的字符串不包含XML编码声明,您可以prepend one使字符串被视为UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:
如果您不知道该字符串是否包含这样的声明,那么在SmartDOMDocument中有一个可以帮助您的变通方法:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
这不是一个很好的解决方案,但是因为不是所有的字符都可以用ISO-8859-1(像这些katana)来表示,这是最安全的选择。
#2
38
The problem is with saveHTML()
and saveXML()
, both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
问题在于saveHTML()和saveXML(),它们在Unix中都不能正常工作。在Unix中,它们不能正确地保存UTF-8字符,但是它们在Windows中工作。
The workaround is very simple:
解决方法很简单:
If you try the default, you will get the error you described
如果您尝试默认,您将得到您描述的错误。
$str = $dom->saveHTML(); // saves incorrectly
All you have to do is save as follows:
你所要做的就是保存如下:
$str = $dom->saveHTML($dom->documentElement); // saves correctly
This line of code will get your UTF-8 characters to be saved correctly (use the same workaround if you are using saveXML()
).
这行代码将使您的UTF-8字符正确地保存(如果使用saveXML(),请使用相同的方法。)
Note
-
English characters do not cause any problem when you use
saveHTML()
without parameters (because English characters are saved as single byte characters in UTF-8)当您使用saveHTML()而不使用参数时,英文字符不会造成任何问题(因为在UTF-8中,英文字符被保存为单个字节字符)
-
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)
当你有多字节字符(例如中文、俄语、阿拉伯语、希伯来语等等)时,问题就出现了。
I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.
我推荐阅读这篇文章:http://coding.smashingmagazine.com/2012/06/06/all- aboutunicodeutf8 -字符-sets/。您将了解UTF-8是如何工作的,以及为什么会出现这个问题。你大概需要30分钟,但是时间很充裕。
#3
14
Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).
确保真正的源文件被保存为UTF-8(您甚至可能想用UTF-8来尝试不推荐的BOM字符集)。
Also in case of HTML, make sure you have declared the correct encoding using meta
tags:
同样在HTML的情况下,确保使用元标签声明了正确的编码:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.
如果是CMS(如您在Joomla上标记了您的问题),您可能需要为编码配置适当的设置。
#4
9
You could prefix a line enforcing utf-8
encoding, like this:
你可以用utf-8编码作为前缀,就像这样:
@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);
And you can then continue with the code you already have, like:
然后您可以继续使用您已经拥有的代码,比如:
$doc->saveXML()
#5
5
You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.
你必须用一个有意义的标题来给DOMDocument提供一个版本的HTML。就像HTML5。
$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;
maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities
!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!
也许让html尽可能的有效是一个好主意,所以当你开始查询时,你不会遇到问题……around:-)并且远离htmlentities!!!!这是浪费资源的必要条件。保持代码疯了! ! ! !
#6
3
This took me a while to figure out but here's my answer.
我花了一段时间才算出来,但这就是我的答案。
Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:
在使用DomDocument之前,我将使用file_get_contents检索url,然后用字符串函数处理它们。也许不是最好的方法,但很快。在被说服之后,我第一次尝试了以下几点:
$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
// error message
}
else {
// process
}
This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:
尽管有适当的元标签、php设置以及在这里和其他地方提供的其他补救措施,但这在维护UTF-8编码方面还是失败了。的工作原理:
$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}
etc. Now everything's right with the world. Hope this helps.
等等,现在世界都是对的。希望这个有帮助。
#7
2
Works finde for me:
finde适合我:
$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return utf8_encode( $dom->saveHTML());
#8
0
Problem is that when you add parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.
问题是,当您将参数添加到DOMDocument::saveHTML()函数时,就会丢失编码。在一些情况下,您需要避免使用参数并使用旧的字符串函数来查找您要查找的内容。
I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help ppl who may be in my case.
我认为之前的答案对你有用,但是因为这个变通方法对我不起作用,所以我补充了这个答案来帮助可能属于我的ppl。
#9
0
Use it for correct result
用它来得到正确的结果。
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;
This operation
这个操作
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
It is bad way, because special symbols like < ; , > ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.
这是不好的方式,因为特殊的符号像<,比;可以是$profile,并且它们不会在mb_convert_encoding之后进行两次转换。它是XSS和不正确HTML的漏洞。
#1
320
DOMDocument::loadHTML
will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
DOMDocument::loadHTML将把你的字符串处理为ISO-8859-1,除非你不告诉它。这导致UTF-8字符串被错误地解释。
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
如果您的字符串不包含XML编码声明,您可以prepend one使字符串被视为UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:
如果您不知道该字符串是否包含这样的声明,那么在SmartDOMDocument中有一个可以帮助您的变通方法:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
这不是一个很好的解决方案,但是因为不是所有的字符都可以用ISO-8859-1(像这些katana)来表示,这是最安全的选择。
#2
38
The problem is with saveHTML()
and saveXML()
, both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
问题在于saveHTML()和saveXML(),它们在Unix中都不能正常工作。在Unix中,它们不能正确地保存UTF-8字符,但是它们在Windows中工作。
The workaround is very simple:
解决方法很简单:
If you try the default, you will get the error you described
如果您尝试默认,您将得到您描述的错误。
$str = $dom->saveHTML(); // saves incorrectly
All you have to do is save as follows:
你所要做的就是保存如下:
$str = $dom->saveHTML($dom->documentElement); // saves correctly
This line of code will get your UTF-8 characters to be saved correctly (use the same workaround if you are using saveXML()
).
这行代码将使您的UTF-8字符正确地保存(如果使用saveXML(),请使用相同的方法。)
Note
-
English characters do not cause any problem when you use
saveHTML()
without parameters (because English characters are saved as single byte characters in UTF-8)当您使用saveHTML()而不使用参数时,英文字符不会造成任何问题(因为在UTF-8中,英文字符被保存为单个字节字符)
-
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)
当你有多字节字符(例如中文、俄语、阿拉伯语、希伯来语等等)时,问题就出现了。
I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.
我推荐阅读这篇文章:http://coding.smashingmagazine.com/2012/06/06/all- aboutunicodeutf8 -字符-sets/。您将了解UTF-8是如何工作的,以及为什么会出现这个问题。你大概需要30分钟,但是时间很充裕。
#3
14
Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).
确保真正的源文件被保存为UTF-8(您甚至可能想用UTF-8来尝试不推荐的BOM字符集)。
Also in case of HTML, make sure you have declared the correct encoding using meta
tags:
同样在HTML的情况下,确保使用元标签声明了正确的编码:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.
如果是CMS(如您在Joomla上标记了您的问题),您可能需要为编码配置适当的设置。
#4
9
You could prefix a line enforcing utf-8
encoding, like this:
你可以用utf-8编码作为前缀,就像这样:
@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);
And you can then continue with the code you already have, like:
然后您可以继续使用您已经拥有的代码,比如:
$doc->saveXML()
#5
5
You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.
你必须用一个有意义的标题来给DOMDocument提供一个版本的HTML。就像HTML5。
$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;
maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities
!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!
也许让html尽可能的有效是一个好主意,所以当你开始查询时,你不会遇到问题……around:-)并且远离htmlentities!!!!这是浪费资源的必要条件。保持代码疯了! ! ! !
#6
3
This took me a while to figure out but here's my answer.
我花了一段时间才算出来,但这就是我的答案。
Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:
在使用DomDocument之前,我将使用file_get_contents检索url,然后用字符串函数处理它们。也许不是最好的方法,但很快。在被说服之后,我第一次尝试了以下几点:
$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
// error message
}
else {
// process
}
This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:
尽管有适当的元标签、php设置以及在这里和其他地方提供的其他补救措施,但这在维护UTF-8编码方面还是失败了。的工作原理:
$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}
etc. Now everything's right with the world. Hope this helps.
等等,现在世界都是对的。希望这个有帮助。
#7
2
Works finde for me:
finde适合我:
$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return utf8_encode( $dom->saveHTML());
#8
0
Problem is that when you add parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.
问题是,当您将参数添加到DOMDocument::saveHTML()函数时,就会丢失编码。在一些情况下,您需要避免使用参数并使用旧的字符串函数来查找您要查找的内容。
I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help ppl who may be in my case.
我认为之前的答案对你有用,但是因为这个变通方法对我不起作用,所以我补充了这个答案来帮助可能属于我的ppl。
#9
0
Use it for correct result
用它来得到正确的结果。
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;
This operation
这个操作
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
It is bad way, because special symbols like < ; , > ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.
这是不好的方式,因为特殊的符号像<,比;可以是$profile,并且它们不会在mb_convert_encoding之后进行两次转换。它是XSS和不正确HTML的漏洞。