PHP DOMDocument loadHTML没有正确编码UTF-8。

时间:2022-10-20 18:27:05

I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

我尝试使用DOMDocument解析一些HTML,但是当我这样做时,我突然失去了编码(至少在我看来是这样)。

$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile); 

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
    echo $dom->saveHTML($div);
}

The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:

这段代码的结果是,我得到了一堆不是日语的字符。但是,如果我做的事:

echo $profile;

it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.

它显示正确。我尝试过saveHTML和saveXML,但都没有正确显示。我使用的是PHP 5.3。

What I see:

我所看到的:

ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åº­ã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ã­ã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å­¦ã

What should be shown:

应该是显示:

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学

EDIT: I've simplified the code down to five lines so you can test it yourself.

编辑:我已经将代码简化为5行,这样您就可以自己测试了。

$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;

Here is the html that is returned:

下面是返回的html:

<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åº­ã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>

10 个解决方案

#1


320  

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

DOMDocument::loadHTML将把你的字符串处理为ISO-8859-1,除非你不告诉它。这导致UTF-8字符串被错误地解释。

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

如果您的字符串不包含XML编码声明,您可以prepend one使字符串被视为UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

如果您不知道该字符串是否包含这样的声明,那么在SmartDOMDocument中有一个可以帮助您的变通方法:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

这不是一个很好的解决方案,但是因为不是所有的字符都可以用ISO-8859-1(像这些katana)来表示,这是最安全的选择。

#2


38  

The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

问题在于saveHTML()和saveXML(),它们在Unix中都不能正常工作。在Unix中,它们不能正确地保存UTF-8字符,但是它们在Windows中工作。

The workaround is very simple:

解决方法很简单:

If you try the default, you will get the error you described

如果您尝试默认,您将得到您描述的错误。

$str = $dom->saveHTML(); // saves incorrectly

All you have to do is save as follows:

你所要做的就是保存如下:

$str = $dom->saveHTML($dom->documentElement); // saves correctly

This line of code will get your UTF-8 characters to be saved correctly (use the same workaround if you are using saveXML()).

这行代码将使您的UTF-8字符正确地保存(如果使用saveXML(),请使用相同的方法。)


Note

  1. English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)

    当您使用saveHTML()而不使用参数时,英文字符不会造成任何问题(因为在UTF-8中,英文字符被保存为单个字节字符)

  2. The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)

    当你有多字节字符(例如中文、俄语、阿拉伯语、希伯来语等等)时,问题就出现了。

I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.

我推荐阅读这篇文章:http://coding.smashingmagazine.com/2012/06/06/all- aboutunicodeutf8 -字符-sets/。您将了解UTF-8是如何工作的,以及为什么会出现这个问题。你大概需要30分钟,但是时间很充裕。

#3


14  

Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).

确保真正的源文件被保存为UTF-8(您甚至可能想用UTF-8来尝试不推荐的BOM字符集)。

Also in case of HTML, make sure you have declared the correct encoding using meta tags:

同样在HTML的情况下,确保使用元标签声明了正确的编码:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.

如果是CMS(如您在Joomla上标记了您的问题),您可能需要为编码配置适当的设置。

#4


9  

You could prefix a line enforcing utf-8 encoding, like this:

你可以用utf-8编码作为前缀,就像这样:

@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);

And you can then continue with the code you already have, like:

然后您可以继续使用您已经拥有的代码,比如:

$doc->saveXML()

#5


5  

You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.

你必须用一个有意义的标题来给DOMDocument提供一个版本的HTML。就像HTML5。

$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;

maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!

也许让html尽可能的有效是一个好主意,所以当你开始查询时,你不会遇到问题……around:-)并且远离htmlentities!!!!这是浪费资源的必要条件。保持代码疯了! ! ! !

#6


3  

This took me a while to figure out but here's my answer.

我花了一段时间才算出来,但这就是我的答案。

Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

在使用DomDocument之前,我将使用file_get_contents检索url,然后用字符串函数处理它们。也许不是最好的方法,但很快。在被说服之后,我第一次尝试了以下几点:

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:

尽管有适当的元标签、php设置以及在这里和其他地方提供的其他补救措施,但这在维护UTF-8编码方面还是失败了。的工作原理:

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything's right with the world. Hope this helps.

等等,现在世界都是对的。希望这个有帮助。

#7


2  

Works finde for me:

finde适合我:

$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return  utf8_encode( $dom->saveHTML());

#8


0  

Problem is that when you add parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.

问题是,当您将参数添加到DOMDocument::saveHTML()函数时,就会丢失编码。在一些情况下,您需要避免使用参数并使用旧的字符串函数来查找您要查找的内容。

I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help ppl who may be in my case.

我认为之前的答案对你有用,但是因为这个变通方法对我不起作用,所以我补充了这个答案来帮助可能属于我的ppl。

#9


0  

Use it for correct result

用它来得到正确的结果。

$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;

This operation

这个操作

mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');

It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.

这是不好的方式,因为特殊的符号像<,比;可以是$profile,并且它们不会在mb_convert_encoding之后进行两次转换。它是XSS和不正确HTML的漏洞。

#10


-3  

Try using utf8_encode

试着用utf8_encode

#1


320  

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

DOMDocument::loadHTML将把你的字符串处理为ISO-8859-1,除非你不告诉它。这导致UTF-8字符串被错误地解释。

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

如果您的字符串不包含XML编码声明,您可以prepend one使字符串被视为UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

如果您不知道该字符串是否包含这样的声明,那么在SmartDOMDocument中有一个可以帮助您的变通方法:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

这不是一个很好的解决方案,但是因为不是所有的字符都可以用ISO-8859-1(像这些katana)来表示,这是最安全的选择。

#2


38  

The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

问题在于saveHTML()和saveXML(),它们在Unix中都不能正常工作。在Unix中,它们不能正确地保存UTF-8字符,但是它们在Windows中工作。

The workaround is very simple:

解决方法很简单:

If you try the default, you will get the error you described

如果您尝试默认,您将得到您描述的错误。

$str = $dom->saveHTML(); // saves incorrectly

All you have to do is save as follows:

你所要做的就是保存如下:

$str = $dom->saveHTML($dom->documentElement); // saves correctly

This line of code will get your UTF-8 characters to be saved correctly (use the same workaround if you are using saveXML()).

这行代码将使您的UTF-8字符正确地保存(如果使用saveXML(),请使用相同的方法。)


Note

  1. English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)

    当您使用saveHTML()而不使用参数时,英文字符不会造成任何问题(因为在UTF-8中,英文字符被保存为单个字节字符)

  2. The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)

    当你有多字节字符(例如中文、俄语、阿拉伯语、希伯来语等等)时,问题就出现了。

I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.

我推荐阅读这篇文章:http://coding.smashingmagazine.com/2012/06/06/all- aboutunicodeutf8 -字符-sets/。您将了解UTF-8是如何工作的,以及为什么会出现这个问题。你大概需要30分钟,但是时间很充裕。

#3


14  

Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).

确保真正的源文件被保存为UTF-8(您甚至可能想用UTF-8来尝试不推荐的BOM字符集)。

Also in case of HTML, make sure you have declared the correct encoding using meta tags:

同样在HTML的情况下,确保使用元标签声明了正确的编码:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.

如果是CMS(如您在Joomla上标记了您的问题),您可能需要为编码配置适当的设置。

#4


9  

You could prefix a line enforcing utf-8 encoding, like this:

你可以用utf-8编码作为前缀,就像这样:

@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);

And you can then continue with the code you already have, like:

然后您可以继续使用您已经拥有的代码,比如:

$doc->saveXML()

#5


5  

You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.

你必须用一个有意义的标题来给DOMDocument提供一个版本的HTML。就像HTML5。

$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;

maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!

也许让html尽可能的有效是一个好主意,所以当你开始查询时,你不会遇到问题……around:-)并且远离htmlentities!!!!这是浪费资源的必要条件。保持代码疯了! ! ! !

#6


3  

This took me a while to figure out but here's my answer.

我花了一段时间才算出来,但这就是我的答案。

Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

在使用DomDocument之前,我将使用file_get_contents检索url,然后用字符串函数处理它们。也许不是最好的方法,但很快。在被说服之后,我第一次尝试了以下几点:

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:

尽管有适当的元标签、php设置以及在这里和其他地方提供的其他补救措施,但这在维护UTF-8编码方面还是失败了。的工作原理:

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything's right with the world. Hope this helps.

等等,现在世界都是对的。希望这个有帮助。

#7


2  

Works finde for me:

finde适合我:

$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return  utf8_encode( $dom->saveHTML());

#8


0  

Problem is that when you add parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.

问题是,当您将参数添加到DOMDocument::saveHTML()函数时,就会丢失编码。在一些情况下,您需要避免使用参数并使用旧的字符串函数来查找您要查找的内容。

I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help ppl who may be in my case.

我认为之前的答案对你有用,但是因为这个变通方法对我不起作用,所以我补充了这个答案来帮助可能属于我的ppl。

#9


0  

Use it for correct result

用它来得到正确的结果。

$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;

This operation

这个操作

mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');

It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.

这是不好的方式,因为特殊的符号像<,比;可以是$profile,并且它们不会在mb_convert_encoding之后进行两次转换。它是XSS和不正确HTML的漏洞。

#10


-3  

Try using utf8_encode

试着用utf8_encode