警告:DOMDocument::loadHTML(): htmlParseEntityRef: expect ';' in Entity,

时间:2022-03-25 07:51:29
$html = file_get_contents("http://www.somesite.com/");

$dom = new DOMDocument();
$dom->loadHTML($html);

echo $dom;

throws

抛出

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,
Catchable fatal error: Object of class DOMDocument could not be converted to string in test.php on line 10

11 个解决方案

#1


117  

To evaporate the warning, you can use libxml_use_internal_errors(true)

要消除警告,可以使用libxml_use_internal_errors(true)

// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');

// set error level
$internalErrors = libxml_use_internal_errors(true);

// load HTML
$document->loadHTML($html);

// Restore error level
libxml_use_internal_errors($internalErrors);

#2


84  

I would bet that if you looked at the source of http://www.somesite.com/ you would find special characters that haven't been converted to HTML. Maybe something like this:

我敢打赌,如果你查看http://www.somesite.com/的源代码,你会发现一些特殊的字符没有被转换成HTML。也许是这样的:

<a href="/script.php?foo=bar&hello=world">link</a>

Should be

应该是

<a href="/script.php?foo=bar&amp;hello=world">link</a>

#3


51  

$dom->@loadHTML($html);

This is incorrect, use this instead:

这是不正确的,用这个代替:

@$dom->loadHTML($html);

#4


12  

The reason for your fatal error is DOMDocument does not have a __toString() method and thus can not be echo'ed.

导致致命错误的原因是DOMDocument没有__toString()方法,因此不能进行echo。

You're probably looking for

你可能找的

echo $dom->saveHTML();

#5


10  

There are 2 errors: the second is because $dom is no string but an object and thus cannot be "echoed". The first error is a warning from loadHTML, caused by invalid syntax of the html document to load (probably a & used as parameter separator and not masked as entity with &).

有两个错误:第二个错误是因为$dom不是字符串而是对象,因此不能被“echo”。第一个错误是来自loadHTML的警告,该警告是由要加载的html文档语法无效引起的(可能是a &用作参数分隔符,而不是用&作为实体)。

You ignore and supress this error message (not the error, just the message!) by calling the function with the error control operator "@" (http://www.php.net/manual/en/language.operators.errorcontrol.php )

通过使用错误控制操作符“@”(http://www.php.net/manual/en/language.operators.errorcontroll.php)调用函数,可以忽略并压制这个错误消息(不是错误,只是消息!)

$dom->@loadHTML($html);

#6


8  

Regardless of the echo (which would need to be replaced with print_r or var_dump), if an exception is thrown the object should stay empty:

无论echo(需要用print_r或var_dump替换)如何,如果抛出异常,对象应该保持为空:

DOMNodeList Object
(
)

Solution

解决方案

  1. Set recover to true, and strictErrorChecking to false

    设置恢复为真,严格检查为假

    $content = file_get_contents($url);
    
    $doc = new DOMDocument();
    $doc->recover = true;
    $doc->strictErrorChecking = false;
    $doc->loadHTML($content);
    
  2. Use php's entity-encoding on the markup's contents, which is a most common error source.

    在标记的内容上使用php的实体编码,这是最常见的错误源。

#7


7  

replace the simple

更换简单

$dom->loadHTML($html);

with the more robust ...

更健壮的……

libxml_use_internal_errors(true);

if (!$DOM->loadHTML($page))
    {
        $errors="";
        foreach (libxml_get_errors() as $error)  {
            $errors.=$error->message."<br/>";
        }
        libxml_clear_errors();
        print "libxml errors:<br>$errors";
        return;
    }

#8


3  

Another possibile solution is

另一个可行性的解决方案是

$sContent = htmlspecialchars($sHTML);
$oDom = new DOMDocument();
$oDom->loadHTML($sContent);
echo html_entity_decode($oDom->saveHTML());

#9


1  

I know this is an old question, but if you ever want ot fix the malformed '&' signs in your HTML. You can use code similar to this:

我知道这是一个老问题,但是如果您想要修复HTML中格式不正确的“&”符号的话。您可以使用类似的代码:

$page = file_get_contents('http://www.example.com');
$page = preg_replace('/\s+/', ' ', trim($page));
fixAmps($page, 0);
$dom->loadHTML($page);


function fixAmps(&$html, $offset) {
    $positionAmp = strpos($html, '&', $offset);
    $positionSemiColumn = strpos($html, ';', $positionAmp+1);

    $string = substr($html, $positionAmp, $positionSemiColumn-$positionAmp+1);

    if ($positionAmp !== false) { // If an '&' can be found.
        if ($positionSemiColumn === false) { // If no ';' can be found.
            $html = substr_replace($html, '&amp;', $positionAmp, 1); // Replace straight away.
        } else if (preg_match('/&(#[0-9]+|[A-Z|a-z|0-9]+);/', $string) === 0) { // If a standard escape cannot be found.
            $html = substr_replace($html, '&amp;', $positionAmp, 1); // This mean we need to escapa the '&' sign.
            fixAmps($html, $positionAmp+5); // Recursive call from the new position.
        } else {
            fixAmps($html, $positionAmp+1); // Recursive call from the new position.
        }
    }
}

#10


1  

It's not always because of the contents of the page and could be because of the URL itself.

这并不总是因为页面的内容,也可能是因为URL本身。

I encountered this error recently and it was duo to return carriage character at the end of the URL. The reason to the existence of this character, was the mistake in splitting of the URLs.

我最近遇到了这个错误,是duo在URL的末尾返回节字符。这个字符存在的原因是url的分割错误。

$urls_array = explode("\r\n", $urls);

instead of

而不是

$urls_array = explode("\n", $urls);

#11


1  

$html = file_get_contents("http://www.somesite.com/");

$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars($html));

echo $dom;

try this

试试这个

#1


117  

To evaporate the warning, you can use libxml_use_internal_errors(true)

要消除警告,可以使用libxml_use_internal_errors(true)

// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');

// set error level
$internalErrors = libxml_use_internal_errors(true);

// load HTML
$document->loadHTML($html);

// Restore error level
libxml_use_internal_errors($internalErrors);

#2


84  

I would bet that if you looked at the source of http://www.somesite.com/ you would find special characters that haven't been converted to HTML. Maybe something like this:

我敢打赌,如果你查看http://www.somesite.com/的源代码,你会发现一些特殊的字符没有被转换成HTML。也许是这样的:

<a href="/script.php?foo=bar&hello=world">link</a>

Should be

应该是

<a href="/script.php?foo=bar&amp;hello=world">link</a>

#3


51  

$dom->@loadHTML($html);

This is incorrect, use this instead:

这是不正确的,用这个代替:

@$dom->loadHTML($html);

#4


12  

The reason for your fatal error is DOMDocument does not have a __toString() method and thus can not be echo'ed.

导致致命错误的原因是DOMDocument没有__toString()方法,因此不能进行echo。

You're probably looking for

你可能找的

echo $dom->saveHTML();

#5


10  

There are 2 errors: the second is because $dom is no string but an object and thus cannot be "echoed". The first error is a warning from loadHTML, caused by invalid syntax of the html document to load (probably a & used as parameter separator and not masked as entity with &).

有两个错误:第二个错误是因为$dom不是字符串而是对象,因此不能被“echo”。第一个错误是来自loadHTML的警告,该警告是由要加载的html文档语法无效引起的(可能是a &用作参数分隔符,而不是用&作为实体)。

You ignore and supress this error message (not the error, just the message!) by calling the function with the error control operator "@" (http://www.php.net/manual/en/language.operators.errorcontrol.php )

通过使用错误控制操作符“@”(http://www.php.net/manual/en/language.operators.errorcontroll.php)调用函数,可以忽略并压制这个错误消息(不是错误,只是消息!)

$dom->@loadHTML($html);

#6


8  

Regardless of the echo (which would need to be replaced with print_r or var_dump), if an exception is thrown the object should stay empty:

无论echo(需要用print_r或var_dump替换)如何,如果抛出异常,对象应该保持为空:

DOMNodeList Object
(
)

Solution

解决方案

  1. Set recover to true, and strictErrorChecking to false

    设置恢复为真,严格检查为假

    $content = file_get_contents($url);
    
    $doc = new DOMDocument();
    $doc->recover = true;
    $doc->strictErrorChecking = false;
    $doc->loadHTML($content);
    
  2. Use php's entity-encoding on the markup's contents, which is a most common error source.

    在标记的内容上使用php的实体编码,这是最常见的错误源。

#7


7  

replace the simple

更换简单

$dom->loadHTML($html);

with the more robust ...

更健壮的……

libxml_use_internal_errors(true);

if (!$DOM->loadHTML($page))
    {
        $errors="";
        foreach (libxml_get_errors() as $error)  {
            $errors.=$error->message."<br/>";
        }
        libxml_clear_errors();
        print "libxml errors:<br>$errors";
        return;
    }

#8


3  

Another possibile solution is

另一个可行性的解决方案是

$sContent = htmlspecialchars($sHTML);
$oDom = new DOMDocument();
$oDom->loadHTML($sContent);
echo html_entity_decode($oDom->saveHTML());

#9


1  

I know this is an old question, but if you ever want ot fix the malformed '&' signs in your HTML. You can use code similar to this:

我知道这是一个老问题,但是如果您想要修复HTML中格式不正确的“&”符号的话。您可以使用类似的代码:

$page = file_get_contents('http://www.example.com');
$page = preg_replace('/\s+/', ' ', trim($page));
fixAmps($page, 0);
$dom->loadHTML($page);


function fixAmps(&$html, $offset) {
    $positionAmp = strpos($html, '&', $offset);
    $positionSemiColumn = strpos($html, ';', $positionAmp+1);

    $string = substr($html, $positionAmp, $positionSemiColumn-$positionAmp+1);

    if ($positionAmp !== false) { // If an '&' can be found.
        if ($positionSemiColumn === false) { // If no ';' can be found.
            $html = substr_replace($html, '&amp;', $positionAmp, 1); // Replace straight away.
        } else if (preg_match('/&(#[0-9]+|[A-Z|a-z|0-9]+);/', $string) === 0) { // If a standard escape cannot be found.
            $html = substr_replace($html, '&amp;', $positionAmp, 1); // This mean we need to escapa the '&' sign.
            fixAmps($html, $positionAmp+5); // Recursive call from the new position.
        } else {
            fixAmps($html, $positionAmp+1); // Recursive call from the new position.
        }
    }
}

#10


1  

It's not always because of the contents of the page and could be because of the URL itself.

这并不总是因为页面的内容,也可能是因为URL本身。

I encountered this error recently and it was duo to return carriage character at the end of the URL. The reason to the existence of this character, was the mistake in splitting of the URLs.

我最近遇到了这个错误,是duo在URL的末尾返回节字符。这个字符存在的原因是url的分割错误。

$urls_array = explode("\r\n", $urls);

instead of

而不是

$urls_array = explode("\n", $urls);

#11


1  

$html = file_get_contents("http://www.somesite.com/");

$dom = new DOMDocument();
$dom->loadHTML(htmlspecialchars($html));

echo $dom;

try this

试试这个