PHP和处理UTF-8 XML的外来字符

时间:2022-10-24 23:30:14

I am currently scraping some data from the internet and converting into xml documents.

我目前正在从互联网上抓取一些数据并转换成xml文档。

  • document being scraped is utf-8 according to its meta tags
  • 被删除的文件根据其meta标签是utf-8

The problem is some of the data contains foreign characters, I cannot find a way of reliably converting them into XML / utf-8 friendly entities, the following errors are what I have managed to find by reading through, I would ideally like a solution that would work all the time.

问题是一些数据包含外来字符,我找不到可靠地将它们转换成XML / utf-8友好实体的方法,以下错误是我通过阅读设法找到的,我理想地喜欢一个解决方案,会一直有效。

Example 1 works correctly, example 2 fails. My research fixed example 1, but it does not seem to be a blanket solution.

示例1正常工作,示例2失败。我的研究修复了示例1,但它似乎并不是一个全面的解决方案。

Côte d'Ivoire  
Côte d'Ivoire (correct)  

I managed to get the - ô - parsing correctly using the following function on my xpath.

我设法在我的xpath上使用以下函数正确解析 - ô - 解析。

$w->text(charset_decode_utf_8((string)$match->a));

function charset_decode_utf_8($string) {
    if(@!ereg("[\200-\237]",$string) && @!ereg("[\241-\377]",$string)) {
        return $string;
    }
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
ÖFB Stiegl Cup  
ÖFB Stiegl Cup (wrong)  

Unfortunately on the - Ö - it gets converted into a double entity. I have no idea how to make it convert to a proper html entity.

不幸的是 - Ö - 它被转换成双重实体。我不知道如何将其转换为适当的html实体。

I have tried:

我试过了:

  • using iso-8859-1 encoding when creating my xml document
  • 在创建我的xml文档时使用iso-8859-1编码

  • using htmlentities with utf-8 encoding
  • 使用带有utf-8编码的htmlentities

Any help would be greatly appreciated, as I am tearing my hair out trying to get things to save correctly.

任何帮助将不胜感激,因为我正在试图让我的头发正确保存。

2 个解决方案

#1


UTF-8 can be used to store any character (a proof ? it stores them in the webpages you are scraping) ; so, why encode some as entities ?

UTF-8可用于存储任何字符(证据?它将它们存储在您正在抓取的网页中);那么,为什么将一些编码为实体?

If you are opening XML documents and see problems with encoding, check the parameters of your editor : does it try to analyse the document as UTF-8 ? (Some editors don't, by default -- if you are opening a document on your hard disk with a browser, it might fail to recognize it as UTF-8 because there is no server to send any header indicating it's UTF-8)

如果要打开XML文档并查看编码问题,请检查编辑器的参数:它是否尝试将文档分析为UTF-8? (默认情况下,某些编辑器不会 - 如果您使用浏览器在硬盘上打开文档,它可能无法将其识别为UTF-8,因为没有服务器发送任何标头,表明它是UTF-8)

If the problem is not that, can upload an example of problematic XML document somewhere ?

如果问题不是那样,可以在某处上传有问题的XML文档的示例吗?

#2


Don't bother with entity encoding. Use CDATA blocks instead.

不要打扰实体编码。请改用CDATA块。

PHP doesn't understand UTF-8. It thinks it's a bytestream. Best to treat it that way. You're shuttling bytes around, and all you need to do is make sure they don't get parsed and they're labeled correctly.

PHP不理解UTF-8。它认为这是一个字节流。最好这样对待它。你正在切换字节,你需要做的就是确保它们不被解析并且它们被正确标记。

#1


UTF-8 can be used to store any character (a proof ? it stores them in the webpages you are scraping) ; so, why encode some as entities ?

UTF-8可用于存储任何字符(证据?它将它们存储在您正在抓取的网页中);那么,为什么将一些编码为实体?

If you are opening XML documents and see problems with encoding, check the parameters of your editor : does it try to analyse the document as UTF-8 ? (Some editors don't, by default -- if you are opening a document on your hard disk with a browser, it might fail to recognize it as UTF-8 because there is no server to send any header indicating it's UTF-8)

如果要打开XML文档并查看编码问题,请检查编辑器的参数:它是否尝试将文档分析为UTF-8? (默认情况下,某些编辑器不会 - 如果您使用浏览器在硬盘上打开文档,它可能无法将其识别为UTF-8,因为没有服务器发送任何标头,表明它是UTF-8)

If the problem is not that, can upload an example of problematic XML document somewhere ?

如果问题不是那样,可以在某处上传有问题的XML文档的示例吗?

#2


Don't bother with entity encoding. Use CDATA blocks instead.

不要打扰实体编码。请改用CDATA块。

PHP doesn't understand UTF-8. It thinks it's a bytestream. Best to treat it that way. You're shuttling bytes around, and all you need to do is make sure they don't get parsed and they're labeled correctly.

PHP不理解UTF-8。它认为这是一个字节流。最好这样对待它。你正在切换字节,你需要做的就是确保它们不被解析并且它们被正确标记。