I am currently scraping some data from the internet and converting into xml documents.
我目前正在从互联网上抓取一些数据并转换成xml文档。
- document being scraped is utf-8 according to its meta tags
被删除的文件根据其meta标签是utf-8
The problem is some of the data contains foreign characters, I cannot find a way of reliably converting them into XML / utf-8 friendly entities, the following errors are what I have managed to find by reading through, I would ideally like a solution that would work all the time.
问题是一些数据包含外来字符,我找不到可靠地将它们转换成XML / utf-8友好实体的方法,以下错误是我通过阅读设法找到的,我理想地喜欢一个解决方案,会一直有效。
Example 1 works correctly, example 2 fails. My research fixed example 1, but it does not seem to be a blanket solution.
示例1正常工作,示例2失败。我的研究修复了示例1,但它似乎并不是一个全面的解决方案。
Côte d'Ivoire Côte d'Ivoire (correct)
I managed to get the - ô - parsing correctly using the following function on my xpath.
我设法在我的xpath上使用以下函数正确解析 - ô - 解析。
$w->text(charset_decode_utf_8((string)$match->a));
function charset_decode_utf_8($string) {
if(@!ereg("[\200-\237]",$string) && @!ereg("[\241-\377]",$string)) {
return $string;
}
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
ÖFB Stiegl Cup ÖFB Stiegl Cup (wrong)
Unfortunately on the - Ö - it gets converted into a double entity. I have no idea how to make it convert to a proper html entity.
不幸的是 - Ö - 它被转换成双重实体。我不知道如何将其转换为适当的html实体。
I have tried:
我试过了:
- using iso-8859-1 encoding when creating my xml document
- using htmlentities with utf-8 encoding
在创建我的xml文档时使用iso-8859-1编码
使用带有utf-8编码的htmlentities
Any help would be greatly appreciated, as I am tearing my hair out trying to get things to save correctly.
任何帮助将不胜感激,因为我正在试图让我的头发正确保存。
2 个解决方案
#1
UTF-8 can be used to store any character (a proof ? it stores them in the webpages you are scraping) ; so, why encode some as entities ?
UTF-8可用于存储任何字符(证据?它将它们存储在您正在抓取的网页中);那么,为什么将一些编码为实体?
If you are opening XML documents and see problems with encoding, check the parameters of your editor : does it try to analyse the document as UTF-8 ? (Some editors don't, by default -- if you are opening a document on your hard disk with a browser, it might fail to recognize it as UTF-8 because there is no server to send any header indicating it's UTF-8)
如果要打开XML文档并查看编码问题,请检查编辑器的参数:它是否尝试将文档分析为UTF-8? (默认情况下,某些编辑器不会 - 如果您使用浏览器在硬盘上打开文档,它可能无法将其识别为UTF-8,因为没有服务器发送任何标头,表明它是UTF-8)
If the problem is not that, can upload an example of problematic XML document somewhere ?
如果问题不是那样,可以在某处上传有问题的XML文档的示例吗?
#2
Don't bother with entity encoding. Use CDATA blocks instead.
不要打扰实体编码。请改用CDATA块。
PHP doesn't understand UTF-8. It thinks it's a bytestream. Best to treat it that way. You're shuttling bytes around, and all you need to do is make sure they don't get parsed and they're labeled correctly.
PHP不理解UTF-8。它认为这是一个字节流。最好这样对待它。你正在切换字节,你需要做的就是确保它们不被解析并且它们被正确标记。
#1
UTF-8 can be used to store any character (a proof ? it stores them in the webpages you are scraping) ; so, why encode some as entities ?
UTF-8可用于存储任何字符(证据?它将它们存储在您正在抓取的网页中);那么,为什么将一些编码为实体?
If you are opening XML documents and see problems with encoding, check the parameters of your editor : does it try to analyse the document as UTF-8 ? (Some editors don't, by default -- if you are opening a document on your hard disk with a browser, it might fail to recognize it as UTF-8 because there is no server to send any header indicating it's UTF-8)
如果要打开XML文档并查看编码问题,请检查编辑器的参数:它是否尝试将文档分析为UTF-8? (默认情况下,某些编辑器不会 - 如果您使用浏览器在硬盘上打开文档,它可能无法将其识别为UTF-8,因为没有服务器发送任何标头,表明它是UTF-8)
If the problem is not that, can upload an example of problematic XML document somewhere ?
如果问题不是那样,可以在某处上传有问题的XML文档的示例吗?
#2
Don't bother with entity encoding. Use CDATA blocks instead.
不要打扰实体编码。请改用CDATA块。
PHP doesn't understand UTF-8. It thinks it's a bytestream. Best to treat it that way. You're shuttling bytes around, and all you need to do is make sure they don't get parsed and they're labeled correctly.
PHP不理解UTF-8。它认为这是一个字节流。最好这样对待它。你正在切换字节,你需要做的就是确保它们不被解析并且它们被正确标记。