So I have a ruby script that parses HTML pages and saves the extracted string into a DB... but i'm getting weired charcters (usually question marks) instead of plain text...
所以我有一个ruby脚本解析HTML页面并将提取的字符串保存到数据库中......但我得到的是字符(通常是问号),而不是纯文本...
Eg : ‘SOME TEXT’ instead of 'Some Text'
例如:“Some TEXT”而不是“Some Text”
I've tried HTML entities and CGI::unescape ... but to no avail... did some googling n set $KCODE = 'u' & require 'jcode' still not working...
我已经尝试过HTML实体和CGI :: unescape ......但无济于事......做了一些谷歌搜索n设置$ KCODE ='u'并要求'jcode'仍然无效...
any suggestions /pointers would be great
任何建议/指针都会很棒
Thanks
PS : using mysql 5.1
PS:使用mysql 5.1
2 个解决方案
#1
Your script is storing the Unicode escape sequences for quotation marks (instead of ASCII quotation marks) in the database.
您的脚本在数据库中存储引号(而不是ASCII引号)的Unicode转义序列。
That's actually good - it shows that the DB itself is working fine, although for best results you should ensure that the table is set to use 'utf8_collation_ci' so that string sorting works properly.
这实际上很好 - 它表明数据库本身工作正常,但为了获得最佳结果,您应该确保将表设置为使用'utf8_collation_ci',以便字符串排序正常工作。
The fact that the output is displayed as "‘" just means that your terminal (and/or web page) output encoding is incorrect.
输出显示为“—的事实只表示您的终端(和/或网页)输出编码不正确。
If it's terminal output, make sure that $ENV{'LANG'}
is set to the appropriate UTF8 encoding (e.g. en.UTF-8
) and that the terminal emulator itself is set the same way.
如果是终端输出,请确保将$ ENV {'LANG'}设置为适当的UTF8编码(例如en.UTF-8),并确保终端仿真器本身的设置方式相同。
If it's HTML output, make sure that the page encoding is set to UTF-8
as well, i.e.:
如果是HTML输出,请确保页面编码也设置为UTF-8,即:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
#2
Is the DB that you're storing data in capable of handling Unicode? These symptoms seem to imply that it's not. For Unicode support under MySQL, please see this link.
您存储数据的数据库是否能够处理Unicode?这些症状似乎暗示它不是。有关MySQL下的Unicode支持,请参阅此链接。
It seems likely that the quotation marks in question are not the standard ASCII quotation marks but the Unicode ones.
有问题的引号似乎不是标准的ASCII引号,而是Unicode引号。
Ruby has an iconv
implementation to convert between encoding types. See here for more information.
Ruby有一个iconv实现,可以在编码类型之间进行转换。浏览此处获取更多信息。
#1
Your script is storing the Unicode escape sequences for quotation marks (instead of ASCII quotation marks) in the database.
您的脚本在数据库中存储引号(而不是ASCII引号)的Unicode转义序列。
That's actually good - it shows that the DB itself is working fine, although for best results you should ensure that the table is set to use 'utf8_collation_ci' so that string sorting works properly.
这实际上很好 - 它表明数据库本身工作正常,但为了获得最佳结果,您应该确保将表设置为使用'utf8_collation_ci',以便字符串排序正常工作。
The fact that the output is displayed as "‘" just means that your terminal (and/or web page) output encoding is incorrect.
输出显示为“—的事实只表示您的终端(和/或网页)输出编码不正确。
If it's terminal output, make sure that $ENV{'LANG'}
is set to the appropriate UTF8 encoding (e.g. en.UTF-8
) and that the terminal emulator itself is set the same way.
如果是终端输出,请确保将$ ENV {'LANG'}设置为适当的UTF8编码(例如en.UTF-8),并确保终端仿真器本身的设置方式相同。
If it's HTML output, make sure that the page encoding is set to UTF-8
as well, i.e.:
如果是HTML输出,请确保页面编码也设置为UTF-8,即:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
#2
Is the DB that you're storing data in capable of handling Unicode? These symptoms seem to imply that it's not. For Unicode support under MySQL, please see this link.
您存储数据的数据库是否能够处理Unicode?这些症状似乎暗示它不是。有关MySQL下的Unicode支持,请参阅此链接。
It seems likely that the quotation marks in question are not the standard ASCII quotation marks but the Unicode ones.
有问题的引号似乎不是标准的ASCII引号,而是Unicode引号。
Ruby has an iconv
implementation to convert between encoding types. See here for more information.
Ruby有一个iconv实现,可以在编码类型之间进行转换。浏览此处获取更多信息。