I'm using ruby 1.9.2
我使用ruby 1.9.2
I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.
我正在尝试解析一个包含一些法语单词(例如specifie)的CSV文件,并将内容放在MySQL数据库中。
When I read the lines from the CSV file,
当我读取CSV文件中的行,
file_contents = CSV.read("csvfile.csv", col_sep: "$")
The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes sp\xE9cifi\xE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.
这些元素返回的字符串是ASCII-8BIT编码的(specifie变成sp\xE9cifi\xE9),字符串像“specifie”,然后没有被正确地保存到MySQL数据库中。
Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.
Yehuda Katz说,ASCII-8BIT实际上是“二进制”数据,这意味着CSV不知道如何读取适当的编码。
So, if I try to make CSV force the encoding like this:
所以,如果我想让CSV强制这样的编码:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")
file_contents = CSV.read(“csvfile。csv", col_sep: "$",编码:"UTF-8")
I get the following error
我得到了下面的错误。
ArgumentError: invalid byte sequence in UTF-8:
If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non sp\xE9cifi\xE9" instead of "Non spécifié".
如果我回到最初的ASCII-8BIT编码字符串并检查我的CSV所读取的字符串,它看起来就像这个“Non - sp\xE9cifi\xE9”而不是“Non - specifie”。
I can't convert "Non sp\xE9cifi\xE9" to "Non spécifié" by doing this "Non sp\xE9cifi\xE9".encode("UTF-8")
我不能将"Non - sp\xE9cifi\xE9"转换为"非特殊",即"Non - sp\xE9cifi\xE9".encode("UTF-8")
because I get this error:
因为我得到了这个错误:
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8
,
编码::UndefinedConversionError:“\xE9”从ASCII-8BIT到UTF-8,
which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".
卡茨指出会出现这种情况,因为ASCII-8BIT并不是真正的字符串“编码”。
Questions:
问题:
- Can I get CSV to read my file in the appropriate encoding? If so, how?
- 我可以让CSV在适当的编码中读取我的文件吗?如果是这样,如何?
- How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?
- 如何将ASCII-8BIT字符串转换为UTF-8,以便在MySQL中进行适当存储?
3 个解决方案
#1
52
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
欺骗是正确的,这是ISO8859-1(又名拉丁文)编码的文本。试试这个:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv
to fix up the individual strings with something like this:
如果这不起作用,你可以用Iconv来固定单个字符串,比如:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string
is "Non sp\xE9cifi\xE9"
, then utf8_string
will be "Non spécifié"
. Also, Iconv.iconv
can unmangle whole arrays at a time:
如果latin1_string是“Non sp\xE9cifi\xE9”,那么utf8_string将是“Non - specifie”。同时,Iconv。iconv可以一次对整个数组进行解析:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
有了新的红宝石,你可以这样做:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string
thinks it is in ASCII-8BIT but is really in ISO-8859-1.
latin1_string认为它位于ASCII-8BIT中,但实际上是ISO-8859-1。
#2
21
With ruby >= 1.9 you can use
使用ruby >= 1.9可以使用。
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
The ISO8859-1:utf-8
is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8
ISO8859-1:utf-8的含义是:csv文件是ISO8859-1编码的,但是将内容转换为utf-8。
If you prefer a more verbose code, you can use:
如果您喜欢更详细的代码,可以使用:
file_contents = CSV.read("csvfile.csv", col_sep: "$",
external_encoding: "ISO8859-1",
internal_encoding: "utf-8"
)
#3
1
I have been dealing with this issue for a while and not any of the other solutions worked for me.
我处理这个问题已经有一段时间了,没有任何其他的解决方案对我有效。
The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:
关键在于将冲突字符串存储在二进制文件中,然后正常读取文件,并使用此字符串来填充CSV模块:
tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)
#1
52
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
欺骗是正确的,这是ISO8859-1(又名拉丁文)编码的文本。试试这个:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv
to fix up the individual strings with something like this:
如果这不起作用,你可以用Iconv来固定单个字符串,比如:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string
is "Non sp\xE9cifi\xE9"
, then utf8_string
will be "Non spécifié"
. Also, Iconv.iconv
can unmangle whole arrays at a time:
如果latin1_string是“Non sp\xE9cifi\xE9”,那么utf8_string将是“Non - specifie”。同时,Iconv。iconv可以一次对整个数组进行解析:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
有了新的红宝石,你可以这样做:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string
thinks it is in ASCII-8BIT but is really in ISO-8859-1.
latin1_string认为它位于ASCII-8BIT中,但实际上是ISO-8859-1。
#2
21
With ruby >= 1.9 you can use
使用ruby >= 1.9可以使用。
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
The ISO8859-1:utf-8
is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8
ISO8859-1:utf-8的含义是:csv文件是ISO8859-1编码的,但是将内容转换为utf-8。
If you prefer a more verbose code, you can use:
如果您喜欢更详细的代码,可以使用:
file_contents = CSV.read("csvfile.csv", col_sep: "$",
external_encoding: "ISO8859-1",
internal_encoding: "utf-8"
)
#3
1
I have been dealing with this issue for a while and not any of the other solutions worked for me.
我处理这个问题已经有一段时间了,没有任何其他的解决方案对我有效。
The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:
关键在于将冲突字符串存储在二进制文件中,然后正常读取文件,并使用此字符串来填充CSV模块:
tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)