I'm trying to read a .txt
file in ruby and split the text line-by-line.
我正在尝试读取ruby中的.txt文件并逐行拆分文本。
Here is my code:
这是我的代码:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter
like this:
这非常有效。但是当我像这样添加方法line_cutter时:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
我收到一个错误:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
我在网上发现这个不受信任的网站,并试图将它用于我自己的代码,但它不起作用。如何删除此错误?
Link to the file: File
链接到文件:文件
2 个解决方案
#1
5
The linked text file contains the following line:
链接的文本文件包含以下行:
Character set encoding: ISO-8859-1
字符集编码:ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
如果不希望或不可能转换它,那么你必须告诉Ruby这个文件是ISO-8859-1编码的。否则使用默认外部编码(在您的情况下为UTF-8)。可能的方法是:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
如果你更喜欢你的字符串UTF-8编码,或者甚至喜欢这个(参见utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
#2
2
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
如果你直接从页面阅读文件似乎有效,也许你的本地副本有一些有趣的东西。试试这个:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)
#1
5
The linked text file contains the following line:
链接的文本文件包含以下行:
Character set encoding: ISO-8859-1
字符集编码:ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
如果不希望或不可能转换它,那么你必须告诉Ruby这个文件是ISO-8859-1编码的。否则使用默认外部编码(在您的情况下为UTF-8)。可能的方法是:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
如果你更喜欢你的字符串UTF-8编码,或者甚至喜欢这个(参见utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
#2
2
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
如果你直接从页面阅读文件似乎有效,也许你的本地副本有一些有趣的东西。试试这个:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)