I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.
我正在从一个远程数据源读取数据,偶尔会在另一个编码中获取一些字符。他们不重要。
I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.
我希望得到一个“最佳猜测”utf-8字符串,并忽略无效数据。
Main goal is to get a string I can use, and not run into errors such as:
主要目标是获取一个我可以使用的字符串,而不会遇到诸如:
- Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
- 编码::UndefinedConversionError:“\xFF”从ASCII-8BIT到UTF-8:
- invalid byte sequence in utf-8
- utf-8中无效的字节序列。
6 个解决方案
#1
15
I thought this was it:
我以为是这样的:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
字符串。编码(“UTF-8”,无效=>:replace,:undef =>:replace,:replace =>“?”)
will replace all knowns with '?'.
将用“?”替换所有已知信息。
To ignore all unknowns, :replace => ''
:
忽略所有未知数,替换=> ":
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
字符串。编码(“UTF-8”,无效=>:replace,:undef =>:replace,:replace =>”)
Edit:
编辑:
I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:
我不确定这是否可靠。我已经进入了偏执狂模式,并且一直在使用:
string.encode("UTF-8", ...).force_encoding('UTF-8')
字符串。编码(“utf - 8”,…).force_encoding(“utf - 8”)
Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.
脚本现在似乎正在运行。但我很肯定我之前也犯过错误。
Edit 2:
编辑2:
Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.
即使这样,我仍然会出现间歇性错误。请注意,不是每次都这样。只是有时。
#2
3
String#chars or String#each_char can be also used.
字符串#chars或String#each_char也可以使用。
# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
+"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"
p [
'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]
String#scrub can be used since Ruby 2.1.
自Ruby 2.1以来,可以使用字符串#刷卡。
p [
'abcd' == str.scrub(''),
'abcd' == str.scrub{ |c| '' }
]
#3
2
This works great for me:
这对我来说很有用:
"String".encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8')
#4
2
To ignore all unknown parts of the string that aren't correctly UTF-8 encoded the following (as you originally posted) almost does what you want.
忽略不正确UTF-8编码的字符串的所有未知部分(正如您最初发布的那样),几乎可以满足您的要求。
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
The caveat is that encode doesn't do anything if it thinks the string is already UTF-8. So you need to change encodings, going via an encoding that can still encode the full set of unicode characters that UTF-8 can encode. (If you don't you'll corrupt any characters that aren't in that encoding - 7bit ASCII would be a really bad choice!) So go via UTF-16:
需要注意的是,如果编码认为字符串已经是UTF-8,那么它不会做任何事情。因此,您需要更改编码,通过一种编码,它仍然可以编码UTF-8编码的全部unicode字符。(如果你不这样做的话,你就会损坏那些编码中没有的字符- 7位ASCII将是一个非常糟糕的选择!)所以通过utf - 16:
string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
#5
0
With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.
在@masakielastic的帮助下,我用#chars方法解决了这个问题。
The trick is to break down each character into its own separate block so that ruby can fail.
关键是要将每个字符分解成单独的块,这样ruby就会失败。
Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.
当Ruby遇到二进制代码时,需要失败。如果你不允许Ruby继续前进,当它遇到这种情况时,它就会失败。因此,我使用String#chars方法将给定的字符串分解为一个字符数组。然后,我将该代码传递给一个消毒方法,该方法允许代码在字符串中有“微故障”(我的代码)。
So, given a "dirty" string, lets say you used File#read
on a picture. (my case)
因此,给定一个“脏”字符串,让我们假设您使用的是图片上的文件#。(我)
dirty = File.open(filepath).read
clean_chars = dirty.chars.select do |c|
begin
num_or_letter?(c)
rescue ArgumentError
next
end
end
clean = clean_chars.join("")
def num_or_letter?(char)
if char =~ /[a-zA-Z0-9]/
true
elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
true
end
end
allowing the code to fail somewhere along in the process seems to be the best way to move through it. So long as you contain those failures within blocks you can grab what is readable by the UTF-8-only-accepting parts of ruby
允许代码在过程中某处失败似乎是最好的方法。只要您在块中包含这些失败,您就可以通过ruby的utf -8-接受部分获取可读的内容。
#6
0
I have not had luck with the one-line uses of String#encode ala string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
. Do not work reliably for me.
我并没有幸运地使用了字符串#编码ala字符串的一行代码。编码(“UTF-8”,无效=>:replace,:undef =>:replace,:replace =>“?”)。不要为我可靠地工作。
But I wrote a pure ruby "backfill" of String#scrub to MRI 1.9 or 2.0 or any other ruby that does not offer a String#scrub.
但是我写了一个纯ruby的“backfill”的字符串#擦洗到MRI 1.9或2.0或任何其他没有提供字符串#擦洗的ruby。
https://github.com/jrochkind/scrub_rb
https://github.com/jrochkind/scrub_rb
It makes String#scrub available in rubies that don't have it; if loaded in MRI 2.1, it will do nothing and you'll still be using the built-in String#scrub, so it can allow you to easily write code that will work on any of these platforms.
它使串#磨砂可以用在没有它的红宝石中;如果在MRI 2.1中加载,它将什么都不做,您仍然会使用内置的字符串#scrub,这样它就可以让您轻松地编写在任何这些平台上工作的代码。
It's implementation is somewhat similar to some of the other char-by-char solutions proposed in other answers, but it does not use exceptions for flow control (don't do that), is tested, and provides an API compatible with MRI 2.1 String#scrub
它的实现有点类似于其他解决方案中提出的其他char-by-char解决方案,但是它不使用流控制的异常(不要这样做),测试,并提供一个与MRI 2.1字符串#擦除器兼容的API。
#1
15
I thought this was it:
我以为是这样的:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
字符串。编码(“UTF-8”,无效=>:replace,:undef =>:replace,:replace =>“?”)
will replace all knowns with '?'.
将用“?”替换所有已知信息。
To ignore all unknowns, :replace => ''
:
忽略所有未知数,替换=> ":
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
字符串。编码(“UTF-8”,无效=>:replace,:undef =>:replace,:replace =>”)
Edit:
编辑:
I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:
我不确定这是否可靠。我已经进入了偏执狂模式,并且一直在使用:
string.encode("UTF-8", ...).force_encoding('UTF-8')
字符串。编码(“utf - 8”,…).force_encoding(“utf - 8”)
Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.
脚本现在似乎正在运行。但我很肯定我之前也犯过错误。
Edit 2:
编辑2:
Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.
即使这样,我仍然会出现间歇性错误。请注意,不是每次都这样。只是有时。
#2
3
String#chars or String#each_char can be also used.
字符串#chars或String#each_char也可以使用。
# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
+"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"
p [
'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]
String#scrub can be used since Ruby 2.1.
自Ruby 2.1以来,可以使用字符串#刷卡。
p [
'abcd' == str.scrub(''),
'abcd' == str.scrub{ |c| '' }
]
#3
2
This works great for me:
这对我来说很有用:
"String".encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8')
#4
2
To ignore all unknown parts of the string that aren't correctly UTF-8 encoded the following (as you originally posted) almost does what you want.
忽略不正确UTF-8编码的字符串的所有未知部分(正如您最初发布的那样),几乎可以满足您的要求。
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
The caveat is that encode doesn't do anything if it thinks the string is already UTF-8. So you need to change encodings, going via an encoding that can still encode the full set of unicode characters that UTF-8 can encode. (If you don't you'll corrupt any characters that aren't in that encoding - 7bit ASCII would be a really bad choice!) So go via UTF-16:
需要注意的是,如果编码认为字符串已经是UTF-8,那么它不会做任何事情。因此,您需要更改编码,通过一种编码,它仍然可以编码UTF-8编码的全部unicode字符。(如果你不这样做的话,你就会损坏那些编码中没有的字符- 7位ASCII将是一个非常糟糕的选择!)所以通过utf - 16:
string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
#5
0
With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.
在@masakielastic的帮助下,我用#chars方法解决了这个问题。
The trick is to break down each character into its own separate block so that ruby can fail.
关键是要将每个字符分解成单独的块,这样ruby就会失败。
Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.
当Ruby遇到二进制代码时,需要失败。如果你不允许Ruby继续前进,当它遇到这种情况时,它就会失败。因此,我使用String#chars方法将给定的字符串分解为一个字符数组。然后,我将该代码传递给一个消毒方法,该方法允许代码在字符串中有“微故障”(我的代码)。
So, given a "dirty" string, lets say you used File#read
on a picture. (my case)
因此,给定一个“脏”字符串,让我们假设您使用的是图片上的文件#。(我)
dirty = File.open(filepath).read
clean_chars = dirty.chars.select do |c|
begin
num_or_letter?(c)
rescue ArgumentError
next
end
end
clean = clean_chars.join("")
def num_or_letter?(char)
if char =~ /[a-zA-Z0-9]/
true
elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
true
end
end
allowing the code to fail somewhere along in the process seems to be the best way to move through it. So long as you contain those failures within blocks you can grab what is readable by the UTF-8-only-accepting parts of ruby
允许代码在过程中某处失败似乎是最好的方法。只要您在块中包含这些失败,您就可以通过ruby的utf -8-接受部分获取可读的内容。
#6
0
I have not had luck with the one-line uses of String#encode ala string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
. Do not work reliably for me.
我并没有幸运地使用了字符串#编码ala字符串的一行代码。编码(“UTF-8”,无效=>:replace,:undef =>:replace,:replace =>“?”)。不要为我可靠地工作。
But I wrote a pure ruby "backfill" of String#scrub to MRI 1.9 or 2.0 or any other ruby that does not offer a String#scrub.
但是我写了一个纯ruby的“backfill”的字符串#擦洗到MRI 1.9或2.0或任何其他没有提供字符串#擦洗的ruby。
https://github.com/jrochkind/scrub_rb
https://github.com/jrochkind/scrub_rb
It makes String#scrub available in rubies that don't have it; if loaded in MRI 2.1, it will do nothing and you'll still be using the built-in String#scrub, so it can allow you to easily write code that will work on any of these platforms.
它使串#磨砂可以用在没有它的红宝石中;如果在MRI 2.1中加载,它将什么都不做,您仍然会使用内置的字符串#scrub,这样它就可以让您轻松地编写在任何这些平台上工作的代码。
It's implementation is somewhat similar to some of the other char-by-char solutions proposed in other answers, but it does not use exceptions for flow control (don't do that), is tested, and provides an API compatible with MRI 2.1 String#scrub
它的实现有点类似于其他解决方案中提出的其他char-by-char解决方案,但是它不使用流控制的异常(不要这样做),测试,并提供一个与MRI 2.1字符串#擦除器兼容的API。