Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let's say my input happens to be UTF-16 encoded:
是否有一种公认的方法来处理Ruby 1.9中不知道输入编码的正则表达式?假设我的输入恰好是UTF-16编码:
x = "foo<p>bar</p>baz"
y = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/
x.match(re)
=> #<MatchData "<p>bar</p>" 1:"bar">
y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)
My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:
我目前的方法是在内部使用UTF-8,并在必要时对输入重新编码(副本):
if y.methods.include?(:encode) # Ruby 1.8 compatibility
if y.encoding.name != 'UTF-8'
y = y.encode('UTF-8')
end
end
y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">
However, this feels a little awkward to me, and I wanted to ask if there's a better way to do it.
然而,这对我来说有点尴尬,我想问一下有没有更好的办法。
2 个解决方案
#1
9
As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?
就我所知,没有比这更好的方法了。不过,我是否可以建议稍加修改?
Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.
与其更改输入的编码,为什么不更改regex的编码呢?每次遇到新编码时,转换一个regex字符串比转换数百或数千行输入以匹配regex的编码要少得多。
# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end
# Inside code looping through lines of input.
# The variables 'regex' and 'line_encoding' should be initialized previously, to
# persist across loops.
if line.methods.include?(:encoding) # Ruby 1.8 compatibility
if line.encoding != last_encoding
regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
last_encoding = line.encoding
end
end
line.match(regex)
In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.
在病态的情况下(输入编码改变了每一行),这将同样缓慢,因为您每次通过循环重新编码regex。但是,在99.9%的情况下,编码对于整个包含数百或数千行的文件是常量,这将导致重新编码的大量减少。
#2
0
Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add
遵循这个页面的建议:http://gnuu.org/2009/02/02/ruby-19共问题-pt-1 encoding/和add
# encoding: utf-8
to the top of your rb file.
到rb文件的顶部。
#1
9
As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?
就我所知,没有比这更好的方法了。不过,我是否可以建议稍加修改?
Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.
与其更改输入的编码,为什么不更改regex的编码呢?每次遇到新编码时,转换一个regex字符串比转换数百或数千行输入以匹配regex的编码要少得多。
# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end
# Inside code looping through lines of input.
# The variables 'regex' and 'line_encoding' should be initialized previously, to
# persist across loops.
if line.methods.include?(:encoding) # Ruby 1.8 compatibility
if line.encoding != last_encoding
regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
last_encoding = line.encoding
end
end
line.match(regex)
In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.
在病态的情况下(输入编码改变了每一行),这将同样缓慢,因为您每次通过循环重新编码regex。但是,在99.9%的情况下,编码对于整个包含数百或数千行的文件是常量,这将导致重新编码的大量减少。
#2
0
Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add
遵循这个页面的建议:http://gnuu.org/2009/02/02/ruby-19共问题-pt-1 encoding/和add
# encoding: utf-8
to the top of your rb file.
到rb文件的顶部。