为什么在对包含无效UTF-8数据的字符串进行操作时出现Ruby错误，但Python不会？

In Ruby (2.4), I can create a string whose encoding is UTF-8 but which contains a byte invalid in UTF-8 (let's use the byte E1).

在Ruby(2.4)中,我可以创建一个字符串,其编码为UTF-8,但在UTF-8中包含一个无效的字节(让我们使用字节E1)。

Then when I try to match a regex against this string, I get an error.

然后,当我尝试将正则表达式与此字符串匹配时,我收到错误。

2.4.0 :001 > "Hi! \xE1".match?(//)
ArgumentError: invalid byte sequence in UTF-8
        from (irb):1:in `match?'
        from (irb):1

When I do the same thing in Python 3, I do not get an error.

当我在Python 3中做同样的事情时,我没有收到错误。

>>> import re; re.match('', "Hi! \xE1")
<_sre.SRE_Match object; span=(0, 0), match=''>

My understanding is that, in both cases, I am in a state of sin because I am creating UTF-8-encoded strings that contain bytes invalid in UTF-8. Given that:

我的理解是,在这两种情况下,我都处于犯罪状态,因为我正在创建UTF-8编码的字符串,其中包含UTF-8中无效的字节。鉴于:

Is it specifically regex comparisons that fail in Ruby, and not other operations? If so, why?

是否特定的正则表达式比较在Ruby中失败,而不是其他操作?如果是这样,为什么?

What accounts for the difference between Ruby and Python here?

什么能解释Ruby和Python之间的区别?

Is it possible to get Python to give an error of this type? (Without interacting with external resources -- I know this can happen in the context of connecting to a database, for example.)

是否有可能让Python给出这种类型的错误? (没有与外部资源交互 - 我知道这可能发生在连接到数据库的上下文中,例如。)

1 个解决方案

#1

In Ruby, creating a new string using quotes (i.e. 'Hi!') will create an instance of the core String class. As you noted, in 2.0 or later, ruby defaults to interpreting strings in source files as UTF-8. If you then call a method on the string instance, it will use the configured encoding to interpret the bytes that make up the string and apply the method (so to answer your first question, it's not specific to regex matches -- you would see the same error if you called gsub or split or any other string method).

在Ruby中,使用引号创建一个新字符串(即'Hi!')将创建一个核心String类的实例。如您所述,在2.0或更高版本中,ruby默认将源文件中的字符串解释为UTF-8。如果然后在字符串实例上调用一个方法,它将使用配置的编码来解释构成字符串的字节并应用该方法(所以为了回答你的第一个问题,它不是特定于正则表达式匹配 - 你会看到如果您调用gsub或split或任何其他字符串方法,则会出现相同的错误)。

As this post helpfully details, python 3 defaults to interpreting strings as Unicode; however, while ruby defaults to UTF-8, python 3 defaults to UTF-16 or UTF-32 depending on how the interpreter was built, so \xE1 is not invalid.

正如这篇文章有用的细节,python 3默认将字符串解释为Unicode;但是,当ruby默认为UTF-8时,python 3默认为UTF-16或UTF-32,具体取决于解释器的构建方式,因此\ xE1无效。

Interestingly enough, if you give python some hex that is not unicode, it seems to leave it as plaintext:

有趣的是,如果你给python一些不是unicode的十六进制,它似乎把它留作明文:

>>> '\uffff'
'\uffff'

whereas if you give it nonsense (non-hex) it will raise an error:

而如果你给它废话(非十六进制)它会引发一个错误:

>>> "Hi! \xz1"
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 4-5: truncated \xXX escape

#1