I've just run into a strange issue when trying to detect a certain string among an array of them. Anyone knows what's going on here?
我试图在它们的数组中检测某个字符串时遇到了一个奇怪的问题。谁知道这里发生了什么?
(rdb:1) p magic_string
"Time Period"
(rdb:1) p magic_string.class
String
(rdb:1) p magic_string == "Time Period"
false
(rdb:1) p "Time Period".length
11
(rdb:1) p magic_string.length
14
(rdb:1) p magic_string[0].chr
"\357"
(rdb:1) p magic_string[1].chr
"\273"
(rdb:1) p magic_string[2].chr
"\277"
(rdb:1) p magic_string[3].chr
"T"
2 个解决方案
#1
7
Your string contains 3 bytes (BOM) at the beginning to indicate that encoding is UTF-8.
您的字符串在开头包含3个字节(BOM),表示编码为UTF-8。
Q: What is a BOM?
问:什么是BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
答:字节顺序标记(BOM)由数据流开头的字符代码U + FEFF组成,它可以用作定义字节顺序和编码形式的签名,主要是未标记的明文文件。在某些更高级别的协议下,在该协议中定义的Unicode数据流中可能必须(或禁止)使用BOM。
#2
6
This might help you to understand what's happening:
这可能有助于您了解正在发生的事情:
# encoding: UTF-8
RUBY_VERSION # => "1.9.3"
magic_string = "Time Period"
magic_string[0].chr # => "\uFEFF"
The same output is true with Ruby v2.2.2.
Ruby v2.2.2也是如此。
Older versions of Ruby didn't default to UTF-8 and treated strings as an array of bytes. The encoding
line is important to tell it what the script's strings' encoding is.
旧版本的Ruby没有默认为UTF-8,并将字符串视为字节数组。编码行很重要,告诉它脚本的字符串编码是什么。
Ruby now correctly treats Strings as arrays of characters not bytes, which is why it reports the first character as "\uFEFF"
, a two-byte character.
Ruby现在正确地将字符串视为字符数组而不是字节,这就是为什么它将第一个字符报告为“\ uFEFF”,一个双字节字符。
"\uFEFF"
and "\uFFFE"
are BOM markers showing which "endian" the characters are. Endianness is tied to the CPU's idea of what a most significant and least significant byte is in a word (two bytes typically). This is also tied to Unicode, both of which are something you need to understand, at least in a rudimentary way as we don't deal with only ASCII any more, and languages don't consist of only the Latin character set.
“\ uFEFF”和“\ uFFFE”是BOM标记,显示字符所在的“endian”。字节顺序与CPU关于字中最重要和最不重要字节的概念(通常为两个字节)有关。这也与Unicode有关,这两者都是你需要理解的东西,至少是基本的方式,因为我们不再只处理ASCII,而且语言不仅仅包含拉丁字符集。
UTF-8 is an multibyte character set that incorporates a huge number of characters from multiple languages. You can also run into UTF-16LE, UTF-16BE or longer; HTML and documents on the internet can be encoded in varying lengths of characters depending on the originating hardware and not being aware of those can drive you nuts and you'll go down the wrong paths trying to read their content. It's important to read the IO class documentation for "IO Encoding" to know the right way to deal with these types of files.
UTF-8是一种多字节字符集,包含来自多种语言的大量字符。您也可以使用UTF-16LE,UTF-16BE或更长时间;互联网上的HTML和文档可以根据原始硬件编码成不同长度的字符,而不知道那些可能会让你疯狂,并且你会试图阅读他们的内容。阅读“IO编码”的IO类文档非常重要,以了解处理这些类型文件的正确方法。
#1
7
Your string contains 3 bytes (BOM) at the beginning to indicate that encoding is UTF-8.
您的字符串在开头包含3个字节(BOM),表示编码为UTF-8。
Q: What is a BOM?
问:什么是BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
答:字节顺序标记(BOM)由数据流开头的字符代码U + FEFF组成,它可以用作定义字节顺序和编码形式的签名,主要是未标记的明文文件。在某些更高级别的协议下,在该协议中定义的Unicode数据流中可能必须(或禁止)使用BOM。
#2
6
This might help you to understand what's happening:
这可能有助于您了解正在发生的事情:
# encoding: UTF-8
RUBY_VERSION # => "1.9.3"
magic_string = "Time Period"
magic_string[0].chr # => "\uFEFF"
The same output is true with Ruby v2.2.2.
Ruby v2.2.2也是如此。
Older versions of Ruby didn't default to UTF-8 and treated strings as an array of bytes. The encoding
line is important to tell it what the script's strings' encoding is.
旧版本的Ruby没有默认为UTF-8,并将字符串视为字节数组。编码行很重要,告诉它脚本的字符串编码是什么。
Ruby now correctly treats Strings as arrays of characters not bytes, which is why it reports the first character as "\uFEFF"
, a two-byte character.
Ruby现在正确地将字符串视为字符数组而不是字节,这就是为什么它将第一个字符报告为“\ uFEFF”,一个双字节字符。
"\uFEFF"
and "\uFFFE"
are BOM markers showing which "endian" the characters are. Endianness is tied to the CPU's idea of what a most significant and least significant byte is in a word (two bytes typically). This is also tied to Unicode, both of which are something you need to understand, at least in a rudimentary way as we don't deal with only ASCII any more, and languages don't consist of only the Latin character set.
“\ uFEFF”和“\ uFFFE”是BOM标记,显示字符所在的“endian”。字节顺序与CPU关于字中最重要和最不重要字节的概念(通常为两个字节)有关。这也与Unicode有关,这两者都是你需要理解的东西,至少是基本的方式,因为我们不再只处理ASCII,而且语言不仅仅包含拉丁字符集。
UTF-8 is an multibyte character set that incorporates a huge number of characters from multiple languages. You can also run into UTF-16LE, UTF-16BE or longer; HTML and documents on the internet can be encoded in varying lengths of characters depending on the originating hardware and not being aware of those can drive you nuts and you'll go down the wrong paths trying to read their content. It's important to read the IO class documentation for "IO Encoding" to know the right way to deal with these types of files.
UTF-8是一种多字节字符集,包含来自多种语言的大量字符。您也可以使用UTF-16LE,UTF-16BE或更长时间;互联网上的HTML和文档可以根据原始硬件编码成不同长度的字符,而不知道那些可能会让你疯狂,并且你会试图阅读他们的内容。阅读“IO编码”的IO类文档非常重要,以了解处理这些类型文件的正确方法。