如何确定字符串的字符集？

I have several files that are in several different languages. I thought they were all encoded UTF-8, but now I'm not so sure. Some characters look fine, some do not. Is there a way that I can break out the strings and try to identify the character sets? Perhaps split on white space then identify each word? Finally, is there an easy way to translate characters from one set to UTF-8?

我有几个不同语言的文件。我以为它们都是UTF-8编码,但现在我不太确定。有些角色看起来很好,有些则没有。有没有办法可以打破字符串并尝试识别字符集?也许在白色空间上拆分然后识别每个单词?最后,是否有一种简单的方法可以将字符从一组转换为UTF-8?

3 个解决方案

#1

If you don't know the character set for sure You can only guess, basically. utf8::valid might help you with that, but you can't really know for sure. If you know that if it isn't unicode it must be a specific character set (Like Latin-1), you lucky. If you have no idea, you're screwed. In any case, you should always assume the whole file is in the same character set, unless otherwise specified. You will lose your sanity if you don't.

如果你不确定字符集肯定你只能猜测,基本上。 utf8 :: valid可能对你有所帮助,但你无法确切知道。如果你知道如果它不是unicode,它必须是一个特定的字符集(如Latin-1),你很幸运。如果你不知道,你就搞砸了。在任何情况下,除非另有说明,否则应始终假设整个文件都在相同的字符集中。如果你不这样做,你将失去理智。

As for your question how to convert between character sets: Encode is there to do that for you

至于你的问题如何在字符集之间进行转换:Encode是为你做的

#2

Determining whether a file is probably UTF-8 or not should be pretty easy. Determining the encoding if it is not UTF-8 would be very difficult in general.

确定文件是否可能是UTF-8应该非常简单。如果编码不是UTF-8则确定编码通常是非常困难的。

If the file is encoded with UTF-8, the high bits of each byte should follow a pattern. If a character is one byte, its high bit will be cleared (zero). Otherwise, an n byte character (where n is 2–4) will have the high n bits of the first byte set to one, followed by a single zero bit. The following n - 1 bytes should all have the highest bit set and the second-highest bit cleared.

如果文件使用UTF-8编码,则每个字节的高位应遵循一种模式。如果一个字符是一个字节,则其高位将被清零(零)。否则,n字节字符(其中n为2-4)将使第一个字节的高n位设置为1,然后是单个零位。以下n - 1个字节应该具有最高位设置并且第二高位清零。

If all the bytes in your file follow these rules, it's probably encoded with UTF-8. I say probably, because anyone can invent a new encoding that happens to follow the same rules, deliberately or by chance, but interprets the codes differently.

如果文件中的所有字节都遵循这些规则,则可能使用UTF-8进行编码。我可能会说,因为任何人都可以发明一种新的编码,它恰好或偶然地遵循相同的规则,但以不同的方式解释代码。

Note that a file encoded with US-ASCII will follow these rules, but the high bit of every byte is zero. It's okay to treat such a file as UTF-8, since they are compatible in this range. Otherwise, it's some other encoding, and there's not an inherent test to distinguish the encoding. You'll have to use some contextual knowledge to guess.

请注意,使用US-ASCII编码的文件将遵循这些规则,但每个字节的高位为零。可以将这样的文件视为UTF-8,因为它们在此范围内兼容。否则,它是一些其他编码,并没有区分编码的固有测试。你必须使用一些上下文知识来猜测。

#3

Take a look at iconv

看看iconv

http://www.gnu.org/software/libiconv/

Text::Iconv

#1