什么是regex来检查字符是否为unicode ?

时间:2022-08-09 20:13:28

I'm trying to use windows' API IsTextUnicode to check if a character input is unicode or not, but is sort of buggy. I figured, it might be better using a regex. However, I'm new to constructing regular expressions. What would be the regex to check if a character is unicode or not?

我试着使用windows的API IsTextUnicode来检查字符输入是否为unicode,但是有点bug。我想,使用regex可能会更好。然而,我对构造正则表达式很陌生。什么是regex来检查字符是否为unicode ?

Thanks...

谢谢……

3 个解决方案

#1


1  

Well, that depends what you mean by ‘Unicode’. As the answers so far say, pretty much any character “is Unicode”.

这取决于你说的Unicode是什么意思。到目前为止的答案是,几乎任何字符都是“Unicode”。

Windows abuses the term ‘Unicode’ to mean the UTF-16LE encoding that the Win32 API uses internally. You can detect UTF-16 by looking for the Byte Order Mark at the front, bytes FF FE for UTF-16LE (or FE FF for UTF-16BE). It's possible to have UTF-16 text that is not marked with a BOM, but that's quite bad news as you can only detect it by pure guesswork.

Windows滥用“Unicode”一词,表示Win32 API在内部使用的UTF-16LE编码。您可以通过查找前面的字节顺序标记、UTF-16LE的字节FF FE(或UTF-16BE的FE FF)来检测UTF-16。可能会有没有BOM标记的UTF-16文本,但这是一个坏消息,因为您只能通过纯粹的猜测来检测它。

Pure guesswork is what the IsTextUnicode function is all about. It looks at the input bytes and, by seeing how often common patterns turn up in it, guesses how likely it is that the bytes represent UTF-16LE or UTF-16BE-encoded characters. Since every sequence of bytes is potentially a valid encoding of characters(*), you might imagine this isn't very predictable or reliable. And you'd be right.

纯粹的猜测是关于IsTextUnicode函数的。它查看输入字节,并通过查看常见模式在其中出现的频率,猜测字节表示UTF-16LE或utf -16编码字符的可能性有多大。由于每个字节序列都可能是有效的字符编码(*),您可能会认为这不是非常可预测的或可靠的。你会是对的。

See Windows i18n guru Michael Kaplan's description of IsTextUnicode and why it's probably not a good idea.

看看Windows i18n大师Michael Kaplan对IsTextUnicode的描述,以及为什么它可能不是一个好主意。

In general you would want a more predictable way of guessing what encoding a set of bytes represents. You could try:

通常,您需要一种更可预测的方式来猜测一组字节代表什么编码。你可以试试:

  • if it begins FE FF, it's UTF-16LE, what Windows thinks of as ‘Unicode’;
  • 如果它以FE FF开头,则是UTF-16LE, Windows将其视为“Unicode”;
  • if it begins FF FE, it's UTF-16BE, what Windows equally-misleadingly calls ‘reverse’ Unicode;
  • 如果它的首字母是FF - FE,那么它就是UTF-16BE, Windows同样会误称为“反向”Unicode;
  • otherwise check the whole string for invalid UTF-8 sequences. If there are none, it's probably UTF-8 (or just ASCII);
  • 否则,检查整个字符串是否有无效的UTF-8序列。如果没有,它可能是UTF-8(或只是ASCII);
  • otherwise try the system default codepage.
  • 否则,请尝试系统默认代码页。

(*: actually not quite true. Apart from the never-a-characters like U+FFFF, there are also many sequences of UTF-16 code units that aren't valid characters, thanks to the ‘surrogates’ approach to encoding characters outside the 16-bit range. However IsTextUnicode doesn't know about those anyway, as it predates the astral planes.)

(*:实际上不完全正确。除了U+FFFF等从未出现过的字符之外,还有许多UTF-16编码单元的序列不是有效字符,这要感谢对16位范围之外的字符进行“替代”编码的方法。无论如何,IsTextUnicode都不知道这些,因为它先于星体层。

#2


1  

Every character you'll encounter is part of Unicode. For instance, latin 'a' is U+0061. This is especially true on Windows, which natievely uses Unicode and UTF-16 encoding.

您将遇到的每个字符都是Unicode的一部分。例如,拉丁语a是U+0061。这在Windows上尤其如此,Windows使用Unicode和UTF-16编码。

The Microsoft function IsTextUnicode is named rather unfortunately. It could more accurately be described as GuessTextEncodingFromRawBytes(). I suspect that your real problem is not the interpretation of raw bytes, since you already know it's one character.

不幸的是,微软的函数IsTextUnicode被命名为IsTextUnicode。它可以更准确地描述为GuessTextEncodingFromRawBytes()。我怀疑您真正的问题不是对原始字节的解释,因为您已经知道它是一个字符。

#3


1  

I think you're mixing up two different concepts. A character and its encoding are not the same. Some characters (like A) are encoded identically in ASCII or latin-1 and UTF-8, some aren't, some can only be encoded in UTF-8 etc.

我认为你混淆了两个不同的概念。字符及其编码是不同的。有些字符(如A)在ASCII或latin-1和UTF-8中编码相同,有些则不是,有些只能用UTF-8等编码。

IsTextUnicode() tries to guess the encoding from a stream of raw bytes.

IsTextUnicode()试图从原始字节流中猜测编码。

If, on the other hand, you already have a character representation, and you wish to find out whether it can be natively expressed as ASCII or latin-1 or some other encoding, then you could indeed look at the character range ([\u0000-\u007F] for ASCII).

另一方面,如果您已经有了字符表示法,并且希望查明它是否可以以ASCII或latin-1或其他编码形式表达,那么您确实可以查看字符范围([\u0000-\u007F] for ASCII)。

Lastly, there are some invalid codes (like \uFFFE) which are possible bytes representations that are not allowed as Unicode characters. But I don't think this is what you're looking for.

最后,还有一些无效的代码(如\uFFFE),它们可能是不允许作为Unicode字符的字节表示。但我不认为这是你要找的。

#1


1  

Well, that depends what you mean by ‘Unicode’. As the answers so far say, pretty much any character “is Unicode”.

这取决于你说的Unicode是什么意思。到目前为止的答案是,几乎任何字符都是“Unicode”。

Windows abuses the term ‘Unicode’ to mean the UTF-16LE encoding that the Win32 API uses internally. You can detect UTF-16 by looking for the Byte Order Mark at the front, bytes FF FE for UTF-16LE (or FE FF for UTF-16BE). It's possible to have UTF-16 text that is not marked with a BOM, but that's quite bad news as you can only detect it by pure guesswork.

Windows滥用“Unicode”一词,表示Win32 API在内部使用的UTF-16LE编码。您可以通过查找前面的字节顺序标记、UTF-16LE的字节FF FE(或UTF-16BE的FE FF)来检测UTF-16。可能会有没有BOM标记的UTF-16文本,但这是一个坏消息,因为您只能通过纯粹的猜测来检测它。

Pure guesswork is what the IsTextUnicode function is all about. It looks at the input bytes and, by seeing how often common patterns turn up in it, guesses how likely it is that the bytes represent UTF-16LE or UTF-16BE-encoded characters. Since every sequence of bytes is potentially a valid encoding of characters(*), you might imagine this isn't very predictable or reliable. And you'd be right.

纯粹的猜测是关于IsTextUnicode函数的。它查看输入字节,并通过查看常见模式在其中出现的频率,猜测字节表示UTF-16LE或utf -16编码字符的可能性有多大。由于每个字节序列都可能是有效的字符编码(*),您可能会认为这不是非常可预测的或可靠的。你会是对的。

See Windows i18n guru Michael Kaplan's description of IsTextUnicode and why it's probably not a good idea.

看看Windows i18n大师Michael Kaplan对IsTextUnicode的描述,以及为什么它可能不是一个好主意。

In general you would want a more predictable way of guessing what encoding a set of bytes represents. You could try:

通常,您需要一种更可预测的方式来猜测一组字节代表什么编码。你可以试试:

  • if it begins FE FF, it's UTF-16LE, what Windows thinks of as ‘Unicode’;
  • 如果它以FE FF开头,则是UTF-16LE, Windows将其视为“Unicode”;
  • if it begins FF FE, it's UTF-16BE, what Windows equally-misleadingly calls ‘reverse’ Unicode;
  • 如果它的首字母是FF - FE,那么它就是UTF-16BE, Windows同样会误称为“反向”Unicode;
  • otherwise check the whole string for invalid UTF-8 sequences. If there are none, it's probably UTF-8 (or just ASCII);
  • 否则,检查整个字符串是否有无效的UTF-8序列。如果没有,它可能是UTF-8(或只是ASCII);
  • otherwise try the system default codepage.
  • 否则,请尝试系统默认代码页。

(*: actually not quite true. Apart from the never-a-characters like U+FFFF, there are also many sequences of UTF-16 code units that aren't valid characters, thanks to the ‘surrogates’ approach to encoding characters outside the 16-bit range. However IsTextUnicode doesn't know about those anyway, as it predates the astral planes.)

(*:实际上不完全正确。除了U+FFFF等从未出现过的字符之外,还有许多UTF-16编码单元的序列不是有效字符,这要感谢对16位范围之外的字符进行“替代”编码的方法。无论如何,IsTextUnicode都不知道这些,因为它先于星体层。

#2


1  

Every character you'll encounter is part of Unicode. For instance, latin 'a' is U+0061. This is especially true on Windows, which natievely uses Unicode and UTF-16 encoding.

您将遇到的每个字符都是Unicode的一部分。例如,拉丁语a是U+0061。这在Windows上尤其如此,Windows使用Unicode和UTF-16编码。

The Microsoft function IsTextUnicode is named rather unfortunately. It could more accurately be described as GuessTextEncodingFromRawBytes(). I suspect that your real problem is not the interpretation of raw bytes, since you already know it's one character.

不幸的是,微软的函数IsTextUnicode被命名为IsTextUnicode。它可以更准确地描述为GuessTextEncodingFromRawBytes()。我怀疑您真正的问题不是对原始字节的解释,因为您已经知道它是一个字符。

#3


1  

I think you're mixing up two different concepts. A character and its encoding are not the same. Some characters (like A) are encoded identically in ASCII or latin-1 and UTF-8, some aren't, some can only be encoded in UTF-8 etc.

我认为你混淆了两个不同的概念。字符及其编码是不同的。有些字符(如A)在ASCII或latin-1和UTF-8中编码相同,有些则不是,有些只能用UTF-8等编码。

IsTextUnicode() tries to guess the encoding from a stream of raw bytes.

IsTextUnicode()试图从原始字节流中猜测编码。

If, on the other hand, you already have a character representation, and you wish to find out whether it can be natively expressed as ASCII or latin-1 or some other encoding, then you could indeed look at the character range ([\u0000-\u007F] for ASCII).

另一方面,如果您已经有了字符表示法,并且希望查明它是否可以以ASCII或latin-1或其他编码形式表达,那么您确实可以查看字符范围([\u0000-\u007F] for ASCII)。

Lastly, there are some invalid codes (like \uFFFE) which are possible bytes representations that are not allowed as Unicode characters. But I don't think this is what you're looking for.

最后,还有一些无效的代码(如\uFFFE),它们可能是不允许作为Unicode字符的字节表示。但我不认为这是你要找的。