如何在SQL查询中检测泰语

I have a column in a table which is a string, and some of those strings have thai language in it, so an example of a thai string is:

我在表中有一个字符串列,其中一些字符串中包含泰语,因此泰语字符串的示例是:

อักษรไทย

Is there such way to query/find a string like this in a column?

有没有这样的方法在列中查询/查找这样的字符串?

2 个解决方案

#1

You could search for strings that start with a character in the Thai Unicode block (i.e. between U+0E01 and U+0E5B):

您可以搜索以Thai Unicode块中的字符开头的字符串(即在U + 0E01和U + 0E5B之间):

WHERE string BETWEEN 'ก' AND '๛'

Of course this won't include strings that start with some other character and go on to include Thai language, such as those that start with a number. For that, you would have to use a much less performant regular expression:

当然,这不包括以其他角色开头并继续包含泰语的字符串,例如以数字开头的字符串。为此,您必须使用性能低得多的正则表达式:

WHERE string RLIKE '[ก-๛]'

Note however the warning in the manual:

但请注意手册中的警告:

Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

REGEXP和RLIKE运算符以字节方式工作,因此它们不是多字节安全的,并且可能会产生多字节字符集的意外结果。此外,这些运算符通过字节值比较字符,并且即使给定的排序规则将重音字符视为相等,重音字符也可能无法比较。

#2

You can do some back and forth conversion between character sets.

您可以在字符集之间进行一些来回转换。

where convert(string, 'AL32UTF8') =
      convert(convert(string, 'TH8TISASCII'), 'AL32UTF8', 'TH8TISASCII' )

will be true if string is made only of thai and ASCII, so if you add

如果string仅由thai和ASCII组成,则为true,因此如果添加

AND convert(string, 'AL32UTF8') != convert(string, 'US7ASCII')

you filter out the strings made only of ASCII and you get the strings made of thai.

你过滤掉仅由ASCII制成的字符串,你就得到了由泰语组成的字符串。

Unfortunately, this will not work if your strings contain something outside of ASCII and Thai.

不幸的是,如果你的字符串包含ASCII和泰语之外的东西,这将不起作用。

Note: Some of the convert may be superfluous depending on your database default encoding.

注意:根据您的数据库默认编码,某些转换可能是多余的。

#1