Hive QL中的正则表达式（RLIKE） - 性能如何？

I'm wondering how/if can I improve the regex I'm using in a query. I have a set of identifiers for certain user groups. They can be in two main format:

我想知道如何/如果我可以改进我在查询中使用的正则表达式。我有一组用于某些用户组的标识符。它们可以采用两种主要格式：

X123 or XY12, (type 1)
X123或XY12，（类型1）
any two letter combo, excluding XY (type 2)
任意两个字母组合，不包括XY（类型2）

Type 1 groups always are of length 4. It's either letter X followed by a number between 100 and 999 (inclusive) OR XY followed by numbers between 0 and 99 (padded to length 2 with zeros).

类型1组的长度始终为4.它是字母X后跟100到999（含）之间的数字或XY后跟0到99之间的数字（用零填充到长度2）。

Type 2 groups are 2 letter strings, with any letter allowed, excluding XY (although my query doesn't specify this).

类型2组是2个字母的字符串，允许任何字母，不包括XY（虽然我的查询没有指定这个）。

User can belong to multiple groups, in which case different groups are separated by pound symbol (#). Here's an example:

用户可以属于多个组，在这种情况下，不同的组由井号（＃）分隔。这是一个例子：

groups     user     age
X124       john     23
XY22#AB    mike     33
AB         peter    21
X122#XY01  francis  43

I want to count rows in which at least one group in second format appears, i.e. where user is not exclusively member of groups in first format.

我想计算出第二种格式中至少有一个组出现的行，即用户不是第一种格式的组的成员。

I need to catch all rows (i.e. users) which don't belong exclusively to first type of groups. In the example above, I want to exclude users john and francis because they are members only of type 1 groups. On the other hand, mike is OK because he's member of AB group (i.e. group of type 2).

我需要捕获不属于第一类组的所有行（即用户）。在上面的示例中，我想排除用户john和francis，因为他们只是类型1组的成员。另一方面，迈克是好的，因为他是AB组的成员（即2型组）。

I'm currently doing it like this:

我现在这样做：

select 
  count(*)
from 
  users
where
  groups not rlike '^(X[Y1-9][0-9]{2,2})(#X[Y1-9][0-9]{2,2})*$'

Is this bad performance wise? And how should I approach fixing it?

这种糟糕的表现是否明智？我应该如何解决这个问题呢？

1 个解决方案

#1

I want to count rows in which at least one group in second format appears.

我想计算出第二种格式中至少有一个组出现的行。

It seems a bit simpler then to select where groups like:

然后选择组的位置似乎有点简单：

\b(?:(?!XY)[A-Z]{2})\b

\b is a word boundary. It doesn't consume a character, instead it states there cannot be a non-alphanumeric character there.

\ b是单词边界。它不消耗一个字符，而是表示那里不能有非字母数字字符。

Live demo.

现场演示。

#1