如何使用Hive REGEXP_EXTRACT()函数删除非字母数字或非数字字符

I've been trying to figure out how to remove multiple non-alphanumeric or non-numeric characters, or return only the numeric characters from a string. I've tried:

我一直在研究如何删除多个非字母数字或非数字字符，或者只从字符串返回数字字符。我试过了:

SELECT
regexp_extract('X789', '[0-9]', 0)
FROM
table_name

But it returns '7', not '789'.

但它返回的是“7”，而不是“789”。

I've also tried to remove non-numeric characters using NOT MATCH syntax ^((?!regexp).)*$:

我也试图把非数字字符使用不匹配的语法^((? ! regexp)。)* $:

SELECT
REGEXP_REPLACE('X789', '^((?![0-9]).)*$', '')
FROM
jav_test_ii

Can regexp_extract return multiple matches? What I'm really trying to do is clean my data to only contain numbers, or alphanumeric characters. This seems to help remove bad characters, but its not a range of characters like [0-9] is. regexp_replace(string, '�','')

regexp_extract是否可以返回多个匹配?我真正想做的是清理我的数据，只包含数字或字母数字字符。这似乎有助于去除不好的字符，但不是像[0-9]这样的字符范围。regexp_replace(字符串,“�”,”)

EDIT: The query below was able to return '7789', which is exactly what I was looking for.

编辑:下面的查询可以返回'7789'，这正是我要查找的。

SELECT
regexp_replace("7X789", "[^0-9]+", "")
FROM
table_name

1 个解决方案

#1

See also this hive regexp_extract weirdness

还可以查看蜂巢regexp_extract的怪异

I think regex_extract will only return the group number stated in the 3rd parameter.

我认为regex_extract只返回第3个参数中指定的组号。

regex_extract seems to only work on a line and then quit.

regex_extract似乎只在一行上工作，然后退出。

I don't know about the replace counterpart.

我不知道替代方案。

It might work on non-alphanum data though if you fed it something like this

如果你给它提供这样的东西，它可能对非字母数字的数据起作用。

REGEXP_REPLACE(error_code, '[^a-zA-Z0-9]+', '')

REGEXP_REPLACE(error_code,'[^ a-zA-Z0-9]+“,”)

Also, for extract, see the link above and you can change it to

另外，对于提取，请参见上面的链接，您可以将其更改为

regexp_extract('X789', '[0-9]+', 0) for multiple numbers.

regexp_extract('X789'， '[0-9]+'， 0)用于多个数字。

或

regexp_extract('XYZ789', '[a-zA-Z]+', 0) for multiple alpha's.

regexp_extract('XYZ789'， '[a-zA-Z]+'， 0)用于多个alpha值。

#1