使用HiveQL正则表达式在一段时间之前提取所有字符？

I have a table that looks like:

我有一张表看起来像:

bl.ah
foo.bar
bar.fight

And I'd like to use HiveQL's regexp_extract to return

我想使用HiveQL的regexp_extract返回

bl
foo
bar

1 个解决方案

#1

Given the docs data about regexp_extract:

给出有关regexp_extract的文档数据:

regexp_extract(string subject, string pattern, int index)

regexp_extract(字符串主题,字符串模式,int索引)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

返回使用模式提取的字符串。例如,regexp_extract('foothebar','foo(。*?)(bar)',2)返回'bar'。请注意,在使用预定义的字符类时需要注意:使用'\ s'作为第二个参数将匹配字母s; '\ s'是匹配空格等的必要条件.'index'参数是Java regex Matcher group()方法索引。有关'index'或Java regex group()方法的更多信息,请参阅docs / api / java / util / regex / Matcher.html。

So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:

因此,如果你有一个包含单个列的表(让我们将它称为我们示例的描述),你应该能够按如下方式使用regexp_extract来获取一个句点之前的数据(如果存在),或者没有一个句点的整个字符串期:

regexp_extract(description,'^([^\.]+)\.?',1)

The components of the regex are as follows:

正则表达式的组件如下:

^ start of string

^字符串的开始

([^\.]+) any non-period character one or more times, in a capture group

([^ \。] +)捕获组中的任何非周期字符一次或多次

\.? a period either once or no times

\?一段时间或没有一段时间

Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.

因为我们感兴趣的字符串部分将位于第一个(也是唯一的)捕获组中,所以我们通过将index参数传递给值1来引用它。

#1