使用HiveQL正则表达式在一段时间之前提取所有字符?

时间:2022-12-28 20:22:51

I have a table that looks like:

我有一张表看起来像:

bl.ah
foo.bar
bar.fight

And I'd like to use HiveQL's regexp_extract to return

我想使用HiveQL的regexp_extract返回

bl
foo
bar

1 个解决方案

#1


2  

Given the docs data about regexp_extract:

给出有关regexp_extract的文档数据:

regexp_extract(string subject, string pattern, int index)

regexp_extract(字符串主题,字符串模式,int索引)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

返回使用模式提取的字符串。例如,regexp_extract('foothebar','foo(。*?)(bar)',2)返回'bar'。请注意,在使用预定义的字符类时需要注意:使用'\ s'作为第二个参数将匹配字母s; '\ s'是匹配空格等的必要条件.'index'参数是Java regex Matcher group()方法索引。有关'index'或Java regex group()方法的更多信息,请参阅docs / api / java / util / regex / Matcher.html。

So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:

因此,如果你有一个包含单个列的表(让我们将它称为我们示例的描述),你应该能够按如下方式使用regexp_extract来获取一个句点之前的数据(如果存在),或者没有一个句点的整个字符串期:

regexp_extract(description,'^([^\.]+)\.?',1)

The components of the regex are as follows:

正则表达式的组件如下:

  • ^ start of string
  • ^字符串的开始

  • ([^\.]+) any non-period character one or more times, in a capture group
  • ([^ \。] +)捕获组中的任何非周期字符一次或多次

  • \.? a period either once or no times
  • \?一段时间或没有一段时间

Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.

因为我们感兴趣的字符串部分将位于第一个(也是唯一的)捕获组中,所以我们通过将index参数传递给值1来引用它。

#1


2  

Given the docs data about regexp_extract:

给出有关regexp_extract的文档数据:

regexp_extract(string subject, string pattern, int index)

regexp_extract(字符串主题,字符串模式,int索引)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

返回使用模式提取的字符串。例如,regexp_extract('foothebar','foo(。*?)(bar)',2)返回'bar'。请注意,在使用预定义的字符类时需要注意:使用'\ s'作为第二个参数将匹配字母s; '\ s'是匹配空格等的必要条件.'index'参数是Java regex Matcher group()方法索引。有关'index'或Java regex group()方法的更多信息,请参阅docs / api / java / util / regex / Matcher.html。

So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:

因此,如果你有一个包含单个列的表(让我们将它称为我们示例的描述),你应该能够按如下方式使用regexp_extract来获取一个句点之前的数据(如果存在),或者没有一个句点的整个字符串期:

regexp_extract(description,'^([^\.]+)\.?',1)

The components of the regex are as follows:

正则表达式的组件如下:

  • ^ start of string
  • ^字符串的开始

  • ([^\.]+) any non-period character one or more times, in a capture group
  • ([^ \。] +)捕获组中的任何非周期字符一次或多次

  • \.? a period either once or no times
  • \?一段时间或没有一段时间

Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.

因为我们感兴趣的字符串部分将位于第一个(也是唯一的)捕获组中,所以我们通过将index参数传递给值1来引用它。