用于域提取的红移正则表达式

时间:2021-08-20 23:05:57

I'm trying to form a regular expression for REGEXP_SUBSTR (Redshift) which will extract the sub-domain & domain part from any given URL.

我正在尝试为REGEXP_SUBSTR (Redshift)构造一个正则表达式,它将从任何给定的URL中提取子域和域部分。

I tried many suggestions from *: regular-expression-extract-subdomain-domain, getting-parts-of-a-url-regex, how-to-get-domain-name-from-url and etc. Some of them work on regex validator but don’t work on Redshift.

我尝试了*上的许多建议:正则表达式提取-子域域、获取-url部分、如何获取-域-名称-从-url等等。

A regular expression should handle URLs with and without http/https prefix.

正则表达式应该处理带有和不带有http/https前缀的url。

Is there any other way of extracting sub-domain & domain from any given URL using a regular expression?

使用正则表达式从任何给定URL中提取子域和域是否还有其他方法?

1 个解决方案

#1


2  

After a ton of experimentation, this is what I use:

经过大量的实验,这就是我所使用的:

REPLACE(REGEXP_SUBSTR(url,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//','')

Need to match the double slash and then remove it with REPLACE because of the quite basic regex supported by Redshift.

需要匹配双斜杠,然后用REPLACE替换它,因为Redshift支持非常基本的regex。

FWIW, you'll notice that this is very different from the regex provided by Jeff Barr in the Redshift UDF's intro - that regex produces nothing for me.

FWIW,您将注意到这与Jeff Barr在Redshift UDF的介绍中提供的regex非常不同——regex不会为我生成任何内容。

#1


2  

After a ton of experimentation, this is what I use:

经过大量的实验,这就是我所使用的:

REPLACE(REGEXP_SUBSTR(url,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//','')

Need to match the double slash and then remove it with REPLACE because of the quite basic regex supported by Redshift.

需要匹配双斜杠,然后用REPLACE替换它,因为Redshift支持非常基本的regex。

FWIW, you'll notice that this is very different from the regex provided by Jeff Barr in the Redshift UDF's intro - that regex produces nothing for me.

FWIW,您将注意到这与Jeff Barr在Redshift UDF的介绍中提供的regex非常不同——regex不会为我生成任何内容。