Is there a way to split a column into tokens, and join them (like you can in other programming languages such as Python, Java, Ruby)
是否有一种方法可以将列分割成标记,并将它们连接起来(就像您可以在其他编程语言中那样,如Python、Java、Ruby)
I have a column with urls such as "http://www.Yahoo.com", and I want to extract "Yahoo.com" from it (the main domain, NOT the subdomain). The urls can be of the forms:
我有一个包含url的列,比如“http://www.Yahoo.com”,我想从中提取“Yahoo.com”(主域,而不是子域)。url可以是表单:
- http://www.domain.com
- http://www.domain.com
- http://domain.com
- http://domain.com
- http://domain.com/page/page1
- http://domain.com/page/page1
- http://www.domain.com/
- http://www.domain.com/
- http://www.domain.com/page/page2
- http://www.domain.com/page/page2
I was planning on using a regex to extract everything after http:// and before the next slash. Then splitting the url by the period (.), then joining the last 2 tokens.
我正在计划使用regex在http://和下一个斜杠之前提取所有内容。然后在期间(.)中分割url,然后加入最后两个令牌。
With the regex, I can extract www.yahoo.com from http://www.yahoo.com. With the splits/joins, I can get yahoo.com from www.yahoo.com. Problem is I don't know how to do split/joins with Postgres.
使用regex,我可以从http://www.yahoo.com提取www.yahoo.com。有了裂口/加入,我可以从www.yahoo.com获得yahoo.com。问题是我不知道如何与Postgres进行分割/连接。
Anyone know of a way? Or better alternative?
有人知道吗?或者更好的选择?
3 个解决方案
#1
3
This isn't quite the approach you asked for, but should get what you want:
这不是你要求的方法,但应该得到你想要的:
vinod=# select * from table;
url
----------------------------------
http://www.domain.com
http://domain.com
http://domain.com/page/page1
http://www.domain.com/page/page2
http://www.domain.com/
(5 rows)
vinod=# select substring(substring(url from 'http[s]*://([^/]+)') from '\w+\.\w+$') from table;
substring
------------
domain.com
domain.com
domain.com
domain.com
domain.com
(5 rows)
The inner substring
command pulls out the full domain, and the outer substring
command pulls out the last two fragments. The Postgresql split and join commands are not as powerful as in your average scripting language, so I tend to do this kind of stuff after I pull things out of the DB, if I can.
内部子字符串命令提取出完整的域,而外部子字符串命令将提取最后两个片段。Postgresql split和join命令并不像一般的脚本语言那样强大,所以如果可以的话,我倾向于在从数据库中取出东西后做这种事情。
#2
0
You can match them with \w+.[^.]+$
您可以匹配\ w +。[^]+美元
http://www.domain.com -> domain.com
http://domain.com -> domain.com
http://domain.com/page/page1 -> domain.com/page/page1
http://www.domain.com/ -> domain.com/
http://www.domain.com/page/page2 -> domain.com/page/page2
#3
0
Splitting things into tokens can be accomplished in quite a few ways:
将事物分解为令牌可以通过以下几种方式实现:
-
regexp_split_to_table
/regexp_split_to_array
- regexp_split_to_table / regexp_split_to_array
-
string_to_array
(for simple fixed delimter splits) - string_to_array(用于简单的固定delimter分割)
- Manual
substring
extraction orsubstring(... from 'pattern')
- 手动子串提取或子串(…从“模式”)
- Full text search's
to_tsvector
andto_tsquery
- 全文搜索的to_tsvector和to_tsquery
- Procedural language libraries, like Perl or Python URL libraries, Python + NLTK for natural language processing, etc
- 过程语言库,如Perl或Python URL库、用于自然语言处理的Python + NLTK等
In this case you could do your URL splitting with a regular expression using regexp_split_....
and that's probably OK for many uses - but probably not this one. Consider:
在这种情况下,你可以做你的URL将正则表达式使用regexp_split_ ....这对于很多用途来说都是可以的,但是这个可能不行。考虑:
- My domain,
ringerc.id.au
(that is the "main" domain) - 我的域,ringerc.id。au(即“主”域)
-
www.ecu.edu.au
("main" domain isecu.edu.au
) - www.ecu.edu.au(“主要”域名为ecu.edu.au)
-
www.transperth.wa.gov.au
("main" domain istransperth.wa.gov.au
) - www.transperth.wa.gov.au ("main"域名为transperth.wa.gov.au)
-
tartarus.uwa.edu.au
("main" domain isuwa.edu.au
) - au ("main"域名为uwa.edu.au)
Good luck dealing with all the national registry and sub-registry variations using a regular expression. Use a proper URL parser to extract the domain, then a proper domain-aware library to work out what the "main" domain is for your purposes. I'd recommend using plperl and the URL::Split
or URI
modules to start with. Or the URL parser of whatever supported procedural language (Python, TCL, whatever) you want. Then find a suitable library for that language that can identify domains and subdomains meaningfully according to the criteria you want and use that, rather than just relying on a regular expression.
祝您好运,使用正则表达式处理所有的国家注册表和子注册表变体。使用一个适当的URL解析器来提取域,然后使用一个适当的域感知库来计算出“主”域是什么。我建议首先使用plperl和URL: Split或URI模块。或者支持过程语言(Python、TCL等)的URL解析器。然后为该语言找到一个合适的库,该库可以根据您想要的标准对域和子域进行有意义的识别,并使用该库,而不是仅仅依赖于正则表达式。
When joining you similarly have many options:
当你加入时,同样有很多选择:
array_to_string
- array_to_string
string_agg
- string_agg
- The
||
concatenation operator - 的| |连接操作
- procedural language string operations and libraries
- 过程语言字符串操作和库
For URL work, again I'd suggest doing this with a PL that has a proper native URL library.
对于URL工作,我建议使用具有适当的本地URL库的PL来执行此操作。
#1
3
This isn't quite the approach you asked for, but should get what you want:
这不是你要求的方法,但应该得到你想要的:
vinod=# select * from table;
url
----------------------------------
http://www.domain.com
http://domain.com
http://domain.com/page/page1
http://www.domain.com/page/page2
http://www.domain.com/
(5 rows)
vinod=# select substring(substring(url from 'http[s]*://([^/]+)') from '\w+\.\w+$') from table;
substring
------------
domain.com
domain.com
domain.com
domain.com
domain.com
(5 rows)
The inner substring
command pulls out the full domain, and the outer substring
command pulls out the last two fragments. The Postgresql split and join commands are not as powerful as in your average scripting language, so I tend to do this kind of stuff after I pull things out of the DB, if I can.
内部子字符串命令提取出完整的域,而外部子字符串命令将提取最后两个片段。Postgresql split和join命令并不像一般的脚本语言那样强大,所以如果可以的话,我倾向于在从数据库中取出东西后做这种事情。
#2
0
You can match them with \w+.[^.]+$
您可以匹配\ w +。[^]+美元
http://www.domain.com -> domain.com
http://domain.com -> domain.com
http://domain.com/page/page1 -> domain.com/page/page1
http://www.domain.com/ -> domain.com/
http://www.domain.com/page/page2 -> domain.com/page/page2
#3
0
Splitting things into tokens can be accomplished in quite a few ways:
将事物分解为令牌可以通过以下几种方式实现:
-
regexp_split_to_table
/regexp_split_to_array
- regexp_split_to_table / regexp_split_to_array
-
string_to_array
(for simple fixed delimter splits) - string_to_array(用于简单的固定delimter分割)
- Manual
substring
extraction orsubstring(... from 'pattern')
- 手动子串提取或子串(…从“模式”)
- Full text search's
to_tsvector
andto_tsquery
- 全文搜索的to_tsvector和to_tsquery
- Procedural language libraries, like Perl or Python URL libraries, Python + NLTK for natural language processing, etc
- 过程语言库,如Perl或Python URL库、用于自然语言处理的Python + NLTK等
In this case you could do your URL splitting with a regular expression using regexp_split_....
and that's probably OK for many uses - but probably not this one. Consider:
在这种情况下,你可以做你的URL将正则表达式使用regexp_split_ ....这对于很多用途来说都是可以的,但是这个可能不行。考虑:
- My domain,
ringerc.id.au
(that is the "main" domain) - 我的域,ringerc.id。au(即“主”域)
-
www.ecu.edu.au
("main" domain isecu.edu.au
) - www.ecu.edu.au(“主要”域名为ecu.edu.au)
-
www.transperth.wa.gov.au
("main" domain istransperth.wa.gov.au
) - www.transperth.wa.gov.au ("main"域名为transperth.wa.gov.au)
-
tartarus.uwa.edu.au
("main" domain isuwa.edu.au
) - au ("main"域名为uwa.edu.au)
Good luck dealing with all the national registry and sub-registry variations using a regular expression. Use a proper URL parser to extract the domain, then a proper domain-aware library to work out what the "main" domain is for your purposes. I'd recommend using plperl and the URL::Split
or URI
modules to start with. Or the URL parser of whatever supported procedural language (Python, TCL, whatever) you want. Then find a suitable library for that language that can identify domains and subdomains meaningfully according to the criteria you want and use that, rather than just relying on a regular expression.
祝您好运,使用正则表达式处理所有的国家注册表和子注册表变体。使用一个适当的URL解析器来提取域,然后使用一个适当的域感知库来计算出“主”域是什么。我建议首先使用plperl和URL: Split或URI模块。或者支持过程语言(Python、TCL等)的URL解析器。然后为该语言找到一个合适的库,该库可以根据您想要的标准对域和子域进行有意义的识别,并使用该库,而不是仅仅依赖于正则表达式。
When joining you similarly have many options:
当你加入时,同样有很多选择:
array_to_string
- array_to_string
string_agg
- string_agg
- The
||
concatenation operator - 的| |连接操作
- procedural language string operations and libraries
- 过程语言字符串操作和库
For URL work, again I'd suggest doing this with a PL that has a proper native URL library.
对于URL工作,我建议使用具有适当的本地URL库的PL来执行此操作。