I need a regex pattern which extracts all hastags from a tweets in a table. My data like is
我需要一个regex模式,它从表中的tweet中提取所有hastags。我的数据是
select regexp_substr('My twwet #HashTag1 and this is the #SecondHashtag sample','#\S+')
from dual
it only brings #HashTag1 not #SecondHashtag
它只带来#HashTag1而不是#SecondHashtag
I need a output like #HashTag1 #SecondHashtag
我需要一个像#HashTag1 #SecondHashtag这样的输出
Thanks
谢谢
1 个解决方案
#1
2
You can use regexp_replace
to remove all that doesn't match your pattern.
可以使用regexp_replace删除与模式不匹配的所有内容。
with t (col) as (
select 'My twwet #HashTag1 and this is the #SecondHashtag sample, #onemorehashtag'
from dual
)
select
regexp_replace(col, '(#\S+\s?)|.', '\1')
from t;
Produces;
产生;
#HashTag1 #SecondHashtag #onemorehashtag
regexp_substr
will return one match. What you can do is turn your string into a table using connect by
:
regexp_substr将返回一个匹配。你所能做的就是使用connect by将你的字符串变成表格:
with t (col) as (
select 'My twwet #HashTag1 and this is the #SecondHashtag sample, #onemorehashtag'
from dual
)
select
regexp_substr(col, '#\S+', 1, level)
from t
connect by regexp_substr(col, '#\S+', 1, level) is not null;
Returns:
返回:
#HashTag1
#SecondHashtag
#onemorehashtag
EDIT:
\S matches any non space character. It would be better to use \w which matches a-z, A-Z, 0-9 and _.
\S匹配任何非空格字符。最好使用与a-z、a-z、0-9和_匹配的\w。
As commented by @mathguy and from this site: a hashtag starts with an alphabet, then alphanumeric characters or underscores are allowed.
@mathguy评论道:hashtag以字母开头,然后允许使用字母数字字符或下划线。
So, pattern #[[:alpha:]]\w*
will work better.
所以,模式#[:alpha:] \w*会更好。
with t (col) as (
select 'My twwet #HashTag1, this is the #SecondHashtag. #onemorehashtag'
from dual
)
select
regexp_substr(col, '#[[:alpha:]]\w*', 1, level)
from t
connect by regexp_substr(col, '#[[:alpha:]]\w*', 1, level) is not null;
Produces:
生产:
#HashTag1
#SecondHashtag
#onemorehashtag
#1
2
You can use regexp_replace
to remove all that doesn't match your pattern.
可以使用regexp_replace删除与模式不匹配的所有内容。
with t (col) as (
select 'My twwet #HashTag1 and this is the #SecondHashtag sample, #onemorehashtag'
from dual
)
select
regexp_replace(col, '(#\S+\s?)|.', '\1')
from t;
Produces;
产生;
#HashTag1 #SecondHashtag #onemorehashtag
regexp_substr
will return one match. What you can do is turn your string into a table using connect by
:
regexp_substr将返回一个匹配。你所能做的就是使用connect by将你的字符串变成表格:
with t (col) as (
select 'My twwet #HashTag1 and this is the #SecondHashtag sample, #onemorehashtag'
from dual
)
select
regexp_substr(col, '#\S+', 1, level)
from t
connect by regexp_substr(col, '#\S+', 1, level) is not null;
Returns:
返回:
#HashTag1
#SecondHashtag
#onemorehashtag
EDIT:
\S matches any non space character. It would be better to use \w which matches a-z, A-Z, 0-9 and _.
\S匹配任何非空格字符。最好使用与a-z、a-z、0-9和_匹配的\w。
As commented by @mathguy and from this site: a hashtag starts with an alphabet, then alphanumeric characters or underscores are allowed.
@mathguy评论道:hashtag以字母开头,然后允许使用字母数字字符或下划线。
So, pattern #[[:alpha:]]\w*
will work better.
所以,模式#[:alpha:] \w*会更好。
with t (col) as (
select 'My twwet #HashTag1, this is the #SecondHashtag. #onemorehashtag'
from dual
)
select
regexp_substr(col, '#[[:alpha:]]\w*', 1, level)
from t
connect by regexp_substr(col, '#[[:alpha:]]\w*', 1, level) is not null;
Produces:
生产:
#HashTag1
#SecondHashtag
#onemorehashtag