I have html contents in following text.
我在下面的文字中有html内容。
"This is my text to be parsed which contains url
http://someurl.com?param1=foo¶ms2=bar
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w
</a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span>
"
Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)
需要一个将纯文本转换为超链接的正则表达式(不要篡改现有的超链接)
Expected result:
"This is my text to be parsed which contains url
<a href="http://someurl.com?param1=foo¶ms2=bar">
http://someurl.com?param1=foo¶ms2=bar</a>
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too <a href="http://someurlinsidespan.com?xyz=abc">http://someurlinsidespan.com?xyz=abc</a> </span> "
4 个解决方案
#1
2
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/
. But it might be tricky if you have nested elements of the same type.
也许你可以先进行搜索和替换,以删除HTML元素。我不知道Ruby,但是正则表达式会像/<(.w+)。*?>。*? /。但是如果你有相同类型的嵌套元素可能会很棘手。
#2
1
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
免责声明:您不应该使用正则表达式执行此任务,请使用html解析器。这是一个POC,用于证明如果你期望一个好的格式化HTML(你无论如何都没有),这是可能的。
So here's what I came up with:(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
所以这就是我提出的:(https?:\ / \ /(?:w {1,3}。)?[^ \ s] *?(?:\。[az] +)+)(?! [^ <] *?(?:<\ / \ W +> | \ /?>))
What does this mean ?
这是什么意思 ?
-
(
: group 1 -
https?
: matchhttp
orhttps
-
\/\/
: match//
-
(?:w{1,3}.)?
: match optionallyw.
,ww.
orwww.
-
[^\s]*?
: match anything except whitespace zero or more times ungreedy -
(?:\.[a-z]+)+)
: match a dot followed by[a-z]
character(s), repeat this one or more times -
(?!
: negative lookahead-
[^<]*?
: match anything except<
zero or more times ungreedy -
(?:<\/\w+>|\/?>)
: match a closing tag or/>
or>
-
)
: end of lookahead
[^ <] *? :匹配除了 <零或更多次ungreedy之外的任何东西< p>
(?:<\ / \ w +> | \ /?>):匹配结束标记或/>或>
):前瞻的结束
-
-
)
: end of group 1
(:第1组
HTTPS? :匹配http或https
\/\/ : 比赛 //
(?:瓦特{1,3})。? :匹配任选w。,ww。或www。
[^ \ s] *? :匹配任何除了空格零以上的任何东西
(?:\。[a-z] +)+):匹配一个点后跟[a-z]字符,重复一次或多次
(?!:负向前瞻[^ <] * ?:匹配除了<0或更多次ungreedy(?:<\ / \ w +> | \ /?>)之外的任何内容:匹配结束标记或/>或>):结束前瞻
):第1组结束
regex101 online demo rubular online demo
regex101在线演示rubular在线演示
#3
0
Maybe try http://rubular.com/ .. there are some Regex
tips helps you get the desired output.
也许试试http://rubular.com/ ..有一些正则表达技巧可以帮助您获得所需的输出。
#4
0
I would do something like this:
我会做这样的事情:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]
#1
2
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/
. But it might be tricky if you have nested elements of the same type.
也许你可以先进行搜索和替换,以删除HTML元素。我不知道Ruby,但是正则表达式会像/<(.w+)。*?>。*? /。但是如果你有相同类型的嵌套元素可能会很棘手。
#2
1
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
免责声明:您不应该使用正则表达式执行此任务,请使用html解析器。这是一个POC,用于证明如果你期望一个好的格式化HTML(你无论如何都没有),这是可能的。
So here's what I came up with:(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
所以这就是我提出的:(https?:\ / \ /(?:w {1,3}。)?[^ \ s] *?(?:\。[az] +)+)(?! [^ <] *?(?:<\ / \ W +> | \ /?>))
What does this mean ?
这是什么意思 ?
-
(
: group 1 -
https?
: matchhttp
orhttps
-
\/\/
: match//
-
(?:w{1,3}.)?
: match optionallyw.
,ww.
orwww.
-
[^\s]*?
: match anything except whitespace zero or more times ungreedy -
(?:\.[a-z]+)+)
: match a dot followed by[a-z]
character(s), repeat this one or more times -
(?!
: negative lookahead-
[^<]*?
: match anything except<
zero or more times ungreedy -
(?:<\/\w+>|\/?>)
: match a closing tag or/>
or>
-
)
: end of lookahead
[^ <] *? :匹配除了 <零或更多次ungreedy之外的任何东西< p>
(?:<\ / \ w +> | \ /?>):匹配结束标记或/>或>
):前瞻的结束
-
-
)
: end of group 1
(:第1组
HTTPS? :匹配http或https
\/\/ : 比赛 //
(?:瓦特{1,3})。? :匹配任选w。,ww。或www。
[^ \ s] *? :匹配任何除了空格零以上的任何东西
(?:\。[a-z] +)+):匹配一个点后跟[a-z]字符,重复一次或多次
(?!:负向前瞻[^ <] * ?:匹配除了<0或更多次ungreedy(?:<\ / \ w +> | \ /?>)之外的任何内容:匹配结束标记或/>或>):结束前瞻
):第1组结束
regex101 online demo rubular online demo
regex101在线演示rubular在线演示
#3
0
Maybe try http://rubular.com/ .. there are some Regex
tips helps you get the desired output.
也许试试http://rubular.com/ ..有一些正则表达技巧可以帮助您获得所需的输出。
#4
0
I would do something like this:
我会做这样的事情:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]