我需要一个正则表达式来查找不在任何html标记内的url或任何html标记的属性值

时间:2021-09-08 18:31:10

I have html contents in following text.

我在下面的文字中有html内容。

    "This is my text to be parsed which contains url 
    http://someurl.com?param1=foo&params2=bar 
 <a href="http://thisshouldnotbetampered.com">
    some text and a url http://someotherurl.com test 1q2w
 </a> <img src="http://someasseturl.com/abc.jpeg"/>
    <span>i have a link too http://someurlinsidespan.com?xyz=abc </span> 
    "

Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)

需要一个将纯文本转换为超链接的正则表达式(不要篡改现有的超链接)

Expected result:

    "This is my text to be parsed which contains url 
    <a href="http://someurl.com?param1=foo&params2=bar">
http://someurl.com?param1=foo&params2=bar</a> 
 <a href="http://thisshouldnotbetampered.com">
    some text and a url http://someotherurl.com test 
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
    <span>i have a link too <a href="http://someurlinsidespan.com?xyz=abc">http://someurlinsidespan.com?xyz=abc</a> </span> "

4 个解决方案

#1


2  

Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.

也许你可以先进行搜索和替换,以删除HTML元素。我不知道Ruby,但是正则表达式会像/<(.w+)。*?>。*? /。但是如果你有相同类型的嵌套元素可能会很棘手。

#2


1  

Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).

免责声明:您不应该使用正则表达式执行此任务,请使用html解析器。这是一个POC,用于证明如果你期望一个好的格式化HTML(你无论如何都没有),这是可能的。

So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))

所以这就是我提出的:(https?:\ / \ /(?:w {1,3}。)?[^ \ s] *?(?:\。[az] +)+)(?! [^ <] *?(?:<\ / \ W +> | \ /?>))

What does this mean ?

这是什么意思 ?

  • ( : group 1
  • (:第1组

  • https? : match http or https
  • HTTPS? :匹配http或https

  • \/\/ : match //
  • \/\/ : 比赛 //

  • (?:w{1,3}.)? : match optionally w., ww. or www.
  • (?:瓦特{1,3})。? :匹配任选w。,ww。或www。

  • [^\s]*? : match anything except whitespace zero or more times ungreedy
  • [^ \ s] *? :匹配任何除了空格零以上的任何东西

  • (?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
  • (?:\。[a-z] +)+):匹配一个点后跟[a-z]字符,重复一次或多次

  • (?! : negative lookahead
    • [^<]*? : match anything except < zero or more times ungreedy
    • [^ <] *? :匹配除了 <零或更多次ungreedy之外的任何东西< p>

    • (?:<\/\w+>|\/?>) : match a closing tag or /> or >
    • (?:<\ / \ w +> | \ /?>):匹配结束标记或/>或>

    • ) : end of lookahead
    • ):前瞻的结束

  • (?!:负向前瞻[^ <] * ?:匹配除了<0或更多次ungreedy(?:<\ / \ w +> | \ /?>)之外的任何内容:匹配结束标记或/>或>):结束前瞻

  • ) : end of group 1
  • ):第1组结束


                           regex101 online demo                                            rubular online demo

regex101在线演示rubular在线演示

#3


0  

Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.

也许试试http://rubular.com/ ..有一些正则表达技巧可以帮助您获得所需的输出。

#4


0  

I would do something like this:

我会做这样的事情:

require 'nokogiri'

doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url 
http://someurl.com  <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF

doc.search('*').each{|n| n.replace "\n"}

URI.extract doc.text
#=> ["http://someurl.com"]

#1


2  

Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.

也许你可以先进行搜索和替换,以删除HTML元素。我不知道Ruby,但是正则表达式会像/<(.w+)。*?>。*? /。但是如果你有相同类型的嵌套元素可能会很棘手。

#2


1  

Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).

免责声明:您不应该使用正则表达式执行此任务,请使用html解析器。这是一个POC,用于证明如果你期望一个好的格式化HTML(你无论如何都没有),这是可能的。

So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))

所以这就是我提出的:(https?:\ / \ /(?:w {1,3}。)?[^ \ s] *?(?:\。[az] +)+)(?! [^ <] *?(?:<\ / \ W +> | \ /?>))

What does this mean ?

这是什么意思 ?

  • ( : group 1
  • (:第1组

  • https? : match http or https
  • HTTPS? :匹配http或https

  • \/\/ : match //
  • \/\/ : 比赛 //

  • (?:w{1,3}.)? : match optionally w., ww. or www.
  • (?:瓦特{1,3})。? :匹配任选w。,ww。或www。

  • [^\s]*? : match anything except whitespace zero or more times ungreedy
  • [^ \ s] *? :匹配任何除了空格零以上的任何东西

  • (?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
  • (?:\。[a-z] +)+):匹配一个点后跟[a-z]字符,重复一次或多次

  • (?! : negative lookahead
    • [^<]*? : match anything except < zero or more times ungreedy
    • [^ <] *? :匹配除了 <零或更多次ungreedy之外的任何东西< p>

    • (?:<\/\w+>|\/?>) : match a closing tag or /> or >
    • (?:<\ / \ w +> | \ /?>):匹配结束标记或/>或>

    • ) : end of lookahead
    • ):前瞻的结束

  • (?!:负向前瞻[^ <] * ?:匹配除了<0或更多次ungreedy(?:<\ / \ w +> | \ /?>)之外的任何内容:匹配结束标记或/>或>):结束前瞻

  • ) : end of group 1
  • ):第1组结束


                           regex101 online demo                                            rubular online demo

regex101在线演示rubular在线演示

#3


0  

Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.

也许试试http://rubular.com/ ..有一些正则表达技巧可以帮助您获得所需的输出。

#4


0  

I would do something like this:

我会做这样的事情:

require 'nokogiri'

doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url 
http://someurl.com  <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF

doc.search('*').each{|n| n.replace "\n"}

URI.extract doc.text
#=> ["http://someurl.com"]