我需要一个正则表达式来查找不在任何html标记内的url或任何html标记的属性值

I have html contents in following text.

我在下面的文字中有html内容。

    "This is my text to be parsed which contains url 
    http://someurl.com?param1=foo&params2=bar 
 <a href="http://thisshouldnotbetampered.com">
    some text and a url http://someotherurl.com test 1q2w
 </a> <img src="http://someasseturl.com/abc.jpeg"/>
    <span>i have a link too http://someurlinsidespan.com?xyz=abc </span> 
    "

Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)

需要一个将纯文本转换为超链接的正则表达式(不要篡改现有的超链接)

Expected result:

    "This is my text to be parsed which contains url 
    <a href="http://someurl.com?param1=foo&params2=bar">
http://someurl.com?param1=foo&params2=bar</a> 
 <a href="http://thisshouldnotbetampered.com">
    some text and a url http://someotherurl.com test 
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
    <span>i have a link too <a href="http://someurlinsidespan.com?xyz=abc">http://someurlinsidespan.com?xyz=abc</a> </span> "

4 个解决方案

#1

Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.

也许你可以先进行搜索和替换,以删除HTML元素。我不知道Ruby,但是正则表达式会像/<(.w+)。*?>。*? /。但是如果你有相同类型的嵌套元素可能会很棘手。

#2

_{Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).}

免责声明:您不应该使用正则表达式执行此任务,请使用html解析器。这是一个POC,用于证明如果你期望一个好的格式化HTML(你无论如何都没有),这是可能的。

So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))

所以这就是我提出的:(https?:\ / \ /(?:w {1,3}。)?[^ \ s] *?(?:\。[az] +)+)(?! [^ <] *?(?:<\ / \ W +> | \ /?>))

What does this mean ?

这是什么意思 ?

( : group 1

(:第1组

https? : match http or https

HTTPS? :匹配http或https

\/\/ : match //

\/\/ : 比赛 //

(?:w{1,3}.)? : match optionally w., ww. or www.

(?:瓦特{1,3})。? :匹配任选w。,ww。或www。

[^\s]*? : match anything except whitespace zero or more times ungreedy

[^ \ s] *? :匹配任何除了空格零以上的任何东西

(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times

(?:\。[a-z] +)+):匹配一个点后跟[a-z]字符,重复一次或多次

(?! : negative lookahead
- [^<]*? : match anything except < zero or more times ungreedy
- (?:<\/\w+>|\/?>) : match a closing tag or /> or >
- ) : end of lookahead

(?!:负向前瞻[^ <] * ?:匹配除了<0或更多次ungreedy(?:<\ / \ w +> | \ /?>)之外的任何内容:匹配结束标记或/>或>):结束前瞻

) : end of group 1

):第1组结束

regex101 online demo rubular online demo

regex101在线演示rubular在线演示

#3

Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.

也许试试http://rubular.com/ ..有一些正则表达技巧可以帮助您获得所需的输出。

#4

I would do something like this:

我会做这样的事情:

require 'nokogiri'

doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url 
http://someurl.com  <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF

doc.search('*').each{|n| n.replace "\n"}

URI.extract doc.text
#=> ["http://someurl.com"]

#1