Firstly I would like to say to the more experienced people than myself that it has to be done in regex. No access to a DOM parser due to weird situation.
首先,我想对比我更有经验的人说,必须在正则表达式中完成。由于奇怪的情况,无法访问DOM解析器。
So I have a full HTML/XHTML string and would like to strip everything from it except the links. Basically just the <a>
tags are important. I need the tags to keep their information fully, so href, target, class, etc and it should work if its a self terminating tag or if it has a separate end tag. i.e. <a />
or <a></a>
所以我有一个完整的HTML / XHTML字符串,并希望除了链接之外从中删除所有内容。基本上只有标签很重要。我需要标签来保持他们的信息完全,所以href,目标,类等,如果它是一个自终止标签或它有一个单独的结束标签它应该工作。即或
Thanks for any HELP guys!
感谢任何帮助人员!
3 个解决方案
#1
2
Of course you have the possibility to parse HTML in a Firefox extension. Have a look at HTML to DOM, especially the second and third way.
当然,您可以在Firefox扩展中解析HTML。看看HTML到DOM,特别是第二和第三种方式。
It might seem to be more complex, but it is less error prone than a regular expression.
它可能看起来更复杂,但它比正则表达式更不容易出错。
As soon as you have a reference to the parsed content, all you have to do is to call ref.getElementsByTagName('a')
and you are done.
只要您对已解析的内容有所引用,您所要做的就是调用ref.getElementsByTagName('a')并完成。
#2
1
result = subject.match(/<a[^<>]*?(?:\/>|>(?:(?!<\/a>).)*<\/a>)/ig);
gets you an array of all <a>
tags in the HTML source (even self-closed tags which are illegal but which you specifically asked for). Is that sufficient?
获取HTML源代码中所有标签的数组(即使是非自动关闭的标签,这些标签是非法的,但您特别要求)。那够了吗?
Explanation:
说明:
<a # Match <a
[^<>]*? # Match any characters besides angle brackets, as few as possible
(?: # Now either match
/> # /> (self-closed tag)
| # or
> # a closing angle bracket
(?: # followed by...
(?!</a>) # (if we're not at the closing tag)
. # any character
)* # any number of times
</a> # until the closing tag
)
#3
0
the regex will look something like this
正则表达式看起来像这样
/\<\a.*[\/]{0,1}>(.*<\/\a>){0,1}/gm
#1
2
Of course you have the possibility to parse HTML in a Firefox extension. Have a look at HTML to DOM, especially the second and third way.
当然,您可以在Firefox扩展中解析HTML。看看HTML到DOM,特别是第二和第三种方式。
It might seem to be more complex, but it is less error prone than a regular expression.
它可能看起来更复杂,但它比正则表达式更不容易出错。
As soon as you have a reference to the parsed content, all you have to do is to call ref.getElementsByTagName('a')
and you are done.
只要您对已解析的内容有所引用,您所要做的就是调用ref.getElementsByTagName('a')并完成。
#2
1
result = subject.match(/<a[^<>]*?(?:\/>|>(?:(?!<\/a>).)*<\/a>)/ig);
gets you an array of all <a>
tags in the HTML source (even self-closed tags which are illegal but which you specifically asked for). Is that sufficient?
获取HTML源代码中所有标签的数组(即使是非自动关闭的标签,这些标签是非法的,但您特别要求)。那够了吗?
Explanation:
说明:
<a # Match <a
[^<>]*? # Match any characters besides angle brackets, as few as possible
(?: # Now either match
/> # /> (self-closed tag)
| # or
> # a closing angle bracket
(?: # followed by...
(?!</a>) # (if we're not at the closing tag)
. # any character
)* # any number of times
</a> # until the closing tag
)
#3
0
the regex will look something like this
正则表达式看起来像这样
/\<\a.*[\/]{0,1}>(.*<\/\a>){0,1}/gm