how to match all contents outside a HTML tag?
如何匹配HTML标记之外的所有内容?
My pseudo-HTML is:
我的伪HTML是:
<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div>
I used the regular expression,
我用了正则表达式,
(?<=^|>)[^><]+?(?=<|$)
which would give me: "aaa bbb ccc ddd"
这会给我:“aaa bbb ccc ddd”
All I need is a way to ignore HTML tags with return: "bbb ccc"
我需要的是一种忽略带有返回的HTML标签的方法:“bbb ccc”
3 个解决方案
#1
Regexes are a clunky and unreliable way to work on markup. I would suggest using a DOM parser such as SimpleHtmlDom:
正则表达式是一种笨重且不可靠的标记工作方式。我建议使用DOM解析器,如SimpleHtmlDom:
//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')->find('a')->plaintext;
If you want to do that on the client, you can use a library such as jQuery like so:
如果你想在客户端上这样做,你可以使用像jQuery这样的库:
$('a').each(function() {
alert($(this).text());
});
#2
Look for an approriate regex to match complete tags (e.g in a library like http://regexlib.com/) and remove them with using the substitute operator s///. Then use the rest.
寻找一个适当的正则表达式来匹配完整的标签(例如在像http://regexlib.com/这样的库中)并使用替换运算符s ///删除它们。然后用剩下的。
#3
Thanks everybody,
the expressions of both together would be dirty work, but I would like the opposite output.
两者的表达将是肮脏的工作,但我想要相反的输出。
(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)
As pseudo string:
作为伪字符串:
<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>
<div>dsada</div> hbhgjh
For simplification, I use this tool.
为简化起见,我使用此工具。
#1
Regexes are a clunky and unreliable way to work on markup. I would suggest using a DOM parser such as SimpleHtmlDom:
正则表达式是一种笨重且不可靠的标记工作方式。我建议使用DOM解析器,如SimpleHtmlDom:
//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')->find('a')->plaintext;
If you want to do that on the client, you can use a library such as jQuery like so:
如果你想在客户端上这样做,你可以使用像jQuery这样的库:
$('a').each(function() {
alert($(this).text());
});
#2
Look for an approriate regex to match complete tags (e.g in a library like http://regexlib.com/) and remove them with using the substitute operator s///. Then use the rest.
寻找一个适当的正则表达式来匹配完整的标签(例如在像http://regexlib.com/这样的库中)并使用替换运算符s ///删除它们。然后用剩下的。
#3
Thanks everybody,
the expressions of both together would be dirty work, but I would like the opposite output.
两者的表达将是肮脏的工作,但我想要相反的输出。
(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)
As pseudo string:
作为伪字符串:
<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>
<div>dsada</div> hbhgjh
For simplification, I use this tool.
为简化起见,我使用此工具。