I have a question about parsing text and removing unwanted html parts. I know functions like - strip_tags() which will remove all the tags, but the problem is, that this function leaves the "inside text" there.
我有一个关于解析文本和删除不需要的HTML部分的问题。我知道像strip_tags()这样的函数会删除所有标签,但问题是,这个函数在那里留下了“内部文本”。
Let me show you an example, we have a text:
让我举个例子,我们有一个文字:
Hello, how are you? <a href="">Link to my website</a> __Here continues html tags, links, images__
What I want is to remove the whole part, where html resides. Not only tags, but also text (like "Link to my website" above).
我想要的是删除html所在的整个部分。不仅是标签,还有文字(如上面的“链接到我的网站”)。
Is there any efficient way, function that I missed?
有没有有效的方法,我错过了什么功能?
7 个解决方案
#1
3
Try this:
function removeTags($str) {
$result = '';
$xpath = new DOMXPath(DOMDocument::loadHTML(sprintf('<body>%s</body>', $str)));
foreach ($xpath->query('//body/text()') as $textNode) {
$result .= $textNode->nodeValue;
}
return $result;
}
echo removeTags(
'Hello, how are you? <a href="">Link to my website</a> __Here continues html <span>tags</span>, links, images__'
);
Output:
Hello, how are you? __Here continues html , links, images__
#2
1
Why not make it a rule that the submittet input are not allowed to contain tags.
为什么不规定submittet输入不允许包含标记。
function containsIllegalHtml($input, $allowable_tags = '') {
if($input != strip_tags($input, $allowable_tags)) {
return true;
} else {
return false;
}
}
Use this function to check wether the input contains tags or not.
使用此功能检查输入是否包含标签。
#3
0
you may write a function that takes a string and it uses php string capabilities to get the position of the "<" and then the position of the ">" and strip them from the input string
你可以编写一个带字符串的函数,它使用php字符串函数获取“<”的位置,然后是“>”的位置,并从输入字符串中删除它们
#4
0
maybe its not correct, but...
也许它不正确,但......
$str = 'Hello, how are you? <a href="">Link to my website</a> __Here continues html tags, links, ';
$rez = preg_replace("/\<.*\>/i",'',$str);
var_dump($rez);
gave me an output
给了我一个输出
string 'Hello, how are you? __Here continues html tags, links, ' (length=56)
#5
0
i have searched and found this solution
我搜索并找到了这个解决方案
$txt = "
<html>
<head><title>Something wicked this way comes</title></head>
<body>
This is the interesting stuff I want to extract
</body>
</html>";
$text = preg_replace("/<([^<>]*)>/", "", $txt);
echo htmlentities($text);
#6
0
Some preg magic?
一些preg魔法?
$text = preg_replace('/<[\/\!]*?[^<>]*?>/si', '', $text);
#7
0
Maybe this will work:
也许这会奏效:
Here is tutorial
这是教程
http://www.zendcasts.com/writing-custom-zend-filters-with-htmlpurifier/2011/06/
it's for Zend Framework but I think it may helps
它适用于Zend Framework,但我认为它可能有所帮助
#1
3
Try this:
function removeTags($str) {
$result = '';
$xpath = new DOMXPath(DOMDocument::loadHTML(sprintf('<body>%s</body>', $str)));
foreach ($xpath->query('//body/text()') as $textNode) {
$result .= $textNode->nodeValue;
}
return $result;
}
echo removeTags(
'Hello, how are you? <a href="">Link to my website</a> __Here continues html <span>tags</span>, links, images__'
);
Output:
Hello, how are you? __Here continues html , links, images__
#2
1
Why not make it a rule that the submittet input are not allowed to contain tags.
为什么不规定submittet输入不允许包含标记。
function containsIllegalHtml($input, $allowable_tags = '') {
if($input != strip_tags($input, $allowable_tags)) {
return true;
} else {
return false;
}
}
Use this function to check wether the input contains tags or not.
使用此功能检查输入是否包含标签。
#3
0
you may write a function that takes a string and it uses php string capabilities to get the position of the "<" and then the position of the ">" and strip them from the input string
你可以编写一个带字符串的函数,它使用php字符串函数获取“<”的位置,然后是“>”的位置,并从输入字符串中删除它们
#4
0
maybe its not correct, but...
也许它不正确,但......
$str = 'Hello, how are you? <a href="">Link to my website</a> __Here continues html tags, links, ';
$rez = preg_replace("/\<.*\>/i",'',$str);
var_dump($rez);
gave me an output
给了我一个输出
string 'Hello, how are you? __Here continues html tags, links, ' (length=56)
#5
0
i have searched and found this solution
我搜索并找到了这个解决方案
$txt = "
<html>
<head><title>Something wicked this way comes</title></head>
<body>
This is the interesting stuff I want to extract
</body>
</html>";
$text = preg_replace("/<([^<>]*)>/", "", $txt);
echo htmlentities($text);
#6
0
Some preg magic?
一些preg魔法?
$text = preg_replace('/<[\/\!]*?[^<>]*?>/si', '', $text);
#7
0
Maybe this will work:
也许这会奏效:
Here is tutorial
这是教程
http://www.zendcasts.com/writing-custom-zend-filters-with-htmlpurifier/2011/06/
it's for Zend Framework but I think it may helps
它适用于Zend Framework,但我认为它可能有所帮助