I have a large amount of post generated with old CMS. It is in HTML markup...almost...the worse of I ever seen before. It contains such constructs:
我用旧CMS生成了大量帖子。它是HTML标记......差不多......我以前见过的更糟糕的事情。它包含这样的结构:
....<IMG alt="Хит сезона - <b>Лучшие фразы...</b>" src="http://www.example.com/articles/pic.jpg" align=left>...
As you can see strictly it is not a HTML, becouse it contains tegs inside tag attributes.
正如您所看到的那样,它不是HTML,因为它包含标记属性中的tegs。
I need to remove any tags from HTML attributes.
我需要从HTML属性中删除任何标记。
I had tried to use parsing through DOMDocument, but it cannot output cyrilic symbols correctly if headers body and html are not in parsed string. And even if it does I have to remove them from output.
我试图通过DOMDocument使用解析,但如果标题body和html不在解析字符串中,它就无法正确输出cyrilic符号。即使它,我必须从输出中删除它们。
The question is how to remove tags from attribute of HTML tag in PHP?
问题是如何在PHP中从HTML标签的属性中删除标签?
Is preg_replace is suitable for this?
preg_replace适合这个吗?
1 个解决方案
#1
1
You could try this:
你可以试试这个:
preg_replace('#<([^ ]+)((\s+[\w]+=((["\'])[^\5]+\5|[^ ]+))+)>#e', '"<\\1" . str_replace("\\\'", "\'", strip_tags("\\2")) . ">"', $code);
It takes every html opening tag (<something>
), matches all the attributes name="value" name='value' name=value
then it tag-strips them. The str_replace
is necessary because when the e
modifier is added, PHP use addslashes
to every match before evaluating it.
它需要每个html开始标记(
I tested it and it seems to work fine. :)
我测试了它,似乎工作正常。 :)
#1
1
You could try this:
你可以试试这个:
preg_replace('#<([^ ]+)((\s+[\w]+=((["\'])[^\5]+\5|[^ ]+))+)>#e', '"<\\1" . str_replace("\\\'", "\'", strip_tags("\\2")) . ">"', $code);
It takes every html opening tag (<something>
), matches all the attributes name="value" name='value' name=value
then it tag-strips them. The str_replace
is necessary because when the e
modifier is added, PHP use addslashes
to every match before evaluating it.
它需要每个html开始标记(
I tested it and it seems to work fine. :)
我测试了它,似乎工作正常。 :)