I am trying to take a string that has HTML, strip out some tags (img, object) and all other HTML tags, strip out their attributes. For example:
我试图获取一个包含HTML的字符串,删除一些标签(img,object)和所有其他HTML标签,去掉它们的属性。例如:
<div id="someId" style="color: #000000">
<p class="someClass">Some Text</p>
<img src="images/someimage.jpg" alt="" />
<a href="somelink.html">Some Link Text</a>
</div>
Would become:
<div>
<p>Some Text</p>
Some Link Text
</div>
I am trying:
我在尝试:
string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object
I am not sure how to strip all attributes inside a tag though.
我不知道如何剥离标签内的所有属性。
Any help would be appreciated.
任何帮助,将不胜感激。
Thanks.
4 个解决方案
#1
7
You can remove all attributes like this:
您可以删除所有属性,如下所示:
string.replaceAll("(<\\w+)[^>]*(>)", "$1$2");
This expression matches an opening tag, but captures only its header <div
and the closing >
as groups 1 and 2. replaceAll
uses references to these groups to join them back in the output as $1$2
. This cuts out the attributes in the middle of the tag.
此表达式与开始标记匹配,但仅捕获其标题
#2
8
I would not recommend regex for this if you want to filter specific tags. This is going to be hell of a job and never going to be fully reliable. Use a normal HTML parser like Jsoup. It offers the Whitelist
API to clean up HTML. See also this cookbook document.
如果您想过滤特定标签,我不建议使用正则表达式。这将是一项艰巨的工作,永远不会完全可靠。使用像Jsoup这样的普通HTML解析器。它提供了Whitelist API来清理HTML。另见本食谱文件。
Here's a kickoff example with help of Jsoup which only allows <div>
and <p>
tags next to the standard set of tags of the chosen Whitelist
which is Whitelist#simpleText()
in the below example.
这是Jsoup帮助下的启动示例,它只允许选择的白名单的标准标签集旁边的
标签,在下面的例子中是Whitelist#simpleText()。
String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);
This results in
这导致了
<div>
<p>Some Text</p>Some Link Text
</div>
See also:
- How to implement a possibility for user to post some html-formatted data in a safe way?
如何实现用户以安全的方式发布一些html格式的数据的可能性?
#3
1
/<(/?\w+) .*?>/<\1>/
might work - takes the tag (the matching group) and reads any attributes until the close bracket and replaces it with just the backets and the tag.
/ <(/?\ w +)。*?> / <\ 1> / /可能有效 - 获取标记(匹配组)并读取任何属性,直到关闭括号,并仅用支持和标记替换它。
#4
-1
Probably would be much easier if you are using a SAX or DOM, and take the node name and value, and remove all attributes.
如果您使用SAX或DOM,并且获取节点名称和值,并删除所有属性,可能会容易得多。
#1
7
You can remove all attributes like this:
您可以删除所有属性,如下所示:
string.replaceAll("(<\\w+)[^>]*(>)", "$1$2");
This expression matches an opening tag, but captures only its header <div
and the closing >
as groups 1 and 2. replaceAll
uses references to these groups to join them back in the output as $1$2
. This cuts out the attributes in the middle of the tag.
此表达式与开始标记匹配,但仅捕获其标题
#2
8
I would not recommend regex for this if you want to filter specific tags. This is going to be hell of a job and never going to be fully reliable. Use a normal HTML parser like Jsoup. It offers the Whitelist
API to clean up HTML. See also this cookbook document.
如果您想过滤特定标签,我不建议使用正则表达式。这将是一项艰巨的工作,永远不会完全可靠。使用像Jsoup这样的普通HTML解析器。它提供了Whitelist API来清理HTML。另见本食谱文件。
Here's a kickoff example with help of Jsoup which only allows <div>
and <p>
tags next to the standard set of tags of the chosen Whitelist
which is Whitelist#simpleText()
in the below example.
这是Jsoup帮助下的启动示例,它只允许选择的白名单的标准标签集旁边的
标签,在下面的例子中是Whitelist#simpleText()。
String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);
This results in
这导致了
<div>
<p>Some Text</p>Some Link Text
</div>
See also:
- How to implement a possibility for user to post some html-formatted data in a safe way?
如何实现用户以安全的方式发布一些html格式的数据的可能性?
#3
1
/<(/?\w+) .*?>/<\1>/
might work - takes the tag (the matching group) and reads any attributes until the close bracket and replaces it with just the backets and the tag.
/ <(/?\ w +)。*?> / <\ 1> / /可能有效 - 获取标记(匹配组)并读取任何属性,直到关闭括号,并仅用支持和标记替换它。
#4
-1
Probably would be much easier if you are using a SAX or DOM, and take the node name and value, and remove all attributes.
如果您使用SAX或DOM,并且获取节点名称和值,并删除所有属性,可能会容易得多。