如何转义字符串中的特定HTML标记

I have a requirement to escape a blacklist of HTML tags before displaying then in a web page. The reason for the selectivity is to allow for formatting to be retained (bod, italics, fonts, etc) but not any tags that will "break" the page (scripts, meta, etc).

我需要在显示然后在网页中之前转义HTML标签的黑名单。选择性的原因是允许保留格式(bod,斜体,字体等),但不允许任何会“破坏”页面的标签(脚本,元等)。

After thinking about this for a while I came up with two approaches:

在思考了一段时间之后,我提出了两种方法:

RegEx -- as almost everyone would tell you, using RegEx for manipulating HTML is a bad idea

RegEx - 正如几乎每个人都会告诉你的那样,使用RegEx操纵HTML是一个坏主意

HtmlAgilityPack

I figured that my best (and really only) solution was to load the string into HtmlAgilityPack and recursively loop through the child nodes. For each node I would check if it was on the specified blacklist. If it was, I would escape the opening (and closing if it existed) node, then process the InnerHtml. If it was not on the list, then output the node as is while still processing the InnerHtml.

我认为我最好的(也是唯一的)解决方案是将字符串加载到HtmlAgilityPack中并递归循环遍历子节点。对于每个节点,我会检查它是否在指定的黑名单中。如果是的话,我会逃离开头(如果它存在则关闭)节点,然后处理InnerHtml。如果它不在列表中,则按原样输出节点,同时仍处理InnerHtml。

So, given the following (very simple) source

所以,给出以下(非常简单)的来源

The quick <b style='padding: 0 25em;'>brown</b> fox <b>jumped <i>over</i> the <meta http-equiv='refresh' /> moon</b>.

I need the following output

我需要以下输出

The quick <b style='padding: 0 25em;'>brown</b> fox <b>jumped <i>over</i> the &lt;meta http-equiv='refresh' /&gt; moon</b>.

After a lot of research, I have come across several concerns, questions, and roadblocks.

经过大量的研究,我遇到了一些问题,问题和障碍。

Is HtmlAgilityPack the best library to use for this requirement?

HtmlAgilityPack是用于此要求的最佳库吗?

Is a recursive solution the only way? I though about using the .Descendants() method since that returns a flattened list of all the nodes via internal recursion but that results in repeated content. Using the above example, the <i>over</i> node is part of the InnerHtml for the second b node but then also becomes its own node in the Descendants collection.

递归解决方案是唯一的方法吗?我虽然使用.Descendants()方法,因为它通过内部递归返回所有节点的扁平列表,但这会导致重复的内容。使用上面的示例, over 节点是第二个b节点的InnerHtml的一部分,但后来也成为Descendants集合中自己的节点。

I could be missing the proper methods or properties, but I cannot find a way to output just the opening and closing tags without including the InnerHtml. My use case for this is to output the opening tag (including all attributes) as an escaped string, output the recursively processed InnerHtml, then output the escaped closing tag. I guess I could construct my own output by using the different properties (Name, Id, Attributes, etc) but I would think this is already available.

我可能会错过正确的方法或属性,但我找不到输出开始和结束标签的方法,而不包括InnerHtml。我的用例是输出开始标记(包括所有属性)作为转义字符串,输出递归处理的InnerHtml,然后输出转义的结束标记。我想我可以通过使用不同的属性(Name,Id,Attributes等)构建我自己的输出,但我认为这已经可用了。

As I see it, the method would look something like this

正如我所看到的,该方法看起来像这样

public string EscapeHtmlTags(string value, ICollection<string> tags) {
   var doc = new System.Text.StringBuilder();
   doc.LoadHtml(doc);

   if (tags.Contains(doc.DocumentNode.Name, StringComparer.CurrentCultureIgnoreCase)) {
      // output opening tag as escaped string ????
      EscapeHtmlTags(doc.DocumentNode.InnerHtml, tags);
      // output closing tag as escaped string ????
   }
   else {
      // output opening tag as is ????
      EscapeHtmlTags(doc.DocumentNode.InnerHtml, tags);
      // output closing tag as is ????
   }
}

Of course I still need to add error handling, and probably handling the various NodeTypes differently, and probably add a StringBuilder instance to collect the output, and so on... I could even possible go the approach of cloning and replacing existing nodes in the document.

当然我仍然需要添加错误处理,并且可能以不同方式处理各种NodeType,并且可能添加一个StringBuilder实例来收集输出,等等......我甚至可以采用克隆方法并替换现有节点文件。

Any thoughts or ideas?

有什么想法或想法吗?

1 个解决方案

#1

You should do this on back-end side, i.e. in PHP:

您应该在后端执行此操作,即在PHP中执行此操作:

http://www.php.net/manual/en/function.strip-tags.php

This function supports list of allowed tags, which you can use.

此功能支持您可以使用的允许标记列表。

#1