删除除img src和href之外的html标记

How can i remove html tag in string except img src and a href. I tried as followed but it removes all of tag.

除了img src和href之外，我如何删除字符串中的html标签。我尝试了如下，但它删除了所有标签。

SELECT 
    REGEXP_REPLACE('lorem <em>ipsum</em><img src="/folder/file.jpg" /> ipsum','<.*?>') 
FROM DUAL;

Result : lorem ipsum (I need like this lorem /folder/file.jpg ipsum)

结果：lorem ipsum（我需要像这样的lorem /folder/file.jpg ipsum）

1 个解决方案

#1

You need to protect the contents of the <img tag's src and the <a tag's href attribute against deletion. The following regex preserves only these parts of html tags in the original data:

您需要保护删除除img src和href之外的html标记标签的src和

REGEXP_REPLACE (
    REGEXP_REPLACE (
        REGEXP_REPLACE (
            'lorem <a class="interference" href="http://www.example.com"><em>ipsum</em><img src="/folder/file.jpg" /> ipsum</a> whatever'
          , '<a[^>]*? href="([^"]+)"[^>]*>|<img[^>]*? src="([^"]+)"[^>]*>|<a[^>]*? href=''([^'']+)''[^>]*>|<img[^>]*? src=''([^'']+)''[^>]*>'
          , '<<\1\2>>'
        )
      , '([^<])<[^<][^>]*>'
      , '\1'
    )
  , '<<([^>]+)>>'
  , ' \1 '
)

Explanation

说明

The attribute values to be protected are wrapped in double angular brackets: <<, >>. The match allows for interfering attributes between the tag name and the target attribute and for attribute values delimited either by double or by single quotes.
要保护的属性值包含在双尖括号中：<<，>>。匹配允许标记名称和目标属性之间的干扰属性以及由双引号或单引号分隔的属性值。
The target attribute is either enclose by single quotes or by double quotes. Thus, in every match, exactly 1 capture group will be filled. Thus the sequence of both of them may be inserted in the substituting pattern '<<\1\2>>' without any further disambiguation logic.
目标属性用单引号或双引号括起来。因此，在每次比赛中，将填充恰好1个捕获组。因此，它们的两个序列可以插入替换模式“<< 1 \ 2 >>”而没有任何进一步的消歧逻辑。
All character sequences wrapped in single angular brackets are replaced
包含在单个尖括号中的所有字符序列都被替换
Removal of the double angular brackets <<, >>.
去掉双角括号<<，>>。

Caveats

注意事项

In general, it is strongly discouraged to use regexen as a substitute for proper parsing. It is much more error-prone and much less flexible and extensible; a nightmare to maintain and debug.

通常，强烈建议不要使用regexen代替正确的解析。它更容易出错，更不灵活和可扩展;维护和调试的噩梦。
The match does not allow for escaped double quotes in the target attribute values. This should not be an issue for src and href; however, expect to run into troubles for target attributes like title or data-...

该匹配不允许目标属性值中的转义双引号。这不应该是src和href的问题;但是，期望遇到诸如标题或数据等目标属性的麻烦。
The substitutions should not interfere with data in literals, since < and > need to be represented as entities in html unless being used as syntax elements. However, this would not hold for xhtml data with cdata sections, where occurrences of would be lost <<, >>. If that might be an issue, test the original string for occurrences.

替换不应干扰文字中的数据，因为 <和> 需要在html中表示为实体，除非用作语法元素。但是，这不适用于带有cdata部分的xhtml数据，其中会出现<<，>>。如果这可能是一个问题，请测试原始字符串是否出现。

Supplement

补充

In case you wish to preserve the said attribute values in valid markup, protect the tags with the target attributes. To this end, use the following:

如果您希望在有效标记中保留所述属性值，请使用目标属性保护标记。为此，请使用以下内容：

REGEXP_REPLACE (
    REGEXP_REPLACE (
        REGEXP_REPLACE (
            'lorem <a href="http://www.example.com"><em>ipsum</em><img src="/folder/file.jpg" /> ipsum</a> whatever'
          , '(<a href|<img src|</a|</img)'
          , '<\1'
        )
      , '([^<])<[^<][^>]*>'
      , '\1'
    )
  , '<(<a href|<img src|</a|</img)'
  , '\1'
)

Explanation

说明

The tags to be protected are prefixed by an additional <.
要保护的标签以附加 <为前缀。< li>
All tags that do not start with a double << are replaced
所有不以双< <开头的标签都将被替换< li>
Backsubstitution of the << sequence. The substitution is applied in the same contexts as the prefixing
“序列的后代替。替换应用于与前缀相同的上下文中

Caveats

注意事项

The general caveat still holds: better do not use regexen as a stand-in for parsers.

一般警告仍然存在：最好不要使用regexen作为解析器的替身。
In order to keep the result being valid html, matching start and end tags need to be preserved. Unfortunately this matching cannot be accounted for using oracle regexp faclities (and is very complicated with other regex engines that do support recursion). Thus, all closing a and img tags are kept.

为了使结果保持有效html，需要保留匹配的开始和结束标记。不幸的是，这种匹配不能用oracle regexp faclities来解释（并且对于支持递归的其他正则表达式引擎来说非常复杂）。因此，保留所有关闭a和img标签。

While the latter rarely occurs in the wild (unless it is xhtml), the former will cause problems for <a name="... tags.

虽然后者很少在野外发生（除非它是xhtml），前者会导致

Tags with interfering attributes between the element name and the target attributes will be deleted. Most commonly this would apply to class or data- attributes. Catering for this case makes the regex more complicated again due to the 4 supported variations ( tag names a/img, single/double quote delimiters ) and potential interfering attributes:
将删除元素名称和目标属性之间具有干扰属性的标记。最常见的是，它适用于类或数据属性。由于4个支持的变体（标签名称为a / img，单/双引号分隔符）和潜在的干扰属性，因此适应此案例会使正则表达式再次变得更复杂：

REGEXP_REPLACE (
    REGEXP_REPLACE (
        REGEXP_REPLACE (
            REGEXP_REPLACE(
                  'lorem <a href="http://www.example.com"><em>ipsum</em><img src="/folder/file.jpg" /> ipsum</a> whatever'
                , '</(a|img)>'
                , '<</\1>'
            )
          , '<(a )[^>]*?(href="[^"]+"|href=''[^'']+'')[^>]*>|<(img )[^>]*?(src="[^"]+"|src=''[^'']+'')[^>]*>'
          , '<<\1\2\3\4>'
        )
      , '([^<])<[^<][^>]*>'
      , '\1'
    )
  , '<(<a href|<img src|</a|</img)'
  , '\1'
)

#1

You need to protect the contents of the <img tag's src and the <a tag's href attribute against deletion. The following regex preserves only these parts of html tags in the original data:

您需要保护删除除img src和href之外的html标记标签的src和

REGEXP_REPLACE (
    REGEXP_REPLACE (
        REGEXP_REPLACE (
            'lorem <a class="interference" href="http://www.example.com"><em>ipsum</em><img src="/folder/file.jpg" /> ipsum</a> whatever'
          , '<a[^>]*? href="([^"]+)"[^>]*>|<img[^>]*? src="([^"]+)"[^>]*>|<a[^>]*? href=''([^'']+)''[^>]*>|<img[^>]*? src=''([^'']+)''[^>]*>'
          , '<<\1\2>>'
        )
      , '([^<])<[^<][^>]*>'
      , '\1'
    )
  , '<<([^>]+)>>'
  , ' \1 '
)

Explanation

说明

The attribute values to be protected are wrapped in double angular brackets: <<, >>. The match allows for interfering attributes between the tag name and the target attribute and for attribute values delimited either by double or by single quotes.
要保护的属性值包含在双尖括号中：<<，>>。匹配允许标记名称和目标属性之间的干扰属性以及由双引号或单引号分隔的属性值。
The target attribute is either enclose by single quotes or by double quotes. Thus, in every match, exactly 1 capture group will be filled. Thus the sequence of both of them may be inserted in the substituting pattern '<<\1\2>>' without any further disambiguation logic.
目标属性用单引号或双引号括起来。因此，在每次比赛中，将填充恰好1个捕获组。因此，它们的两个序列可以插入替换模式“<< 1 \ 2 >>”而没有任何进一步的消歧逻辑。
All character sequences wrapped in single angular brackets are replaced
包含在单个尖括号中的所有字符序列都被替换
Removal of the double angular brackets <<, >>.
去掉双角括号<<，>>。

Caveats

注意事项

In general, it is strongly discouraged to use regexen as a substitute for proper parsing. It is much more error-prone and much less flexible and extensible; a nightmare to maintain and debug.

通常，强烈建议不要使用regexen代替正确的解析。它更容易出错，更不灵活和可扩展;维护和调试的噩梦。
The match does not allow for escaped double quotes in the target attribute values. This should not be an issue for src and href; however, expect to run into troubles for target attributes like title or data-...

该匹配不允许目标属性值中的转义双引号。这不应该是src和href的问题;但是，期望遇到诸如标题或数据等目标属性的麻烦。
The substitutions should not interfere with data in literals, since < and > need to be represented as entities in html unless being used as syntax elements. However, this would not hold for xhtml data with cdata sections, where occurrences of would be lost <<, >>. If that might be an issue, test the original string for occurrences.

替换不应干扰文字中的数据，因为 <和> 需要在html中表示为实体，除非用作语法元素。但是，这不适用于带有cdata部分的xhtml数据，其中会出现<<，>>。如果这可能是一个问题，请测试原始字符串是否出现。

Supplement

补充

In case you wish to preserve the said attribute values in valid markup, protect the tags with the target attributes. To this end, use the following:

如果您希望在有效标记中保留所述属性值，请使用目标属性保护标记。为此，请使用以下内容：

REGEXP_REPLACE (
    REGEXP_REPLACE (
        REGEXP_REPLACE (
            'lorem <a href="http://www.example.com"><em>ipsum</em><img src="/folder/file.jpg" /> ipsum</a> whatever'
          , '(<a href|<img src|</a|</img)'
          , '<\1'
        )
      , '([^<])<[^<][^>]*>'
      , '\1'
    )
  , '<(<a href|<img src|</a|</img)'
  , '\1'
)

Explanation

说明

The tags to be protected are prefixed by an additional <.
要保护的标签以附加 <为前缀。< li>
All tags that do not start with a double << are replaced
所有不以双< <开头的标签都将被替换< li>
Backsubstitution of the << sequence. The substitution is applied in the same contexts as the prefixing
“序列的后代替。替换应用于与前缀相同的上下文中

Caveats

注意事项

The general caveat still holds: better do not use regexen as a stand-in for parsers.

一般警告仍然存在：最好不要使用regexen作为解析器的替身。
In order to keep the result being valid html, matching start and end tags need to be preserved. Unfortunately this matching cannot be accounted for using oracle regexp faclities (and is very complicated with other regex engines that do support recursion). Thus, all closing a and img tags are kept.

为了使结果保持有效html，需要保留匹配的开始和结束标记。不幸的是，这种匹配不能用oracle regexp faclities来解释（并且对于支持递归的其他正则表达式引擎来说非常复杂）。因此，保留所有关闭a和img标签。

While the latter rarely occurs in the wild (unless it is xhtml), the former will cause problems for <a name="... tags.

虽然后者很少在野外发生（除非它是xhtml），前者会导致

Tags with interfering attributes between the element name and the target attributes will be deleted. Most commonly this would apply to class or data- attributes. Catering for this case makes the regex more complicated again due to the 4 supported variations ( tag names a/img, single/double quote delimiters ) and potential interfering attributes:
将删除元素名称和目标属性之间具有干扰属性的标记。最常见的是，它适用于类或数据属性。由于4个支持的变体（标签名称为a / img，单/双引号分隔符）和潜在的干扰属性，因此适应此案例会使正则表达式再次变得更复杂：

REGEXP_REPLACE (
    REGEXP_REPLACE (
        REGEXP_REPLACE (
            REGEXP_REPLACE(
                  'lorem <a href="http://www.example.com"><em>ipsum</em><img src="/folder/file.jpg" /> ipsum</a> whatever'
                , '</(a|img)>'
                , '<</\1>'
            )
          , '<(a )[^>]*?(href="[^"]+"|href=''[^'']+'')[^>]*>|<(img )[^>]*?(src="[^"]+"|src=''[^'']+'')[^>]*>'
          , '<<\1\2\3\4>'
        )
      , '([^<])<[^<][^>]*>'
      , '\1'
    )
  , '<(<a href|<img src|</a|</img)'
  , '\1'
)

秒客网

删除除img src和href之外的html标记

1 个解决方案

#1

#1

相关文章