如何删除除标记之外的所有HTML代码？

I need to remove all HTML tags except:

我需要删除所有HTML标记,除了:

it is <sub> tag

它是_标签

there is {1 (or more) newline(s) + 4 (or more) spaces} in the behind of it

后面有{1(或更多)换行符+4(或更多)空格}

it is surrounded into "`" character.

它被包围成“`”字符。

Here is an examples:

var str = "something1
           <sub>
             something2
             <div class='myclass'>something3</div>
           </sub>
           <div class='myclass'>something4</div>
           something5

               <div class='myclass'>something6</div>
           <div class='myclass'>something7</div>
           `<div>something8</div>`
           something9";

Expected output:

/*   
something1
<sub>
  something2
  something3
</sub>
something4
something5

    <div class='myclass'>something6</div>
`<div>something8</div>`
something9

Here is what I've tried so far:

这是我到目前为止所尝试的:

/\n\s{0,3}<.*[^>]+|<sub>.*?<\/sub>|`.*?`/gm

1 个解决方案

#1

This is possible with regex substitutions. Use this regex with mg modifiers:

这可以通过正则表达式替换来实现。将此正则表达式与mg修饰符一起使用:

(\n\n    .*|`[^`]+`|<\/?sub\b[^>]+>)|<[^>]+>

And use $1 as the substitution.

并使用$ 1作为替代。

There are several parts to this. The capturing group finds all the HTML you may want to keep:

这有几个部分。捕获组找到您可能想要保留的所有HTML:

\n\n .* An empty line, and another line that starts with 4 spaces.

\ n \ n。*空行,以及以4个空格开头的另一行。

`[^`]+` Things in Back`Ticks.

`[^`] +`回来的东西'滴答作响。

<\/?sub\b[^>]+>) This matches sub HTML elements, opening or closing.

<\ /?sub \ b [^>] +>)这匹配子HTML元素,打开或关闭。

The remaining HTML elements will match <[^>]+>, which is discarded.

其余的HTML元素将匹配<[^>] +>,将其丢弃。

#1