从标记除去字符串中的所有HTML

I have a large string in javascript that I need to strip all the html, minus specific tags.

我在javascript中有一个大字符串，我需要删除所有的html，减去特定的标签。

I'm currently using

我正在使用

var noHTML = /(<([^>]+)>)/ig;

Now this strips the html, what regex can I add to ignore mark tags while doing this?

现在这剥离了html，我可以添加什么正则表达式来执行此操作时忽略标记标记？

2 个解决方案

#1

As mentioned in the comments above, regex isn't the really the right tool to use for parsing HTML. That being said, one way to do this is to use a look ahead for the tags you want to keep:

正如上面的评论中所提到的，正则表达式并不是用于解析HTML的正确工具。话虽这么说，一种方法是使用你要保留的标签：

var noHTML = /(?!(<ul|<\/ul>))(<([^>]+)>)/ig;

In this example, "ul"

在这个例子中，“ul”

so specific to your case:

特别针对你的情况：

var noHTML = /(?!(<mark|<\/mark>))(<([^>]+)>)/ig;

You can see it working here in this fiddle: https://jsfiddle.net/0xgs0u9m/

你可以看到它在这个小提琴中工作：https：//jsfiddle.net/0xgs0u9m/

You may also want to instead consider using something like html parser on npm:

您可能还想考虑在npm上使用类似html解析器的东西：

https://www.npmjs.com/package/htmlparser

From their example:

从他们的例子：

var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
    if (error)
        [...do something for errors...]
    else
        [...parsing done, do something...]
});
var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
parser.parseComplete(document.body.innerHTML);
alert(JSON.stringify(handler.dom, null, 2));

Results in:

结果是：

[ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
  , { raw: 'script language= javascript'
  , data: 'script language= javascript'
  , type: 'script'
  , name: 'script'
  , attribs: { language: 'javascript' }
  , children: 
     [ { raw: 'var foo = \'<bar>\';<'
       , data: 'var foo = \'<bar>\';<'
       , type: 'text'
       }
     ]
  }
, { raw: '<!-- Waah! -- '
  , data: '<!-- Waah! -- '
  , type: 'comment'
  }
]

#2

One way is to use the JS implementation of php's strip_tags

一种方法是使用php的strip_tags的JS实现

JS strip_tags (from the phpjs project)

JS strip_tags（来自phpjs项目）

function strip_tags(input, allowed) {
  allowed = (((allowed || '') + '')
    .toLowerCase()
    .match(/<[a-z][a-z0-9]*>/g) || [])
    .join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)
  var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
    commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;
  return input.replace(commentsAndPhpTags, '')
    .replace(tags, function($0, $1) {
      return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
    });
}

Usage

用法

var x="<html><body>something in <b>bold</b> <mark>mark <i>italics</i> </mark>";
console.log(strip_tags(x,"<mark>")); //"something in bold <mark>mark italics </mark>"

#1