I have a large string in javascript that I need to strip all the html, minus specific tags.
我在javascript中有一个大字符串,我需要删除所有的html,减去特定的标签。
I'm currently using
我正在使用
var noHTML = /(<([^>]+)>)/ig;
Now this strips the html, what regex can I add to ignore mark tags while doing this?
现在这剥离了html,我可以添加什么正则表达式来执行此操作时忽略标记标记?
2 个解决方案
#1
1
As mentioned in the comments above, regex isn't the really the right tool to use for parsing HTML. That being said, one way to do this is to use a look ahead for the tags you want to keep:
正如上面的评论中所提到的,正则表达式并不是用于解析HTML的正确工具。话虽这么说,一种方法是使用你要保留的标签:
var noHTML = /(?!(<ul|<\/ul>))(<([^>]+)>)/ig;
In this example, "ul"
在这个例子中,“ul”
so specific to your case:
特别针对你的情况:
var noHTML = /(?!(<mark|<\/mark>))(<([^>]+)>)/ig;
You can see it working here in this fiddle: https://jsfiddle.net/0xgs0u9m/
你可以看到它在这个小提琴中工作:https://jsfiddle.net/0xgs0u9m/
You may also want to instead consider using something like html parser on npm:
您可能还想考虑在npm上使用类似html解析器的东西:
https://www.npmjs.com/package/htmlparser
https://www.npmjs.com/package/htmlparser
From their example:
从他们的例子:
var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
if (error)
[...do something for errors...]
else
[...parsing done, do something...]
});
var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
parser.parseComplete(document.body.innerHTML);
alert(JSON.stringify(handler.dom, null, 2));
Results in:
结果是:
[ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
, { raw: 'script language= javascript'
, data: 'script language= javascript'
, type: 'script'
, name: 'script'
, attribs: { language: 'javascript' }
, children:
[ { raw: 'var foo = \'<bar>\';<'
, data: 'var foo = \'<bar>\';<'
, type: 'text'
}
]
}
, { raw: '<!-- Waah! -- '
, data: '<!-- Waah! -- '
, type: 'comment'
}
]
#2
0
One way is to use the JS implementation of php's strip_tags
一种方法是使用php的strip_tags的JS实现
JS strip_tags
(from the phpjs project)
JS strip_tags(来自phpjs项目)
function strip_tags(input, allowed) {
allowed = (((allowed || '') + '')
.toLowerCase()
.match(/<[a-z][a-z0-9]*>/g) || [])
.join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)
var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;
return input.replace(commentsAndPhpTags, '')
.replace(tags, function($0, $1) {
return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
});
}
Usage
用法
var x="<html><body>something in <b>bold</b> <mark>mark <i>italics</i> </mark>";
console.log(strip_tags(x,"<mark>")); //"something in bold <mark>mark italics </mark>"
#1
1
As mentioned in the comments above, regex isn't the really the right tool to use for parsing HTML. That being said, one way to do this is to use a look ahead for the tags you want to keep:
正如上面的评论中所提到的,正则表达式并不是用于解析HTML的正确工具。话虽这么说,一种方法是使用你要保留的标签:
var noHTML = /(?!(<ul|<\/ul>))(<([^>]+)>)/ig;
In this example, "ul"
在这个例子中,“ul”
so specific to your case:
特别针对你的情况:
var noHTML = /(?!(<mark|<\/mark>))(<([^>]+)>)/ig;
You can see it working here in this fiddle: https://jsfiddle.net/0xgs0u9m/
你可以看到它在这个小提琴中工作:https://jsfiddle.net/0xgs0u9m/
You may also want to instead consider using something like html parser on npm:
您可能还想考虑在npm上使用类似html解析器的东西:
https://www.npmjs.com/package/htmlparser
https://www.npmjs.com/package/htmlparser
From their example:
从他们的例子:
var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
if (error)
[...do something for errors...]
else
[...parsing done, do something...]
});
var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
parser.parseComplete(document.body.innerHTML);
alert(JSON.stringify(handler.dom, null, 2));
Results in:
结果是:
[ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
, { raw: 'script language= javascript'
, data: 'script language= javascript'
, type: 'script'
, name: 'script'
, attribs: { language: 'javascript' }
, children:
[ { raw: 'var foo = \'<bar>\';<'
, data: 'var foo = \'<bar>\';<'
, type: 'text'
}
]
}
, { raw: '<!-- Waah! -- '
, data: '<!-- Waah! -- '
, type: 'comment'
}
]
#2
0
One way is to use the JS implementation of php's strip_tags
一种方法是使用php的strip_tags的JS实现
JS strip_tags
(from the phpjs project)
JS strip_tags(来自phpjs项目)
function strip_tags(input, allowed) {
allowed = (((allowed || '') + '')
.toLowerCase()
.match(/<[a-z][a-z0-9]*>/g) || [])
.join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)
var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;
return input.replace(commentsAndPhpTags, '')
.replace(tags, function($0, $1) {
return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
});
}
Usage
用法
var x="<html><body>something in <b>bold</b> <mark>mark <i>italics</i> </mark>";
console.log(strip_tags(x,"<mark>")); //"something in bold <mark>mark italics </mark>"