使用Regex删除脚本标记

时间:2021-10-22 00:25:35

I'm trying to use a Regex expression I've found in this website and it doesn't seem to work. Any ideas?

我正在尝试使用我在本网站上找到的正则表达式,它似乎不起作用。有任何想法吗?

Input string:

输入字符串:

sFetch = "123<script type=\"text/javascript\">\n\t\tfunction utmx_section(){}function utmx(){}\n\t\t(function()})();\n\t</script>456";

Regex:

正则表达式:

sFetch = Regex.Replace(sFetch, "<script.*?>.*?</script>", "", RegexOptions.IgnoreCase);

4 个解决方案

#1


9  

Add RegexOptions.Singleline

添加RegexOptions.Singleline

RegexOptions.IgnoreCase | RegexOptions.Singleline

And that will never work on follow one.

这将永远不会影响到一个。

<script
>
alert(1)
</script
/**/
>

So, Find a HTML parser like HTML Agility Pack

因此,查找HTML Agility Pack等HTML解析器

#2


7  

The reason the regex fails is that your input has newlines and the meta char . does not match it.

正则表达式失败的原因是你的输入有换行符和元字符。与它不符。

To solve this you can use the RegexOptions.Singleline option as S.Mark says, or you can change the regex to:

要解决此问题,您可以使用RegexOptions.Singleline选项作为S.Mark说,或者您可以将正则表达式更改为:

"<script[\d\D]*?>[\d\D]*?</script>"

which used [\d\D] instead of ..

用[\ d \ D]代替..

\d is any digit and \D is any non-digit, so [\d\D] is a digit or a non-digit which is effectively any char.

\ d是任何数字,\ D是任何非数字,因此[\ d \ D]是一个数字或非数字,实际上是任何字符。

#3


4  

If you actually want to sanitize a html string (and you're using .NET) then take a look at the Microsoft Web Protection Library:

如果您确实想要清理html字符串(并且您使用的是.NET),那么请查看Microsoft Web Protection Library:

Sanitizer.GetSafeHtmlFragment(untrustedHtml);

There's a description here.

这里有一个描述。

#4


1  

This is a bit shorter:

这有点短:

 "<script[^<]*</script>"

or

要么

"<[^>]*>[^>]*>"

#1


9  

Add RegexOptions.Singleline

添加RegexOptions.Singleline

RegexOptions.IgnoreCase | RegexOptions.Singleline

And that will never work on follow one.

这将永远不会影响到一个。

<script
>
alert(1)
</script
/**/
>

So, Find a HTML parser like HTML Agility Pack

因此,查找HTML Agility Pack等HTML解析器

#2


7  

The reason the regex fails is that your input has newlines and the meta char . does not match it.

正则表达式失败的原因是你的输入有换行符和元字符。与它不符。

To solve this you can use the RegexOptions.Singleline option as S.Mark says, or you can change the regex to:

要解决此问题,您可以使用RegexOptions.Singleline选项作为S.Mark说,或者您可以将正则表达式更改为:

"<script[\d\D]*?>[\d\D]*?</script>"

which used [\d\D] instead of ..

用[\ d \ D]代替..

\d is any digit and \D is any non-digit, so [\d\D] is a digit or a non-digit which is effectively any char.

\ d是任何数字,\ D是任何非数字,因此[\ d \ D]是一个数字或非数字,实际上是任何字符。

#3


4  

If you actually want to sanitize a html string (and you're using .NET) then take a look at the Microsoft Web Protection Library:

如果您确实想要清理html字符串(并且您使用的是.NET),那么请查看Microsoft Web Protection Library:

Sanitizer.GetSafeHtmlFragment(untrustedHtml);

There's a description here.

这里有一个描述。

#4


1  

This is a bit shorter:

这有点短:

 "<script[^<]*</script>"

or

要么

"<[^>]*>[^>]*>"