正则表达式从html标记中去除属性和值

时间:2022-08-27 17:20:33

Hi Guys I'm very new to regex, can you help me with this.

嗨,大家好,我对正则表达式很新,你能帮助我吗?

I have a string like this "<input attribute='value' >" where attribute='value' could be anything and I want to get do a preg_replace to get just <input />

我有一个像这样的字符串“”其中attribute ='value'可以是任何东西,我想做一个preg_replace来得到

How do I specify a wildcard to replace any number of any characters in a srting?

如何指定通配符来替换srting中任意数量的任何字符?

like this? preg_replace("/<input.*>/",$replacement,$string);

像这样? preg_replace函数( “/ <输入*> /”,$替换,​​$字符串);

Many thanks

4 个解决方案

#1


10  

What you have:

你有什么:

.*

will match "any character, and as many as possible.

将匹配“任何角色,并尽可能多。

what you mean is

你的意思是

[^>]+

which translates to "any character, thats not a ">", and there must be at least one

这意味着“任何角色,那不是”>,并且必须至少有一个

or altertaively,

.*?

which means "any character, but only enough to make this rule work"

这意味着“任何角色,但只足以使这条规则有效”

BUT DONT

Parsing HTML with regexps is Bad

使用regexp解析HTML是不好的

use any of the existing html parsers, DOM librarys, anything, Just NOT NAïVE REGEX

使用任何现有的HTML解析器,DOM库,任何东西,JustNOTNAïVEREGEX

For example:

 <foo attr=">"> 

Will get grabbed wrongly by regex as

将正则表达式错误地抓住

'<foo attr=" ' with following text of '">' 

Which will lead you to this regex:

这会引导你到这个正则表达式:

 `<[a-zA-Z]+( [a-zA-Z]+=['"][^"']['"])*)>  etc etc 

at which point you'll discover this lovely gem:

在这一点上,你会发现这个可爱的宝石:

 <foo attr="'>\'\"">

and your head will explode.

你的头会爆炸。

( the syntax highlighter verifies my point, and incorrectly matches thinking i've ended the tag. )

(语法高亮显示验证我的观点,错误地匹配思考我已经结束了标记。)

#2


1  

Some people were close... but not 100%:

有些人很接近......但不是100%:

This:

preg_replace("<input[^>]*>", $replacement, $string);

should be this:

应该是这样的:

preg_replace("<input[^>]*?>", $replacement, $string);

You don't want that to be a greedy match.

你不希望那是一场贪婪的比赛。

#3


0  

preg_replace("<input[^>]*>", $replacement, $string); 
// [^>] means "any character except the greater than symbol / right tag bracket"

This is really basic stuff, you should catch up with some reading. :-)

这是非常基本的东西,你应该赶上一些阅读。 :-)

#4


0  

If I understand the question correctly, you have the code:

如果我正确理解了这个问题,你有代码:

preg_replace("/<input.*>/",$replacement,$string);

and you want us to tell you what you should use for $replacement to delete what was matched by .*

并且您希望我们告诉您应该使用什么来替换$以删除匹配的内容。*

You have to go about this the other way around. Use capturing groups to capture what you want to keep, and reinsert that into the replacement. E.g.:

你必须以相反的方式解决这个问题。使用捕获组捕获要保留的内容,然后将其重新插入替换中。例如。:

preg_replace("/(<input).*(>)/","$1$2",$string);

Of course, you don't really need capturing groups here, as you're only reinserting literal text. Bet the above shows the technique, in case you want to do this in a situation where the tag can vary. This is a better solution:

当然,你不需要在这里捕获组,因为你只是重新插入文字文本。如果您想在标签可以变化的情况下执行此操作,请在上面显示该技术。这是一个更好的解决方案:

preg_replace("/<input [^>]*>/","<input />",$string);

The negated character class is more specific than the dot. This regex will work if there are two HTML tags in the string. Your original regex won't.

否定的字符类比点更具体。如果字符串中有两个HTML标记,则此正则表达式将起作用。你的原始正则表达不会。

#1


10  

What you have:

你有什么:

.*

will match "any character, and as many as possible.

将匹配“任何角色,并尽可能多。

what you mean is

你的意思是

[^>]+

which translates to "any character, thats not a ">", and there must be at least one

这意味着“任何角色,那不是”>,并且必须至少有一个

or altertaively,

.*?

which means "any character, but only enough to make this rule work"

这意味着“任何角色,但只足以使这条规则有效”

BUT DONT

Parsing HTML with regexps is Bad

使用regexp解析HTML是不好的

use any of the existing html parsers, DOM librarys, anything, Just NOT NAïVE REGEX

使用任何现有的HTML解析器,DOM库,任何东西,JustNOTNAïVEREGEX

For example:

 <foo attr=">"> 

Will get grabbed wrongly by regex as

将正则表达式错误地抓住

'<foo attr=" ' with following text of '">' 

Which will lead you to this regex:

这会引导你到这个正则表达式:

 `<[a-zA-Z]+( [a-zA-Z]+=['"][^"']['"])*)>  etc etc 

at which point you'll discover this lovely gem:

在这一点上,你会发现这个可爱的宝石:

 <foo attr="'>\'\"">

and your head will explode.

你的头会爆炸。

( the syntax highlighter verifies my point, and incorrectly matches thinking i've ended the tag. )

(语法高亮显示验证我的观点,错误地匹配思考我已经结束了标记。)

#2


1  

Some people were close... but not 100%:

有些人很接近......但不是100%:

This:

preg_replace("<input[^>]*>", $replacement, $string);

should be this:

应该是这样的:

preg_replace("<input[^>]*?>", $replacement, $string);

You don't want that to be a greedy match.

你不希望那是一场贪婪的比赛。

#3


0  

preg_replace("<input[^>]*>", $replacement, $string); 
// [^>] means "any character except the greater than symbol / right tag bracket"

This is really basic stuff, you should catch up with some reading. :-)

这是非常基本的东西,你应该赶上一些阅读。 :-)

#4


0  

If I understand the question correctly, you have the code:

如果我正确理解了这个问题,你有代码:

preg_replace("/<input.*>/",$replacement,$string);

and you want us to tell you what you should use for $replacement to delete what was matched by .*

并且您希望我们告诉您应该使用什么来替换$以删除匹配的内容。*

You have to go about this the other way around. Use capturing groups to capture what you want to keep, and reinsert that into the replacement. E.g.:

你必须以相反的方式解决这个问题。使用捕获组捕获要保留的内容,然后将其重新插入替换中。例如。:

preg_replace("/(<input).*(>)/","$1$2",$string);

Of course, you don't really need capturing groups here, as you're only reinserting literal text. Bet the above shows the technique, in case you want to do this in a situation where the tag can vary. This is a better solution:

当然,你不需要在这里捕获组,因为你只是重新插入文字文本。如果您想在标签可以变化的情况下执行此操作,请在上面显示该技术。这是一个更好的解决方案:

preg_replace("/<input [^>]*>/","<input />",$string);

The negated character class is more specific than the dot. This regex will work if there are two HTML tags in the string. Your original regex won't.

否定的字符类比点更具体。如果字符串中有两个HTML标记,则此正则表达式将起作用。你的原始正则表达不会。