正则表达式删除HTML标记的问题

时间:2022-08-27 17:15:57

In my Ruby app, I've used the following method and regular expression to remove all HTML tags from a string:

在我的Ruby应用程序中,我使用以下方法和正则表达式从字符串中删除所有HTML标记:

str.gsub(/<\/?[^>]*>/,"")

This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into &#8220; and all single quotes to be changed to &#8221; .

这个正则表达式完成了我所期待的所有内容,除了它导致所有引号转换为“并且所有单引号都被更改为”。

What's the obvious thing I'm missing to convert the messy codes back into their proper characters?

我将丢失的代码转换回正确的字符时,我显然错过了什么?

Edit: The problem occurs with or without the Regular Expression, so it's clear my problem has nothing to do with it. My question now is how to deal with this formatting error and correct it. Thanks!

编辑:无论有没有正则表达式都会出现问题,所以很明显我的问题与它无关。我现在的问题是如何处理这种格式错误并纠正它。谢谢!

5 个解决方案

#1


5  

Use CGI::unescapeHTML after you perform your regular expression substitution:

执行正则表达式替换后使用CGI :: unescapeHTML:

CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))

See http://www.ruby-doc.org/core/classes/CGI.html#M000547

In the above code snippet, gsub removes all HTML tags. Then, unescapeHTML() reverts all HTML entities (such as <, &#8220) to their actual characters (<, quotes, etc.)

在上面的代码段中,gsub删除了所有HTML标记。然后,unescapeHTML()将所有HTML实体(例如<,“)恢复为其实际字符(<,引号等)

With respect to another post on this page, note that you will never ever be passed HTML such as

关于此页面上的另一篇文章,请注意您永远不会传递HTML等

<tag attribute="<value>">2 + 3 < 6</tag>

(which is invalid HTML); what you may receive is, instead:

(这是无效的HTML);你会收到的是:

<tag attribute="&lt;value&gt;">2 + 3 &lt; 6</tag>

The call to gsub will transform the above to:

对gsub的调用会将上述内容转换为:

2 + 3 &lt; 6

And unescapeHTML will finish the job:

而unescapeHTML将完成这项工作:

2 + 3 < 6

#2


2  

You're going to run into more trouble when you see something like:

当你看到类似的东西时,你会遇到更多麻烦:

<doohickey name="<foobar>">

You'll want to apply something like:

您将要应用以下内容:

gsub(/<[^<>]*>/, "")

...for as long as the pattern matches.

...只要模式匹配。

#3


2  

This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into “ and all single quotes to be changed to ” .

这个正则表达式完成了我所期待的所有内容,除了它导致所有引号转换为“并且所有单引号都被更改为”。

This doesn't sound as if the RegExp would be doing this. Are you sure it's different before?

这听起来好像RegExp会这样做。你确定之前有所不同吗?

See this question here for information about the problem, it has got an excellent answer:
Get non UTF-8 form fields as UTF-8 in php.

在这里查看这个问题有关该问题的信息,它有一个很好的答案:在php中获取非UTF-8表单字段为UTF-8。

#4


0  

I've run into a similar problem with character changes, this happened when my code ran through another module that enforced UTF-8 encoding and then when it came back, I had a different file (slurped array of lines) on my hands.

我遇到了类似的字符更改问题,这发生在我的代码运行另一个强制执行UTF-8编码的模块时,然后当它返回时,我手上有一个不同的文件(slurped数组)。

#5


-3  

You could use a multi-pass system to get the results you are looking for.

您可以使用多次通过系统来获取您要查找的结果。

After running your regular expression, run an expression to convert &8220; to quotes and another to convert &8221; to single quotes.

运行正则表达式后,运行一个表达式来转换&8220;引用和另一个转换&8221;单引号。

#1


5  

Use CGI::unescapeHTML after you perform your regular expression substitution:

执行正则表达式替换后使用CGI :: unescapeHTML:

CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))

See http://www.ruby-doc.org/core/classes/CGI.html#M000547

In the above code snippet, gsub removes all HTML tags. Then, unescapeHTML() reverts all HTML entities (such as <, &#8220) to their actual characters (<, quotes, etc.)

在上面的代码段中,gsub删除了所有HTML标记。然后,unescapeHTML()将所有HTML实体(例如<,“)恢复为其实际字符(<,引号等)

With respect to another post on this page, note that you will never ever be passed HTML such as

关于此页面上的另一篇文章,请注意您永远不会传递HTML等

<tag attribute="<value>">2 + 3 < 6</tag>

(which is invalid HTML); what you may receive is, instead:

(这是无效的HTML);你会收到的是:

<tag attribute="&lt;value&gt;">2 + 3 &lt; 6</tag>

The call to gsub will transform the above to:

对gsub的调用会将上述内容转换为:

2 + 3 &lt; 6

And unescapeHTML will finish the job:

而unescapeHTML将完成这项工作:

2 + 3 < 6

#2


2  

You're going to run into more trouble when you see something like:

当你看到类似的东西时,你会遇到更多麻烦:

<doohickey name="<foobar>">

You'll want to apply something like:

您将要应用以下内容:

gsub(/<[^<>]*>/, "")

...for as long as the pattern matches.

...只要模式匹配。

#3


2  

This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into “ and all single quotes to be changed to ” .

这个正则表达式完成了我所期待的所有内容,除了它导致所有引号转换为“并且所有单引号都被更改为”。

This doesn't sound as if the RegExp would be doing this. Are you sure it's different before?

这听起来好像RegExp会这样做。你确定之前有所不同吗?

See this question here for information about the problem, it has got an excellent answer:
Get non UTF-8 form fields as UTF-8 in php.

在这里查看这个问题有关该问题的信息,它有一个很好的答案:在php中获取非UTF-8表单字段为UTF-8。

#4


0  

I've run into a similar problem with character changes, this happened when my code ran through another module that enforced UTF-8 encoding and then when it came back, I had a different file (slurped array of lines) on my hands.

我遇到了类似的字符更改问题,这发生在我的代码运行另一个强制执行UTF-8编码的模块时,然后当它返回时,我手上有一个不同的文件(slurped数组)。

#5


-3  

You could use a multi-pass system to get the results you are looking for.

您可以使用多次通过系统来获取您要查找的结果。

After running your regular expression, run an expression to convert &8220; to quotes and another to convert &8221; to single quotes.

运行正则表达式后,运行一个表达式来转换&8220;引用和另一个转换&8221;单引号。