在Python中,什么是最简单的逃避HTML的方法?

时间:2022-05-15 20:03:14

cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?

逃避似乎是一种可能的选择。它工作得很好吗?有没有更好的东西?

9 个解决方案

#1


159  

cgi.escape is fine. It escapes:

cgi.escape很好。它逃:

  • < to &lt;
  • < & lt;
  • > to &gt;
  • >比;
  • & to &amp;
  • 和&;

That is enough for all HTML.

这对所有的HTML来说都足够了。

EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:

编辑:如果你有非ascii字符,你也想要转义,因为包含在另一个使用不同编码的编码文档中,就像Craig说的,只用:

data.encode('ascii', 'xmlcharrefreplace')

Don't forget to decode data to unicode first, using whatever encoding it was encoded.

不要忘记先将数据解码到unicode,然后使用编码的任何编码。

However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).

然而,在我的经验中,如果您一直使用unicode,那么这种编码是无用的。只需在文档头(utf-8)中指定的编码的末尾编码(最大兼容性)。

Example:

例子:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.

同样值得注意的是(感谢格雷格)是额外的引用参数cgi.escape。如果将其设置为True,则cgi.escape还可以避免双引号(“)”,因此您可以在XML/HTML属性中使用结果值。

EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.

编辑:注意,cgi.escape在Python 3.2中已被弃用,支持html.escape,它执行相同的操作,除了默认为True。

#2


65  

In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.

在Python 3.2中引入了一个新的html模块,用于从html标记中转义保留字符。

It has one function escape():

它有一个函数escape():

>>> import html
>>> html.escape('x > 2 && x < 7')
'x &gt; 2 &amp;&amp; x &lt; 7'

#3


8  

cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.

escape应该很好地避开HTML,因为它可以避免HTML标记和字符实体的丢失。

But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.

但是,您可能还需要考虑编码问题:如果您想引用的HTML在特定编码中具有非ascii字符,那么您还必须注意,在引用时,您要适当地表示这些字符。也许你可以把它们转换成实体。否则,您应该确保在“源”HTML和它所嵌入的页面之间进行正确的编码转换,以避免损坏非ascii字符。

#4


6  

If you wish to escape HTML in a URL:

如果您希望在URL中转义HTML:

This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.

这可能不是OP想要的(这个问题没有明确指出要使用哪个上下文来进行转义),但是Python的本地库urllib有一个方法来避免HTML实体,这些HTML实体需要安全地包含在URL中。

The following is an example:

下面是一个例子:

#!/usr/bin/python
from urllib import quote

x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'

Find docs here

在这里找到文档

#5


4  

There is also the excellent markupsafe package.

此外,还有极好的markupsafe包。

>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'&lt;script&gt;alert(document.cookie);&lt;/script&gt;')

The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:

markupsafe包装设计得很好,可能是最多才多艺、最具python风格的逃生方式,IMHO,因为:

  1. the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
  2. 返回(标记)是来自unicode的类(例如,isinstance(escape('str'), unicode) == True。
  3. it properly handles unicode input
  4. 它正确处理unicode输入。
  5. it works in Python (2.6, 2.7, 3.3, and pypy)
  6. 它在Python中工作(2.6、2.7、3.3和pypy)
  7. it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).
  8. 它尊重对象的自定义方法(即具有__html__属性的对象)和模板重载(__html_format__)。

#6


2  

cgi.escape extended

This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.

这个版本改进cgi.escape。它还保留了空格和换行符。返回一个unicode字符串。

def escape_html(text):
    """escape strings for display in HTML"""
    return cgi.escape(text, quote=True).\
           replace(u'\n', u'<br />').\
           replace(u'\t', u'&emsp;').\
           replace(u'  ', u' &nbsp;')

for example

>>> escape_html('<foo>\nfoo\t"bar"')
u'&lt;foo&gt;<br />foo&emsp;&quot;bar&quot;'

#7


2  

Not the easiest way, but still straightforward. The main difference from cgi.escape module - it still will work properly if you already have &amp; in your text. As you see from comments to it:

不是最简单的方法,但仍然很简单。与cgi.escape模块的主要区别是,如果你已经有了,它仍然可以正常工作。在你的文本。正如你在评论中看到的:

cgi.escape version

cgi.escape版本

def escape(s, quote=None):
    '''Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
is also translated.'''
    s = s.replace("&", "&amp;") # Must be done first!
    s = s.replace("<", "&lt;")
    s = s.replace(">", "&gt;")
    if quote:
        s = s.replace('"', "&quot;")
    return s

regex version

正则表达式版本

QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
    """
    Replaces special characters <>&"' to HTML-safe sequences. 
    With attention to already escaped characters.
    """
    replace_with = {
        '<': '&gt;',
        '>': '&lt;',
        '&': '&amp;',
        '"': '&quot;', # should be escaped in attributes
        "'": '&#39'    # should be escaped in attributes
    }
    quote_pattern = re.compile(QUOTE_PATTERN)
    return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)

#8


0  

Via BeautifulSoup4:

通过BeautifulSoup4:

>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&amp;d'

#9


0  

No libraries, pure python, safely escapes text into html text:

没有图书馆,纯粹的python,安全地转义成html文本:

text.replace('<', '&lt;').replace('>', '&gt;').replace('&', '&amp;'
        ).encode('ascii', 'xmlcharrefreplace')

#1


159  

cgi.escape is fine. It escapes:

cgi.escape很好。它逃:

  • < to &lt;
  • < & lt;
  • > to &gt;
  • >比;
  • & to &amp;
  • 和&;

That is enough for all HTML.

这对所有的HTML来说都足够了。

EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:

编辑:如果你有非ascii字符,你也想要转义,因为包含在另一个使用不同编码的编码文档中,就像Craig说的,只用:

data.encode('ascii', 'xmlcharrefreplace')

Don't forget to decode data to unicode first, using whatever encoding it was encoded.

不要忘记先将数据解码到unicode,然后使用编码的任何编码。

However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).

然而,在我的经验中,如果您一直使用unicode,那么这种编码是无用的。只需在文档头(utf-8)中指定的编码的末尾编码(最大兼容性)。

Example:

例子:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.

同样值得注意的是(感谢格雷格)是额外的引用参数cgi.escape。如果将其设置为True,则cgi.escape还可以避免双引号(“)”,因此您可以在XML/HTML属性中使用结果值。

EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.

编辑:注意,cgi.escape在Python 3.2中已被弃用,支持html.escape,它执行相同的操作,除了默认为True。

#2


65  

In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.

在Python 3.2中引入了一个新的html模块,用于从html标记中转义保留字符。

It has one function escape():

它有一个函数escape():

>>> import html
>>> html.escape('x > 2 && x < 7')
'x &gt; 2 &amp;&amp; x &lt; 7'

#3


8  

cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.

escape应该很好地避开HTML,因为它可以避免HTML标记和字符实体的丢失。

But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.

但是,您可能还需要考虑编码问题:如果您想引用的HTML在特定编码中具有非ascii字符,那么您还必须注意,在引用时,您要适当地表示这些字符。也许你可以把它们转换成实体。否则,您应该确保在“源”HTML和它所嵌入的页面之间进行正确的编码转换,以避免损坏非ascii字符。

#4


6  

If you wish to escape HTML in a URL:

如果您希望在URL中转义HTML:

This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.

这可能不是OP想要的(这个问题没有明确指出要使用哪个上下文来进行转义),但是Python的本地库urllib有一个方法来避免HTML实体,这些HTML实体需要安全地包含在URL中。

The following is an example:

下面是一个例子:

#!/usr/bin/python
from urllib import quote

x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'

Find docs here

在这里找到文档

#5


4  

There is also the excellent markupsafe package.

此外,还有极好的markupsafe包。

>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'&lt;script&gt;alert(document.cookie);&lt;/script&gt;')

The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:

markupsafe包装设计得很好,可能是最多才多艺、最具python风格的逃生方式,IMHO,因为:

  1. the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
  2. 返回(标记)是来自unicode的类(例如,isinstance(escape('str'), unicode) == True。
  3. it properly handles unicode input
  4. 它正确处理unicode输入。
  5. it works in Python (2.6, 2.7, 3.3, and pypy)
  6. 它在Python中工作(2.6、2.7、3.3和pypy)
  7. it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).
  8. 它尊重对象的自定义方法(即具有__html__属性的对象)和模板重载(__html_format__)。

#6


2  

cgi.escape extended

This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.

这个版本改进cgi.escape。它还保留了空格和换行符。返回一个unicode字符串。

def escape_html(text):
    """escape strings for display in HTML"""
    return cgi.escape(text, quote=True).\
           replace(u'\n', u'<br />').\
           replace(u'\t', u'&emsp;').\
           replace(u'  ', u' &nbsp;')

for example

>>> escape_html('<foo>\nfoo\t"bar"')
u'&lt;foo&gt;<br />foo&emsp;&quot;bar&quot;'

#7


2  

Not the easiest way, but still straightforward. The main difference from cgi.escape module - it still will work properly if you already have &amp; in your text. As you see from comments to it:

不是最简单的方法,但仍然很简单。与cgi.escape模块的主要区别是,如果你已经有了,它仍然可以正常工作。在你的文本。正如你在评论中看到的:

cgi.escape version

cgi.escape版本

def escape(s, quote=None):
    '''Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
is also translated.'''
    s = s.replace("&", "&amp;") # Must be done first!
    s = s.replace("<", "&lt;")
    s = s.replace(">", "&gt;")
    if quote:
        s = s.replace('"', "&quot;")
    return s

regex version

正则表达式版本

QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
    """
    Replaces special characters <>&"' to HTML-safe sequences. 
    With attention to already escaped characters.
    """
    replace_with = {
        '<': '&gt;',
        '>': '&lt;',
        '&': '&amp;',
        '"': '&quot;', # should be escaped in attributes
        "'": '&#39'    # should be escaped in attributes
    }
    quote_pattern = re.compile(QUOTE_PATTERN)
    return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)

#8


0  

Via BeautifulSoup4:

通过BeautifulSoup4:

>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&amp;d'

#9


0  

No libraries, pure python, safely escapes text into html text:

没有图书馆,纯粹的python,安全地转义成html文本:

text.replace('<', '&lt;').replace('>', '&gt;').replace('&', '&amp;'
        ).encode('ascii', 'xmlcharrefreplace')