在Python中处理字符串中的转义序列

时间:2021-01-04 00:14:23

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.

有时当我从文件或用户那里获得输入时,我会得到一个包含转义序列的字符串。我想以与Python处理字符串文字中的转义序列相同的方式处理转义序列。

For example, let's say myString is defined as:

例如,假设myString定义为:

>>> myString = "spam\\neggs"
>>> print(myString)
spam\neggs

I want a function (I'll call it process) that does this:

我想要一个函数(我称之为进程),它执行此操作:

>>> print(process(myString))
spam
eggs

It's important that the function can process all of the escape sequences in Python (listed in a table in the link above).

重要的是该函数可以处理Python中的所有转义序列(在上面链接的表中列出)。

Does Python have a function to do this?

Python有功能吗?

6 个解决方案

#1


104  

The correct thing to do is use the 'string-escape' code to decode the string.

正确的做法是使用'string-escape'代码来解码字符串。

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

Don't use the AST or eval. Using the string codecs is much safer.

不要使用AST或eval。使用字符串编解码器更安全。

#2


73  

unicode_escape doesn't work in general

It turns out that the string_escape or unicode_escape solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.

事实证明,string_escape或unicode_escape解决方案通常不起作用 - 特别是,它在实际的Unicode存在时不起作用。

If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

如果你可以确定每个非ASCII字符都将被转义(并且记住,除了前128个字符之外的任何东西都是非ASCII),unicode_escape将为你做正确的事情。但是如果你的字符串中已经存在任何文字非ASCII字符,那么事情就会出错。

unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.

unicode_escape从根本上设计用于将字节转换为Unicode文本。但在许多地方 - 例如,Python源代码 - 源数据已经是Unicode文本。

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

这种方法可以正常工作的唯一方法是首先将文本编码为字节。 UTF-8是所有文本的合理编码,因此应该可以正常工作,对吧?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

以下示例在Python 3中,因此字符串文字更清晰,但同样的问题存在于Python 2和3上略有不同的表现形式。

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve   test

Well, that's wrong.

嗯,那是错的。

The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?

使用将文本解码为文本的编解码器的新推荐方法是直接调用codecs.decode。这有帮助吗?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

一点也不。 (另外,上面是Python 2上的UnicodeError。)

The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

unicode_escape编解码器,尽管它的名字,但结果是假设所有非ASCII字节都是Latin-1(ISO-8859-1)编码。所以你必须这样做:

>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test

But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

但那太可怕了。这限制你使用256个Latin-1字符,就像从未发明过Unicode一样!

>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

(Surprisingly, we do not now have two problems.)

(令人惊讶的是,我们现在没有两个问题。)

What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

我们需要做的只是将unicode_escape解码器应用于我们肯定是ASCII文本的东西。特别是,我们可以确保只将它应用于有效的Python转义序列,它们保证是ASCII文本。

The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.

计划是,我们将使用正则表达式找到转义序列,并使用函数作为re.sub的参数,用它们的非转义值替换它们。

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

随之而来的是:

>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő     Rubik

#3


13  

The actually correct and convenient answer for python 3:

python 3的实际正确和方便的答案:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test

Details regarding codecs.escape_decode:

有关codecs.escape_decode的详细信息:

  • codecs.escape_decode is a bytes-to-bytes decoder
  • codecs.escape_decode是一个字节到字节的解码器
  • codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
  • codecs.escape_decode解码ascii转义序列,例如:b“\\ n” - > b“\ n”,b“\\ xce” - > b“\ xce”。
  • codecs.escape_decode does not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.
  • codecs.escape_decode不关心或不需要知道字节对象的编码,但转义字节的编码应该与对象其余部分的编码相匹配。

Background:

背景:

  • @rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
  • @rspeer是正确的:unicode_escape是python3的错误解决方案。这是因为unicode_escape对转义的字节进行解码,然后将字节解码为unicode字符串,但不接收有关用于第二个操作的编解码器的信息。
  • @Jerub is correct: avoid the AST or eval.
  • @Jerub是正确的:避免AST或eval。
  • I first discovered codecs.escape_decode from this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.
  • 我首先从这个回答中发现了codecs.escape_decode“我在Python3中如何.decode('string-escape')?”。正如该答案所述,目前没有为python 3记录该函数。

#4


5  

The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.

ast.literal_eval函数接近,但它会期望首先正确引用字符串。

Of course Python's interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.

当然,Python对反斜杠转义的解释取决于字符串的引用方式(“”vs r“”vs u“”,三引号等),因此您可能希望将用户输入包装在合适的引号中并传递给literal_eval。将它包装在引号中也会阻止literal_eval返回数字,元组,字典等。

Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.

如果用户键入要打包在字符串周围的类型的不带引号的引号,事情仍然会变得棘手。

#5


0  

Below code should work for \n is required to be displayed on the string.

下面的代码应该适用于\ n需要在字符串上显示。

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

#6


-4  

If you trust the source of the data, just slap quotes around it and eval() it?

如果你信任数据的来源,只需在它周围打一个引号和eval()它?

>>> myString = 'spam\\neggs'
>>> print eval('"' + myString.replace('"','') + '"')
spam
eggs

PS. added evil-code-exec counter-measure - now it will strip all " before eval-ing

PS。添加了邪恶的代码 - exec反措施 - 现在它将剥夺所有“在评估之前

#1


104  

The correct thing to do is use the 'string-escape' code to decode the string.

正确的做法是使用'string-escape'代码来解码字符串。

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

Don't use the AST or eval. Using the string codecs is much safer.

不要使用AST或eval。使用字符串编解码器更安全。

#2


73  

unicode_escape doesn't work in general

It turns out that the string_escape or unicode_escape solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.

事实证明,string_escape或unicode_escape解决方案通常不起作用 - 特别是,它在实际的Unicode存在时不起作用。

If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

如果你可以确定每个非ASCII字符都将被转义(并且记住,除了前128个字符之外的任何东西都是非ASCII),unicode_escape将为你做正确的事情。但是如果你的字符串中已经存在任何文字非ASCII字符,那么事情就会出错。

unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.

unicode_escape从根本上设计用于将字节转换为Unicode文本。但在许多地方 - 例如,Python源代码 - 源数据已经是Unicode文本。

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

这种方法可以正常工作的唯一方法是首先将文本编码为字节。 UTF-8是所有文本的合理编码,因此应该可以正常工作,对吧?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

以下示例在Python 3中,因此字符串文字更清晰,但同样的问题存在于Python 2和3上略有不同的表现形式。

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve   test

Well, that's wrong.

嗯,那是错的。

The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?

使用将文本解码为文本的编解码器的新推荐方法是直接调用codecs.decode。这有帮助吗?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

一点也不。 (另外,上面是Python 2上的UnicodeError。)

The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

unicode_escape编解码器,尽管它的名字,但结果是假设所有非ASCII字节都是Latin-1(ISO-8859-1)编码。所以你必须这样做:

>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test

But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

但那太可怕了。这限制你使用256个Latin-1字符,就像从未发明过Unicode一样!

>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

(Surprisingly, we do not now have two problems.)

(令人惊讶的是,我们现在没有两个问题。)

What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

我们需要做的只是将unicode_escape解码器应用于我们肯定是ASCII文本的东西。特别是,我们可以确保只将它应用于有效的Python转义序列,它们保证是ASCII文本。

The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.

计划是,我们将使用正则表达式找到转义序列,并使用函数作为re.sub的参数,用它们的非转义值替换它们。

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

随之而来的是:

>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő     Rubik

#3


13  

The actually correct and convenient answer for python 3:

python 3的实际正确和方便的答案:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test

Details regarding codecs.escape_decode:

有关codecs.escape_decode的详细信息:

  • codecs.escape_decode is a bytes-to-bytes decoder
  • codecs.escape_decode是一个字节到字节的解码器
  • codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
  • codecs.escape_decode解码ascii转义序列,例如:b“\\ n” - > b“\ n”,b“\\ xce” - > b“\ xce”。
  • codecs.escape_decode does not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.
  • codecs.escape_decode不关心或不需要知道字节对象的编码,但转义字节的编码应该与对象其余部分的编码相匹配。

Background:

背景:

  • @rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
  • @rspeer是正确的:unicode_escape是python3的错误解决方案。这是因为unicode_escape对转义的字节进行解码,然后将字节解码为unicode字符串,但不接收有关用于第二个操作的编解码器的信息。
  • @Jerub is correct: avoid the AST or eval.
  • @Jerub是正确的:避免AST或eval。
  • I first discovered codecs.escape_decode from this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.
  • 我首先从这个回答中发现了codecs.escape_decode“我在Python3中如何.decode('string-escape')?”。正如该答案所述,目前没有为python 3记录该函数。

#4


5  

The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.

ast.literal_eval函数接近,但它会期望首先正确引用字符串。

Of course Python's interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.

当然,Python对反斜杠转义的解释取决于字符串的引用方式(“”vs r“”vs u“”,三引号等),因此您可能希望将用户输入包装在合适的引号中并传递给literal_eval。将它包装在引号中也会阻止literal_eval返回数字,元组,字典等。

Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.

如果用户键入要打包在字符串周围的类型的不带引号的引号,事情仍然会变得棘手。

#5


0  

Below code should work for \n is required to be displayed on the string.

下面的代码应该适用于\ n需要在字符串上显示。

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

#6


-4  

If you trust the source of the data, just slap quotes around it and eval() it?

如果你信任数据的来源,只需在它周围打一个引号和eval()它?

>>> myString = 'spam\\neggs'
>>> print eval('"' + myString.replace('"','') + '"')
spam
eggs

PS. added evil-code-exec counter-measure - now it will strip all " before eval-ing

PS。添加了邪恶的代码 - exec反措施 - 现在它将剥夺所有“在评估之前