Python - 转义4位转义的unicode字符

时间:2020-11-30 00:27:30

I downloaded the source code of a website. Through downloading the source code, and converting it into a string, many of the characters (like single quotes ('), double quotes ("), angled brackets (<, >), and forward slashes (/)) are now double escaped.

我下载了一个网站的源代码。通过下载源代码并将其转换为字符串,许多字符(如单引号('),双引号(“),斜角括号(<,>)和正斜杠(/))现在被双重转义。

Example:

例:

s = '\\u2018this \\/ that\\u2019'

The text represented in the website, and how i want it represented when printed out is:

网站上显示的文字,以及打印时我想要的方式是:

this / that

这个那个

My first instinct was to use regex to find all instances of 2 backslashes, and replace it with a single backslash, then use str.encode('utf-8').decode('utf-8') to convert the 4 digit escaped Unicode characters into their actual characters:

我的第一直觉是使用正则表达式来查找2个反斜杠的所有实例,并用一个反斜杠替换它,然后使用str.encode('utf-8')。decode('utf-8')来转换4位数的反斜杠将Unicode字符转换为实际字符:

import re
sample = '\\u2018this \\/ that\\u2019'
pattern = r'(\\)\\\1'
double_escapes_removed = re.sub(pattern, '', text)
final_text = text.encode('utf-8').decode('utf-8')

print(final_text) should return this / that, but the returned string appears to be completely unaltered: \u2018this \/ that\u2019.

print(final_text)应该返回this / that,但返回的字符串看起来完全没有改变:\ u2018这个\ /那个\ u2019。

I tested the pattern individually with re.findall(pattern, text), and it successfully found the 3 instances of double backslashes. Beyond that, I have no idea what is going wrong

我用re.findall(pattern,text)单独测试了模式,并成功找到了3个双反斜杠实例。除此之外,我不知道出了什么问题

1 个解决方案

#1


1  

This turns out to be a bit difficult. A big part of the issue is that although '\u2018' is 6 characters, '\u2018' is a representation of a single character, so you can't just replace '\u' with '\u' and have it work.

事实证明这有点困难。问题的一个重要部分是虽然'\ u2018'是6个字符,'\ u2018'是单个字符的表示,所以你不能只用'\ u'替换'\ u'并让它工作。

This gets you most of the way there without having to manually iterate over escapes with regex:

这样就可以在不必使用正则表达式手动迭代转义的情况下获取大部分内容:

>>> s.encode('ascii').decode('unicode-escape')
<<< '‘this \\/ that’'

Python 3 does output a warning about '\/' being an invalid unicode escape sequence, so you'd probably want to take care of those first.

Python 3确实输出一个关于'\ /'作为无效的unicode转义序列的警告,所以你可能想要先处理它们。

#1


1  

This turns out to be a bit difficult. A big part of the issue is that although '\u2018' is 6 characters, '\u2018' is a representation of a single character, so you can't just replace '\u' with '\u' and have it work.

事实证明这有点困难。问题的一个重要部分是虽然'\ u2018'是6个字符,'\ u2018'是单个字符的表示,所以你不能只用'\ u'替换'\ u'并让它工作。

This gets you most of the way there without having to manually iterate over escapes with regex:

这样就可以在不必使用正则表达式手动迭代转义的情况下获取大部分内容:

>>> s.encode('ascii').decode('unicode-escape')
<<< '‘this \\/ that’'

Python 3 does output a warning about '\/' being an invalid unicode escape sequence, so you'd probably want to take care of those first.

Python 3确实输出一个关于'\ /'作为无效的unicode转义序列的警告,所以你可能想要先处理它们。