使用正则表达式摆脱十六进制

时间:2021-09-22 08:55:24

I'm trying to delete some hex (such as \xc3) from strings of text. I plan to use regular expressions to help get rid of those. Here is my code:

我正在尝试从文本字符串中删除一些十六进制(例如\ xc3)。我计划使用正则表达式来帮助摆脱这些。这是我的代码:

import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'    
tweet1 = re.sub(r'\\x[a-f0-9]{2}', '', tweet)
print(tweet1)

However, instead of deleting the output I actually get the encoded version of hex. Here is my output:

但是,我没有删除输出,而是实际获得了十六进制的编码版本。这是我的输出:

b"[/Very seldom~ will someone enter your life] to questionââ¬Â¦ "

Does somebody know how I can get rid of those hex strings?... Thanks in advance.

有人知道如何摆脱那些十六进制字符串吗?...提前谢谢。

4 个解决方案

#1


0  

Try tweet1.decode('ascii','ignore') after applying the regex.

应用正则表达式后尝试tweet1.decode('ascii','ignore')。

#2


0  

You can simply do

你可以干脆做

import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'
tweet1 = re.sub(b'[\xc3\xa2\xe2\x82\xac\xc2\xa6]', '', tweet)

Output:

b"[/Very seldom~ will someone enter your life] to question"

b“[/很少〜有人会进入你的生活]来质疑”

#3


0  

You can try something like this:

你可以尝试这样的事情:

import re
import string

tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'    
tweet1 = re.sub(r'[^\w\s{}]'.format(string.punctuation), '', tweet)
print(tweet1)

Output:

b"[Very seldom~ will someone enter your life] to question"

Regex:

[^\w\s{}] - Match everything that is not a \w, \s or a punctuation character.

[^ \ w \ s {}] - 匹配不是\ w,\ s或标点字符的所有内容。

#4


0  

Actually, the issue is how I modeled the problem. tweet doesn't contain the literal characters \xc3\xa2..., it actually encodes them when declaring the string. So the regex is looking for the string \xc3, but what tweet contains in that position is actually Ã

实际上,问题是我如何模拟问题。 tweet不包含文字字符\ xc3 \ xa2 ...,它实际上在声明字符串时对它们进行编码。因此正则表达式正在寻找字符串\ xc3,但在该位置包含的推文实际上是Ã

The solution is to encode in utf8 and then convert to string, to finally use regex to get rid of the hex. I got the lead in this post (look the first answer by Martijn Pieters): python regex: how to remove hex dec characters from string

解决方案是在utf8中编码然后转换为字符串,最后使用正则表达式来摆脱十六进制。我在这篇文章中取得了领先(看看Martijn Pieters的第一个答案):python regex:如何从字符串中删除十六进制dec字符

#1


0  

Try tweet1.decode('ascii','ignore') after applying the regex.

应用正则表达式后尝试tweet1.decode('ascii','ignore')。

#2


0  

You can simply do

你可以干脆做

import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'
tweet1 = re.sub(b'[\xc3\xa2\xe2\x82\xac\xc2\xa6]', '', tweet)

Output:

b"[/Very seldom~ will someone enter your life] to question"

b“[/很少〜有人会进入你的生活]来质疑”

#3


0  

You can try something like this:

你可以尝试这样的事情:

import re
import string

tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'    
tweet1 = re.sub(r'[^\w\s{}]'.format(string.punctuation), '', tweet)
print(tweet1)

Output:

b"[Very seldom~ will someone enter your life] to question"

Regex:

[^\w\s{}] - Match everything that is not a \w, \s or a punctuation character.

[^ \ w \ s {}] - 匹配不是\ w,\ s或标点字符的所有内容。

#4


0  

Actually, the issue is how I modeled the problem. tweet doesn't contain the literal characters \xc3\xa2..., it actually encodes them when declaring the string. So the regex is looking for the string \xc3, but what tweet contains in that position is actually Ã

实际上,问题是我如何模拟问题。 tweet不包含文字字符\ xc3 \ xa2 ...,它实际上在声明字符串时对它们进行编码。因此正则表达式正在寻找字符串\ xc3,但在该位置包含的推文实际上是Ã

The solution is to encode in utf8 and then convert to string, to finally use regex to get rid of the hex. I got the lead in this post (look the first answer by Martijn Pieters): python regex: how to remove hex dec characters from string

解决方案是在utf8中编码然后转换为字符串,最后使用正则表达式来摆脱十六进制。我在这篇文章中取得了领先(看看Martijn Pieters的第一个答案):python regex:如何从字符串中删除十六进制dec字符