I'm trying to delete some hex (such as \xc3
) from strings of text. I plan to use regular expressions to help get rid of those. Here is my code:
我正在尝试从文本字符串中删除一些十六进制(例如\ xc3)。我计划使用正则表达式来帮助摆脱这些。这是我的代码:
import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'
tweet1 = re.sub(r'\\x[a-f0-9]{2}', '', tweet)
print(tweet1)
However, instead of deleting the output I actually get the encoded version of hex. Here is my output:
但是,我没有删除输出,而是实际获得了十六进制的编码版本。这是我的输出:
b"[/Very seldom~ will someone enter your life] to questionââ¬Â¦ "
Does somebody know how I can get rid of those hex strings?... Thanks in advance.
有人知道如何摆脱那些十六进制字符串吗?...提前谢谢。
4 个解决方案
#1
0
Try tweet1.decode('ascii','ignore')
after applying the regex.
应用正则表达式后尝试tweet1.decode('ascii','ignore')。
#2
0
You can simply do
你可以干脆做
import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'
tweet1 = re.sub(b'[\xc3\xa2\xe2\x82\xac\xc2\xa6]', '', tweet)
Output:
b"[/Very seldom~ will someone enter your life] to question"
b“[/很少〜有人会进入你的生活]来质疑”
#3
0
You can try something like this:
你可以尝试这样的事情:
import re
import string
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'
tweet1 = re.sub(r'[^\w\s{}]'.format(string.punctuation), '', tweet)
print(tweet1)
Output:
b"[Very seldom~ will someone enter your life] to question"
Regex:
[^\w\s{}]
- Match everything that is not a \w
, \s
or a punctuation character.
[^ \ w \ s {}] - 匹配不是\ w,\ s或标点字符的所有内容。
#4
0
Actually, the issue is how I modeled the problem. tweet
doesn't contain the literal characters \xc3\xa2...
, it actually encodes them when declaring the string. So the regex is looking for the string \xc3
, but what tweet
contains in that position is actually Ã
实际上,问题是我如何模拟问题。 tweet不包含文字字符\ xc3 \ xa2 ...,它实际上在声明字符串时对它们进行编码。因此正则表达式正在寻找字符串\ xc3,但在该位置包含的推文实际上是Ã
The solution is to encode in utf8 and then convert to string, to finally use regex to get rid of the hex. I got the lead in this post (look the first answer by Martijn Pieters): python regex: how to remove hex dec characters from string
解决方案是在utf8中编码然后转换为字符串,最后使用正则表达式来摆脱十六进制。我在这篇文章中取得了领先(看看Martijn Pieters的第一个答案):python regex:如何从字符串中删除十六进制dec字符
#1
0
Try tweet1.decode('ascii','ignore')
after applying the regex.
应用正则表达式后尝试tweet1.decode('ascii','ignore')。
#2
0
You can simply do
你可以干脆做
import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'
tweet1 = re.sub(b'[\xc3\xa2\xe2\x82\xac\xc2\xa6]', '', tweet)
Output:
b"[/Very seldom~ will someone enter your life] to question"
b“[/很少〜有人会进入你的生活]来质疑”
#3
0
You can try something like this:
你可以尝试这样的事情:
import re
import string
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'
tweet1 = re.sub(r'[^\w\s{}]'.format(string.punctuation), '', tweet)
print(tweet1)
Output:
b"[Very seldom~ will someone enter your life] to question"
Regex:
[^\w\s{}]
- Match everything that is not a \w
, \s
or a punctuation character.
[^ \ w \ s {}] - 匹配不是\ w,\ s或标点字符的所有内容。
#4
0
Actually, the issue is how I modeled the problem. tweet
doesn't contain the literal characters \xc3\xa2...
, it actually encodes them when declaring the string. So the regex is looking for the string \xc3
, but what tweet
contains in that position is actually Ã
实际上,问题是我如何模拟问题。 tweet不包含文字字符\ xc3 \ xa2 ...,它实际上在声明字符串时对它们进行编码。因此正则表达式正在寻找字符串\ xc3,但在该位置包含的推文实际上是Ã
The solution is to encode in utf8 and then convert to string, to finally use regex to get rid of the hex. I got the lead in this post (look the first answer by Martijn Pieters): python regex: how to remove hex dec characters from string
解决方案是在utf8中编码然后转换为字符串,最后使用正则表达式来摆脱十六进制。我在这篇文章中取得了领先(看看Martijn Pieters的第一个答案):python regex:如何从字符串中删除十六进制dec字符