从python中的字符串中删除非utf-8字符

I am attempting to read in tweets and write these tweets to a file. However, I am getting UnicodeEncodeErrors when I try to write some of these tweets to a file. Is there a way to remove these non utf-8 characters so I can write out the rest of the tweet?

我试图阅读推文并将这些推文写入文件。但是，当我尝试将一些推文写入文件时，我收到UnicodeEncodeErrors。有没有办法删除这些非utf-8字符，以便我可以写出其余的推文？

For example, a problem tweet may look it this:

例如，问题推文可能看起来像这样：

Camera? ????

相机？ ????

This is the code I am using:

这是我正在使用的代码：

with open("Tweets.txt",'w') as f:
    for user_tws in twitter.get_user_timeline(screen_name='camera',
                                          count = 200):
        try:
            f.write(user_tws["text"] + '\n')
        except UnicodeEncodeError:
            print("skipped: " + user_tws["text"])
            mod_tw = user_tws["text"]
            mod_tw=mod_tw.encode('utf-8','replace').decode('utf-8')
            print(mod_tw)
            f.write(mod_tw)

The error is this:

错误是这样的：

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3a5' in position 56: character maps to

UnicodeEncodeError：'charmap'编解码器不能编码位置56的字符'\ U0001f3a5'：字符映射到

1 个解决方案

#1

You are not writing a UTF8 encoded file, add the encoding parameter to the open function

您没有编写UTF8编码文件，请将编码参数添加到open函数中

with open("Tweets.txt",'w', encoding='utf8') as f:
    ...

Have fun ????

玩得开心????

#1