用BOM将UTF-8转换成Python中没有BOM的UTF-8

时间:2022-06-14 20:12:05

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?

这两个问题。我有一组通常是UTF-8和BOM的文件。我想将它们(理想情况下)转换为UTF-8,没有BOM。它看起来像编解码器。StreamRecoder(流、编码、解码、读取、写入、错误)将处理这个问题。但是我并没有看到任何关于用法的好例子。这是最好的处理方式吗?

source files:
Tue Jan 17$ file brh-m-157.json 
brh-m-157.json: UTF-8 Unicode (with BOM) text

Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?

而且,如果我们能够清楚地处理不同的输入编码,那就很理想了(见ASCII和UTF-16)。看来这一切都是可行的。有没有一种解决方案可以将任何已知的Python编码和输出作为UTF-8而不使用BOM?

edit 1 proposed sol'n from below (thanks!)

编辑从下面建议的sol n(谢谢!)

fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding  
fp.write(s)

This gives me the following error:

这给了我以下的错误:

IOError: [Errno 9] Bad file descriptor

Newsflash

I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

我在评论中被告知错误在于我用模式rw打开文件而不是用'r+'/'r+b',所以我最终应该重新编辑我的问题并删除已解决的部分。

6 个解决方案

#1


71  

Simply use the "utf-8-sig" codec:

只需使用“utf-8-sig”编解码器:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

这提供了一个没有BOM的unicode字符串。然后,您可以使用

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

要将一个普通的UTF-8编码字符串返回到s中。如果文件很大,那么应该避免将它们全部读入内存。BOM只是文件开头的三个字节,所以您可以使用此代码将它们从文件中删除:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
    chunk = fp.read(BUFSIZE)
    if chunk.startswith(codecs.BOM_UTF8):
        i = 0
        chunk = chunk[BOMLEN:]
        while chunk:
            fp.seek(i)
            fp.write(chunk)
            i += len(chunk)
            fp.seek(BOMLEN, os.SEEK_CUR)
            chunk = fp.read(BUFSIZE)
        fp.seek(-BOMLEN, os.SEEK_CUR)
        fp.truncate()

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

它打开文件,读取一块数据块,并将其写入文件,比读取文件的位置早3个字节。文件被就地重写。更简单的解决方案是将较短的文件写入新文件,如newtover的答案。这将更简单,但是在短时间内使用两次磁盘空间。

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

至于猜测编码,则可以从最具体到最不具体地对编码进行循环:

def decode(s):
    for encoding in "utf-8-sig", "utf-16":
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            continue
    return s.decode("latin-1") # will always work

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

一个UTF-16编码的文件不能解码为UTF-8,所以我们首先尝试使用UTF-8。如果失败,那么我们尝试使用UTF-16。最后,我们使用Latin-1—这将始终有效,因为所有256字节在Latin-1中都是合法的值。在这种情况下,您可能想要返回None,因为这实际上是一个退步,您的代码可能想要更仔细地处理这个问题(如果可能的话)。

#2


15  

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:

在Python 3中,这很容易:读取文件并使用utf-8编码重写它:

s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)

#3


5  

import codecs
import shutil
import sys

s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
    sys.stdout.write(s)

shutil.copyfileobj(sys.stdin, sys.stdout)

#4


4  

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:

这是我在没有BOM的情况下将任何一种编码转换为UTF-8,并用通用格式替换windows enlines的实现:

def utf8_converter(file_path, universal_endline=True):
    '''
    Convert any type of file to UTF-8 without BOM
    and using universal endline by default.

    Parameters
    ----------
    file_path : string, file path.
    universal_endline : boolean (True),
                        by default convert endlines to universal format.
    '''

    # Fix file path
    file_path = os.path.realpath(os.path.expanduser(file_path))

    # Read from file
    file_open = open(file_path)
    raw = file_open.read()
    file_open.close()

    # Decode
    raw = raw.decode(chardet.detect(raw)['encoding'])
    # Remove windows end line
    if universal_endline:
        raw = raw.replace('\r\n', '\n')
    # Encode to UTF-8
    raw = raw.encode('utf8')
    # Remove BOM
    if raw.startswith(codecs.BOM_UTF8):
        raw = raw.replace(codecs.BOM_UTF8, '', 1)

    # Write to file
    file_open = open(file_path, 'w')
    file_open.write(raw)
    file_open.close()
    return 0

#5


0  

You can use codecs.

您可以使用编解码器。

import codecs
content = open("test.txt",'r').read()
filehandle.close()
if content[:3] == codecs.BOM_UTF8
content = content[3:]
print content.decode("utf-8")

#6


0  

I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.

我发现这个问题是因为在使用UTF8 BOM头打开文件时,configparser. config ().read(fp)有问题。

For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of: File contains no section headers, please open the file like the following:

对于那些正在寻找删除header的解决方案以便ConfigPhaser能够打开配置文件而不是报告一个错误:file中不包含section header的人,请如下所示打开文件:

        configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")

This could save you tons of effort by making the remove of the BOM header of the file unnecessary.

这可以通过删除文件的BOM头来节省大量的工作量。

(I know this sounds unrelated, but hopefully this could help people struggling like me.)

(我知道这听起来毫不相干,但我希望这能帮助像我这样挣扎的人。)

#1


71  

Simply use the "utf-8-sig" codec:

只需使用“utf-8-sig”编解码器:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

这提供了一个没有BOM的unicode字符串。然后,您可以使用

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

要将一个普通的UTF-8编码字符串返回到s中。如果文件很大,那么应该避免将它们全部读入内存。BOM只是文件开头的三个字节,所以您可以使用此代码将它们从文件中删除:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
    chunk = fp.read(BUFSIZE)
    if chunk.startswith(codecs.BOM_UTF8):
        i = 0
        chunk = chunk[BOMLEN:]
        while chunk:
            fp.seek(i)
            fp.write(chunk)
            i += len(chunk)
            fp.seek(BOMLEN, os.SEEK_CUR)
            chunk = fp.read(BUFSIZE)
        fp.seek(-BOMLEN, os.SEEK_CUR)
        fp.truncate()

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

它打开文件,读取一块数据块,并将其写入文件,比读取文件的位置早3个字节。文件被就地重写。更简单的解决方案是将较短的文件写入新文件,如newtover的答案。这将更简单,但是在短时间内使用两次磁盘空间。

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

至于猜测编码,则可以从最具体到最不具体地对编码进行循环:

def decode(s):
    for encoding in "utf-8-sig", "utf-16":
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            continue
    return s.decode("latin-1") # will always work

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

一个UTF-16编码的文件不能解码为UTF-8,所以我们首先尝试使用UTF-8。如果失败,那么我们尝试使用UTF-16。最后,我们使用Latin-1—这将始终有效,因为所有256字节在Latin-1中都是合法的值。在这种情况下,您可能想要返回None,因为这实际上是一个退步,您的代码可能想要更仔细地处理这个问题(如果可能的话)。

#2


15  

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:

在Python 3中,这很容易:读取文件并使用utf-8编码重写它:

s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)

#3


5  

import codecs
import shutil
import sys

s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
    sys.stdout.write(s)

shutil.copyfileobj(sys.stdin, sys.stdout)

#4


4  

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:

这是我在没有BOM的情况下将任何一种编码转换为UTF-8,并用通用格式替换windows enlines的实现:

def utf8_converter(file_path, universal_endline=True):
    '''
    Convert any type of file to UTF-8 without BOM
    and using universal endline by default.

    Parameters
    ----------
    file_path : string, file path.
    universal_endline : boolean (True),
                        by default convert endlines to universal format.
    '''

    # Fix file path
    file_path = os.path.realpath(os.path.expanduser(file_path))

    # Read from file
    file_open = open(file_path)
    raw = file_open.read()
    file_open.close()

    # Decode
    raw = raw.decode(chardet.detect(raw)['encoding'])
    # Remove windows end line
    if universal_endline:
        raw = raw.replace('\r\n', '\n')
    # Encode to UTF-8
    raw = raw.encode('utf8')
    # Remove BOM
    if raw.startswith(codecs.BOM_UTF8):
        raw = raw.replace(codecs.BOM_UTF8, '', 1)

    # Write to file
    file_open = open(file_path, 'w')
    file_open.write(raw)
    file_open.close()
    return 0

#5


0  

You can use codecs.

您可以使用编解码器。

import codecs
content = open("test.txt",'r').read()
filehandle.close()
if content[:3] == codecs.BOM_UTF8
content = content[3:]
print content.decode("utf-8")

#6


0  

I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.

我发现这个问题是因为在使用UTF8 BOM头打开文件时,configparser. config ().read(fp)有问题。

For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of: File contains no section headers, please open the file like the following:

对于那些正在寻找删除header的解决方案以便ConfigPhaser能够打开配置文件而不是报告一个错误:file中不包含section header的人,请如下所示打开文件:

        configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")

This could save you tons of effort by making the remove of the BOM header of the file unnecessary.

这可以通过删除文件的BOM头来节省大量的工作量。

(I know this sounds unrelated, but hopefully this could help people struggling like me.)

(我知道这听起来毫不相干,但我希望这能帮助像我这样挣扎的人。)