I'm reading a text file:
我正在读一个文本文件:
f = open('data.txt')
data = f.read()
However newline in data
variable is normalized to LF ('\n') while the file contains CRLF ('\r\n').
但是,当文件包含CRLF('\ r \ n')时,数据变量中的换行符被标准化为LF('\ n')。
How can I instruct Python to read the file as is?
我如何指示Python按原样读取文件?
5 个解决方案
#1
13
In Python 2.x:
在Python 2.x中:
f = open('data.txt', 'rb')
As the docs say:
正如文档所说:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append
'b'
to the mode value to open the file in binary mode, which will improve portability. (Appending'b'
is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)默认设置是使用文本模式,该模式可以在写入时将“\ n”字符转换为特定于平台的表示,并在读取时返回。因此,在打开二进制文件时,您应该将'b'附加到模式值以在二进制模式下打开文件,这将提高可移植性。 (附加'b'即使在不以不同方式处理二进制文件和文本文件的系统上也很有用,它可用作文档。)
In Python 3.x, there are three alternatives:
在Python 3.x中,有三种选择:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes
instead of str
, which you will have to explicitly decode
to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str
object is; in 3.x str
is Unicode.)
这将使换行保持未转换状态,但也将返回字节而不是str,您必须自己明确地将其解码为Unicode。 (当然2.x版本还返回了必须手动解码的字节,如果你想要Unicode,但是2.x就是str对象; 3.x str是Unicode。)
f2 = open('data.txt', 'r', newline='')
This will return str
, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline
and friends will treat '\r\n'
as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.
这将返回str,并保留换行符。然而,与2.x等价物不同,readline和朋友会将'\ r \ n'视为换行符,而不是常规字符后跟换行符。通常这不重要,但如果确实如此,请记住。
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str
using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.
这与2.x代码完全一样处理换行符,并且如果你刚刚使用了所有默认值,则使用你将获得的相同编码返回str ...但它在当前3.x中不再有效。
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
从流中读取输入时,如果换行为“无”,则启用通用换行模式。输入中的行可以以'\ n','\ r'或'\ r \ n'结尾,并且在返回给调用者之前将这些行转换为'\ n'。如果是'',则启用通用换行模式,但行结尾将返回给调用者未翻译。
The reason you need to specify an explicit encoding for f3
is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)
" to "don't decode, and return raw bytes
instead of str
". Again, from the docs:
您需要为f3指定显式编码的原因是以二进制模式打开文件意味着默认从“使用locale.getpreferredencoding(False)解码”更改为“不解码,并返回原始字节而不是str”。再次,从文档:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
在文本模式下,如果未指定编码,则使用的编码与平台相关:调用locale.getpreferredencoding(False)以获取当前的语言环境编码。 (对于读取和写入原始字节,请使用二进制模式并保留未指定的编码。)
However:
然而:
'encoding' … should only be used in text mode.
'encoding'...只应在文本模式下使用。
And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument
.
并且,至少从3.3开始,这是强制执行的;如果你尝试二进制模式,你得到ValueError:二进制模式不采用编码参数。
So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes
, obviously f
and f1are the same. But if you want to deal in
str, as appropriate for each version, the simplest answer is to write different code for each, probably
fand
f2`, respectively. If this comes up a lot, consider writing either wrapper function:
所以,如果你想编写适用于2.x和3.x的代码,你会用什么?如果你想以字节为单位,显然f和f1是相同的。但是如果你想根据每个版本处理instr,最简单的答案是分别为每个版本编写不同的代码,mightfandf2`。如果这出现了很多,请考虑编写包装函数:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False)
almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII'
in 2.x. Using locale.getpreferredencoding(True)
is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)
在编写多版本代码时要注意的另一件事是,如果你不编写可识别语言环境的代码,locale.getpreferredencoding(False)几乎总是在3.x中返回合理的东西,但它通常会返回'US -xCII'在2.x.使用locale.getpreferredencoding(True)在技术上是不正确的,但如果您不想考虑编码,可能更有可能是您真正想要的。 (尝试在2.x和3.x解释器中调用它以查看原因 - 或阅读文档。)
Of course if you actually know the file's encoding, that's always better than guessing anyway.
当然,如果你真的知道文件的编码,那总是比猜测更好。
In either case, the 'r'
means "read-only". If you don't specify a mode, the default is 'r'
, so the binary-mode equivalent to the default is 'rb'
.
在任何一种情况下,'r'表示“只读”。如果未指定模式,则默认为“r”,因此与默认值等效的二进制模式为“rb”。
#2
5
You need to open the file in the binary mode:
您需要以二进制模式打开文件:
f = open('data.txt', 'rb')
data = f.read()
('r'
for "read", 'b'
for "binary")
('r'表示“读”,'b'表示“二进制”)
Then everything is returned as is, nothing is normalized
然后一切都按原样返回,没有任何标准化
#3
4
You can use the codecs module to write 'version-agnostic' code:
您可以使用编解码器模块编写“版本无关”代码:
Underlying encoded files are always opened in binary mode. No automatic conversion of
'\n'
is done on reading and writing. The mode argument may be any binary mode acceptable to the built-inopen()
function; the'b'
is automatically added.底层编码文件始终以二进制模式打开。在读写时不会自动转换'\ n'。 mode参数可以是内置open()函数可接受的任何二进制模式; 'b'会自动添加。
import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
# python2: u'foo\r\n'
# python3: 'foo\r\n'
f.readline()
#4
1
Just request "read binary" in the open
:
只需在打开时请求“读取二进制”:
f = open('data.txt', 'rb')
data = f.read()
#1
13
In Python 2.x:
在Python 2.x中:
f = open('data.txt', 'rb')
As the docs say:
正如文档所说:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append
'b'
to the mode value to open the file in binary mode, which will improve portability. (Appending'b'
is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)默认设置是使用文本模式,该模式可以在写入时将“\ n”字符转换为特定于平台的表示,并在读取时返回。因此,在打开二进制文件时,您应该将'b'附加到模式值以在二进制模式下打开文件,这将提高可移植性。 (附加'b'即使在不以不同方式处理二进制文件和文本文件的系统上也很有用,它可用作文档。)
In Python 3.x, there are three alternatives:
在Python 3.x中,有三种选择:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes
instead of str
, which you will have to explicitly decode
to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str
object is; in 3.x str
is Unicode.)
这将使换行保持未转换状态,但也将返回字节而不是str,您必须自己明确地将其解码为Unicode。 (当然2.x版本还返回了必须手动解码的字节,如果你想要Unicode,但是2.x就是str对象; 3.x str是Unicode。)
f2 = open('data.txt', 'r', newline='')
This will return str
, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline
and friends will treat '\r\n'
as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.
这将返回str,并保留换行符。然而,与2.x等价物不同,readline和朋友会将'\ r \ n'视为换行符,而不是常规字符后跟换行符。通常这不重要,但如果确实如此,请记住。
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str
using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.
这与2.x代码完全一样处理换行符,并且如果你刚刚使用了所有默认值,则使用你将获得的相同编码返回str ...但它在当前3.x中不再有效。
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
从流中读取输入时,如果换行为“无”,则启用通用换行模式。输入中的行可以以'\ n','\ r'或'\ r \ n'结尾,并且在返回给调用者之前将这些行转换为'\ n'。如果是'',则启用通用换行模式,但行结尾将返回给调用者未翻译。
The reason you need to specify an explicit encoding for f3
is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)
" to "don't decode, and return raw bytes
instead of str
". Again, from the docs:
您需要为f3指定显式编码的原因是以二进制模式打开文件意味着默认从“使用locale.getpreferredencoding(False)解码”更改为“不解码,并返回原始字节而不是str”。再次,从文档:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
在文本模式下,如果未指定编码,则使用的编码与平台相关:调用locale.getpreferredencoding(False)以获取当前的语言环境编码。 (对于读取和写入原始字节,请使用二进制模式并保留未指定的编码。)
However:
然而:
'encoding' … should only be used in text mode.
'encoding'...只应在文本模式下使用。
And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument
.
并且,至少从3.3开始,这是强制执行的;如果你尝试二进制模式,你得到ValueError:二进制模式不采用编码参数。
So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes
, obviously f
and f1are the same. But if you want to deal in
str, as appropriate for each version, the simplest answer is to write different code for each, probably
fand
f2`, respectively. If this comes up a lot, consider writing either wrapper function:
所以,如果你想编写适用于2.x和3.x的代码,你会用什么?如果你想以字节为单位,显然f和f1是相同的。但是如果你想根据每个版本处理instr,最简单的答案是分别为每个版本编写不同的代码,mightfandf2`。如果这出现了很多,请考虑编写包装函数:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False)
almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII'
in 2.x. Using locale.getpreferredencoding(True)
is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)
在编写多版本代码时要注意的另一件事是,如果你不编写可识别语言环境的代码,locale.getpreferredencoding(False)几乎总是在3.x中返回合理的东西,但它通常会返回'US -xCII'在2.x.使用locale.getpreferredencoding(True)在技术上是不正确的,但如果您不想考虑编码,可能更有可能是您真正想要的。 (尝试在2.x和3.x解释器中调用它以查看原因 - 或阅读文档。)
Of course if you actually know the file's encoding, that's always better than guessing anyway.
当然,如果你真的知道文件的编码,那总是比猜测更好。
In either case, the 'r'
means "read-only". If you don't specify a mode, the default is 'r'
, so the binary-mode equivalent to the default is 'rb'
.
在任何一种情况下,'r'表示“只读”。如果未指定模式,则默认为“r”,因此与默认值等效的二进制模式为“rb”。
#2
5
You need to open the file in the binary mode:
您需要以二进制模式打开文件:
f = open('data.txt', 'rb')
data = f.read()
('r'
for "read", 'b'
for "binary")
('r'表示“读”,'b'表示“二进制”)
Then everything is returned as is, nothing is normalized
然后一切都按原样返回,没有任何标准化
#3
4
You can use the codecs module to write 'version-agnostic' code:
您可以使用编解码器模块编写“版本无关”代码:
Underlying encoded files are always opened in binary mode. No automatic conversion of
'\n'
is done on reading and writing. The mode argument may be any binary mode acceptable to the built-inopen()
function; the'b'
is automatically added.底层编码文件始终以二进制模式打开。在读写时不会自动转换'\ n'。 mode参数可以是内置open()函数可接受的任何二进制模式; 'b'会自动添加。
import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
# python2: u'foo\r\n'
# python3: 'foo\r\n'
f.readline()
#4
1
Just request "read binary" in the open
:
只需在打开时请求“读取二进制”:
f = open('data.txt', 'rb')
data = f.read()