文件与IO

所有的程序都要处理输入与输出，涉及到文本、二进制文件、文件编码和对文件名、目录的操作。

读写文本数据

需要读写各种不同编码的文本数据，使用rt模式的open()函数。

该读写操作使用系统默认编码，可通过sys.getdefaultencoding()来得到，大部分都是utf-8。

打印输出到文件中

将print()函数的输出重定向到文件中。

# 指定file关键字参数，文件必须是文本形式打开
with open('d:/work/test.txt', 'wt') as f:
    print('Hello World!', file=f)

使用其它分割符或行终止符打印

调用print()方法的sep和end关键字。

>>> print('ACME', 50, 91.5)
ACME 50 91.5
>>> print('ACME', 50, 91.5, sep=',')
ACME,50,91.5
>>> print('ACME', 50, 91.5, sep=',', end='!!\n')
ACME,50,91.5!!
>>>

读写字节数据

使用rb或wb的open()函数来读写二进制数据。

# Read the entire file as a single byte string
with open('somefile.bin', 'rb') as f:
    data = f.read()

# Write binary data to a file
with open('somefile.bin', 'wb') as f:
    f.write(b'Hello World')

数组对象和C结构体可以直接当作字节数据写入。

文件不存在才能写入

有时怕覆盖了已存在的文件，可以使用x模式代替w模式的open()函数，对已存在的文件会throw一个FileExistsError。

如果文件是二进制的，使用xb代替xt。

x模式是在python3中对open()的特有拓展，在低版本中没有这个模式。

字符串的IO操作

想用操作类文件对象的程序来操作文本或二进制字串，使用io.StringIO()和io.BytesIO()来操作字符串数据。

>>> s = io.StringIO()
>>> s.write('Hello World\n')
12
>>> print('This is a test', file=s)
15
>>> # Get all of the data written so far
>>> s.getvalue()
'Hello World\nThis is a test\n'
>>>

>>> # Wrap a file interface around an existing string
>>> s = io.StringIO('Hello\nWorld\n')
>>> s.read(4)
'Hell'
>>> s.read()
'o\nWorld\n'
>>>

io.StringIO只能用于文本，对二进制数据使用io.BytesIO。

>>> s = io.BytesIO()
>>> s.write(b'binary data')
>>> s.getvalue()
b'binary data'
>>>

读写压缩文件

读写一个gzip或bz2格式的压缩文件，使用gzip和bz2模块，都提供了open()函数来实现。

# gzip compression
import gzip
with gzip.open('somefile.gz', 'rt') as f:
    text = f.read()

# bz2 compression
import bz2
with bz2.open('somefile.bz2', 'rt') as f:
    text = f.read()

这两个模块提供的open()函数接收与内置函数一样的参数，还提供compresslevel这个新参数用于指定压缩级别，默认*别9，等级越低性能越好。

并且可以作用在一个已被打开的文件上。

import gzip
f = open('somefile.gz', 'rb')
with gzip.open(f, 'rt') as g:
    text = g.read()

固定大小记录的文件迭代

想在一个固定长度记录或数据块的集合上迭代，而不是在文件中一行行迭代。

使用iter和functools.partial()函数。

from functools import partial

RECORD_SIZE = 32

with open('somefile.data', 'rb') as f:
    records = iter(partial(f.read, RECORD_SIZE), b'')
    for r in records:
        ...

读取二进制数据到可变缓冲区

想直接读取二进制数据到一个可变缓冲区中，而不做任何中间复制操作，或原地修改数据并写回到文件中去。

使用文件的readinto()方法。

import os.path

def read_into_buffer(filename):
    buf = bytearray(os.path.getsize(filename))
    with open(filename, 'rb') as f:
        f.readinto(buf)
    return buf

>>> # Write a sample file
>>> with open('sample.bin', 'wb') as f:
...     f.write(b'Hello World')
...
>>> buf = read_into_buffer('sample.bin')
>>> buf
bytearray(b'Hello World')
>>> buf[0:5] = b'Hello'
>>> buf
bytearray(b'Hello World')
>>> with open('newsample.bin', 'wb') as f:
...     f.write(buf)
...
11
>>>

f.readinto()的返回值为实际读取的字节数，因此要检查其值是否小于期望值。

文件路径名的操作

通过os.path模块的函数来操作路径名。

os.path.basename(path)获取最后一级的文件名。

os.path.dirname(path)获取除文件名的所有路径。

os.path.join(path1, path2, path3)将所有路径按顺序拼接起来，等效于加法。

测试文件是否存在

os.path.exists(path)用于检测文件或目录是否存在。

os.path.isfile(path)检测是否是文件。

os.path.isdir(path)检测是否是目录。

os.path.islink(path)检测是否是链接文件。

os.path.realpath(path)获取链接文件的真实文件路径。

os.path.getsize(path)获取文件大小。

os.path.getmtime(path)获取文件修改时间。

获取文件夹中的文件列表

获取文件系统中某个目录下的所有文件列表。

os.listdir(path)返回目录中所有文件列表、子目录、符号链接。

import os.path

# Get all regular files
names = [name for name in os.listdir('somedir')
        if os.path.isfile(os.path.join('somedir', name))]

# Get all dirs
dirnames = [name for name in os.listdir('somedir')
        if os.path.isdir(os.path.join('somedir', name))]

也可以通过字串的startswith()和endswith()方法过滤内容。

pyfiles = [name for name in os.listdir('somedir')
            if name.endswith('.py')]

打印不合法的文件名

针对文件名出现UnicodeEncodeError和surrogates not allowed异常和消息，通过辅助函数处理该错误。

def bad_filename(filename):
    return repr(filename)[1:-1]

try:
    print(filename)
except UnicodeEncodeError:
    print(bad_filename(filename))

增加和改变已打开文件的编码

想在不关闭一个已打开的文件前提下增加或改变其Unicode编码。

为一个二进制模式打开的文件添加Unicode编解码方式，使用io.TextIOWrapper()包装。

import urllib.request
import io

u = urllib.request.urlopen('http://www.python.org')
f = io.TextIOWrapper(u, encoding='utf-8')
text = f.read()

将字节写入文本文件

写入文件的缓冲区即可。

>>> import sys
>>> sys.stdout.write(b'Hello\n')
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
TypeError: must be str, not bytes
>>> sys.stdout.buffer.write(b'Hello\n')
Hello
5
>>>

文本文件buffer属性来读取二进制数据。

将文件描述符包装成文件对象

一个文件描述符和一个打开的普通文件是不一样的，文件描述符是系统指定的整数用来指代某个系统的IO通道，可以通过将该描述符作为文件名来打开open()函数。

# Open a low-level file descriptor
import os
fd = os.open('somefile.txt', os.O_WRONLY | os.O_CREAT)

# Turn into a proper file
f = open(fd, 'wt')
f.write('hello world\n')
f.close()

如果不想在高层文件对象关闭的时候关闭底层文件描述符，传递一个可选参数closefd=False。

# Create a file object, but don't close underlying fd when done
f = open(fd, 'wt', closefd=False)
...

创建临时文件和文件夹

需要在程序执行时创建一个临时文件或目录，并希望用完之后可以自动销毁。

tempfile模块提供了很多函数。

tempfile.TemporaryFile()创建匿名临时文件。

from tempfile import TemporaryFile

with TemporaryFile('w+t') as f:
    # Read/write to the file
    f.write('Hello World\n')
    f.write('Testing\n')

    # Seek back to beginning and read the data
    f.seek(0)
    data = f.read()

# Temporary file is destroyed

tempfile.NamedTemporaryFile()创建命名临时文件，可通过f.name访问其名称。

temfile.TemporaryDirectory()创建临时文件夹。

与串行端口的数据通信

与一些硬件设备通信，最好使用pySerial包。

序列化Python对象

最普遍的是pickle模块。

s = pickle.dumps(data)
data = pickle.loads(s)

不要对不信任的数据使用pickle.load()，其有个副作用就是自动加载相应模块并构造实例对象。

并且有些类型的对象是不能被序列化的，通常是依赖外部系统状态的对象，如打开的文件、网络链接、线程、进程、栈帧等，用户自定义类可以通过__getstate__()和__setstate__()方法来绕过该限制。

# countdown.py
import time
import threading

class Countdown:
    def __init__(self, n):
        self.n = n
        self.thr = threading.Thread(target=self.run)
        self.thr.daemon = True
        self.thr.start()

    def run(self):
        while self.n > 0:
            print('T-minus', self.n)
            self.n -= 1
            time.sleep(5)

    def __getstate__(self):
        return self.n

    def __setstate__(self, n):
        self.__init__(n)

>>> import countdown
>>> c = countdown.Countdown(30)
>>> T-minus 30
T-minus 29
T-minus 28
...

>>> # After a few moments
>>> f = open('cstate.p', 'wb')
>>> import pickle
>>> pickle.dump(c, f)
>>> f.close()


>>> f = open('cstate.p', 'rb')
>>> pickle.load(f)
countdown.Countdown object at 0x10069e2d0>
T-minus 19
T-minus 18
...

pickle是附着在源码上的，如果需要在数据库或文档中存储数据，最好使用XML、CSV、JSON等格式。

秒客网

PythonCookBook笔记——文件与IO