如何用numpy读取部分二进制文件?

I'm converting a matlab script to numpy, but have some problems with reading data from a binary file. Is there an equivelent to fseek when using fromfile to skip the beginning of the file? This is the type of extractions I need to do:

我正在将matlab脚本转换为numpy，但是在从二进制文件读取数据时遇到了一些问题。在使用fromfile跳过文件的开头部分时，是否有相同的fseek ?这是我需要做的抽取:

fid = fopen(fname);
fseek(fid, 8, 'bof');
second = fread(fid, 1, 'schar');
fseek(fid, 100, 'bof');
total_cycles = fread(fid, 1, 'uint32', 0, 'l');
start_cycle = fread(fid, 1, 'uint32', 0, 'l');

Thanks!

谢谢!

3 个解决方案

#1

You can use seek with a file object in the normal way, and then use this file object in fromfile. Here's a full example:

您可以使用普通方式使用file对象查找，然后在fromfile中使用这个file对象。这是一个完整的例子:

import numpy as np
import os

data = np.arange(100, dtype=np.int)
data.tofile("temp")  # save the data

f = open("temp", "rb")  # reopen the file
f.seek(256, os.SEEK_SET)  # seek

x = np.fromfile(f, dtype=np.int)  # read the data into numpy
print x 
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

#2

There probably is a better answer… But when I've been faced with this problem, I had a file that I already wanted to access different parts of separately, which gave me an easy solution to this problem.

可能有更好的答案……但是当我遇到这个问题时，我有一个文件，我已经想要分别访问不同的部分，这给了我一个简单的解决方案。

For example, say chunkyfoo.bin is a file consisting of a 6-byte header, a 1024-byte numpy array, and another 1024-byte numpy array. You can't just open the file and seek 6 bytes (because the first thing numpy.fromfile does is lseek back to 0). But you can just mmap the file and use fromstring instead:

例如,假设chunkyfoo。bin是一个文件，包含一个6字节的头、一个1024字节的numpy数组和另一个1024字节的numpy数组。你不能仅仅打开文件并寻找6个字节(因为numpi .fromfile做的第一件事是lseek返回到0)。

with open('chunkyfoo.bin', 'rb') as f:
    with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
        a1 = np.fromstring(m[6:1030])
        a2 = np.fromstring(m[1030:])

This sounds like exactly what you want to do. Except, of course, that in real life the offset and length to a1 and a2 probably depend on the header, rather than being fixed comments.

这听起来正是你想做的。当然，在现实生活中，a1和a2的偏移量和长度可能取决于标题，而不是固定的注释。

The header is just m[:6], and you can parse that by explicitly pulling it apart, using the struct module, or whatever else you'd do once you read the data. But, if you'd prefer, you can explicitly seek and read from f before constructing m, or after, or even make the same calls on m, and it will work, without affecting a1 and a2.

header仅仅是m[:6]，您可以通过显式地将其拆分，使用struct模块，或者在读取数据之后可以做的任何事情来解析它。但是，如果您愿意，您可以在构建m之前明确地查找和读取f，或者在构建m之前，或者甚至在m上进行相同的调用，并且它将工作，而不影响a1和a2。

An alternative, which I've done for a different non-numpy-related project, is to create a wrapper file object, like this:

另一种方法是创建一个包装器文件对象，如下所示:

class SeekedFileWrapper(object):
    def __init__(self, fileobj):
        self.fileobj = fileobj
        self.offset = fileobj.tell()
    def seek(self, offset, whence=0):
        if whence == 0:
            offset += self.offset
        return self.fileobj.seek(offset, whence)
    # ... delegate everything else unchanged

I did the "delegate everything else unchanged" by generating a list of attributes at construction time and using that in __getattr__, but you probably want something less hacky. numpy only relies on a handful of methods of the file-like object, and I think they're properly documented, so just explicitly delegate those. But I think the mmap solution makes more sense here, unless you're trying to mechanically port over a bunch of explicit seek-based code. (You'd think mmap would also give you the option of leaving it as a numpy.memmap instead of a numpy.array, which lets numpy have more control over/feedback from the paging, etc. But it's actually pretty tricky to get a numpy.memmap and an mmap to work together.)

通过在构建时生成属性列表并在__getattr__中使用该列表，我实现了“未做任何修改的委托”，但您可能需要一些不那么陈腐的东西。numpy只依赖于一些类文件对象的方法，我认为它们是经过适当记录的，所以只需显式地委托这些方法。但是我认为mmap解决方案在这里更有意义，除非您试图机械地移植一堆显式的基于seek的代码。(你可能认为mmap也会给你一个选择，让它成为一个numpy。memmap而不是numpy。数组，它可以让numpy对分页进行更多的控制/反馈，等等。memmap和mmap一起工作)

#3

This is what I do when I have to read arbitrary in an heterogeneous binary file.
Numpy allows to interpret a bit pattern in arbitray way by changing the dtype of the array. The Matlab code in the question reads a char and two uint.

当我必须在异构二进制文件中读取任意文件时，我就是这样做的。Numpy允许通过更改数组的dtype以仲裁的方式解释位模式。这个问题的Matlab代码读取一个char和两个uint。

Read this paper (easy reading on user level, not for scientists) on what one can achieve with changing the dtype, stride, dimensionality of an array.

阅读这篇文章(在用户层面上很容易阅读，而不是科学家)，关于改变数组的类型、步幅和维度可以实现什么。

import numpy as np

data = np.arange(10, dtype=np.int)
data.tofile('f')

x = np.fromfile('f', dtype='u1')
print x.size
# 40

second = x[8]
print 'second', second
# second 2

total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0]       !endianness
# total_cycles [2]

start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]

x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]

x[3] = 423 
print 'start_cycle', start_cycle
# start_cycle [423]

#1