如何使用`np.fromfile`从二进制文件中读取连续数组？

问题描述：

我想读取Python中的二进制文件，其准确的布局存储在二进制文件本身。如何使用`np.fromfile`从二进制文件中读取连续数组？

该文件包含一个二维数组序列，每个数组的行和列维存储为其内容之前的一对整数对。我想连续读取文件中包含的所有数组。

我知道这可以用f = open("myfile", "rb")和f.read(numberofbytes)来完成，但是这很笨拙，因为那时我需要将输出转换为有意义的数据结构。我想用numpy的np.fromfile自定义dtype，但还没有找到一种方法来读取文件的一部分，保持打开状态，然后继续阅读修改的dtype。

我知道我可以使用os到f.seek(numberofbytes, os.SEEK_SET)和np.fromfile多次，但是这将在文件中围绕意味着很多不必要的跳跃。

总之，我想MATLAB的fread（或至少像C++ ifstreamread）。

这样做的最好方法是什么？

你能描述一下文件的格式吗？在不了解文件本身的情况下很难推荐一种特定的方法。 –

它是一个原始的二进制文件，它包含来自C++程序的双精度矩阵，以及描述矩阵大小的整数 – jacob

单个文件是否包含多个数组，或者每个文件只有一个数组？数组的大小是否在文件开头的标题中给出？你能描述标题吗？ –

答

您可以将打开的文件对象传递到np.fromfile，读取第一个数组的维度，然后读取数组内容（再次使用np.fromfile），并为同一文件中的其他数组重复此过程。

例如：

import numpy as np 
import os 

def iter_arrays(fname, array_ndim=2, dim_dtype=np.int, array_dtype=np.double): 

    with open(fname, 'rb') as f: 
     fsize = os.fstat(f.fileno()).st_size 

     # while we haven't yet reached the end of the file... 
     while f.tell() < fsize: 

      # get the dimensions for this array 
      dims = np.fromfile(f, dim_dtype, array_ndim) 

      # get the array contents 
      yield np.fromfile(f, array_dtype, np.prod(dims)).reshape(dims)

用法示例：

# write some random arrays to an example binary file 
x = np.random.randn(100, 200) 
y = np.random.randn(300, 400) 

with open('/tmp/testbin', 'wb') as f: 
    np.array(x.shape).tofile(f) 
    x.tofile(f) 
    np.array(y.shape).tofile(f) 
    y.tofile(f) 

# read the contents back 
x1, y1 = iter_arrays('/tmp/testbin') 

# check that they match the input arrays 
assert np.allclose(x, x1) and np.allclose(y, y1)

如果阵列是大的，你可以考虑使用np.memmap与地方np.fromfile的offset=参数，以获得数组的内容作为内存映射而不是将它们加载到RAM中。

如何使用`np.fromfile`从二进制文件中读取连续数组？

相关推荐