在C++中读取文件到内存的最快方法？

问题描述：

我正在尝试以更快的方式从文件中读取数据。目前我正在做的是以下，但对于大文件来说非常慢。我想知道是否有更快的方法来做到这一点？我需要存储一个结构体的值，我在下面定义了它。在C++中读取文件到内存的最快方法？

std::vector<matEntry> matEntries; 
inputfileA.open(matrixAfilename.c_str()); 

// Read from file to continue setting up sparse matrix A 
while (!inputfileA.eof()) { 
    // Read row, column, and value into vector 
    inputfileA >> (int) row; // row 
    inputfileA >> (int) col; // col 
    inputfileA >> val;  // value 

    // Add row, column, and value entry to the matrix 
    matEntries.push_back(matEntry()); 
    matEntries[index].row = row-1; 
    matEntries[index].col = col-1; 
    matEntries[index].val = val; 

    // Increment index 
    index++; 
}

我的结构：

struct matEntry { 
    int row; 
    int col; 
    float val; 
};

该文件格式如下（整型，整型，浮点）：

更多信息：

我知道运行时文件中的行数。
我正面临瓶颈。分析器说while（）循环是瓶颈。

**您是否面临瓶颈？**虽然['mmap']（https://en.wikipedia.org/wiki/Mmap）可能是另一种选择，这通常被认为是'最快'的方式，具体取决于你如何使用文件的内容。 – Qix

你的配置文件说哪个部分很慢？文件读取？ 'vector'重新分配？ 'fstream'操作？ – genpfault

@genpfault，** while循环是缓慢的部分** – Veridian

答

为了使事情更容易，我会为你的结构定义一个输入流运算符。

std::istream& operator>>(std::istream& is, matEntry& e) 
{ 
    is >> e.row >> e.col >> e.val; 
    e.row -= 1; 
    e.col -= 1; 

    return is; 
}

关于速度，没有进入文件IO的very basic级别，没有太多的改进。我认为你可以做的唯一事情就是初始化你的向量，以便它不会在循环中一直调整大小。并与定义的输入流运算符，它看起来更清洁，以及：

std::vector<matEntry> matEntries; 
matEntries.resize(numberOfLines); 
inputfileA.open(matrixAfilename.c_str()); 

// Read from file to continue setting up sparse matrix A 
while(index < numberOfLines && (is >> matEntries[index++])) 
{ }

通过定义一个像这样的自定义操作符，可以使用['std :: istream_iterator']（http://en.cppreference.com/w/cpp/iterator/istream_iterator）和['std :: [back_inserter']（http://en.cppreference.com/w/cpp/iterator/back_inserter）与['std :: copy（）']（http://en.cppreference.com/w/cpp/算法/复制），例如：'std :: vector matEntries; matEntries。储备（numberOfLines）; std :: copy（std :: istream_iterator （inputfileA），std :: istream_iterator （），std :: back_inserter（matEntries））;' –

或['std :: copy_n（）']（http：// en .cppreference.com/w/cpp/algorithm/copy_n）在C++ 11及更高版本中，例如：'std :: vector matEntries; matEntries.reserve（numberOfLines）; std :: copy_n（std :: istream_iterator （inputfileA），numberOfLines，std :: back_inserter（matEntries））;' –

答

正如意见建议，你应该尝试优化之前分析代码。如果你想尝试随机的东西，直到性能足够好，你可以尝试先将它读入内存。这里有一个简单的例子写在一些基本的分析：

#include <vector> 
#include <ctime> 
#include <fstream> 
#include <sstream> 
#include <iostream> 

// Assuming something like this... 
struct matEntry 
{ 
    int row, col; 
    double val; 
}; 

std::istream& operator << (std::istream& is, matEntry& e) 
{ 
    is >> matEntry.row >> matEntry.col >> matEntry.val; 
    matEntry.row -= 1; 
    matEntry.col -= 1; 
    return is; 
} 


std::vector<matEntry> ReadMatrices(std::istream& stream) 
{ 
    auto matEntries = std::vector<matEntry>(); 

    auto e = matEntry(); 
    // For why this is better than your EOF test, see https://isocpp.org/wiki/faq/input-output#istream-and-while 
    while(stream >> e) { 
     matEntries.push_back(e); 
    } 
    return matEntries; 
} 

int main() 
{ 
    const auto time0 = std::clock(); 

    // Read file a piece at a time 
    std::ifstream inputFileA("matFileA.txt"); 
    const auto matA = ReadMatrices(inputFileA); 

    const auto time1 = std::clock(); 

    // Read file into memory (from http://stackoverflow.com/a/2602258/201787) 
    std::ifstream inputFileB("matFileB.txt"); 
    std::stringstream buffer; 
    buffer << inputFileB.rdbuf(); 
    const auto matB = ReadMatrices(buffer); 

    const auto time2 = std::clock(); 
    std::cout << "A: " << ((time1 - time0) * CLOCKS_PER_SEC) << " B: " << ((time2 - time1) * CLOCKS_PER_SEC) << "\n"; 
    std::cout << matA.size() << " " << matB.size(); 
}

谨防连续读取磁盘上的同一个文件两次，因为磁盘缓存可能隐藏的性能差异。

其他选项包括：

预分配的空间（可能是添加大小的文件格式或估计它根据文件大小或东西）
更改您的文件格式为二进制或者是压缩数据以最小化读取时间
存储器映射文件
并行化（容易：在单独的线程处理文件A和文件B [见std::async()]; 中等：管道它，以便读取和转换在不同的线程上完成; 硬：过程在单独的线程相同的文件）

其他更高级别的考虑因素可能包括：

看起来你有数据的一个4-d阵列2D的（行/ COLS矩阵）。在许多应用中，这是一个错误。花点时间重新考虑这个数据结构是否真的是你需要的。
有许多高质量的矩阵库可用（例如，Boost.QVM，Blaze等）。使用它们而不是重新发明轮子。

我打算给出类似的建议，将整个文件加载到stringstream中（假设有足够的内存），然后解析可能会更快，因为它避免了旋转硬盘再次寻找文件。 –

通常I/O有两个缓慢的部分：磁盘读取本身，以及从文本翻译为二进制。磁盘读取不在配置文件中显示，因为它发生在操作系统中，并且文本到二进制文件对于此答案中提供的两种方法都是相同的。 –

答

根据我的经验，这种代码中最慢的部分是数值解析（特别是浮点数）。因此，你的代码是最有可能CPU限制，可以是加速的，通过并行如下：

假设你的数据在ñ线，你会使用ķ线程来处理它，每个线程都必须处理约[N/k]行。

mmap()该文件。
扫描整个文件中的换行符号并确定要分配给每个线程的范围。
让每个线程通过使用implementation of an std::istream that wraps an in-memory buffer并行处理其范围。

请注意，这将需要确保用于填充数据结构的代码是线程安全的。

在C++中读取文件到内存的最快方法？

相关推荐