最快的方法来检查一个stl容器中是否已经存在一个值

问题描述：

我持有一个非常大的内存地址列表（大约在400.000），并且需要检查某个地址是否已经存在于每秒400.000次。最快的方法来检查一个stl容器中是否已经存在一个值

A码的例子来说明我的设置：

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries 

while (true) { 
    // a new list with possible new addresses 
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries 

    // in my own code, these represent a new address list 
    for (auto newAddress : newAddresses) { 

     // already processed this address, skip it 
     if (existingAddresses.find(newAddress) != existingAddresses.end()) { 
      continue; 
     } 

     // we didn't have this address yet, so process it. 
     SomeHeavyTask(newAddress); 

     // so we don't process it again 
     existingAddresses.emplace(newAddress); 
    } 

    Sleep(1000); 
}

这是我想出了第一个实现，我认为它可以大大提高。

接下来我想出了一些自定义索引策略，也用于数据库。这个想法是获取价值的一部分，并用它来在它自己的集合集中索引它。如果我将采取例如地址的最后两个数字我会16^2 = 256组把地址在

所以我最终会得到一个地图是这样的：

[FF] -> all address ending with `FF` 
[EF] -> all addresses ending with `EF` 
[00] -> all addresses ending with `00` 
// etc...

有了这个，我只会需要对相应集合中的〜360条目进行查找。导致〜360查询每秒完成400.000次。好多了！

我想知道是否有任何其他技巧或更好的方法来做到这一点？我的目标是尽可能将此地址查找为FAST。

也许[unordered_set]（http://en.cppreference.com/w/cpp/container/unordered_set）可以为你做这项工作吗？ – slawekwin

@slawekwin这确实不会比基于索引的查找更快。另外我认为它会比'set'慢，因为'set'是有序的，所以它可以跳过50％的值，这将导致更快的查找。 –

@SteffenBrem：'std :: unordered_set'是基于哈希的，因此它（理论上）具有〜'O（1）'查找，这比'set'的'O（log n）'查找更好。在实践中，相对于真正的'O（1）'索引，比如'std :: vector'，它的开销会有小幅增加，但它应该非常接近'O（1）'。 – ShadowRanger

答

std::set<uintptr_t>使用平衡树，因此查找时间为O(log N)。

std::unordered_set<uintptr_t>另一方面是基于散列的，查找时间为O(1)。

虽然这只是一个asymptotic complexity措施，这意味着由于涉及的常数因素没有保证的改进，但当集合包含400,000个元素时，差异可能证明是显着的。

好吧，据我了解，为此实现自定义索引策略并不明智，但只需使用'unordered_set'即可。一个问题，但。使用'find'是在无序集上进行查找的最快方法吗？ –

@SteffenBrem是的，'existingAddresses.find（newAddress）！= existingAddresses.end（）'检查是查找的最快方法。总的来说，'std :: unordered_set'应该是你的文章中程序中'std :: set'的一个替代插件。 – dasblinkenlight

答

您可以使用类似的算法来合并：现在

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries 

while (true) { 
    // a new list with possible new addresses 
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries 
    auto existing_it = existingAddresses.begin(); 
    auto new_it = newAddresses.begin(); 

    while (new_it != newAddresses.end() && existing_it != existingAddresses.end()) { 
     if (*new_it < *existing_it) { 
      // we didn't have this address yet, so process it. 
      SomeHeavyTask(*new_it); 
      // so we don't process it again 
      existingAddresses.insert(existing_it, *new_it); 
      ++new_it; 
     } else if (*existing_it < *new_it) { 
      ++existing_it; 
     } else { // Both equal 
      ++existing_it; 
      ++new_it; 
     } 
    } 
    for (new_it != newAddresses.end()) 
     // we didn't have this address yet, so process it. 
     SomeHeavyTask(*new_it); 
     // so we don't process it again 
     existingAddresses.insert(existingAddresses.end(), *new_it); 
     ++new_it; 
    } 
    Sleep(1000); 
}

复杂性是线性方面：O(N + M)代替O(N log M)（与N一些新的地址，并M计数旧的）。

如果两个集合都是有序的，则合并算法是最快的方法，因为您已经使用数据的排序结构。 – xMRi

最快的方法来检查一个stl容器中是否已经存在一个值

相关推荐