MongoDB Aggregation with $ sample very slow

问题描述：

从mongodb集合中选择随机文档的方法很多（如in this answer所述）。评论指出，在mongodb版本> = 3.2的情况下，在聚合框架中使用$sample是首选。然而，在一个有许多小文件的收藏中，这似乎非常缓慢。

下面的代码使用mongoengine来模拟这个问题，并把它比作“跳过随机”方法：

import timeit 
from random import randint 

import mongoengine as mdb 

mdb.connect("test-agg") 


class ACollection(mdb.Document): 
    name = mdb.StringField(unique=True) 

    meta = {'indexes': ['name']} 


ACollection.drop_collection() 

ACollection.objects.insert([ACollection(name="Document {}".format(n)) for n in range(50000)]) 


def agg(): 
    doc = list(ACollection.objects.aggregate({"$sample": {'size': 1}}))[0] 
    print(doc['name']) 

def skip_random(): 
    n = ACollection.objects.count() 
    doc = ACollection.objects.skip(randint(1, n)).limit(1)[0] 
    print(doc['name']) 


if __name__ == '__main__': 
    print("agg took {:2.2f}s".format(timeit.timeit(agg, number=1))) 
    print("skip_random took {:2.2f}s".format(timeit.timeit(skip_random, number=1)))

结果是：

Document 44551 
agg took 21.89s 
Document 25800 
skip_random took 0.01s

只要我有性能问题mongodb在过去我的答案一直是使用汇总框架，所以我很惊讶$sample是如此之慢。

我在这里错过了什么吗？这个例子是什么导致聚合需要这么长时间？

你在运行什么MongoDB版本？我发现'$ sample'在3.2.5中很慢，但在3.2.7基本上是瞬时的。 – JohnnyHK

啊，3.2.0 - 那就是它。是的，[this]（https://jira.mongodb.org/browse/SERVER-21887?jql=text%20~%20%22%24sample%22）表明这是一个已知的错误。 –

没错，但是我不确定为什么3.2.5版本中为什么它仍然很慢，因为它在3.2.3中被标记为固定。 – JohnnyHK

答

这是WiredTiger引擎中known bug在mongodb版本中的结果< 3.2.3。升级to the latest version应该解决这个问题。

MongoDB Aggregation with $ sample very slow

相关推荐