查询的索引表中的Postgres

问题描述:

我有46230行(1个DOC,4623个句子和10块每一个句子)的表太慢:查询的索引表中的Postgres

create table chunks(
doc varchar, 
sentenceid int, 
chunkid int, 
chunk bit(10) 
); 

随着我想比较其他句子块查询

SELECT 
     a.sentenceid, 
     b.sentenceid, a.chunkid, 
     Length(Replace(Cast(a.chunk & b.chunk AS TEXT), '0', ''))::float/Length(a.chunk)::float 
FROM chunks2 a 
INNER JOIN chunks2 b 
     ON a.sentenceid < b.sentenceid and a.chunkid = b.chunkid; 

我跑了解释上没有索引的表,综合指数和两个分别索引分析:

没有索引:

有ID相同的句子块使用

Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=144.376..1156178.110 rows=106837530 loops=1) 
    Hash Cond: (a.chunkid = b.chunkid) 
    Join Filter: (a.sentenceid < b.sentenceid) 
    Rows Removed by Join Filter: 106883760 
    -> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.039..77.275 rows=46230 loops=1) 
    -> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.954..142.954 rows=46230 loops=1) 
     Buckets: 65536 Batches: 1 Memory Usage: 2680kB 
     -> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..64.340 rows=46230 loops=1) 
Planning time: 1.209 ms 
Execution time: 1212779.012 ms 

我知道他们有相同的操作,并没有索引:

Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=140.532..1160629.611 rows=106837530 loops=1) 
    Hash Cond: (a.chunkid = b.chunkid) 
    Join Filter: (a.sentenceid < b.sentenceid) 
    Rows Removed by Join Filter: 106883760 
    -> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.043..76.936 rows=46230 loops=1) 
    -> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=140.056..140.056 rows=46230 loops=1) 
     Buckets: 65536 Batches: 1 Memory Usage: 2680kB 
     -> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.032..65.781 rows=46230 loops=1) 
Planning time: 0.518 ms 
Execution time: 1217920.271 ms 

指数(sentenceid)&指数(chunkind):

Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=143.719..1155138.691 rows=106837530 loops=1) 
    Hash Cond: (a.chunkid = b.chunkid) 
    Join Filter: (a.sentenceid < b.sentenceid) 
    Rows Removed by Join Filter: 106883760 
    -> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.038..74.031 rows=46230 loops=1) 
    -> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.160..142.160 rows=46230 loops=1) 
     Buckets: 65536 Batches: 1 Memory Usage: 2680kB 
     -> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..63.628 rows=46230 loops=1) 
Planning time: 1.664 ms 
Execution time: 1213844.696 ms 

指数(sentenceid,块ID)。我的错误在哪里,以及如何加快索引查询?或者在我的情况下如何有效地使用索引?

+0

这种情况的目的是什么:'a.sentenceid

+0

当比较句子时:1与2,1到3相比,相同的结果将是3与1或2相比,与1相比减少比较的数量a.sentenceid Masyaf

+0

你可以尝试一个包含'chunk'('chunkid','sentenceid','chunk')的复合索引,但我不认为PostgreSQL支持索引检索,所以我怀疑这会有所帮助。 –

您没有包含逐字创建索引语句,它们会非常有帮助。

第一件事:一般的规则是index for equality first, then for ranges.

所以,如果你要查询块ID等于和sentenceid少,那么你应该创建索引:

create index chunks2_chunkid_idx on chunks2 (chunkid, sentenceid); 

二:要加入整个表本身。这永远不会是一个便宜的操作,postgres会计算出来并跳过索引使用。如果您的查询只涉及表格的一小部分,索引就可以使用。

我猜你试图找到类似的句子,但我认为你正在接近这不是最好的方式。

+0

其实我做过。我写了'我对未索引表,复合索引和两个索引分别进行了解释分析。我做了'在chunk2(chunkid,sentenceid)'和'(sentenceid,chunkid)'上分别创建索引chunks2_idx。但(chunkid,sentenceid)的时间比(sentenecid,chunkid)差, – Masyaf