查询的索引表中的Postgres

问题描述：

我有46230行（1个DOC，4623个句子和10块每一个句子）的表太慢：查询的索引表中的Postgres

create table chunks(
doc varchar, 
sentenceid int, 
chunkid int, 
chunk bit(10) 
);

随着我想比较其他句子块查询

SELECT 
     a.sentenceid, 
     b.sentenceid, a.chunkid, 
     Length(Replace(Cast(a.chunk & b.chunk AS TEXT), '0', ''))::float/Length(a.chunk)::float 
FROM chunks2 a 
INNER JOIN chunks2 b 
     ON a.sentenceid < b.sentenceid and a.chunkid = b.chunkid;

我跑了解释上没有索引的表，综合指数和两个分别索引分析：

没有索引：

有ID相同的句子块使用

Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=144.376..1156178.110 rows=106837530 loops=1) 
    Hash Cond: (a.chunkid = b.chunkid) 
    Join Filter: (a.sentenceid < b.sentenceid) 
    Rows Removed by Join Filter: 106883760 
    -> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.039..77.275 rows=46230 loops=1) 
    -> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.954..142.954 rows=46230 loops=1) 
     Buckets: 65536 Batches: 1 Memory Usage: 2680kB 
     -> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..64.340 rows=46230 loops=1) 
Planning time: 1.209 ms 
Execution time: 1212779.012 ms

我知道他们有相同的操作，并没有索引：

Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=140.532..1160629.611 rows=106837530 loops=1) 
    Hash Cond: (a.chunkid = b.chunkid) 
    Join Filter: (a.sentenceid < b.sentenceid) 
    Rows Removed by Join Filter: 106883760 
    -> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.043..76.936 rows=46230 loops=1) 
    -> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=140.056..140.056 rows=46230 loops=1) 
     Buckets: 65536 Batches: 1 Memory Usage: 2680kB 
     -> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.032..65.781 rows=46230 loops=1) 
Planning time: 0.518 ms 
Execution time: 1217920.271 ms

指数（sentenceid）&指数（chunkind）：

Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=143.719..1155138.691 rows=106837530 loops=1) 
    Hash Cond: (a.chunkid = b.chunkid) 
    Join Filter: (a.sentenceid < b.sentenceid) 
    Rows Removed by Join Filter: 106883760 
    -> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.038..74.031 rows=46230 loops=1) 
    -> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.160..142.160 rows=46230 loops=1) 
     Buckets: 65536 Batches: 1 Memory Usage: 2680kB 
     -> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..63.628 rows=46230 loops=1) 
Planning time: 1.664 ms 
Execution time: 1213844.696 ms

指数（sentenceid，块ID）。我的错误在哪里，以及如何加快索引查询？或者在我的情况下如何有效地使用索引？

这种情况的目的是什么：'a.sentenceid

当比较句子时：1与2,1到3相比，相同的结果将是3与1或2相比，与1相比减少比较的数量a.sentenceid Masyaf

你可以尝试一个包含'chunk'（'chunkid'，'sentenceid'，'chunk'）的复合索引，但我不认为PostgreSQL支持索引检索，所以我怀疑这会有所帮助。 –

答

您没有包含逐字创建索引语句，它们会非常有帮助。

第一件事：一般的规则是index for equality first, then for ranges.

所以，如果你要查询块ID等于和sentenceid少，那么你应该创建索引：

create index chunks2_chunkid_idx on chunks2 (chunkid, sentenceid);

二：要加入整个表本身。这永远不会是一个便宜的操作，postgres会计算出来并跳过索引使用。如果您的查询只涉及表格的一小部分，索引就可以使用。

我猜你试图找到类似的句子，但我认为你正在接近这不是最好的方式。

其实我做过。我写了'我对未索引表，复合索引和两个索引分别进行了解释分析。我做了'在chunk2（chunkid，sentenceid）'和'（sentenceid，chunkid）'上分别创建索引chunks2_idx。但（chunkid，sentenceid）的时间比（sentenecid，chunkid）差， – Masyaf

查询的索引表中的Postgres

相关推荐