查询的索引表中的Postgres
我有46230行(1个DOC,4623个句子和10块每一个句子)的表太慢:查询的索引表中的Postgres
create table chunks(
doc varchar,
sentenceid int,
chunkid int,
chunk bit(10)
);
随着我想比较其他句子块查询
SELECT
a.sentenceid,
b.sentenceid, a.chunkid,
Length(Replace(Cast(a.chunk & b.chunk AS TEXT), '0', ''))::float/Length(a.chunk)::float
FROM chunks2 a
INNER JOIN chunks2 b
ON a.sentenceid < b.sentenceid and a.chunkid = b.chunkid;
我跑了解释上没有索引的表,综合指数和两个分别索引分析:
没有索引:
有ID相同的句子块使用Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=144.376..1156178.110 rows=106837530 loops=1)
Hash Cond: (a.chunkid = b.chunkid)
Join Filter: (a.sentenceid < b.sentenceid)
Rows Removed by Join Filter: 106883760
-> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.039..77.275 rows=46230 loops=1)
-> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.954..142.954 rows=46230 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2680kB
-> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..64.340 rows=46230 loops=1)
Planning time: 1.209 ms
Execution time: 1212779.012 ms
我知道他们有相同的操作,并没有索引:
Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=140.532..1160629.611 rows=106837530 loops=1)
Hash Cond: (a.chunkid = b.chunkid)
Join Filter: (a.sentenceid < b.sentenceid)
Rows Removed by Join Filter: 106883760
-> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.043..76.936 rows=46230 loops=1)
-> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=140.056..140.056 rows=46230 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2680kB
-> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.032..65.781 rows=46230 loops=1)
Planning time: 0.518 ms
Execution time: 1217920.271 ms
指数(sentenceid)&指数(chunkind):
Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=143.719..1155138.691 rows=106837530 loops=1)
Hash Cond: (a.chunkid = b.chunkid)
Join Filter: (a.sentenceid < b.sentenceid)
Rows Removed by Join Filter: 106883760
-> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.038..74.031 rows=46230 loops=1)
-> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.160..142.160 rows=46230 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2680kB
-> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..63.628 rows=46230 loops=1)
Planning time: 1.664 ms
Execution time: 1213844.696 ms
指数(sentenceid,块ID)。我的错误在哪里,以及如何加快索引查询?或者在我的情况下如何有效地使用索引?
您没有包含逐字创建索引语句,它们会非常有帮助。
第一件事:一般的规则是index for equality first, then for ranges.
所以,如果你要查询块ID等于和sentenceid少,那么你应该创建索引:
create index chunks2_chunkid_idx on chunks2 (chunkid, sentenceid);
二:要加入整个表本身。这永远不会是一个便宜的操作,postgres会计算出来并跳过索引使用。如果您的查询只涉及表格的一小部分,索引就可以使用。
我猜你试图找到类似的句子,但我认为你正在接近这不是最好的方式。
其实我做过。我写了'我对未索引表,复合索引和两个索引分别进行了解释分析。我做了'在chunk2(chunkid,sentenceid)'和'(sentenceid,chunkid)'上分别创建索引chunks2_idx。但(chunkid,sentenceid)的时间比(sentenecid,chunkid)差, – Masyaf
这种情况的目的是什么:'a.sentenceid
当比较句子时:1与2,1到3相比,相同的结果将是3与1或2相比,与1相比减少比较的数量a.sentenceid
Masyaf
你可以尝试一个包含'chunk'('chunkid','sentenceid','chunk')的复合索引,但我不认为PostgreSQL支持索引检索,所以我怀疑这会有所帮助。 –