This is my query:
这是我的查询:
explain analyze SELECT levenshtein('google', lower(s."Name"), 2, 2, 1), d."Domain"
FROM analyst_sld s, analyst_domain d
WHERE levenshtein('google', lower(s."Name"), 2, 2, 1) < 4 AND s.id = d."SLDk_id"
ORDER BY 1;
This is the output:
这是输出:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=5340874.17..5383497.72 rows=17049420 width=46) (actual time=136245.943..138709.585 rows=1022346 loops=1)
Sort Key: (levenshtein('google'::text, lower((s."Name")::text), 2, 2, 1))
Sort Method: external sort Disk: 78656kB
-> Hash Join (cost=122111.24..1195078.39 rows=17049420 width=46) (actual time=16730.865..133020.419 rows=1022346 loops=1)
Hash Cond: (d."SLDk_id" = s.id)
-> Seq Scan on analyst_domain d (cost=0.00..417631.20 rows=17049420 width=38) (actual time=0.036..64677.170 rows=17041042 loops=1)
-> Hash (cost=103151.93..103151.93 rows=1090665 width=16) (actual time=16730.443..16730.443 rows=1071 loops=1)
-> Seq Scan on analyst_sld s (cost=0.00..103151.93 rows=1090665 width=16) (actual time=14.742..16726.358 rows=1071 loops=1)
Filter: (levenshtein('google'::text, lower(("Name")::text), 2, 2, 1) < 4)
Total runtime: 139557.853 ms
Why does it not use an index, but a sequential scan? Also, what does the "Hash Join" and "Hash Cond" mean?
为什么它不使用索引,而是使用顺序扫描?另外,“Hash Join”和“Hash Cond”是什么意思?
EDIT_1: Indexes:
Table "public.analyst_domain"
Column | Type | Modifiers
----------------+--------------------------+-------------------------------------------------------------
ID | integer | not null default nextval('analyst_domain_id_seq'::regclass)
Domain | character varying(255) | not null
SLDk_id | integer |
Indexes:
"analyst_domain_pkey" PRIMARY KEY, btree ("ID")
"analyst_domain_Domain_key" UNIQUE, btree ("Domain")
"analyst_domain_sldk" btree ("SLDk_id")
Table "public.analyst_sld"
Column | Type | Modifiers
----------------+--------------------------+----------------------------------------------------------
id | integer | not null default nextval('analyst_sld_id_seq'::regclass)
Name | character varying(255) | not null
Indexes:
"analyst_sld_pkey" PRIMARY KEY, btree (id)
"analyst_sld_Name_key" UNIQUE, btree ("Name") CLUSTER
"analyst_sld_upper_idx" btree (upper("Name"::text))
3 个解决方案
#1
2
It uses a sequential scan on analyst_sld because that's the only way to do the levenshtein filtering. If you think that is a significant filter, you can
它在analyst_sld上使用顺序扫描,因为这是进行levenshtein过滤的唯一方法。如果您认为这是一个重要的过滤器,您可以
CREATE INDEX lev_index on
analyst_sld (levenshtein('google', lower("Name"), 2, 2, 1));
As far as the hash goes: Postgres has decided that the best way to join your tables is by looking for equal hashes on the join column (and resolving them in case the buckets have more than one entry). How many elements do your tables have, and how large do you expect the join to be?
哈希说:Postgres决定加入表的最佳方法是在连接列上查找相等的哈希值(如果存储桶有多个条目,则解析它们)。你的表有多少个元素,你期望连接有多大?
#2
1
As to the DDL: in pgAdmin III - when you select an object (table/index/etc) in the tree the DDL will show in the window to the right.
至于DDL:在pgAdmin III中 - 当你在树中选择一个对象(表/索引/等)时,DDL将显示在右边的窗口中。
The DDL would allow a better answer than the assumption based guess below.
DDL将允许比下面基于假设的猜测更好的答案。
As to the reason, with the information given and assuming primary key index on the join criteria, here is my GUESS: d is ~17 times larger than s, s is being filtered on a function which is not indexed, so the optimizer has no idea on the selectiveness of that filter. Which is quicker an index scan with row lookups for s.name or a sequential scan? Sequential wins according to the optimizer.
至于原因,通过给出的信息和假设连接标准的主键索引,这里是我的GUESS:d比s大~17倍,s在未编入索引的函数上被过滤,因此优化器没有关于过滤器选择性的想法。使用行查找s.name或顺序扫描的索引扫描更快?根据优化器顺序获胜。
#3
0
While I'm not a PostgreSQL expert by far, it looks like it's using a aequential scan due to the use of the levenshtein function. I don't know the specifics of this function, but usually one can create an index based on a function evaluation; however I hazard to guess that the "Google" string is variable in your case, so I don't know how useful an index would be...
虽然到目前为止我不是PostgreSQL专家,但由于使用了levenshtein函数,它似乎正在使用序列扫描。我不知道这个函数的具体细节,但通常可以根据函数评估创建索引;但是我猜测“Google”字符串在你的情况下是可变的,所以我不知道索引有多么有用......
#1
2
It uses a sequential scan on analyst_sld because that's the only way to do the levenshtein filtering. If you think that is a significant filter, you can
它在analyst_sld上使用顺序扫描,因为这是进行levenshtein过滤的唯一方法。如果您认为这是一个重要的过滤器,您可以
CREATE INDEX lev_index on
analyst_sld (levenshtein('google', lower("Name"), 2, 2, 1));
As far as the hash goes: Postgres has decided that the best way to join your tables is by looking for equal hashes on the join column (and resolving them in case the buckets have more than one entry). How many elements do your tables have, and how large do you expect the join to be?
哈希说:Postgres决定加入表的最佳方法是在连接列上查找相等的哈希值(如果存储桶有多个条目,则解析它们)。你的表有多少个元素,你期望连接有多大?
#2
1
As to the DDL: in pgAdmin III - when you select an object (table/index/etc) in the tree the DDL will show in the window to the right.
至于DDL:在pgAdmin III中 - 当你在树中选择一个对象(表/索引/等)时,DDL将显示在右边的窗口中。
The DDL would allow a better answer than the assumption based guess below.
DDL将允许比下面基于假设的猜测更好的答案。
As to the reason, with the information given and assuming primary key index on the join criteria, here is my GUESS: d is ~17 times larger than s, s is being filtered on a function which is not indexed, so the optimizer has no idea on the selectiveness of that filter. Which is quicker an index scan with row lookups for s.name or a sequential scan? Sequential wins according to the optimizer.
至于原因,通过给出的信息和假设连接标准的主键索引,这里是我的GUESS:d比s大~17倍,s在未编入索引的函数上被过滤,因此优化器没有关于过滤器选择性的想法。使用行查找s.name或顺序扫描的索引扫描更快?根据优化器顺序获胜。
#3
0
While I'm not a PostgreSQL expert by far, it looks like it's using a aequential scan due to the use of the levenshtein function. I don't know the specifics of this function, but usually one can create an index based on a function evaluation; however I hazard to guess that the "Google" string is variable in your case, so I don't know how useful an index would be...
虽然到目前为止我不是PostgreSQL专家,但由于使用了levenshtein函数,它似乎正在使用序列扫描。我不知道这个函数的具体细节,但通常可以根据函数评估创建索引;但是我猜测“Google”字符串在你的情况下是可变的,所以我不知道索引有多么有用......