I have tried everything I could think of to speed up this query, but it still takes about 2.5 seconds.
我已经尝试了一些我能想到的加速查询,但它仍然需要大约2.5秒。
The table is images_tags (~4 Million Rows): Here is the table EXPLAIN:
该表是images_tags(约4百万行):这是表EXPLAIN:
Field Type Null Key Default
image_ids int(7) unsigned NO PRI NULL
tags_id int(7) unsigned NO PRI NULL
Here are the Indexes:
以下是索引:
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type
images_tags 0 PRIMARY 1 image_ids A NULL NULL NULL BTREE
images_tags 0 PRIMARY 2 tags_id A 4408605 NULL NULL BTREE
images_tags 1 image_ids 1 image_ids A 734767 NULL NULL BTREE
And here is the query:
这是查询:
select image_ids
from images_tags
where tags_id in (1, 2, 21, 846, 3175, 4290, 6591, 9357, 9594, 14289, 43364, 135019, 151295, 208803, 704452)
group by image_ids
order by count(*) desc
limit 10
And here is the query EXPLAIN:
这是查询EXPLAIN:
select_type table type possible_keys key key_len ref rows Extra
SIMPLE vids_x_tags index join_tags_id join_vids_id_unique 8 NULL 4408605 Using where; Using index; Using temporary; Using filesort
The goal is to get the 10 images that match those tags the most. I have tried messing around with these variables with little to no improvement:
目标是获得与这些标签最匹配的10个图像。我试过搞乱这些变量几乎没有改进:
- max_heap_table_size
- tmp_table_size
- myisam_sort_buffer_size
- read_buffer_size
- sort_buffer_size
- read_rnd_buffer_size
- net_buffer_length
- preload_buffer_size
- key_buffer_size
Is there any way to speed up this query considerably? There are about 700K images and it's always growing, so I wouldn't want to cache the result for more than a day or 2, and it has to be done for each image, so re-caching that many queries would be impossible.
有没有办法大大加快这个查询?大约有700K图像并且它总是在增长,所以我不希望将结果缓存超过一天或两天,并且必须为每个图像完成,因此重新缓存许多查询是不可能的。
2 个解决方案
#1
1
In this kind of link (junction, many-to-many) tables, it's almost always useful to have two compound indices, on both (a, b)
and (b, a)
. You have only one of them (the primary index) and not the other.
在这种链接(结点,多对多)表中,在(a,b)和(b,a)上都有两个复合索引几乎总是有用的。您只有一个(主要索引)而不是另一个。
And if there are no other columns in the table, you don't need any other index at all.
如果表中没有其他列,则根本不需要任何其他索引。
So, you should add the (tags_id, image_ids)
index and remove the (image_ids)
one which is redundant:
因此,您应该添加(tags_id,image_ids)索引并删除冗余的(image_ids):
ALTER TABLE images_tags
DROP INDEX image_ids,
ADD INDEX tag_image_IDX -- choose a name for the index
(tags_id, image_ids) ;
The efficiency of the index regarding the specific query depends on a lot of factors and mainly on the distribution of images and tags (how popular are the 15 tags you have in the IN
list?)
关于特定查询的索引的效率取决于很多因素,主要取决于图像和标签的分布(你在IN列表中有15个标签有多受欢迎?)
#2
1
In the EXPLAIN
output from your query, you see that the key
column does not match any item from the possible_keys
list. This means that although the data was fetched from the index (which in many cases is smaller than the actual table, as it spans fewer columns), the engine still had to traverse all rows.
在查询的EXPLAIN输出中,您会看到键列与possible_keys列表中的任何项都不匹配。这意味着尽管数据是从索引中获取的(在许多情况下,它比实际表小,因为它跨越较少的列),但引擎仍然必须遍历所有行。
If your want to properly use an index to speed up this query, you should add one with the tag as its first (and probably only) component.
如果您想要正确使用索引来加速此查询,则应添加一个标记作为其第一个(可能只是)组件。
By the way, the index on image_ids
only is of little use, as the primary key can be used to provide that information just as well. In general, an index over multiple rows can be used to speed up queries which provide explicit values (or ranges) for either all of these columns, or a continuous set of columns starting at the first. In other words, a two-column index will serve like a single column index for its first column as well, but won't be much use for its second column all by itself, which is what you have here.
顺便说一下,image_ids上的索引很少使用,因为主键也可用于提供该信息。通常,可以使用多行索引来加速查询,这些查询为所有这些列提供显式值(或范围),或者从第一列开始提供连续的列集。换句话说,一个双列索引也将作为其第一列的单个列索引,但它本身就不会用于它的第二列,这就是你在这里所拥有的。
As an alternative to adding a key on tags_id
and dropping the key on image_ids
, you could keep the key on image_ids
as it is, and reverse the order of columns for the primary key. Then the primary key could be used to answer tag-only queries as well. If you query the table more often by tag than by image, then I'd suggest this approach.
作为在tags_id上添加键并将键放在image_ids上的替代方法,您可以将键保持在image_ids上,并反转主键的列顺序。然后,主键也可用于回答仅标记查询。如果您通过标签而不是图像更频繁地查询表格,那么我建议采用这种方法。
#1
1
In this kind of link (junction, many-to-many) tables, it's almost always useful to have two compound indices, on both (a, b)
and (b, a)
. You have only one of them (the primary index) and not the other.
在这种链接(结点,多对多)表中,在(a,b)和(b,a)上都有两个复合索引几乎总是有用的。您只有一个(主要索引)而不是另一个。
And if there are no other columns in the table, you don't need any other index at all.
如果表中没有其他列,则根本不需要任何其他索引。
So, you should add the (tags_id, image_ids)
index and remove the (image_ids)
one which is redundant:
因此,您应该添加(tags_id,image_ids)索引并删除冗余的(image_ids):
ALTER TABLE images_tags
DROP INDEX image_ids,
ADD INDEX tag_image_IDX -- choose a name for the index
(tags_id, image_ids) ;
The efficiency of the index regarding the specific query depends on a lot of factors and mainly on the distribution of images and tags (how popular are the 15 tags you have in the IN
list?)
关于特定查询的索引的效率取决于很多因素,主要取决于图像和标签的分布(你在IN列表中有15个标签有多受欢迎?)
#2
1
In the EXPLAIN
output from your query, you see that the key
column does not match any item from the possible_keys
list. This means that although the data was fetched from the index (which in many cases is smaller than the actual table, as it spans fewer columns), the engine still had to traverse all rows.
在查询的EXPLAIN输出中,您会看到键列与possible_keys列表中的任何项都不匹配。这意味着尽管数据是从索引中获取的(在许多情况下,它比实际表小,因为它跨越较少的列),但引擎仍然必须遍历所有行。
If your want to properly use an index to speed up this query, you should add one with the tag as its first (and probably only) component.
如果您想要正确使用索引来加速此查询,则应添加一个标记作为其第一个(可能只是)组件。
By the way, the index on image_ids
only is of little use, as the primary key can be used to provide that information just as well. In general, an index over multiple rows can be used to speed up queries which provide explicit values (or ranges) for either all of these columns, or a continuous set of columns starting at the first. In other words, a two-column index will serve like a single column index for its first column as well, but won't be much use for its second column all by itself, which is what you have here.
顺便说一下,image_ids上的索引很少使用,因为主键也可用于提供该信息。通常,可以使用多行索引来加速查询,这些查询为所有这些列提供显式值(或范围),或者从第一列开始提供连续的列集。换句话说,一个双列索引也将作为其第一列的单个列索引,但它本身就不会用于它的第二列,这就是你在这里所拥有的。
As an alternative to adding a key on tags_id
and dropping the key on image_ids
, you could keep the key on image_ids
as it is, and reverse the order of columns for the primary key. Then the primary key could be used to answer tag-only queries as well. If you query the table more often by tag than by image, then I'd suggest this approach.
作为在tags_id上添加键并将键放在image_ids上的替代方法,您可以将键保持在image_ids上,并反转主键的列顺序。然后,主键也可用于回答仅标记查询。如果您通过标签而不是图像更频繁地查询表格,那么我建议采用这种方法。