如何加快连接表自身的查询？

We have a `users' table that holds information about our users. One of the fields within this table is called 'query'. I am trying to SELECT the user id's of all users that have the same query. So my output should look like this:

我们有一个“用户”表,其中包含有关用户的信息。此表中的一个字段称为“查询”。我试图选择具有相同查询的所有用户的用户ID。所以我的输出应该是这样的:

user1_id    user2_id    common_query
   43          2            "foo"
   117         433          "bar"
   1           119          "baz"
   1           52           "qux"

Unfortunately, I can't get this query to finish in under an hour (the users table is pretty big). This is my current query:

不幸的是,我无法在一小时内完成此查询(用户表非常大)。这是我目前的查询:

SELECT u1.id,
       u2.id,
       u1.query
FROM users u1
INNER JOIN users u2
        ON u1.query = u2.query
       AND u1.id <> u2.id

My explain:

+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+
| id | select_type | table | type  | possible_keys        | key                  | key_len | ref                             | rows     | Extra                    |
+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+
|  1 | SIMPLE      | u1    | index | index_users_on_query | index_users_on_query | 768     | NULL                            | 10905267 | Using index              |
|  1 | SIMPLE      | u2    | ref   | index_users_on_query | index_users_on_query | 768     | u1.query                        |       11 | Using where; Using index |
+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+

As you can see from the explain, the users table is indexed on query and the index appears to be being used in my SELECT. I'm wondering why the 'rows' column on table u2 has a value of 11, and not 1. Is there anything I can do to speed this query up? Is my '<>' comparison within the join bad practice? Also, the id field is the primary key

从解释中可以看出,users表在查询时被索引,索引似乎在我的SELECT中使用。我想知道为什么表u2上的'rows'列的值为11,而不是1.有什么办法可以加快查询速度吗?我在加入不良做法中的'<>'比较?此外,id字段是主键

4 个解决方案

#1

My biggest concern is the key_len, which indicates that MySQL must compare up to 768 bytes in order to lookup each index entry.

我最关心的是key_len,它表明MySQL必须比较多达768个字节才能查找每个索引条目。

For this query, a hash index on query could be much more performant (as it would involve substantially shorter comparisons, at the cost of calculating hashes and being unable to sort records using that index):

对于此查询,查询的哈希索引可能更高性能(因为它将涉及大大缩短的比较,以计算哈希值为代价并且无法使用该索引对记录进行排序):

ALTER TABLE users ADD INDEX (query) USING HASH

You might also consider making this a composite on (query, id) so that MySQL need not scan into the record itself to test the <> criterion.

您也可以考虑将其作为复合(query,id),以便MySQL不需要扫描到记录本身来测试<>标准。

#2

The main driver of the query is the equality on the query field--if it's indexed. The <> to the id is probably not very specific and it shows by the type of select being used for it is 'ref'

查询的主要驱动因素是查询字段上的相等性 - 如果它已被索引。 <>到id可能不是非常具体,它显示的是用于它的select类型是'ref'

Below only applies if 'query' is not indexed....

以下仅适用于'查询'未编入索引的情况....

If id is the primary key you could just do this:

如果id是主键,你可以这样做:

CREATE INDEX index_1  ON users (query);

The result of adding such an index will be a covering index for the query and will result in the fastest execution for the query.

添加此类索引的结果将是查询的覆盖索引,并将导致查询执行速度最快。

#3

How many queries do you have? You can add table UsersInQueries:

你有几个查询?您可以添加表UsersInQueries:

id   queryId   userId
0      5         453   
1      23        732 
2      15        761

then select from this table and group by queryId

然后通过queryId从此表和组中进行选择

#4

If you only have up to two users per query, you could do this instead:

如果每个查询最多只有两个用户,则可以执行以下操作:

select query, min(id) as FirstID, max(id) as SecondId
from users
group by query
having count(*) > 1

If you have more than two users with the same query, can you explain why you would want all pairs of such users?

如果您有两个以上具有相同查询的用户,您能解释为什么您会想要所有这类用户吗?

#1