I've been doing some research and testing on how to do fast random selection in MySQL. In the process I've faced some unexpected results and now I am not fully sure I know how ORDER BY RAND() really works.
我一直在研究和测试如何在MySQL中进行快速随机选择。在这个过程中,我遇到了一些意想不到的结果,现在我不能完全肯定我知道RAND()的ORDER是如何工作的。
I always thought that when you do ORDER BY RAND() on the table, MySQL adds a new column to the table which is filled with random values, then it sorts data by that column and then e.g. you take the above value which got there randomly. I've done lots of googling and testing and finally found that the query Jay offers in his blog is indeed the fastest solution:
我一直认为,当你在表上按RAND()排序时,MySQL在表中添加了一个新的列,这个列中充满了随机值,然后它通过该列对数据进行排序,然后,例如,你将上面的值随机抽取。我做了大量的google和测试,最后发现Jay在他的博客中提供的查询确实是最快的解决方案:
SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;
While common ORDER BY RAND() takes 30-40 seconds on my test table, his query does the work in 0.1 seconds. He explains how this functions in the blog so I'll just skip this and finally move to the odd thing.
虽然RAND()的common ORDER在我的测试表上需要30-40秒,但是他的查询在0.1秒内完成工作。他在博客中解释了这个功能,所以我就跳过这个,最后讲一些奇怪的东西。
My table is a common table with a PRIMARY KEY id
and other non-indexed stuff like username
, age
, etc. Here's the thing I am struggling to explain
我的表是一个具有主键id和其他非索引内容(如用户名、年龄等)的通用表
SELECT * FROM table ORDER BY RAND() LIMIT 1; /*30-40 seconds*/
SELECT id FROM table ORDER BY RAND() LIMIT 1; /*0.25 seconds*/
SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /*90 seconds*/
I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this. I have a project where I need to do fast ORDER BY RAND() and personally I would prefer to use
由于我总是对单个列进行排序,所以我希望看到这三个查询的时间大致相同。但出于某种原因,这并没有发生。如果你有什么想法,请告诉我。我有一个项目,我需要用RAND()的快速排序,而我个人更倾向于使用。
SELECT id FROM table ORDER BY RAND() LIMIT 1;
SELECT * FROM table WHERE id=ID_FROM_PREVIOUS_QUERY LIMIT 1;
which, yes, is slower than Jay's method, however it is smaller and easier to understand. My queries are rather big ones with several JOINs and with WHERE clause and while Jay's method still works, the query grows really big and complex because I need to use all the JOINs and WHERE in the JOINed (called x in his query) sub request.
是的,它比Jay的方法慢,但是它更小,更容易理解。我的查询比较大,有几个连接和WHERE子句,虽然Jay的方法仍然可以工作,但是查询变得非常大和复杂,因为我需要使用所有的连接以及连接(在他的查询中称为x)子请求中的位置。
Thanks for your time!
谢谢你的时间!
4 个解决方案
#1
12
While there's no such thing as a "fast order by rand()", there is a workaround for your specific task.
而不存在“rand()的快速订单”对于你的特定任务,有一个变通的办法。
For getting any single random row, you can do like this german blogger does: http://www.roberthartung.de/mysql-order-by-rand-a-case-study-of-alternatives/ (I couldn't see a hotlink url. If anyone sees one, feel free to edit the link.)
要获得任意一个随机的行,您可以像这个德国博客一样:http://www.roberthartung.de/mysql-order-by- case- studof alternative /(我看不到hotlink url)。如果有人看到,请随意编辑链接。
The text is in german, but the SQL code is a bit down the page and in big white boxes, so it's not hard to see.
文本是用德语写的,但是SQL代码在页面和大白盒中有点小,所以不难看出。
Basically what he does is make a procedure that does the job of getting a valid row. That generates a random number between 0 and max_id, try fetching a row, and if it doesn't exist, keep going until you hit one that does. He allows for fetching x number of random rows by storing them in a temp table, so you can probably rewrite the procedure to be a bit faster fetching only one row.
基本上,他所做的是做一个过程来获得一个有效的行。这将生成一个0和max_id之间的随机数,尝试获取一个行,如果不存在,则继续执行,直到命中一个行。他允许通过将随机行存储在临时表中来获取x个数,因此您可以重写过程,以便更快地只获取一行。
The downside of this is that if you delete A LOT of rows, and there are huge gaps, the chances are big that it will miss tons of times, making it ineffective.
这样做的缺点是,如果删除大量行,并且存在巨大的空白,那么很可能会错过大量的时间,从而使其无效。
Update: Different execution times
更新:不同的执行时间
SELECT * FROM table ORDER BY RAND() LIMIT 1; /30-40 seconds/
根据RAND() LIMIT 1从表顺序中选择*;/ 30 - 40秒
SELECT id FROM table ORDER BY RAND() LIMIT 1; /0.25 seconds/
根据RAND() LIMIT 1从表顺序中选择id;/ 0.25秒/
SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /90 seconds/
按RAND() LIMIT 1从表顺序选择id、用户名;/ 90秒/
I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this.
由于我总是对单个列进行排序,所以我希望看到这三个查询的时间大致相同。但出于某种原因,这并没有发生。如果你有什么想法,请告诉我。
It may have to do with indexing. id
is indexed and quick to access, whereas adding username
to the result, means it needs to read that from each row and put it in the memory table. With the *
it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking.
这可能与索引有关。id被索引并快速访问,而将用户名添加到结果中,则意味着它需要从每一行读取并将其放入内存表中。对于*,它还必须将所有内容读取到内存中,但它不需要在数据文件中跳转,这意味着没有时间丢失搜索。
This makes a difference only if there are variable length columns (varchar/text), which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row.
这只有在有可变长度列(varchar/text)时才会有所不同,这意味着它必须检查长度,然后跳过这个长度,而不是在每一行之间跳过一个设置的长度(或0)。
#2
2
It may have to do with indexing. id is indexed and quick to access, whereas adding username to the result, means it needs to read that from each row and put it in the memory table. With the * it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking. This makes a difference only if there are variable length columns, which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row
这可能与索引有关。id被索引并快速访问,而将用户名添加到结果中,则意味着它需要从每一行读取并将其放入内存表中。它还必须将所有内容都读入内存,但它不需要跳过数据文件,这意味着没有时间去寻找。这只有在有可变长度列时才会有所不同,这意味着它必须检查长度,然后跳过这个长度,而不是在每一行之间跳过一个设置的长度(或0)
Practice is better that all theories! Why not just to check plans? :)
实践胜于一切理论!为什么不检查一下计划呢?:)
mysql> explain select name from avatar order by RAND() limit 1;
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | avatar | index | NULL | IDX_AVATAR_NAME | 302 | NULL | 30062 | Using index; Using temporary; Using filesort |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
1 row in set (0.00 sec)
mysql> explain select * from avatar order by RAND() limit 1;
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | avatar | ALL | NULL | NULL | NULL | NULL | 30062 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
1 row in set (0.00 sec)
mysql> explain select name, experience from avatar order by RAND() limit 1;
+----+-------------+--------+------+--------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | avatar | ALL | NULL | NULL | NULL | NULL | 30064 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
#3
0
I can tell you why the SELECT id FROM ...
is much slower than the other two, but I am not sure, why SELECT id, username
is 2-3 times faster than SELECT *
.
我可以告诉你为什么从…比其他两个慢很多,但是我不确定,为什么选择id,用户名比选择*快2-3倍。
When you have an index (the primary key in your case) and the result includes only the columns from the index, MySQL optimizer is able to use the data from the index only, does not even look into the table itself. The more expensive is each row, the more effect you will observe, since you substitute the filesystem IO operations with pure in-memory operations. If you will have an additional index on (id, username), you will have a similar performance in the third case as well.
当您有一个索引(在您的例子中是主键)并且结果只包含来自索引的列时,MySQL优化器只能使用来自索引的数据,甚至不查看表本身。每一行的成本越高,您将观察到的效果就越好,因为您将文件系统IO操作替换为纯内存操作。如果您有一个额外的索引(id,用户名),您将在第三个案例中有类似的性能。
#4
0
Why don't you add an index id, username
on the table see if that forces mysql to use the index rather than just a filesort and temp table.
为什么不在表中添加索引id和用户名,看看这会不会迫使mysql使用索引,而不仅仅是一个filesort和temp表。
#1
12
While there's no such thing as a "fast order by rand()", there is a workaround for your specific task.
而不存在“rand()的快速订单”对于你的特定任务,有一个变通的办法。
For getting any single random row, you can do like this german blogger does: http://www.roberthartung.de/mysql-order-by-rand-a-case-study-of-alternatives/ (I couldn't see a hotlink url. If anyone sees one, feel free to edit the link.)
要获得任意一个随机的行,您可以像这个德国博客一样:http://www.roberthartung.de/mysql-order-by- case- studof alternative /(我看不到hotlink url)。如果有人看到,请随意编辑链接。
The text is in german, but the SQL code is a bit down the page and in big white boxes, so it's not hard to see.
文本是用德语写的,但是SQL代码在页面和大白盒中有点小,所以不难看出。
Basically what he does is make a procedure that does the job of getting a valid row. That generates a random number between 0 and max_id, try fetching a row, and if it doesn't exist, keep going until you hit one that does. He allows for fetching x number of random rows by storing them in a temp table, so you can probably rewrite the procedure to be a bit faster fetching only one row.
基本上,他所做的是做一个过程来获得一个有效的行。这将生成一个0和max_id之间的随机数,尝试获取一个行,如果不存在,则继续执行,直到命中一个行。他允许通过将随机行存储在临时表中来获取x个数,因此您可以重写过程,以便更快地只获取一行。
The downside of this is that if you delete A LOT of rows, and there are huge gaps, the chances are big that it will miss tons of times, making it ineffective.
这样做的缺点是,如果删除大量行,并且存在巨大的空白,那么很可能会错过大量的时间,从而使其无效。
Update: Different execution times
更新:不同的执行时间
SELECT * FROM table ORDER BY RAND() LIMIT 1; /30-40 seconds/
根据RAND() LIMIT 1从表顺序中选择*;/ 30 - 40秒
SELECT id FROM table ORDER BY RAND() LIMIT 1; /0.25 seconds/
根据RAND() LIMIT 1从表顺序中选择id;/ 0.25秒/
SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /90 seconds/
按RAND() LIMIT 1从表顺序选择id、用户名;/ 90秒/
I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this.
由于我总是对单个列进行排序,所以我希望看到这三个查询的时间大致相同。但出于某种原因,这并没有发生。如果你有什么想法,请告诉我。
It may have to do with indexing. id
is indexed and quick to access, whereas adding username
to the result, means it needs to read that from each row and put it in the memory table. With the *
it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking.
这可能与索引有关。id被索引并快速访问,而将用户名添加到结果中,则意味着它需要从每一行读取并将其放入内存表中。对于*,它还必须将所有内容读取到内存中,但它不需要在数据文件中跳转,这意味着没有时间丢失搜索。
This makes a difference only if there are variable length columns (varchar/text), which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row.
这只有在有可变长度列(varchar/text)时才会有所不同,这意味着它必须检查长度,然后跳过这个长度,而不是在每一行之间跳过一个设置的长度(或0)。
#2
2
It may have to do with indexing. id is indexed and quick to access, whereas adding username to the result, means it needs to read that from each row and put it in the memory table. With the * it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking. This makes a difference only if there are variable length columns, which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row
这可能与索引有关。id被索引并快速访问,而将用户名添加到结果中,则意味着它需要从每一行读取并将其放入内存表中。它还必须将所有内容都读入内存,但它不需要跳过数据文件,这意味着没有时间去寻找。这只有在有可变长度列时才会有所不同,这意味着它必须检查长度,然后跳过这个长度,而不是在每一行之间跳过一个设置的长度(或0)
Practice is better that all theories! Why not just to check plans? :)
实践胜于一切理论!为什么不检查一下计划呢?:)
mysql> explain select name from avatar order by RAND() limit 1;
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | avatar | index | NULL | IDX_AVATAR_NAME | 302 | NULL | 30062 | Using index; Using temporary; Using filesort |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
1 row in set (0.00 sec)
mysql> explain select * from avatar order by RAND() limit 1;
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | avatar | ALL | NULL | NULL | NULL | NULL | 30062 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
1 row in set (0.00 sec)
mysql> explain select name, experience from avatar order by RAND() limit 1;
+----+-------------+--------+------+--------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | avatar | ALL | NULL | NULL | NULL | NULL | 30064 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
#3
0
I can tell you why the SELECT id FROM ...
is much slower than the other two, but I am not sure, why SELECT id, username
is 2-3 times faster than SELECT *
.
我可以告诉你为什么从…比其他两个慢很多,但是我不确定,为什么选择id,用户名比选择*快2-3倍。
When you have an index (the primary key in your case) and the result includes only the columns from the index, MySQL optimizer is able to use the data from the index only, does not even look into the table itself. The more expensive is each row, the more effect you will observe, since you substitute the filesystem IO operations with pure in-memory operations. If you will have an additional index on (id, username), you will have a similar performance in the third case as well.
当您有一个索引(在您的例子中是主键)并且结果只包含来自索引的列时,MySQL优化器只能使用来自索引的数据,甚至不查看表本身。每一行的成本越高,您将观察到的效果就越好,因为您将文件系统IO操作替换为纯内存操作。如果您有一个额外的索引(id,用户名),您将在第三个案例中有类似的性能。
#4
0
Why don't you add an index id, username
on the table see if that forces mysql to use the index rather than just a filesort and temp table.
为什么不在表中添加索引id和用户名,看看这会不会迫使mysql使用索引,而不仅仅是一个filesort和temp表。