I know that If I run this query
我知道如果我运行这个查询
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
它将从我的表中得到100个随机记录。
However, I'm a bit confused as to how it works, since I don't see newid()
in the select
list. Can someone explain? Is there something special about newid()
here?
但是,我对它的工作方式有点困惑,因为我在select列表中没有看到newid()。谁能解释一下吗?newid()有什么特别之处吗?
5 个解决方案
#1
31
I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that (1) the select statement will select EVERYTHING from mytable, (2) for each row selected, tack on a uniqueidentifier generated by NewID(), (3) sort the rows by this uniqueidentifier and (4) pick off the top 100 from the sorted list?
我知道NewID()做什么,我只是想了解它在随机选择中的作用。是(1)select语句将从mytable中选择所有内容,(2)为每一行选择,附加NewID()生成的一个惟一标识符,(3)通过这个惟一标识符对行进行排序,(4)从排序列表中选择前100行吗?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
是的。这几乎完全正确(但它不一定需要对所有行进行排序)。您可以通过查看实际的执行计划来验证这一点。
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID()
column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
compute标量运算符将NEWID()列添加到每一行(在我的示例查询中是表中的2506),然后根据这一列对表中的行进行排序,并选择前100行。
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N
sort operator which attempts to perform the entire sort operation in memory (for small values of N
)
SQL Server实际上不需要从100以下的位置对整个集合进行排序,所以它使用一个*N排序操作符,它试图在内存中执行整个排序操作(对于较小的N值)
#2
10
In general it works like this:
一般来说,它是这样工作的:
- All rows from mytable is "looped"
- mytable中的所有行都是“循环的”
- NEWID() is executed for each row
- 对每一行执行NEWID()
- The rows are sorted according to random number from NEWID()
- 根据NEWID()中的随机数对行进行排序
- 100 first row are selected
- 第一行被选中。
#3
7
The key here is the NEWID function, which generates a globally unique identifier (GUID) in memory for each row. By definition, the GUID is unique and fairly random; so, when you sort by that GUID with the ORDER BY clause, you get a random ordering of the rows in the table. Taking the top 10 percent (or whatever percentage you want) will give you a random sampling of the rows in the table.
这里的关键是NEWID函数,它为每一行在内存中生成全局惟一标识符(GUID)。根据定义,GUID是唯一的和相当随机的;所以,当你用ORDER by子句对GUID排序时,你会得到表中行的随机排序。取前10%(或您想要的任何百分比)将会给您一个对表中的行进行的随机抽样。
NEWID query is proposed; it is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems: The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command. What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:
NEWID查询提出;它很简单,对于小的表非常有效。然而,NEWID查询在使用大型表时有一个很大的缺点。ORDER BY子句会将表中的所有行复制到tempdb数据库中,在那里对它们进行排序。这导致了两个问题:排序操作通常具有较高的成本。排序可以使用大量的磁盘I/O,并且可以运行很长时间。在最坏的情况下,tempdb可能会耗尽空间。在最好的情况下,tempdb可以占用大量磁盘空间,如果没有手动收缩命令,这些空间将永远不会被回收。您需要的是一种随机选择不使用tempdb且不会随着表变大而变慢的行的方法。以下是关于如何做到这一点的新想法:
SELECT * FROM master..spt_values
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.
这个查询背后的基本思想是,我们希望为表中的每一行生成一个0到99之间的随机数,然后选择所有随机数小于指定百分比值的行。在这个示例中,我们希望随机选择大约10%的行;因此,我们选择所有随机数小于10的行。
#4
4
as MSDN says:
MSDN说:
NewID() Creates a unique value of type uniqueidentifier.
NewID()创建类型惟一标识符的惟一值。
and your table will be sorted by this random values.
你的表会按这个随机值排序。
#5
1
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
使用select top 100 randid = newid(), * from mytable order by randid你将被澄清。
#1
31
I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that (1) the select statement will select EVERYTHING from mytable, (2) for each row selected, tack on a uniqueidentifier generated by NewID(), (3) sort the rows by this uniqueidentifier and (4) pick off the top 100 from the sorted list?
我知道NewID()做什么,我只是想了解它在随机选择中的作用。是(1)select语句将从mytable中选择所有内容,(2)为每一行选择,附加NewID()生成的一个惟一标识符,(3)通过这个惟一标识符对行进行排序,(4)从排序列表中选择前100行吗?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
是的。这几乎完全正确(但它不一定需要对所有行进行排序)。您可以通过查看实际的执行计划来验证这一点。
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID()
column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
compute标量运算符将NEWID()列添加到每一行(在我的示例查询中是表中的2506),然后根据这一列对表中的行进行排序,并选择前100行。
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N
sort operator which attempts to perform the entire sort operation in memory (for small values of N
)
SQL Server实际上不需要从100以下的位置对整个集合进行排序,所以它使用一个*N排序操作符,它试图在内存中执行整个排序操作(对于较小的N值)
#2
10
In general it works like this:
一般来说,它是这样工作的:
- All rows from mytable is "looped"
- mytable中的所有行都是“循环的”
- NEWID() is executed for each row
- 对每一行执行NEWID()
- The rows are sorted according to random number from NEWID()
- 根据NEWID()中的随机数对行进行排序
- 100 first row are selected
- 第一行被选中。
#3
7
The key here is the NEWID function, which generates a globally unique identifier (GUID) in memory for each row. By definition, the GUID is unique and fairly random; so, when you sort by that GUID with the ORDER BY clause, you get a random ordering of the rows in the table. Taking the top 10 percent (or whatever percentage you want) will give you a random sampling of the rows in the table.
这里的关键是NEWID函数,它为每一行在内存中生成全局惟一标识符(GUID)。根据定义,GUID是唯一的和相当随机的;所以,当你用ORDER by子句对GUID排序时,你会得到表中行的随机排序。取前10%(或您想要的任何百分比)将会给您一个对表中的行进行的随机抽样。
NEWID query is proposed; it is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems: The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command. What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:
NEWID查询提出;它很简单,对于小的表非常有效。然而,NEWID查询在使用大型表时有一个很大的缺点。ORDER BY子句会将表中的所有行复制到tempdb数据库中,在那里对它们进行排序。这导致了两个问题:排序操作通常具有较高的成本。排序可以使用大量的磁盘I/O,并且可以运行很长时间。在最坏的情况下,tempdb可能会耗尽空间。在最好的情况下,tempdb可以占用大量磁盘空间,如果没有手动收缩命令,这些空间将永远不会被回收。您需要的是一种随机选择不使用tempdb且不会随着表变大而变慢的行的方法。以下是关于如何做到这一点的新想法:
SELECT * FROM master..spt_values
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.
这个查询背后的基本思想是,我们希望为表中的每一行生成一个0到99之间的随机数,然后选择所有随机数小于指定百分比值的行。在这个示例中,我们希望随机选择大约10%的行;因此,我们选择所有随机数小于10的行。
#4
4
as MSDN says:
MSDN说:
NewID() Creates a unique value of type uniqueidentifier.
NewID()创建类型惟一标识符的惟一值。
and your table will be sorted by this random values.
你的表会按这个随机值排序。
#5
1
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
使用select top 100 randid = newid(), * from mytable order by randid你将被澄清。