从SQL Server表中选择n个随机行

I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND(), and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.

我有一个包含50,000行的SQL Server表。我想随机选择5000行。我想到了一种复杂的方法，创建一个带有“随机数”列的临时表，将我的表复制到这个表中，循环遍历temp表，并使用RAND()更新每一行，然后从那个随机数字列< 0.1的表中进行选择。我正在寻找一种更简单的方法，如果可能的话，在一份声明中。

This article suggest using the NEWID() function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.

本文建议使用NEWID()函数。这看起来很有希望，但是我不知道如何可靠地选择一定百分比的行。

Anybody ever do this before? Any ideas?

有人以前做过吗?什么好主意吗?

15 个解决方案

#1

323

select top 10 percent * from [yourtable] order by newid()

In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.

对于关于大型表的“纯粹垃圾”评论，您可以这样做，以提高性能。

select  * from [yourtable] where [yourPk] in 
(select top 10 percent [yourPk] from [yourtable] order by newid())

The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

这样做的成本将是值的关键扫描加上连接成本，在一个大的表上，只有很小百分比的选择应该是合理的。

#2

Depending on your needs, TABLESAMPLE will get you nearly as random and better performance. this is available on MS SQL server 2005 and later.

根据你的需要，汤匙量几乎可以让你获得随机和更好的性能。这可以在MS SQL server 2005和以后的版本中获得。

TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.

汤匙将从随机页面返回数据，而不是随机行，因此deos甚至不会检索它不会返回的数据。

On a very large table I tested

我在一张很大的桌子上测试

select top 1 percent * from [tablename] order by newid()

took more than 20 minutes.

花了20多分钟。

select * from [tablename] tablesample(1 percent)

took 2 minutes.

花了2分钟。

Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid().

在大汤匙的小样本上，性能也会得到改善，而newid()则不会。

Please keep in mind that this is not as random as the newid() method but will give you a decent sampling.

请记住，这并不像newid()方法那样随机，但会给您一个不错的抽样。

See the MSDN page.

看到MSDN页面。

#3

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

newid()/order by可以工作，但是对于大型结果集来说代价非常高，因为它必须为每一行生成一个id，然后对它们进行排序。

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

从性能的角度来看，大量的()是好的，但是您将会得到一些结果(在页面上的所有行都将被返回)。

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

为了更好地执行真正的随机样本，最好的方法是随机地筛选行。我在SQL Server Books在线文章中发现了下面的代码示例，该文章通过使用汤匙量来限制结果集:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

如果您确实想要一个单独的行的随机样本，请修改您的查询，以便随机地过滤出行，而不是使用大容量的。例如，下面的查询使用NEWID函数返回大约销售行的1%。SalesOrderDetail表:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

CHECKSUM表达式中包含SalesOrderID列，以便NEWID()每一行计算一次，以实现每一行的抽样。表达式CAST(CHECKSUM(NEWID()， SalesOrderID)和0x7fffffff作为float / CAST(0x7fffff作为int)计算为0到1之间的一个随机浮点值。

When run against a table with 1,000,000 rows, here are my results:

当在一个有1,000,000行的表上运行时，这里是我的结果:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

如果你可以不用大汤匙，它会给你最好的表现。否则使用newid()/filter方法。如果你有一个大的结果集，那就应该是最后一招。

#4

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.

从MSDN上的一个大表中随机选择行，有一个简单、清晰的解决方案，解决了大规模的性能问题。

  SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

#5

Just order the table by a random number and obtain the first 5,000 rows using TOP.

只需按随机数对表进行排序，然后使用TOP获取前5,000行。

SELECT TOP 5000 * FROM [Table] ORDER BY newid();

UPDATE

更新

Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

只要尝试一下，一个newid()调用就足够了——不需要所有的类型转换和所有的数学运算。

#6

If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this:

如果您(不像OP)需要特定数量的记录(这使得校验和方法很困难)，并且希望得到一个比大表本身提供的更多的随机样本，并且还需要比校验和更好的速度，您可能需要合并一大堆的和NEWID()方法，例如:

DECLARE @sampleCount int = 50
SET STATISTICS TIME ON

SELECT TOP (@sampleCount) * 
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()

SET STATISTICS TIME OFF

In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. (Note that TABLESAMPLE will not accept a variable)

在我看来，这是在随机性(我知道这不是真的)和速度之间最直接的妥协。根据需要更改大汤匙百分比(或行)——百分比越高，样本越随机，但是速度会直线下降。(注意，汤匙量不接受变量)

#7

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.

这个链接将Orderby(NEWID())和具有1,7和13,000,000行的表的其他方法进行了有趣的比较。

Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.

当在讨论组中询问如何选择随机行的问题时，通常会提出NEWID查询;它很简单，适用于小桌子。

SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()

However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:

然而，NEWID查询在使用大型表时有一个很大的缺点。ORDER BY子句会将表中的所有行复制到tempdb数据库中，在那里对它们进行排序。这将导致两个问题:

The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
排序操作通常具有较高的成本。排序可以使用大量的磁盘I/O，并且可以运行很长时间。
In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.
在最坏的情况下，tempdb可能会耗尽空间。在最好的情况下，tempdb可以占用大量磁盘空间，如果没有手动收缩命令，这些空间将永远不会被回收。

What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:

您需要的是一种随机选择的方法，它不会使用tempdb，当表变大时，它不会慢得多。关于如何做到这一点，我有一个新想法:

SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.

这个查询背后的基本思想是，我们希望为表中的每一行生成一个0到99之间的随机数，然后选择所有随机数小于指定百分比值的行。在这个示例中，我们希望随机选择大约10%的行;因此，我们选择所有随机数小于10的行。

Please read the full article in MSDN.

请阅读MSDN的全文。

#8

In MySQL you can do this:

在MySQL中，您可以这样做:

SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

#9

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():

这是最初种子的想法和一个校验和的组合，在我看来，如果没有NEWID()的成本，我就能给出正确的随机结果:

SELECT TOP [number] 
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())

#10

Try this:

试试这个:

SELECT TOP 10 Field1, ..., FieldN
FROM Table1
ORDER BY NEWID()

#11

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.

答案中还没有看到这种变化。在给定初始种子的情况下，我需要一个额外的约束，每次选择相同的行集合。

For MS SQL:

对于SQL女士:

Minimum example:

最小的例子:

select top 10 percent *
from table_name
order by rand(checksum(*))

Normalized execution time: 1.00

标准化的执行时间:1.00

NewId() example:

NewId()例子:

select top 10 percent *
from table_name
order by newid()

Normalized execution time: 1.02

标准化的执行时间:1.02

NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.

NewId()比rand(checksum(*))慢不了多少，因此您可能不希望对大型记录集使用它。

Selection with Initial Seed:

选择初始种子:

declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */

select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */

If you need to select the same set given a seed, this seems to work.

如果您需要在给定种子的情况下选择相同的集合，这似乎是可行的。

#12

This works for me:

这工作对我来说:

SELECT * FROM table_name
ORDER BY RANDOM()
LIMIT [number]

#13

It appears newid() can't be used in where clause, so this solution requires an inner query:

在where子句中不能使用newid()，所以这个解决方案需要一个内部查询:

SELECT *
FROM (
    SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
    FROM MyTable
) vw
WHERE Rnd % 100 < 10        --10%

#14

I was using it in subquery and it returned me same rows in subquery

我在子查询中使用它，它在子查询中返回相同的行。

 SELECT  ID ,
            ( SELECT TOP 1
                        ImageURL
              FROM      SubTable 
              ORDER BY  NEWID()
            ) AS ImageURL,
            GETUTCDATE() ,
            1
    FROM    Mytable

then i solved with including parent table variable in where

然后在where中包含父表变量

SELECT  ID ,
            ( SELECT TOP 1
                        ImageURL
              FROM      SubTable 
              Where Mytable.ID>0
              ORDER BY  NEWID()
            ) AS ImageURL,
            GETUTCDATE() ,
            1
    FROM    Mytable

Note the where condtition

注意,condtition

#15

The server-side processing language in use (eg PHP, .net, etc) isn't specified, but if it's PHP, grab the required number (or all the records) and instead of randomising in the query use PHP's shuffle function. I don't know if .net has an equivalent function but if it does then use that if you're using .net

使用的服务器端处理语言(如PHP、.net等)没有指定，但是如果是PHP，则获取所需的数字(或所有记录)，而不是在查询中使用PHP的shuffle函数。我不知道。net是否有相同的功能，但是如果有的话，如果你用

ORDER BY RAND() can have quite a performance penalty, depending on how many records are involved.

RAND()的ORDER可能会造成相当大的性能损失，这取决于涉及多少条记录。

#1

323

select top 10 percent * from [yourtable] order by newid()