TABLESAMPLE返回错误的行数?

时间:2022-11-27 18:32:00

I've just discovered the TABLESAMPLE clause but surprisingly it doesn't return the number of rows i've specified.

我刚刚发现了TABLESAMPLE子句,但令人惊讶的是它没有返回我指定的行数。

The table that i've used has ~14M rows and i wanted an arbitrary sample of 10000 rows.

我使用的表有~14M行,我想要一个10000行的任意样本。

select * from tabData TABLESAMPLE(10000 ROWS)

I get not 10000 but a different number everytime i execute it(between 8000 and 14000).

每次执行它(8000到14000之间),我得到的不是10000而是一个不同的数字。

What's going on here, have i misunderstood the intended purpose of TABLESAMPLE?

这里发生了什么,我是否误解了TABLESAMPLE的预期目的?

Edit:

编辑:

David's link explains it pretty well.

大卫的链接很好地解释了它。

This returns always 10000 roughly random rows in an efficient way:

这将以有效的方式返回10000个大致随机的行:

select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);

and the REPEATABLE option helps to get always the same (unless data has changed)

并且REPEATABLE选项有助于始终保持相同(除非数据已更改)

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS) REPEATABLE(100);

Since i wanted to know if it's more expensive to use TABLESAMPLE with a large number of rows to ensure(?) that i get the correct row-number, i've measured it;

因为我想知道使用带有大量行的TABLESAMPLE来确保(?)我得到正确的行号是否更昂贵,我已经测量过了;

1.loop (20 times):

1.环(20次):

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS);

(9938 row(s) affected)
(10000 row(s) affected)
(9383 row(s) affected)
(9526 row(s) affected)
(10000 row(s) affected)
(9545 row(s) affected)
(9560 row(s) affected)
(9673 row(s) affected)
(9608 row(s) affected)
(9476 row(s) affected)
(9766 row(s) affected)
(10000 row(s) affected)
(9500 row(s) affected)
(9941 row(s) affected)
(9769 row(s) affected)
(9547 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(9478 row(s) affected)
First batch(only 10000 rows) completed in: 14 seconds!

2.loop (20 times):

2.loop(20次):

select TOP 10000 * from tabData TABLESAMPLE(10000000 ROWS);

(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
Second batch(max rows) completed in: 13 seconds!

3.loop: counterscheck with 100% random rows using ORDER BY NEWID():

3.loop:使用ORDER BY NEWID()进行100%随机行的反序验:

select TOP 10000 * from tabData ORDER BY NEWID();

(10000 row(s) affected)

Cancelled after one execution that lasted 23 minutes

在持续23分钟的一次执行后取消

Conclusion:

结论:

So suprisingly the approach with an exact TOP clause and a large number in TABLESAMPLE is not slower. Hence it's a very efficient alternative to ORDER BY NEWID() if it doesn't matter that the rows are not random per row but per page level(Each 8K page for the table is given a random value).

令人惊讶的是,在TABLESAMPLE中使用精确的TOP子句和大量数字的方法并不慢。因此,它是ORDER BY NEWID()的一个非常有效的替代方法,如果行不是每行随机而是每页级别无关紧要(表的每个8K页面都给出一个随机值)。

4 个解决方案

#1


4  

See the article here. You need to add a top clause and/or use the repeatable option to get the number of rows you want.

请参阅此处的文章。您需要添加一个top子句和/或使用repeatable选项来获取所需的行数。

#2


2  

From the documentation.

从文档中。

The actual number of rows that are returned can vary significantly. If you specify a small number, such as 5, you might not receive results in the sample.

返回的实际行数可能会有很大差异。如果指定较小的数字(例如5),则可能无法在样本中收到结果。

http://msdn.microsoft.com/en-us/library/ms189108(v=sql.90).aspx

http://msdn.microsoft.com/en-us/library/ms189108(v=sql.90).aspx

#3


1  

This behavior has been documented before. There is a good writeup on it here.

此行为已在之前记录过。这里有一篇很好的文章。

I believe you can fix it by passing REPEATABLE with the same seed each time. Here is a snippit from the writeup:

我相信你可以通过每次都使用相同的种子传递REPEATABLE来解决它。这是来自写作的snippit:

...you will notice that different numbers of rows are returned everytime. Without any data changing, re-running the identical query keeps giving different results. This is non -deterministic factor of TABLESAMEPLE clause. If table is static and rows are not changed what could be the reason to return different numbers of the rows to return in each execution. The factor is 10 PERCENT is not the percentages of the table rows or tables records, it is percentages of the table’s data pages. Once the sample pages of data selected, all the rows from the selected pages are returned, it will not limit the number of rows sampled from that page. Fill factor of all the pages varies depends on the data of the table. This makes script to return different row count in result set everytime it is executed. The REPEATABLE option causes a selected sample to be returned again. When REPEATABLE is specified with the same repeat_seed value, SQL Server returns the same subset of rows, as long as no changes have been made to the table. When REPEATABLE is specified with a different repeat_seed value, SQL Server will typically return a different sample of the rows in the table. .

...您会注意到每次都会返回不同数量的行。如果没有任何数据更改,重新运行相同的查询将继续提供不同的结果。这是TABLESAMEPLE子句的非确定性因素。如果table是静态的并且没有更改行,则可能是在每次执行中返回不同数量的行以返回的原因。因子是10 PERCENT不是表行或表记录的百分比,它是表的数据页的百分比。选择数据的示例页面后,将返回所选页面中的所有行,但不会限制从该页面采样的行数。所有页面的填充因子取决于表的数据。这使脚本在每次执行时返回结果集中的不同行数。 REPEATABLE选项导致再次返回选定的样本。如果使用相同的repeat_seed值指定REPEATABLE,则只要未对表进行任何更改,SQL Server就会返回相同的行子集。当使用不同的repeat_seed值指定REPEATABLE时,SQL Server通常会返回表中行的不同样本。 。

#4


1  

I've observed the same.

我也观察到了同样的情况。

The page explanation definitely makes sense and rings a bell - You should see much more predictable row counts when your row size is fixed. Try it on a table with no nullable or variable-length columns.

页面说明肯定是有意义的并敲响了一个响铃 - 当您的行大小固定时,您应该会看到更多可预测的行数。在没有可空列或可变长度列的表上尝试它。

In fact I just used it to prove a theory about using it to update (you were probably spurred by the same question I was), and choosing TABLESAMPLE (50000 ROWS) actually affected 49,849 rows.

事实上,我只是用它来证明一个关于使用它进行更新的理论(你可能受到我同样的问题的刺激),选择TABLESAMPLE(50000 ROWS)实际上影响了49,849行。

#1


4  

See the article here. You need to add a top clause and/or use the repeatable option to get the number of rows you want.

请参阅此处的文章。您需要添加一个top子句和/或使用repeatable选项来获取所需的行数。

#2


2  

From the documentation.

从文档中。

The actual number of rows that are returned can vary significantly. If you specify a small number, such as 5, you might not receive results in the sample.

返回的实际行数可能会有很大差异。如果指定较小的数字(例如5),则可能无法在样本中收到结果。

http://msdn.microsoft.com/en-us/library/ms189108(v=sql.90).aspx

http://msdn.microsoft.com/en-us/library/ms189108(v=sql.90).aspx

#3


1  

This behavior has been documented before. There is a good writeup on it here.

此行为已在之前记录过。这里有一篇很好的文章。

I believe you can fix it by passing REPEATABLE with the same seed each time. Here is a snippit from the writeup:

我相信你可以通过每次都使用相同的种子传递REPEATABLE来解决它。这是来自写作的snippit:

...you will notice that different numbers of rows are returned everytime. Without any data changing, re-running the identical query keeps giving different results. This is non -deterministic factor of TABLESAMEPLE clause. If table is static and rows are not changed what could be the reason to return different numbers of the rows to return in each execution. The factor is 10 PERCENT is not the percentages of the table rows or tables records, it is percentages of the table’s data pages. Once the sample pages of data selected, all the rows from the selected pages are returned, it will not limit the number of rows sampled from that page. Fill factor of all the pages varies depends on the data of the table. This makes script to return different row count in result set everytime it is executed. The REPEATABLE option causes a selected sample to be returned again. When REPEATABLE is specified with the same repeat_seed value, SQL Server returns the same subset of rows, as long as no changes have been made to the table. When REPEATABLE is specified with a different repeat_seed value, SQL Server will typically return a different sample of the rows in the table. .

...您会注意到每次都会返回不同数量的行。如果没有任何数据更改,重新运行相同的查询将继续提供不同的结果。这是TABLESAMEPLE子句的非确定性因素。如果table是静态的并且没有更改行,则可能是在每次执行中返回不同数量的行以返回的原因。因子是10 PERCENT不是表行或表记录的百分比,它是表的数据页的百分比。选择数据的示例页面后,将返回所选页面中的所有行,但不会限制从该页面采样的行数。所有页面的填充因子取决于表的数据。这使脚本在每次执行时返回结果集中的不同行数。 REPEATABLE选项导致再次返回选定的样本。如果使用相同的repeat_seed值指定REPEATABLE,则只要未对表进行任何更改,SQL Server就会返回相同的行子集。当使用不同的repeat_seed值指定REPEATABLE时,SQL Server通常会返回表中行的不同样本。 。

#4


1  

I've observed the same.

我也观察到了同样的情况。

The page explanation definitely makes sense and rings a bell - You should see much more predictable row counts when your row size is fixed. Try it on a table with no nullable or variable-length columns.

页面说明肯定是有意义的并敲响了一个响铃 - 当您的行大小固定时,您应该会看到更多可预测的行数。在没有可空列或可变长度列的表上尝试它。

In fact I just used it to prove a theory about using it to update (you were probably spurred by the same question I was), and choosing TABLESAMPLE (50000 ROWS) actually affected 49,849 rows.

事实上,我只是用它来证明一个关于使用它进行更新的理论(你可能受到我同样的问题的刺激),选择TABLESAMPLE(50000 ROWS)实际上影响了49,849行。