SQL Server“”操作符与表上有数百万行的“=”相比非常慢

时间:2021-07-03 00:50:01

I have two tables. Forms has ~77000 rows. Logs has ~2.7 million rows.

我有两个表。形式有~ 77000行。日志有大约270万行。

The following query returns "30198" in less than a second:

以下查询在不到一秒内返回“30198”:

SELECT COUNT(DISTINCT logs.DOCID) FROM logs, forms WHERE logs.DOCID = forms.DOCID;

And this query has been running for ~15 minutes so far, and still hasn't finished:

这个查询到目前为止已经运行了大约15分钟,还没有完成:

SELECT COUNT(DISTINCT logs.DOCID) FROM logs, forms WHERE logs.DOCID <> forms.DOCID;

Why is the "not equal" query so much slower?

为什么“不相等”查询要慢得多?

3 个解决方案

#1


28  

Because = reduces the join operation to one single matching row from each table (presuming those docids are unique).

因为=将连接操作从每个表中减少为单个匹配行(假定这些docid是惟一的)。

Think of it this way- you've got a dance with 5 boys and 5 girls:

这样想——你和5个男孩和5个女孩跳舞:

Adam      Alice
Bob       Betty
Charly    Cathy
Dick      Deb
Evan      Elly

You pair them up by first letter. So

你用第一个字母把它们配对。所以

Adam->Alice
Bob->Betty
etc...

One single pairing

一个配对

But if you pair them up by "First letters do NOT match", you end up with:

但是如果你用“第一个字母不匹配”把它们配对,你会得到:

Adam->Betty
Adam->Cathy
Adam->Deb
Adam->Elly
Bob->Alice
etc...

you've MASSIVELY increased the number of pairings. This is why your <> query is taking so long. You're essentially trying to fetch m x n rows, rather than just min(m,n). With this data, you end up with 25 rows, rather than 5. For your specified table sizes, you're working with 77,000 * 2,700,000 = 207.9 billion rows, minus 77,000 where the two ids match up, for a total of 207,899,923,000 rows in the joined data set.

你大大增加了配对的数量。这就是为什么您的<>查询需要这么长的时间。本质上,你是想取m×n行,而不仅仅是min(m,n)有了这些数据,您将得到25行,而不是5行。对于指定的表大小,您使用的是77000 * 2,700,000 = 2079亿行,减去两个id匹配的77000行,在联合数据集中总共有207,899,923,000行。


given your query requirements, try a left join and look for null right-side records:

考虑到您的查询需求,请尝试左连接并查找空的右侧记录:

SELECT DISTINCT logs.DOCID
FROM logs
LEFT JOIN forms ON logs.DOCID = forms.DOCID
WHERE forms.DOCID IS NULL

#2


2  

Two reasons:

两个原因:

  • queries for equivalence can generally use indexes (if available), while query for nonequivalence cannot

    查询等效性通常可以使用索引(如果可用),而查询非等效性则不能

  • <> returns so much more data.

    <>返回如此多的数据。

Your query with <> is bogus. What should it return?

您的<>查询是假的。应该返回什么?

#3


1  

This is totally dependant on the distribution of values in the table. If the column you are searching, for example had the same value (= forms.DOCID) for 99.99 % of the rows and only one row with a different value, you would see exactly the opposite behavior.

这完全取决于表中值的分布。如果您正在搜索的列具有相同的值(= forms.DOCID),其中99.99%的行和只有一行具有不同的值,您将看到完全相反的行为。

#1


28  

Because = reduces the join operation to one single matching row from each table (presuming those docids are unique).

因为=将连接操作从每个表中减少为单个匹配行(假定这些docid是惟一的)。

Think of it this way- you've got a dance with 5 boys and 5 girls:

这样想——你和5个男孩和5个女孩跳舞:

Adam      Alice
Bob       Betty
Charly    Cathy
Dick      Deb
Evan      Elly

You pair them up by first letter. So

你用第一个字母把它们配对。所以

Adam->Alice
Bob->Betty
etc...

One single pairing

一个配对

But if you pair them up by "First letters do NOT match", you end up with:

但是如果你用“第一个字母不匹配”把它们配对,你会得到:

Adam->Betty
Adam->Cathy
Adam->Deb
Adam->Elly
Bob->Alice
etc...

you've MASSIVELY increased the number of pairings. This is why your <> query is taking so long. You're essentially trying to fetch m x n rows, rather than just min(m,n). With this data, you end up with 25 rows, rather than 5. For your specified table sizes, you're working with 77,000 * 2,700,000 = 207.9 billion rows, minus 77,000 where the two ids match up, for a total of 207,899,923,000 rows in the joined data set.

你大大增加了配对的数量。这就是为什么您的<>查询需要这么长的时间。本质上,你是想取m×n行,而不仅仅是min(m,n)有了这些数据,您将得到25行,而不是5行。对于指定的表大小,您使用的是77000 * 2,700,000 = 2079亿行,减去两个id匹配的77000行,在联合数据集中总共有207,899,923,000行。


given your query requirements, try a left join and look for null right-side records:

考虑到您的查询需求,请尝试左连接并查找空的右侧记录:

SELECT DISTINCT logs.DOCID
FROM logs
LEFT JOIN forms ON logs.DOCID = forms.DOCID
WHERE forms.DOCID IS NULL

#2


2  

Two reasons:

两个原因:

  • queries for equivalence can generally use indexes (if available), while query for nonequivalence cannot

    查询等效性通常可以使用索引(如果可用),而查询非等效性则不能

  • <> returns so much more data.

    <>返回如此多的数据。

Your query with <> is bogus. What should it return?

您的<>查询是假的。应该返回什么?

#3


1  

This is totally dependant on the distribution of values in the table. If the column you are searching, for example had the same value (= forms.DOCID) for 99.99 % of the rows and only one row with a different value, you would see exactly the opposite behavior.

这完全取决于表中值的分布。如果您正在搜索的列具有相同的值(= forms.DOCID),其中99.99%的行和只有一行具有不同的值,您将看到完全相反的行为。