Let's suppose I have a table in my database with 1.000.000
records.
假设我的数据库中有一个包含1.000.000条记录的表。
If I execute:
如果我执行:
SELECT * FROM [Table] LIMIT 1000
Will this query take the same time as if I have that table with 1000
records and just do:
这个查询会花费同样的时间,好像我有1000个记录的那个表,只是这样做:
SELECT * FROM [Table]
?
?
I'm not looking for if it will take exactly the same time. I just want to know if the first one will take much more time to execute than the second one.
我不是在寻找是否需要完全相同的时间。我只是想知道第一个是否会比第二个花费更多的时间来执行。
I said 1.000.000
records, but it could be 20.000.000
. That was just an example.
我说1.000.000记录,但可能是20.000.000。这只是一个例子。
Edit:
Of course that when using LIMIT and without using it in the same table, the query built using LIMIT should be executed faster, but I'm not asking that...
编辑:当然,当使用LIMIT并且不在同一个表中使用它时,使用LIMIT构建的查询应该更快地执行,但我不是要求......
To make it generic:
使其通用:
Table1
: X
recordsTable2
: Y
records
表1:X记录表2:Y记录
(X << Y)
(X << Y)
What I want to compare is:
我想要比较的是:
SELECT * FROM Table1
SELECT * FROM Table1
and
和
SELECT * FROM Table2 LIMIT X
SELECT * FROM Table2 LIMIT X.
Edit 2:
Here is why I'm asking this:
编辑2:这就是我问这个问题的原因:
I have a database, with 5 tables and relationships between some of them. One of those tables will (I'm 100% sure) contain about 5.000.000
records. I'm using SQL Server CE 3.5, Entity Framework as the ORM and LINQ to SQL to make the queries.
我有一个数据库,有5个表和一些表之间的关系。其中一个表(我100%肯定)包含大约5.000.000条记录。我使用SQL Server CE 3.5,Entity Framework作为ORM和LINQ to SQL来进行查询。
I need to perform basically three kind of non-simple queries, and I was thinking about showing to the user a limit of records (just like lot of websites do). If the user wants to see more records, the option he/she has is to restrict more the search.
我需要基本上执行三种非简单查询,而我正在考虑向用户显示记录限制(就像很多网站一样)。如果用户想要查看更多记录,则他/她具有的选项是限制更多搜索。
So, the question came up because I was thinking about doing this (limiting to X
records per query) or if storing in the database only X
results (the recent ones), which will require to do some deletions in the database, but I was just thinking...
所以,问题出现了,因为我正在考虑这样做(限制为每个查询的X记录)或者如果只在数据库中存储X结果(最近的那些),这将需要在数据库中进行一些删除,但我是只是想...
So, that table could contain 5.000.000
records or more, and what I don't want is to show the user 1000
or so, and even like this, the query still be as slow as if it would be returning the 5.000.000
rows.
因此,该表可能包含5.000.000条记录或更多,而我不想要的是向用户显示1000左右,即使这样,查询仍然会像返回5.000.000一样慢行。
3 个解决方案
#1
1
Assuming both tables are equivalent in terms of index, row-sizing and other structures. Also assuming that you are running that simple SELECT statement. If you have an ORDER BY
clause in your SQL statements, then obviously the larger table will be slower. I suppose you're not asking that.
假设两个表在索引,行大小和其他结构方面是等价的。还假设您正在运行那个简单的SELECT语句。如果SQL语句中有ORDER BY子句,那么较大的表显然会更慢。我想你不是在问这个问题。
If X = Y, then obviously they should run in similar speed, since the query engine will be going through the records in exactly the same order -- basically a table scan -- for this simple SELECT statement. There will be no difference in query plan.
如果X = Y,那么显然它们应该以相似的速度运行,因为查询引擎将以完全相同的顺序遍历记录 - 基本上是表扫描 - 用于这个简单的SELECT语句。查询计划没有区别。
If Y > X only by a little bit, then also similar speed.
如果Y> X只有一点点,那么速度也相似。
However, if Y >> X (meaning Y has many many more rows than X), then the LIMIT version MAY be slower. Not because of query plan -- again should be the same -- but simply because that the internal structure of data layout may have several more levels. For example, if data is stored as leafs on a tree, there may be more tree levels, so it may take slightly more time to access the same number of pages.
但是,如果Y >> X(意味着Y的行数比X多很多),那么LIMIT版本可能会慢一些。不是因为查询计划 - 再次应该是相同的 - 而仅仅是因为数据布局的内部结构可能有几个级别。例如,如果数据作为树叶存储在树上,则可能会有更多树级别,因此访问相同数量的页面可能需要稍多的时间。
In other words, 1000 rows may be stored in 1 tree level in 10 pages, say. 1000000 rows may be stored in 3-4 tree levels in 10000 pages. Even when taking only 10 pages from those 10000 pages, the storage engine still has to go through 3-4 tree levels, which may take slightly longer.
换句话说,1000行可以存储在10页中的1个树级中。 1000000行可以存储在10000页中的3-4个树级中。即使只从这10000页中只有10页,存储引擎仍然需要经过3-4个树级别,这可能需要稍长一些。
Now, if the storage engine stores data pages sequentially or as a linked list, say, then there will be no difference in execution speed.
现在,如果存储引擎按顺序存储数据页或者作为链表存储,那么执行速度就没有差别。
#2
3
TAKE 1000
from a table of 1000000 records - will be 1000000/1000 (= 1000
) times faster because it only needs to look at (and return) 1000/1000000 records. Since it does less, it is naturally faster.
从1000000条记录表中获取1000 - 将快1000000/1000(= 1000)倍,因为它只需要查看(并返回)1000/1000000条记录。由于它做得少,它自然更快。
The result will be pretty (pseudo-)random, since you haven't specified any order in which to TAKE. However, if you do introduce an order, then one of two below becomes true:
结果将是非常(伪)随机的,因为您没有指定TAKE的任何顺序。但是,如果您确实介绍了订单,则以下两个中的一个变为真:
- The ORDER BY clause follows an index - the above statement is still true.
- ORDER BY子句跟在索引之后 - 上面的语句仍然是正确的。
- The ORDER BY clause cannot use any index - it will be only marginally faster than without the TAKE, because
- it has to inspect ALL records, and sort by
ORDER BY
- 它必须检查所有记录,并按ORDER BY排序
- deliver only a subset (TAKE count)
- 只提供一个子集(TAKE计数)
- so it is not faster in the first step, but the 2nd step involves less IO/network than ALL records
- 所以它在第一步中并不快,但第二步涉及的IO /网络少于所有记录
- it has to inspect ALL records, and sort by
- ORDER BY子句不能使用任何索引 - 它只会比没有TAKE快一点,因为它必须检查所有记录,并按ORDER BY排序只提供一个子集(TAKE计数),因此在第一步中它不会更快,但第二步涉及的IO /网络少于所有记录
If you TAKE 1000 records from a table of 1000 records, it will be equivalent (with little significant differences) to TAKE 1000 records from 1 billion, as long as you are following the case of (1) no order by, or (2) order by against an index
如果您从1000个记录表中获取1000条记录,只要您遵循以下情况(1)没有订单,或者(2),它将与10亿条TAKE 1000条记录相当(几乎没有显着差异)按索引排序
#3
0
It would be approximately linear, as long as you specify no fields, no ordering, and all the records. But that doesn't buy you much. It falls apart as soon as your query wants to do something useful.
只要您没有指定字段,没有排序和所有记录,它将是近似线性的。但这并没有给你带来多少好处。一旦您的查询想要做一些有用的事情,它就会崩溃。
This would be quite a bit more interesting if you intended to draw some useful conclusion and tell us about the way it would be used to make a design choice in some context.
如果您打算得出一些有用的结论并告诉我们在某些上下文中用于设计选择的方式,这将会更有趣。
Thanks for the clarification.
谢谢你的澄清。
In my experience, real applications with real users seldom have interesting or useful queries that return entire million-row tables. Users want to know about their own activity, or a specific forum thread, etc. So unless yours is an unusual case, by the time you've really got their selection criteria in hand, you'll be talking about reasonable result sizes.
根据我的经验,真实用户的真实应用程序很少有有趣或有用的查询返回整个百万行表。用户想要了解他们自己的活动,或者特定的论坛帖子等等。因此,除非你的情况不同寻常,当你真正掌握他们的选择标准时,你将谈论合理的结果大小。
In any case, users wouldn't be able to do anything useful with many rows over several hundred, transporting them would take a long time, and they couldn't scroll through it in any reasonable way.
在任何情况下,用户将无法对数百行以上的任何行进行任何有用的操作,运输它们需要很长时间,并且它们无法以任何合理的方式滚动它。
MySQL has the LIMIT and OFFSET (starting record #) modifiers primarlly for the exact purpose of creating chunks of a list for paging as you describe.
MySQL具有LIMIT和OFFSET(起始记录#)修饰符,主要用于为您描述的创建分页列表的确切目的。
It's way counterproductive to start thinking about schema design and record purging until you've used up this and a bunch of other strategies. In this case don't solve problems you don't have yet. Several-million-row tables are not big, practically speaking, as long as they are correctly indexed.
在你用完这个和其他一些策略之前,开始考虑模式设计和记录清除会适得其反。在这种情况下,不要解决你还没有的问题。实际上,几百万行表并不大,只要它们被正确编入索引即可。
#1
1
Assuming both tables are equivalent in terms of index, row-sizing and other structures. Also assuming that you are running that simple SELECT statement. If you have an ORDER BY
clause in your SQL statements, then obviously the larger table will be slower. I suppose you're not asking that.
假设两个表在索引,行大小和其他结构方面是等价的。还假设您正在运行那个简单的SELECT语句。如果SQL语句中有ORDER BY子句,那么较大的表显然会更慢。我想你不是在问这个问题。
If X = Y, then obviously they should run in similar speed, since the query engine will be going through the records in exactly the same order -- basically a table scan -- for this simple SELECT statement. There will be no difference in query plan.
如果X = Y,那么显然它们应该以相似的速度运行,因为查询引擎将以完全相同的顺序遍历记录 - 基本上是表扫描 - 用于这个简单的SELECT语句。查询计划没有区别。
If Y > X only by a little bit, then also similar speed.
如果Y> X只有一点点,那么速度也相似。
However, if Y >> X (meaning Y has many many more rows than X), then the LIMIT version MAY be slower. Not because of query plan -- again should be the same -- but simply because that the internal structure of data layout may have several more levels. For example, if data is stored as leafs on a tree, there may be more tree levels, so it may take slightly more time to access the same number of pages.
但是,如果Y >> X(意味着Y的行数比X多很多),那么LIMIT版本可能会慢一些。不是因为查询计划 - 再次应该是相同的 - 而仅仅是因为数据布局的内部结构可能有几个级别。例如,如果数据作为树叶存储在树上,则可能会有更多树级别,因此访问相同数量的页面可能需要稍多的时间。
In other words, 1000 rows may be stored in 1 tree level in 10 pages, say. 1000000 rows may be stored in 3-4 tree levels in 10000 pages. Even when taking only 10 pages from those 10000 pages, the storage engine still has to go through 3-4 tree levels, which may take slightly longer.
换句话说,1000行可以存储在10页中的1个树级中。 1000000行可以存储在10000页中的3-4个树级中。即使只从这10000页中只有10页,存储引擎仍然需要经过3-4个树级别,这可能需要稍长一些。
Now, if the storage engine stores data pages sequentially or as a linked list, say, then there will be no difference in execution speed.
现在,如果存储引擎按顺序存储数据页或者作为链表存储,那么执行速度就没有差别。
#2
3
TAKE 1000
from a table of 1000000 records - will be 1000000/1000 (= 1000
) times faster because it only needs to look at (and return) 1000/1000000 records. Since it does less, it is naturally faster.
从1000000条记录表中获取1000 - 将快1000000/1000(= 1000)倍,因为它只需要查看(并返回)1000/1000000条记录。由于它做得少,它自然更快。
The result will be pretty (pseudo-)random, since you haven't specified any order in which to TAKE. However, if you do introduce an order, then one of two below becomes true:
结果将是非常(伪)随机的,因为您没有指定TAKE的任何顺序。但是,如果您确实介绍了订单,则以下两个中的一个变为真:
- The ORDER BY clause follows an index - the above statement is still true.
- ORDER BY子句跟在索引之后 - 上面的语句仍然是正确的。
- The ORDER BY clause cannot use any index - it will be only marginally faster than without the TAKE, because
- it has to inspect ALL records, and sort by
ORDER BY
- 它必须检查所有记录,并按ORDER BY排序
- deliver only a subset (TAKE count)
- 只提供一个子集(TAKE计数)
- so it is not faster in the first step, but the 2nd step involves less IO/network than ALL records
- 所以它在第一步中并不快,但第二步涉及的IO /网络少于所有记录
- it has to inspect ALL records, and sort by
- ORDER BY子句不能使用任何索引 - 它只会比没有TAKE快一点,因为它必须检查所有记录,并按ORDER BY排序只提供一个子集(TAKE计数),因此在第一步中它不会更快,但第二步涉及的IO /网络少于所有记录
If you TAKE 1000 records from a table of 1000 records, it will be equivalent (with little significant differences) to TAKE 1000 records from 1 billion, as long as you are following the case of (1) no order by, or (2) order by against an index
如果您从1000个记录表中获取1000条记录,只要您遵循以下情况(1)没有订单,或者(2),它将与10亿条TAKE 1000条记录相当(几乎没有显着差异)按索引排序
#3
0
It would be approximately linear, as long as you specify no fields, no ordering, and all the records. But that doesn't buy you much. It falls apart as soon as your query wants to do something useful.
只要您没有指定字段,没有排序和所有记录,它将是近似线性的。但这并没有给你带来多少好处。一旦您的查询想要做一些有用的事情,它就会崩溃。
This would be quite a bit more interesting if you intended to draw some useful conclusion and tell us about the way it would be used to make a design choice in some context.
如果您打算得出一些有用的结论并告诉我们在某些上下文中用于设计选择的方式,这将会更有趣。
Thanks for the clarification.
谢谢你的澄清。
In my experience, real applications with real users seldom have interesting or useful queries that return entire million-row tables. Users want to know about their own activity, or a specific forum thread, etc. So unless yours is an unusual case, by the time you've really got their selection criteria in hand, you'll be talking about reasonable result sizes.
根据我的经验,真实用户的真实应用程序很少有有趣或有用的查询返回整个百万行表。用户想要了解他们自己的活动,或者特定的论坛帖子等等。因此,除非你的情况不同寻常,当你真正掌握他们的选择标准时,你将谈论合理的结果大小。
In any case, users wouldn't be able to do anything useful with many rows over several hundred, transporting them would take a long time, and they couldn't scroll through it in any reasonable way.
在任何情况下,用户将无法对数百行以上的任何行进行任何有用的操作,运输它们需要很长时间,并且它们无法以任何合理的方式滚动它。
MySQL has the LIMIT and OFFSET (starting record #) modifiers primarlly for the exact purpose of creating chunks of a list for paging as you describe.
MySQL具有LIMIT和OFFSET(起始记录#)修饰符,主要用于为您描述的创建分页列表的确切目的。
It's way counterproductive to start thinking about schema design and record purging until you've used up this and a bunch of other strategies. In this case don't solve problems you don't have yet. Several-million-row tables are not big, practically speaking, as long as they are correctly indexed.
在你用完这个和其他一些策略之前,开始考虑模式设计和记录清除会适得其反。在这种情况下,不要解决你还没有的问题。实际上,几百万行表并不大,只要它们被正确编入索引即可。