SQLite:对大表计数慢

时间:2023-01-20 15:31:40

I'm having a performance problem in SQLite with a SELECT COUNT(*) on a large tables.

我在SQLite中遇到了一个性能问题,在一个大表上有一个SELECT COUNT(*)。

As I didn't yet receive a usable answer and I did some further testing, I edited my question to incorporate my new findings.

由于我还没有得到一个可用的答案,而且我还做了一些进一步的测试,我编辑了我的问题,以纳入我的新发现。

I have 2 tables:

我有两个表:

CREATE TABLE Table1 (
Key INTEGER NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL,
CONSTRAINT PK_Table1 PRIMARY KEY (Key ASC))

CREATE Table2 (
Key INTEGER NOT NULL,
Key2 INTEGER NOT NULL,
... a few other fields ...,
CONSTRAINT PK_Table2 PRIMARY KEY (Key ASC, Key2 ASC))

Table1 has around 8 million records and Table2 has around 51 million records, and the databasefile is over 5GB.

表1有大约800万条记录,表2有大约5100万条记录,数据库文件超过5GB。

Table1 has 2 more indexes:

表1还有2个指标:

CREATE INDEX IDX_Table1_Status ON Table1 (Status ASC, Key ASC)
CREATE INDEX IDX_Table1_Selection ON Table1 (Selection ASC, Key ASC)

"Status" is required field, but has only 6 distinct values, "Selection" is not required and has only around 1.5 million values different from null and only around 600k distinct values.

“状态”是必需的字段,但是只有6个不同的值,“选择”不是必需的,只有大约150万个值与null不同,只有大约600k个不同的值。

I did some tests on both tables, you can see the timings below, and I added the "explain query plan" for each request (QP). I placed the database file on an USB-memorystick so i could remove it after each test and get reliable results without interference of the disk cache. Some requests are faster on USB (I suppose due to lack of seektime), but some are slower (table scans).

我对两个表都做了一些测试,您可以看到下面的时间安排,并为每个请求添加了“explain query plan”(QP)。我将数据库文件放在一个USB-memorystick上,以便在每次测试之后删除它,并在不受磁盘缓存干扰的情况下获得可靠的结果。有些请求在USB上更快(我认为是因为缺少seektime),但有些请求则更慢(表扫描)。

SELECT COUNT(*) FROM Table1
    Time: 105 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 153 sec
    QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Key = 5123456
    Time: 5 ms
    QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 16 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
SELECT * FROM Table1 WHERE Selection = 'SomeValue' AND Key > 5123456 LIMIT 1
    Time: 9 ms
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Selection (Selection=?) (~3 rows)

As you can see the counts are very slow, but normal selects are fast (except for the 2nd one, which took 16 seconds).

正如您所看到的,计数非常缓慢,但是普通的选择是快速的(除了第二种,需要16秒)。

The same goes for Table2:

表2也一样:

SELECT COUNT(*) FROM Table2
    Time: 528 sec
    QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~1000000 rows)
SELECT COUNT(Key) FROM Table2
    Time: 249 sec
    QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table2 WHERE Key = 5123456 AND Key2 = 0
    Time: 7 ms
    QP: SEARCH TABLE Table2 USING INDEX sqlite_autoindex_Table2_1 (Key=? AND Key2=?) (~1 rows)

Why is SQLite not using the automatically created index on the primary key on Table1 ? And why, when he uses the auto-index on Table2, it still takes a lot of time ?

为什么SQLite不使用表1主键上自动创建的索引?为什么,当他在表2上使用自动索引时,仍然需要花费很多时间?

I created the same tables with the same content and indexes on SQL Server 2008 R2 and there the counts are nearly instantaneous.

我在SQL Server 2008 R2上创建了相同的表,其中包含相同的内容和索引,并且几乎是即时的。

One of the comments below suggested executing ANALYZE on the database. I did and it took 11 minutes to complete. After that, I ran some of the tests again:

下面的一条评论建议对数据库执行分析。我做到了,花了11分钟完成。之后,我又做了一些测试:

SELECT COUNT(*) FROM Table1
    Time: 104 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~7848023 rows)
SELECT COUNT(Key) FROM Table1
    Time: 151 sec
    QP: SCAN TABLE Table1 (~7848023 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 5 ms
    QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid>?) (~196200 rows)
SELECT COUNT(*) FROM Table2
    Time: 529 sec
    QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~51152542 rows)
SELECT COUNT(Key) FROM Table2
    Time: 249 sec
    QP: SCAN TABLE Table2 (~51152542 rows)

As you can see, the queries took the same time (except the query plan is now showing the real number of rows), only the slower select is now also fast.

正如您所看到的,查询所花费的时间是相同的(除了查询计划现在显示实际的行数),现在只有较慢的select才会很快。

Next, I create dan extra index on the Key field of Table1, which should correspond to the auto-index. I did this on the original database, without the ANALYZE data. It took over 23 minutes to create this index (remember, this is on an USB-stick).

接下来,我在表1的Key字段上创建dan extra索引,它应该对应于自动索引。我是在原始数据库上做的,没有分析数据。创建这个索引花了超过23分钟(记住,这是在一个USB-stick上)。

CREATE INDEX IDX_Table1_Key ON Table1 (Key ASC)

Then I ran the tests again:

然后我又做了一次测试:

SELECT COUNT(*) FROM Table1
    Time: 4 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Key(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 167 sec
    QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 17 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)

As you can see, the index helped with the count(*), but not with the count(Key).

如您所见,索引对count(*)有帮助,但对count(Key)没有帮助。

Finaly, I created the table using a column constraint instead of a table constraint:

最后,我使用列约束而不是表约束来创建表:

CREATE TABLE Table1 (
Key INTEGER PRIMARY KEY ASC NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL)

Then I ran the tests again:

然后我又做了一次测试:

SELECT COUNT(*) FROM Table1
    Time: 6 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 28 sec
    QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 10 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)

Although the query plans are the same, the times are a lot better. Why is this ?

尽管查询计划是相同的,但是时间要比以前好得多。这是为什么呢?

The problem is that ALTER TABLE does not permit to convert an existing table and I have a lot of existing databases which i can not convert to this form. Besides, using a column contraint instead of table constraint won't work for Table2.

问题是,ALTER TABLE不允许转换现有的表,而且我有很多现有的数据库,我无法转换成这种形式。此外,使用列禁忌代替表约束对表2不起作用。

Has anyone any idea what I am doing wrong and how to solve this problem ?

有人知道我做错了什么吗?怎么解决这个问题?

I used System.Data.SQLite version 1.0.74.0 to create the tables and to run the tests I used SQLiteSpy 1.9.1.

我用System.Data。SQLite版本1.0.74.0用于创建表并运行我使用的SQLiteSpy 1.9.1的测试。

Thanks,

谢谢,

Marc

马克

8 个解决方案

#1


19  

From http://old.nabble.com/count(*)-slow-td869876.html

从http://old.nabble.com/count(*)-slow-td869876.html

SQLite always does a full table scan for count(*). It
does not keep meta information on tables to speed this
process up.

SQLite总是为count(*)做一个全表扫描。它没有在表上保存元信息来加速这个过程。

Not keeping meta information is a deliberate design
decision. If each table stored a count (or better, each
node of the btree stored a count) then much more updating
would have to occur on every INSERT or DELETE. This
would slow down INSERT and DELETE, even in the common
case where count(*) speed is unimportant.

不保留元信息是一个精心设计的决定。如果每个表存储一个计数(或者更好,btree的每个节点存储一个计数),那么在每一个INSERT或DELETE中都必须进行更多的更新。这将降低插入和删除的速度,即使在计数(*)速度不重要的常见情况下也是如此。

If you really need a fast COUNT, then you can create
a trigger on INSERT and DELETE that updates a running
count in a separate table then query that separate
table to find the latest count.

如果您确实需要一个快速计数,那么您可以在INSERT和DELETE上创建一个触发器,该触发器更新一个单独表中的正在运行的计数,然后查询那个单独的表以查找最新的计数。

Of course, it's not worth keeping a FULL row count if you
need COUNTs dependent on WHERE clauses (i.e. WHERE field1 > 0 and field2 < 1000000000).

当然,如果您需要依赖于WHERE子句(例如field1 > 0和field2 < 1000000000),则不值得保留完整的行计数。

#2


18  

If you haven't DELETEd any records, doing:

如果您没有删除任何记录,那么做:

SELECT MAX(_ROWID_) FROM "table" LIMIT 1;

Will avoid the full-table scan. Note that _ROWID_ is a SQLite identifier.

将避免全表扫描。注意,_ROWID_是一个SQLite标识符。

#3


2  

Do not count the stars, count the records! Or in other language, never issue

不要数星星,要数记录!或者用其他语言,永远不要争论

SELECT COUNT(*) FROM tablename;

从表SELECT COUNT(*);

use

使用

SELECT COUNT(ROWID) FROM tablename;

从表选择计数(ROWID);

Call EXPLAIN QUERY PLAN for both to see the difference. Make sure you have an index in place containing all columns mentioned in the WHERE clause.

调用EXPLAIN查询计划以查看两者的差异。确保有一个包含WHERE子句中提到的所有列的索引。

#4


1  

This may not help much, but you can run the ANALYZE command to rebuild statistics about your database. Try running "ANALYZE;" to rebuild statistics about the entire database, then run your query again and see if it is any faster.

这可能没有太大的帮助,但是您可以运行ANALYZE命令来重建关于数据库的统计数据。尝试运行“ANALYZE;”重新构建关于整个数据库的统计信息,然后再次运行查询,看看它是否更快。

#5


0  

On the matter of the column constraint, SQLite maps columns that are declared to be INTEGER PRIMARY KEY to the internal row id (which in turn admits a number of internal optimizations). Theoretically, it could do the same for a separately-declared primary key constraint, but it appears not to do so in practice, at least with the version of SQLite in use. (System.Data.SQLite 1.0.74.0 corresponds to core SQLite 3.7.7.1. You might want to try re-checking your figures with 1.0.79.0; you shouldn't need to change your database to do that, just the library.)

关于列约束,SQLite将声明为整数主键的列映射到内部行id(这反过来又允许一些内部优化)。理论上,它可以对单独声明的主键约束执行相同的操作,但在实践中似乎没有这样做,至少在使用SQLite版本时是这样的。(System.Data。SQLite 1.0.74.0对应于核心SQLite 3.7.7.1。您可能想尝试用1.0.79.0重新检查您的数据;你不需要改变你的数据库就可以做到,只需要改变库。

#6


0  

The output for the fast queries all start with the text "QP: SEARCH". Whilst those for the slow queries start with text "QP: SCAN", which suggests that sqlite is performing a scan of the entire table in order to generate the count.

快速查询的输出都以文本“QP: SEARCH”开始。而对于慢速查询,则从文本“QP: SCAN”开始,这表明sqlite正在对整个表进行扫描,以生成计数。

Googling for "sqlite table scan count" finds the following, which suggests that using a full table scan to retrieve a count is just the way sqlite works, and is therefore probably unavoidable.

搜索“sqlite表扫描计数”会发现以下内容,这表明使用完整的表扫描来检索计数正是sqlite工作的方式,因此可能是不可避免的。

As a workaround, and given that status has only eight values, I wondered if you could get a count quickly using a query like the following?

作为一种解决方案,鉴于状态只有8个值,我想知道是否可以使用如下这样的查询快速获得计数?

select 1 where status=1 union select 1 where status=2 ...

选择1,状态=1 union select 1,状态=2…

then count the rows in the result. This is clearly ugly, but it might work if it persuades sqlite to run the query as a search rather than a scan. The idea of returning "1" each time is to avoid the overhead of returning real data.

然后计算结果中的行数。这显然是丑陋的,但如果它说服sqlite将查询作为搜索而不是扫描来运行,那么它可能会起作用。每次返回“1”的想法是为了避免返回真实数据的开销。

#7


0  

Here's a potential workaround to improve the query performance. From the context, it sounds like your query takes about a minute and a half to run.

这里有一个改进查询性能的方法。从上下文来看,您的查询大约需要一分半钟的时间运行。

Assuming you have a date_created column (or can add one), run a query in the background each day at midnight (say at 00:05am) and persist the value somewhere along with the last_updated date it was calculated (I'll come back to that in a bit).

假设您有一个date_created列(或者可以添加一个),每天午夜在后台运行一个查询(比如在00:05am),并将该值与计算的last_update日期一起保存在某个地方(稍后我将回到这个问题)。

Then, running against your date_created column (with an index), you can avoid a full table scan by doing a query like SELECT COUNT(*) FROM TABLE WHERE date_updated > "[TODAY] 00:00:05".

然后,针对date_created列(带有索引),您可以通过执行SELECT COUNT(*)这样的查询来避免完整的表扫描。

Add the count value from that query to your persisted value, and you have a reasonably fast count that's generally accurate.

将查询中的count值添加到持久值中,您就有了一个通常比较准确的快速计数。

The only catch is that from 12:05am to 12:07am (the duration during which your total count query is running) you have a race condition which you can check the last_updated value of your full table scan count(). If it's > 24 hours old, then your incremental count query needs to pull a full day's count plus time elapsed today. If it's < 24 hours old, then your incremental count query needs to pull a partial day's count (just time elapsed today).

惟一的问题是,从12:05am到12:07am(您的total count查询运行的持续时间)您有一个race条件,您可以检查完整表扫描计数()的last_updates值。如果它是> 24小时,那么您的增量计数查询需要提取一整天的计数加上今天经过的时间。如果它小于24小时,那么您的增量计数查询需要提取部分天的计数(仅占用今天的时间)。

#8


0  

I had the same problem, in my situation VACUUM command helped. After its execution on database COUNT(*) speed increased near 100 times. However, command itself needs some minutes in my database (20 millions records). I solved this problem by running VACUUM when my software exits after main window destruction, so the delay doesn't make problems to user.

我也有同样的问题,在我的情况下真空指挥帮助了我。在执行数据库计数(*)之后,速度增加了近100倍。但是,命令本身在我的数据库中需要一些时间(2千万条记录)。我解决了这个问题,当我的软件在主窗口被破坏后退出时运行真空,所以延迟不会对用户造成问题。

#1


19  

From http://old.nabble.com/count(*)-slow-td869876.html

从http://old.nabble.com/count(*)-slow-td869876.html

SQLite always does a full table scan for count(*). It
does not keep meta information on tables to speed this
process up.

SQLite总是为count(*)做一个全表扫描。它没有在表上保存元信息来加速这个过程。

Not keeping meta information is a deliberate design
decision. If each table stored a count (or better, each
node of the btree stored a count) then much more updating
would have to occur on every INSERT or DELETE. This
would slow down INSERT and DELETE, even in the common
case where count(*) speed is unimportant.

不保留元信息是一个精心设计的决定。如果每个表存储一个计数(或者更好,btree的每个节点存储一个计数),那么在每一个INSERT或DELETE中都必须进行更多的更新。这将降低插入和删除的速度,即使在计数(*)速度不重要的常见情况下也是如此。

If you really need a fast COUNT, then you can create
a trigger on INSERT and DELETE that updates a running
count in a separate table then query that separate
table to find the latest count.

如果您确实需要一个快速计数,那么您可以在INSERT和DELETE上创建一个触发器,该触发器更新一个单独表中的正在运行的计数,然后查询那个单独的表以查找最新的计数。

Of course, it's not worth keeping a FULL row count if you
need COUNTs dependent on WHERE clauses (i.e. WHERE field1 > 0 and field2 < 1000000000).

当然,如果您需要依赖于WHERE子句(例如field1 > 0和field2 < 1000000000),则不值得保留完整的行计数。

#2


18  

If you haven't DELETEd any records, doing:

如果您没有删除任何记录,那么做:

SELECT MAX(_ROWID_) FROM "table" LIMIT 1;

Will avoid the full-table scan. Note that _ROWID_ is a SQLite identifier.

将避免全表扫描。注意,_ROWID_是一个SQLite标识符。

#3


2  

Do not count the stars, count the records! Or in other language, never issue

不要数星星,要数记录!或者用其他语言,永远不要争论

SELECT COUNT(*) FROM tablename;

从表SELECT COUNT(*);

use

使用

SELECT COUNT(ROWID) FROM tablename;

从表选择计数(ROWID);

Call EXPLAIN QUERY PLAN for both to see the difference. Make sure you have an index in place containing all columns mentioned in the WHERE clause.

调用EXPLAIN查询计划以查看两者的差异。确保有一个包含WHERE子句中提到的所有列的索引。

#4


1  

This may not help much, but you can run the ANALYZE command to rebuild statistics about your database. Try running "ANALYZE;" to rebuild statistics about the entire database, then run your query again and see if it is any faster.

这可能没有太大的帮助,但是您可以运行ANALYZE命令来重建关于数据库的统计数据。尝试运行“ANALYZE;”重新构建关于整个数据库的统计信息,然后再次运行查询,看看它是否更快。

#5


0  

On the matter of the column constraint, SQLite maps columns that are declared to be INTEGER PRIMARY KEY to the internal row id (which in turn admits a number of internal optimizations). Theoretically, it could do the same for a separately-declared primary key constraint, but it appears not to do so in practice, at least with the version of SQLite in use. (System.Data.SQLite 1.0.74.0 corresponds to core SQLite 3.7.7.1. You might want to try re-checking your figures with 1.0.79.0; you shouldn't need to change your database to do that, just the library.)

关于列约束,SQLite将声明为整数主键的列映射到内部行id(这反过来又允许一些内部优化)。理论上,它可以对单独声明的主键约束执行相同的操作,但在实践中似乎没有这样做,至少在使用SQLite版本时是这样的。(System.Data。SQLite 1.0.74.0对应于核心SQLite 3.7.7.1。您可能想尝试用1.0.79.0重新检查您的数据;你不需要改变你的数据库就可以做到,只需要改变库。

#6


0  

The output for the fast queries all start with the text "QP: SEARCH". Whilst those for the slow queries start with text "QP: SCAN", which suggests that sqlite is performing a scan of the entire table in order to generate the count.

快速查询的输出都以文本“QP: SEARCH”开始。而对于慢速查询,则从文本“QP: SCAN”开始,这表明sqlite正在对整个表进行扫描,以生成计数。

Googling for "sqlite table scan count" finds the following, which suggests that using a full table scan to retrieve a count is just the way sqlite works, and is therefore probably unavoidable.

搜索“sqlite表扫描计数”会发现以下内容,这表明使用完整的表扫描来检索计数正是sqlite工作的方式,因此可能是不可避免的。

As a workaround, and given that status has only eight values, I wondered if you could get a count quickly using a query like the following?

作为一种解决方案,鉴于状态只有8个值,我想知道是否可以使用如下这样的查询快速获得计数?

select 1 where status=1 union select 1 where status=2 ...

选择1,状态=1 union select 1,状态=2…

then count the rows in the result. This is clearly ugly, but it might work if it persuades sqlite to run the query as a search rather than a scan. The idea of returning "1" each time is to avoid the overhead of returning real data.

然后计算结果中的行数。这显然是丑陋的,但如果它说服sqlite将查询作为搜索而不是扫描来运行,那么它可能会起作用。每次返回“1”的想法是为了避免返回真实数据的开销。

#7


0  

Here's a potential workaround to improve the query performance. From the context, it sounds like your query takes about a minute and a half to run.

这里有一个改进查询性能的方法。从上下文来看,您的查询大约需要一分半钟的时间运行。

Assuming you have a date_created column (or can add one), run a query in the background each day at midnight (say at 00:05am) and persist the value somewhere along with the last_updated date it was calculated (I'll come back to that in a bit).

假设您有一个date_created列(或者可以添加一个),每天午夜在后台运行一个查询(比如在00:05am),并将该值与计算的last_update日期一起保存在某个地方(稍后我将回到这个问题)。

Then, running against your date_created column (with an index), you can avoid a full table scan by doing a query like SELECT COUNT(*) FROM TABLE WHERE date_updated > "[TODAY] 00:00:05".

然后,针对date_created列(带有索引),您可以通过执行SELECT COUNT(*)这样的查询来避免完整的表扫描。

Add the count value from that query to your persisted value, and you have a reasonably fast count that's generally accurate.

将查询中的count值添加到持久值中,您就有了一个通常比较准确的快速计数。

The only catch is that from 12:05am to 12:07am (the duration during which your total count query is running) you have a race condition which you can check the last_updated value of your full table scan count(). If it's > 24 hours old, then your incremental count query needs to pull a full day's count plus time elapsed today. If it's < 24 hours old, then your incremental count query needs to pull a partial day's count (just time elapsed today).

惟一的问题是,从12:05am到12:07am(您的total count查询运行的持续时间)您有一个race条件,您可以检查完整表扫描计数()的last_updates值。如果它是> 24小时,那么您的增量计数查询需要提取一整天的计数加上今天经过的时间。如果它小于24小时,那么您的增量计数查询需要提取部分天的计数(仅占用今天的时间)。

#8


0  

I had the same problem, in my situation VACUUM command helped. After its execution on database COUNT(*) speed increased near 100 times. However, command itself needs some minutes in my database (20 millions records). I solved this problem by running VACUUM when my software exits after main window destruction, so the delay doesn't make problems to user.

我也有同样的问题,在我的情况下真空指挥帮助了我。在执行数据库计数(*)之后,速度增加了近100倍。但是,命令本身在我的数据库中需要一些时间(2千万条记录)。我解决了这个问题,当我的软件在主窗口被破坏后退出时运行真空,所以延迟不会对用户造成问题。