确保SQLite3中唯一行的有效方法。

时间:2022-06-22 04:19:03

I am using SQLite3 in one of my projects and I need to ensure that the rows that are inserted into a table are unique with regard to a combination of some of their columns. In most cases the rows inserted will differ in that respect, but in case of a match the new row must update/replace the existing one.

我在一个项目中使用SQLite3,我需要确保插入到表中的行与它们的一些列的组合是唯一的。在大多数情况下,插入的行在这方面会有所不同,但是在匹配的情况下,新的行必须更新/替换现有的行。

The obvious solution was to use a composite primary key, with a conflict clause to handle collisions. Thefore this:

显而易见的解决方案是使用复合主键和冲突子句来处理冲突。因此这个:

CREATE TABLE Event (Id INTEGER, Fld0 TEXT, Fld1 INTEGER, Fld2 TEXT, Fld3 TEXT, Fld4 TEXT, Fld5 TEXT, Fld6 TEXT);

became this:

成为这个:

CREATE TABLE Event (Id INTEGER, Fld0 TEXT, Fld1 INTEGER, Fld2 TEXT, Fld3 TEXT, Fld4 TEXT, Fld5 TEXT, Fld6 TEXT, PRIMARY KEY (Fld0, Fld2, Fld3) ON CONFLICT REPLACE);

This does indeed enforce the uniqueness constraint as I need it to. Unfortunately, this change also incurs a performance penalty that is way beyond what I expected. I did a few tests using the sqlite3 command line utility to ensure that there is not a fault in the rest of my code. The tests involve entering 100,000 rows, either in a single transaction or in 100 transactions of 1,000 rows each. I got the following results:

这确实实现了惟一性约束。不幸的是,这种更改还会导致性能损失,这远远超出了我的预期。我使用sqlite3命令行实用程序做了一些测试,以确保代码的其余部分没有错误。测试涉及输入100,000行,要么在单个事务中,要么在每个事务1000行的100个事务中。我得到了以下结果:

                                | 1 * 100,000   | 10 * 10,000   | 100 * 1,000   |
                                |---------------|---------------|---------------|
                                | Time  | CPU   | Time  | CPU   | Time  | CPU   |
                                | (sec) | (%)   | (sec) | (%)   | (sec) | (%)   |
--------------------------------|-------|-------|-------|-------|-------|-------|
No primary key                  | 2.33  | 80    | 3.73  | 50    | 15.1  | 15    |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld3               | 5.19  | 84    | 23.6  | 21    | 226.2 | 3     |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld2, Fld3         | 5.11  | 88    | 24.6  | 22    | 258.8 | 3     |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld0, Fld2, Fld3   | 5.38  | 87    | 23.8  | 23    | 232.3 | 3     |

My application currently performs transactions of at most 1,000 rows and I was surprised by the 15-fold drop in performance. I expected at most a 3-fold drop in throughput and a rise in CPU usage, as seen in the 100k-transaction case. I guess the indexing involved in maintaining the primary key constraints requires a significantly larger number of synchronous DB operations, thus making my hard drives the bottleneck in this case.

我的应用程序目前最多执行1,000行事务,我对性能下降15倍感到惊讶。我预计吞吐量最多将下降3倍,CPU使用率将上升,如100k事务的情况所示。我猜维护主键约束所涉及的索引需要大量的同步DB操作,因此在这种情况下,我的硬盘驱动器成为了瓶颈。

Using WAL mode does have some effect - a performance increase of about 15%. Unfortunately that is not enough on its own. PRAGMA synchronous = NORMAL did not seem to have any effect.

使用WAL模式确实有一些效果——性能提高了15%。不幸的是,光靠这一点是不够的。PRAGMA同步= NORMAL似乎没有任何效果。

I might be able to recover some performance by increasing the transaction size, but I'd rather not do that, due to the increased memory usage and concerns about responsiveness and reliability.

我可以通过增加事务大小来恢复一些性能,但我宁愿不这样做,因为内存使用量增加,并且对响应性和可靠性感到担忧。

The text fields in each row have variable lengths of about 250 bytes in average. The query performance does not matter too much, but the insert performance is very important. My application code is in C and is (supposed to be) portable to at least Linux and Windows.

每行中的文本字段平均长度约为250字节。查询性能并不重要,但是插入性能非常重要。我的应用程序代码是C语言的,并且至少可以移植到Linux和Windows上。

Is there a way to improve the insert performance without increasing the transaction size? Either some setting in SQLite (anything but permanently forcing the DB into asynchronous operation, that is) or programmatically in my application code? For example, is there a way to ensure row uniqueness without using an index?

是否有一种方法可以在不增加事务大小的情况下提高插入性能?或者是SQLite中的一些设置(不是永久强制DB进行异步操作),或者是我的应用程序代码中的编程设置?例如,是否有一种方法可以在不使用索引的情况下确保行惟一性?

BOUNTY:

赏金:

By using the hashing/indexing method described in my own answer, I managed to somewhat moderate the performance drop to a point where it's probably acceptable for my application. It seems, however, that as the number of rows in the table increases, the presence of the index makes inserts slower and slower.

通过使用我自己的答案中描述的散列/索引方法,我设法将性能下降的幅度控制在一定程度上,使其对我的应用程序可能是可以接受的。然而,似乎随着表中的行数的增加,索引的存在使得插入越来越慢。

I am interested in any technique or fine-tuning setting that will increase performance in this particular use case, as long as it does not involve hacking the SQLite3 code or otherwise cause the project to become unmaintainable.

我对任何技术或微调设置感兴趣,只要它不涉及破坏SQLite3代码或导致项目无法维护,就可以在这个特定的用例中提高性能。

5 个解决方案

#1


15  

I have used sqlite to insert millions of rows at runtime and this is what I have used to increase performance:

我使用sqlite在运行时插入数百万行,这是我用来提高性能的:

  • Use as few transactions as possible.
  • 尽可能少使用事务。
  • Use parametrized commands for inserting the data (prepare the command once and just change paramater values in the loop)
  • 使用参数化命令插入数据(只准备一次命令并在循环中更改参数值)
  • Set PRAGMA synchronous OFF (not sure how it works with WAL)
  • 设置PRAGMA同步关闭(不确定它如何与WAL一起工作)
  • Increase page size of the database.
  • 增加数据库的页面大小。
  • Increase cache size. This is an important setting as it will cause sqlite to actually write the data to the disk fewer times and will run more operations in memory making the whole process faster.
  • 增加缓存大小。这是一个非常重要的设置,因为它将导致sqlite实际将数据写入磁盘的次数减少,并在内存中运行更多的操作,从而使整个进程更快。
  • If you need an index add it after inserting the rows by running the necessary sqlite command. In this case you will need to ensure uniqueness yourself as you are currently doing it now.
  • 如果需要索引,请在插入行之后通过运行必要的sqlite命令添加索引。在这种情况下,您需要确保您自己的独特性,因为您目前正在这样做。

If you try these please post your test results. I believe it will be interesting for everyone.

如果您尝试这些,请张贴您的测试结果。我相信大家都会喜欢的。

#2


8  

The ON CONFLICT REPLACE clause will make SQLite delete existing rows, then insert new rows. That means that SQLite is probably going to spend some of its time

ON CONFLICT REPLACE子句将使SQLite删除现有行,然后插入新的行。这意味着SQLite可能会花一些时间

  • deleting existing rows
  • 删除现有的行
  • updating the indexes
  • 更新索引
  • inserting new rows
  • 插入新行
  • updating the indexes
  • 更新索引

That's my take on it, based on SQLite documentation and reading about other database management systems. I didn't look at the source code.

这是我对它的看法,基于SQLite文档和阅读其他数据库管理系统。我没有看源代码。

SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index, though.

SQLite有两种表示惟一性约束的方式:主键和惟一。不过,它们都创建了一个索引。

Now the really important stuff . . .

现在真正重要的是……

It's great that you did tests. Most developers don't do that. But I think your test results are badly misleading.

你做了测试真是太棒了。大多数开发人员不这么做。但是我认为你的测试结果有很大的误导性。

In your case, it doesn't matter how fast you can insert rows into a table that doesn't have a primary key. A table that doesn't have a primary key doesn't satisfy your basic requirements for data integrity. That means you can't rely on your database to give you the right answers.

在您的情况中,您可以将行插入到一个没有主键的表中,这并不重要。没有主键的表不能满足数据完整性的基本要求。这意味着你不能依赖你的数据库来给你正确的答案。

If it doesn't have to give the right answers, I can make it really, really fast.

如果它不需要给出正确的答案,我可以让它很快。

To get a meaningful timing for inserting into a table that has no key, you need to either

要获得一个有意义的插入到没有键的表中的时间,您需要选择其中之一

  • run code before inserting new data to make sure you don't violate the undeclared primary key constraint, and to make sure you update existing rows with the right values (instead of inserting), or
  • 在插入新数据之前运行代码,以确保不违反未声明的主键约束,并确保使用正确的值(而不是插入)更新现有的行
  • run code after inserting into that table to clean up duplicates on (Fld0, Fld2, Fld3), and to reconcile conflicts
  • 在插入该表后运行代码,以清除(Fld0、Fld2、Fld3)上的副本,并协调冲突。

And, of course, the time those processes take has to be taken into account, too.

当然,这些过程所花费的时间也必须考虑进去。

FWIW, I did a test by running 100K SQL insert statements into your schema in transactions of 1000 statements, and it only took 30 seconds. A single transaction of 1000 insert statements, which seems to be what you expect in production, took 149 msec.

FWIW,我做了一个测试,在1000个语句的事务中运行100K SQL insert语句到您的模式中,只花了30秒。一个1000个insert语句的事务花费了149毫秒,这似乎是您在生产中所期望的。

Maybe you can speed things up by inserting into an unkeyed temporary table, then updating the keyed table from that.

也许您可以通过插入一个非键临时表来加快速度,然后根据它更新键控表。

#3


4  

(I don't normally answer my own questions, but I would like to document a few ideas/partial solutions for this.)

(我通常不回答自己的问题,但我想为这个问题记录一些想法/部分解决方案。)

The main problem with a composite primary key is the way the indexes are handled. Composite keys imply an index on the composite value, which in my case means indexing strings. While comparing string values isn't that slow, indexing a value with a length of, say, 500 bytes means that the B-tree nodes in the index can fit far fewer row/node pointers than a B-tree that indexes a 64-bit integer value. This means loading far more DB pages for each index search, as the height of the B-tree increases.

复合主键的主要问题是处理索引的方式。复合键意味着复合值上的索引,在我的例子中,这意味着索引字符串。虽然比较字符串值并不慢,但是索引长度为500字节的值意味着索引中的b树节点可以容纳比索引64位整数值的b树节点少得多的行/节点指针。这意味着随着b树的高度增加,每次索引搜索都要加载更多的DB页面。

In order to deal with this issue I modified my code so that:

为了解决这个问题,我修改了我的代码:

  • It uses WAL mode. The performance increase was certainly worth such a small change, since I do not have any issues with the DB file not being self-contained.

    它使用WAL模式。性能的提升当然值得这么小的改变,因为我对DB文件不包含没有任何问题。

  • I used the MurmurHash3 hash function - after re-writing it in C and adapting it - to produce a single 32-bit hash value from the values of the fields that would form the key. I stored this hash in a new indexed column. Since this is an integer value, the index is quite fast. This is the only index for this table. Since there are going to be at most 10,000,000 rows in the table, hash collisions will not be a performance concern - although I cannot really consider the hash value to be UNIQUE, the index will only return a single row in the general case.

    我使用杂音哈希3哈希函数——在用C重写并修改之后——从将构成键的字段的值中生成一个32位哈希值。我将这个散列存储在一个新的索引列中。由于这是一个整数值,索引速度很快。这是该表的唯一索引。由于表中最多有10,000,000行,所以哈希冲突将不会是性能问题——尽管我不能真正地认为哈希值是唯一的,但在一般情况下,索引将只返回一行。

At this point there are two alternatives that I have coded and are currently undergoing testing:

目前,我已经编写了两种替代方案,目前正在进行测试:

  • DELETE FROM Event WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT.

    从Hash=?和Fld0 = ?和Fld2 = ?和Fld3 = ?,后面跟着插入。

  • UPDATE Event SET Fld1=?,... WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT if no rows where updated.

    更新事件设置Fld1 = ?,……散列= ?和Fld0 = ?和Fld2 = ?和Fld3 = ?,如果没有更新的行,则进行插入。

I expect the second alternative to be faster, but I will have to complete the testing first. In any case, it seems that with these changes the performance drop (as compared to the original index-less table) has been lessened to a factor of 5 or so, which is far more manageable.

我期望第二个替代方案更快,但我必须先完成测试。无论如何,随着这些变化,性能下降(与原始的无索引表相比)似乎减少了5倍左右,这要容易管理得多。

EDIT:

编辑:

At this point I have settled with using the second variation, which is indeed slightly faster. It seems, however, that any kind of index slows down SQLite3 dramatically as the indexed table gets larger. Increasing the DB page size to 8192 bytes seems to help somewhat, but not nearly as drastically as I would like.

在这一点上,我已经决定使用第二种变化,它确实稍微快一点。然而,似乎任何类型的索引都随着索引表的变大而显著地减慢了SQLite3。将DB页面大小增加到8192字节似乎有所帮助,但并没有我希望的那么大。

#4


2  

Case When Exists((Select ID From Table Where Fld0 = value0 and Fld2 = value1 and Fld3 = value 2)) Then
    --Insert Statement
End

I'm not 100% that the insert works like that in SQLite, but I think it should. This with proper indexing on the Where fields should be rather quick. However this is two transactions which is something to consider.

在SQLite中,我并不是100%的那样使用insert语句,但我认为应该这样。这与正确地索引字段应该相当快。然而,这是两个需要考虑的事务。

#5


2  

In addition to all the other great answers, one thing that you can do is partition data into several tables.

除了所有其他优秀的答案之外,您还可以做的一件事是将数据划分为几个表。

SQLite INSERTs get slower and slower as the number of rows increases, but if you can split a table into several ones that effect is diminished (e.g.: "names" -> "names_a", "names_b", ... for names starting with the letter x). Later on, you can do CREATE VIEW "names" AS SELECT * FROM "names_a" UNION SELECT * FROM "names_b" UNION ....

当行数增加时,SQLite插入会变得越来越慢,但是如果您可以将一个表拆分为几个效果会减少的表(例如:“names”—>“names_a”,“names_b”,…名字从字母x)。后来,你可以创建视图“名称”,选择从“names_a”联盟选择* *“names_b”联盟....

#1


15  

I have used sqlite to insert millions of rows at runtime and this is what I have used to increase performance:

我使用sqlite在运行时插入数百万行,这是我用来提高性能的:

  • Use as few transactions as possible.
  • 尽可能少使用事务。
  • Use parametrized commands for inserting the data (prepare the command once and just change paramater values in the loop)
  • 使用参数化命令插入数据(只准备一次命令并在循环中更改参数值)
  • Set PRAGMA synchronous OFF (not sure how it works with WAL)
  • 设置PRAGMA同步关闭(不确定它如何与WAL一起工作)
  • Increase page size of the database.
  • 增加数据库的页面大小。
  • Increase cache size. This is an important setting as it will cause sqlite to actually write the data to the disk fewer times and will run more operations in memory making the whole process faster.
  • 增加缓存大小。这是一个非常重要的设置,因为它将导致sqlite实际将数据写入磁盘的次数减少,并在内存中运行更多的操作,从而使整个进程更快。
  • If you need an index add it after inserting the rows by running the necessary sqlite command. In this case you will need to ensure uniqueness yourself as you are currently doing it now.
  • 如果需要索引,请在插入行之后通过运行必要的sqlite命令添加索引。在这种情况下,您需要确保您自己的独特性,因为您目前正在这样做。

If you try these please post your test results. I believe it will be interesting for everyone.

如果您尝试这些,请张贴您的测试结果。我相信大家都会喜欢的。

#2


8  

The ON CONFLICT REPLACE clause will make SQLite delete existing rows, then insert new rows. That means that SQLite is probably going to spend some of its time

ON CONFLICT REPLACE子句将使SQLite删除现有行,然后插入新的行。这意味着SQLite可能会花一些时间

  • deleting existing rows
  • 删除现有的行
  • updating the indexes
  • 更新索引
  • inserting new rows
  • 插入新行
  • updating the indexes
  • 更新索引

That's my take on it, based on SQLite documentation and reading about other database management systems. I didn't look at the source code.

这是我对它的看法,基于SQLite文档和阅读其他数据库管理系统。我没有看源代码。

SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index, though.

SQLite有两种表示惟一性约束的方式:主键和惟一。不过,它们都创建了一个索引。

Now the really important stuff . . .

现在真正重要的是……

It's great that you did tests. Most developers don't do that. But I think your test results are badly misleading.

你做了测试真是太棒了。大多数开发人员不这么做。但是我认为你的测试结果有很大的误导性。

In your case, it doesn't matter how fast you can insert rows into a table that doesn't have a primary key. A table that doesn't have a primary key doesn't satisfy your basic requirements for data integrity. That means you can't rely on your database to give you the right answers.

在您的情况中,您可以将行插入到一个没有主键的表中,这并不重要。没有主键的表不能满足数据完整性的基本要求。这意味着你不能依赖你的数据库来给你正确的答案。

If it doesn't have to give the right answers, I can make it really, really fast.

如果它不需要给出正确的答案,我可以让它很快。

To get a meaningful timing for inserting into a table that has no key, you need to either

要获得一个有意义的插入到没有键的表中的时间,您需要选择其中之一

  • run code before inserting new data to make sure you don't violate the undeclared primary key constraint, and to make sure you update existing rows with the right values (instead of inserting), or
  • 在插入新数据之前运行代码,以确保不违反未声明的主键约束,并确保使用正确的值(而不是插入)更新现有的行
  • run code after inserting into that table to clean up duplicates on (Fld0, Fld2, Fld3), and to reconcile conflicts
  • 在插入该表后运行代码,以清除(Fld0、Fld2、Fld3)上的副本,并协调冲突。

And, of course, the time those processes take has to be taken into account, too.

当然,这些过程所花费的时间也必须考虑进去。

FWIW, I did a test by running 100K SQL insert statements into your schema in transactions of 1000 statements, and it only took 30 seconds. A single transaction of 1000 insert statements, which seems to be what you expect in production, took 149 msec.

FWIW,我做了一个测试,在1000个语句的事务中运行100K SQL insert语句到您的模式中,只花了30秒。一个1000个insert语句的事务花费了149毫秒,这似乎是您在生产中所期望的。

Maybe you can speed things up by inserting into an unkeyed temporary table, then updating the keyed table from that.

也许您可以通过插入一个非键临时表来加快速度,然后根据它更新键控表。

#3


4  

(I don't normally answer my own questions, but I would like to document a few ideas/partial solutions for this.)

(我通常不回答自己的问题,但我想为这个问题记录一些想法/部分解决方案。)

The main problem with a composite primary key is the way the indexes are handled. Composite keys imply an index on the composite value, which in my case means indexing strings. While comparing string values isn't that slow, indexing a value with a length of, say, 500 bytes means that the B-tree nodes in the index can fit far fewer row/node pointers than a B-tree that indexes a 64-bit integer value. This means loading far more DB pages for each index search, as the height of the B-tree increases.

复合主键的主要问题是处理索引的方式。复合键意味着复合值上的索引,在我的例子中,这意味着索引字符串。虽然比较字符串值并不慢,但是索引长度为500字节的值意味着索引中的b树节点可以容纳比索引64位整数值的b树节点少得多的行/节点指针。这意味着随着b树的高度增加,每次索引搜索都要加载更多的DB页面。

In order to deal with this issue I modified my code so that:

为了解决这个问题,我修改了我的代码:

  • It uses WAL mode. The performance increase was certainly worth such a small change, since I do not have any issues with the DB file not being self-contained.

    它使用WAL模式。性能的提升当然值得这么小的改变,因为我对DB文件不包含没有任何问题。

  • I used the MurmurHash3 hash function - after re-writing it in C and adapting it - to produce a single 32-bit hash value from the values of the fields that would form the key. I stored this hash in a new indexed column. Since this is an integer value, the index is quite fast. This is the only index for this table. Since there are going to be at most 10,000,000 rows in the table, hash collisions will not be a performance concern - although I cannot really consider the hash value to be UNIQUE, the index will only return a single row in the general case.

    我使用杂音哈希3哈希函数——在用C重写并修改之后——从将构成键的字段的值中生成一个32位哈希值。我将这个散列存储在一个新的索引列中。由于这是一个整数值,索引速度很快。这是该表的唯一索引。由于表中最多有10,000,000行,所以哈希冲突将不会是性能问题——尽管我不能真正地认为哈希值是唯一的,但在一般情况下,索引将只返回一行。

At this point there are two alternatives that I have coded and are currently undergoing testing:

目前,我已经编写了两种替代方案,目前正在进行测试:

  • DELETE FROM Event WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT.

    从Hash=?和Fld0 = ?和Fld2 = ?和Fld3 = ?,后面跟着插入。

  • UPDATE Event SET Fld1=?,... WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT if no rows where updated.

    更新事件设置Fld1 = ?,……散列= ?和Fld0 = ?和Fld2 = ?和Fld3 = ?,如果没有更新的行,则进行插入。

I expect the second alternative to be faster, but I will have to complete the testing first. In any case, it seems that with these changes the performance drop (as compared to the original index-less table) has been lessened to a factor of 5 or so, which is far more manageable.

我期望第二个替代方案更快,但我必须先完成测试。无论如何,随着这些变化,性能下降(与原始的无索引表相比)似乎减少了5倍左右,这要容易管理得多。

EDIT:

编辑:

At this point I have settled with using the second variation, which is indeed slightly faster. It seems, however, that any kind of index slows down SQLite3 dramatically as the indexed table gets larger. Increasing the DB page size to 8192 bytes seems to help somewhat, but not nearly as drastically as I would like.

在这一点上,我已经决定使用第二种变化,它确实稍微快一点。然而,似乎任何类型的索引都随着索引表的变大而显著地减慢了SQLite3。将DB页面大小增加到8192字节似乎有所帮助,但并没有我希望的那么大。

#4


2  

Case When Exists((Select ID From Table Where Fld0 = value0 and Fld2 = value1 and Fld3 = value 2)) Then
    --Insert Statement
End

I'm not 100% that the insert works like that in SQLite, but I think it should. This with proper indexing on the Where fields should be rather quick. However this is two transactions which is something to consider.

在SQLite中,我并不是100%的那样使用insert语句,但我认为应该这样。这与正确地索引字段应该相当快。然而,这是两个需要考虑的事务。

#5


2  

In addition to all the other great answers, one thing that you can do is partition data into several tables.

除了所有其他优秀的答案之外,您还可以做的一件事是将数据划分为几个表。

SQLite INSERTs get slower and slower as the number of rows increases, but if you can split a table into several ones that effect is diminished (e.g.: "names" -> "names_a", "names_b", ... for names starting with the letter x). Later on, you can do CREATE VIEW "names" AS SELECT * FROM "names_a" UNION SELECT * FROM "names_b" UNION ....

当行数增加时,SQLite插入会变得越来越慢,但是如果您可以将一个表拆分为几个效果会减少的表(例如:“names”—>“names_a”,“names_b”,…名字从字母x)。后来,你可以创建视图“名称”,选择从“names_a”联盟选择* *“names_b”联盟....