在包含200万条记录的表格上添加索引的速度是具有100万条记录的相同表格的两倍吗?

时间:2022-06-28 00:49:44

I have a table with 70 million records and there is an index missing. I want to calculate the time to add the index without backing up the table and doing the index on the backed up table.

我有一个包含7000万条记录的表格,并且缺少索引。我想计算添加索引的时间,而无需备份表并在备份表上执行索引。

I am just wondering if it will be twice as slow(linear) or if it is exponential.

我只是想知道它是慢两倍(线性)还是指数。

database: mysql 5.0

数据库:mysql 5.0

Thanks a lot

非常感谢

2 个解决方案

#1


4  

(Disclaimer: I have minimal experience on MySQL)

(免责声明:我对MySQL的经验很少)

It should be somewhere in-between.

它应该介于两者之间。

The absolutely lowest complexity of the whole operation would be the one that would appear when just reading all records in order, which is a linear process - O(n). This is an I/O bound operation and there is not much that can be done about it - modern caching systems in most OS may help, but only in a DB that is in use and fits in the available memory.

整个操作的绝对最低的复杂性将是在按顺序读取所有记录时出现的复杂性,这是一个线性过程 - O(n)。这是一个I / O绑定操作,并没有太多可以做的事情 - 大多数操作系统中的现代缓存系统可能有所帮助,但仅限于正在使用并适合可用内存的数据库。

In most SQL engines, indexes are some variation of a B-tree. The CPU complexity of inserting a single record into such a tree is roughly O(log(n)), where n is its size. For n records we get a complexity of O(n log(n)). The total complexity of the operation should be O(n log(n)).

在大多数SQL引擎中,索引是B树的一些变体。将单个记录插入到这样的树中的CPU复杂度大致为O(log(n)),其中n是其大小。对于n个记录,我们得到O(n log(n))的复杂度。操作的总复杂度应为O(n log(n))。

Of course, it's not quite that simple. Computing the index tree is not really CPU-heavy and since the index pages should fit in RAM on any modern system, the operation of inserting a single node when the tree is not rebalanced would be close to O(1) time-wise: a single disk operation to update a leaf page of the index.

当然,它并不那么简单。计算索引树并不是真的占用CPU,因为索引页应该适合任何现代系统的RAM,当树没有重新平衡时插入单个节点的操作将接近O(1)时间:a单磁盘操作来更新索引的叶页。

Since the tree does get rebalanced, however, things are probably a bit more complex. Multiple index pages may have to be commited to disk, thus increasing the necessary time. As a rough guess, I'd say O(n log(n)) is a good start...

但是,由于树确实得到了重新平衡,所以事情可能会更复杂一些。可能必须将多个索引页面提交到磁盘,从而增加了必要的时间。粗略猜测,我会说O(n log(n))是一个好的开始......

It should never come anywhere close to an exponential complexity, though.

但它永远不应该接近指数复杂性。

EDIT:

编辑:

It just occured to me that 70,000,000 B-tree entries may not, in fact, fit in the in-memory cache. It would depend heavily on what is being indexed. INTEGER columns would probably be fine, but TEXT columns are another story altogether. If the average field length is 100 bytes (e.g. HTTP links or 30 characters of non-English UTF-8 text) you'd need more than 7GB of memory to store the index.

我刚刚发现,70,000,000个B树条目实际上可能不适合内存缓存。这在很大程度上取决于被索引的内容。 INTEGER列可能会很好,但TEXT列完全是另一个故事。如果平均字段长度为100个字节(例如,HTTP链接或非英语UTF-8文本的30个字符),则需要超过7GB的内存来存储索引。

Bottom line:

底线:

  • If the index fits in the cache, then since building the index should be a single DB transaction, it would be I/O-bound and roughly linear as all the records have to be parsed and then the index itelse has to be written-out to permanent storage.

    如果索引适合缓存,那么由于构建索引应该是单个数据库事务,它将是I / O绑定的并且大致是线性的,因为必须解析所有记录,然后必须写出索引itelse永久存储。

  • If the index does not fit in the cache, then the complexity rises, as I/O wait-times on the index itself become involved in each operation.

    如果索引不适合缓存,则复杂性会增加,因为索引本身的I / O等待时间会涉及每个操作。

#2


1  

What thkala describes is true for inserting individual rows, but when creating a new index, no reasonable RDBMS will just do n inserts, rather it will construct the index directly starting with the leaf nodes. This process will almost certainly be IO-bound.

thkala描述的内容对于插入单个行是正确的,但是在创建新索引时,没有合理的RDBMS只会执行n次插入,而是直接从叶节点开始构造索引。这个过程几乎肯定会受到IO限制。

So, in practical terms, re-indexing time should be linear: twice as long for twice as many records.

因此,实际上,重新索引时间应该是线性的:两倍于记录的两倍。

#1


4  

(Disclaimer: I have minimal experience on MySQL)

(免责声明:我对MySQL的经验很少)

It should be somewhere in-between.

它应该介于两者之间。

The absolutely lowest complexity of the whole operation would be the one that would appear when just reading all records in order, which is a linear process - O(n). This is an I/O bound operation and there is not much that can be done about it - modern caching systems in most OS may help, but only in a DB that is in use and fits in the available memory.

整个操作的绝对最低的复杂性将是在按顺序读取所有记录时出现的复杂性,这是一个线性过程 - O(n)。这是一个I / O绑定操作,并没有太多可以做的事情 - 大多数操作系统中的现代缓存系统可能有所帮助,但仅限于正在使用并适合可用内存的数据库。

In most SQL engines, indexes are some variation of a B-tree. The CPU complexity of inserting a single record into such a tree is roughly O(log(n)), where n is its size. For n records we get a complexity of O(n log(n)). The total complexity of the operation should be O(n log(n)).

在大多数SQL引擎中,索引是B树的一些变体。将单个记录插入到这样的树中的CPU复杂度大致为O(log(n)),其中n是其大小。对于n个记录,我们得到O(n log(n))的复杂度。操作的总复杂度应为O(n log(n))。

Of course, it's not quite that simple. Computing the index tree is not really CPU-heavy and since the index pages should fit in RAM on any modern system, the operation of inserting a single node when the tree is not rebalanced would be close to O(1) time-wise: a single disk operation to update a leaf page of the index.

当然,它并不那么简单。计算索引树并不是真的占用CPU,因为索引页应该适合任何现代系统的RAM,当树没有重新平衡时插入单个节点的操作将接近O(1)时间:a单磁盘操作来更新索引的叶页。

Since the tree does get rebalanced, however, things are probably a bit more complex. Multiple index pages may have to be commited to disk, thus increasing the necessary time. As a rough guess, I'd say O(n log(n)) is a good start...

但是,由于树确实得到了重新平衡,所以事情可能会更复杂一些。可能必须将多个索引页面提交到磁盘,从而增加了必要的时间。粗略猜测,我会说O(n log(n))是一个好的开始......

It should never come anywhere close to an exponential complexity, though.

但它永远不应该接近指数复杂性。

EDIT:

编辑:

It just occured to me that 70,000,000 B-tree entries may not, in fact, fit in the in-memory cache. It would depend heavily on what is being indexed. INTEGER columns would probably be fine, but TEXT columns are another story altogether. If the average field length is 100 bytes (e.g. HTTP links or 30 characters of non-English UTF-8 text) you'd need more than 7GB of memory to store the index.

我刚刚发现,70,000,000个B树条目实际上可能不适合内存缓存。这在很大程度上取决于被索引的内容。 INTEGER列可能会很好,但TEXT列完全是另一个故事。如果平均字段长度为100个字节(例如,HTTP链接或非英语UTF-8文本的30个字符),则需要超过7GB的内存来存储索引。

Bottom line:

底线:

  • If the index fits in the cache, then since building the index should be a single DB transaction, it would be I/O-bound and roughly linear as all the records have to be parsed and then the index itelse has to be written-out to permanent storage.

    如果索引适合缓存,那么由于构建索引应该是单个数据库事务,它将是I / O绑定的并且大致是线性的,因为必须解析所有记录,然后必须写出索引itelse永久存储。

  • If the index does not fit in the cache, then the complexity rises, as I/O wait-times on the index itself become involved in each operation.

    如果索引不适合缓存,则复杂性会增加,因为索引本身的I / O等待时间会涉及每个操作。

#2


1  

What thkala describes is true for inserting individual rows, but when creating a new index, no reasonable RDBMS will just do n inserts, rather it will construct the index directly starting with the leaf nodes. This process will almost certainly be IO-bound.

thkala描述的内容对于插入单个行是正确的,但是在创建新索引时,没有合理的RDBMS只会执行n次插入,而是直接从叶节点开始构造索引。这个过程几乎肯定会受到IO限制。

So, in practical terms, re-indexing time should be linear: twice as long for twice as many records.

因此,实际上,重新索引时间应该是线性的:两倍于记录的两倍。