PG CREATEINDEX CONCURRENTLY
官方说法
根据9.1的文档
Creating an index can interfere with regular operation of a database. Normally PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single scan of the table. Other transactions can still read the table, but if they try to insert, update, or delete rows in the table they will block until the index build is finished. This could have a severe effect if the system is a live production database. Very large tables can take many hours to be indexed, and even for smaller tables, an index build can lock out writers for periods that are unacceptably long for a production system.
PostgreSQL supports building indexes without locking out writes. This method is invoked by specifying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially use the index to terminate. Thus this method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment. Of course, the extra CPU and I/O load imposed by the index creation might slow other operations.
正常的create index会锁表阻止所有写操作,但是变通的方法是 create index concurrently。当然这样做会慢。PG会启动两次表扫描并且等待现存的所有事务结束。
In a concurrent index build, the index is actually entered into the system catalogs in one transaction, then two table scans occur in two more transactions. Any transaction active when the second table scan starts can block concurrent index creation until it completes, even transactions that only reference the table after the second table scan starts. Concurrent index creation serially waits for each old transaction to complete using the method outlined in section Section 45.56.
在concurrent build的过程中,索引实际上进入system catalogs,然后另外启动两个table scan。任何在第二个table scan开始时活动的事务都会block concurrent index build,甚至事务只是reference 了那个表。Concurrently index creation等待所有上述事务结束,用的方法是45.56节中。(基本上依赖pg_locks:The view pg_locks provides access to information about the locks held by open transactions within the database server. )
If a problem arises while scanning the table, such as a uniqueness violation in a unique index, the CREATE INDEX command will fail but leave behind an "invalid" index. This index will be ignored for querying purposes because it might be incomplete; however it will still consume update overhead. The psql \d command will report such an index as INVALID:
postgres=# \d tab
Table "public.tab" Column | Type | Modifiers
--------+---------+----------- col | integer | Indexes:
"idx" btree (col) INVALID
The recommended recovery method in such cases is to drop the index and try again to perform CREATE INDEX CONCURRENTLY. (Another possibility is to rebuild the index with REINDEX. However, since REINDEX does not support concurrent builds, this option is unlikely to seem attractive.)
如果build失败了,就会有一个invalid的index,这个index查询的时候用不上,但是update的时候还要更新它。推荐做法是drop之后重新create index。
Another caveat when building a unique index concurrently is that the uniqueness constraint is already being enforced against other transactions when the second table scan begins. This means that constraint violations could be reported in other queries prior to the index becoming available for use, or even in cases where the index build eventually fails. Also, if a failure does occur in the second scan, the "invalid" index continues to enforce its uniqueness constraint afterwards.
有一个问题:如果是一个unique的index的话,unique约束已经对其他事务可见了。这就意味着其他事务中可能在索引可用之前就报constraint violation,哪怕索引建失败了。另外,如果在第二次扫描的时候出错,这个invalid的索引依然会强加一个unique的constraint。
Concurrent builds of expression indexes and partial indexes are supported. Errors occurring in the evaluation of these expressions could cause behavior similar to that described above for unique constraint violations.
expression和partial index同样支持,但是还是有上述的unique的问题。
Regular index builds permit other regular index builds on the same table to occur in parallel, but only one concurrent index build can occur on a table at a time. In both cases, no other types of schema modification on the table are allowed meanwhile. Another difference is that a regular CREATE INDEX command can be performed within a transaction block, but CREATE INDEX CONCURRENTLY cannot.
正常的索引容许在同一个表上同时建多个索引,currently build的index不行,同时只容许一个。另外一个区别是create index可以在事务中但是create index concurrently不行。
上述的缺陷本质上是因为,create index concurrently 包含了事务。
代码分析
一个正常的create index流程应该是,DefineIndex->index_create->index_build。
index_build
/*
* index_build - invoke access-method-specific index build procedure
*
* On entry, the index's catalog entries are valid, and its physical disk
* file has been created but is empty. We call the AM-specific build
* procedure to fill in the index contents. We then update the pg_class
* entries of the index and heap relation as needed, using statistics
* returned by ambuild as well as data passed by the caller.
*
* isprimary tells whether to mark the index as a primary-key index.
* isreindex indicates we are recreating a previously-existing index.
*
* Note: when reindexing an existing index, isprimary can be false even if
* the index is a PK; it's already properly marked and need not be re-marked.
*
* Note: before Postgres 8.2, the passed-in heap and index Relations
* were automatically closed by this routine. This is no longer the case.
* The caller opened 'em, and the caller should close 'em.
*/
void
index_build(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
bool isprimary,
bool isreindex)
```
这个函数的参数大概是本文里面最少的一个,比较单纯。重要的东西见注释。
### index_create
这个函数完成create index的工作。
1. 校验错误,
2. build元数据数据,并插入pg_class. index,constraint 等元数据表。
3. 如果是正常的索引:调用index_build真的建索引;否则只更新元数据表。
### DefineIndex
create index语句会辗转调到DefineIndex里面来,这个函数的参数很多,其中专门有个参数是bool concurrent。就是控制是不是concurrently create index的。实际上控制concurrently build的逻辑大部分都在这个函数里。
#### 各种校验
目前没有指的关注的。
#### index create
在各种校验之后,
```
/*
* Make the catalog entries for the index, including constraints. Then, if
* not skip_build || concurrent, actually build the index.
*/
indexRelationId =
index_create(rel, indexRelationName, indexRelationId,
indexInfo, indexColNames,
accessMethodId, tablespaceId,
collationObjectId, classObjectId,
coloptions, reloptions, primary,
isconstraint, deferrable, initdeferred,
allowSystemTableMods,
skip_build || concurrent,
concurrent);
```
注意倒数第二行,暗示index_create不要真的去建索引。
后面的流程,一遍看代码一遍解释吧
```
if (!concurrent)
{
/* Close the heap and we're done, in the non-concurrent case */
heap_close(rel, NoLock);
return;
}
```
正常直接退出。下面都是concurrently的流程。
#### index concurrently create
```
/* save lockrelid and locktag for below, then close rel */
heaprelid = rel->rd_lockInfo.lockRelId;
SET_LOCKTAG_RELATION(heaplocktag, heaprelid.dbId, heaprelid.relId);
heap_close(rel, NoLock);
```
这个时候,table的锁已经释放,其他事务不会被block住了。
```
/*
* For a concurrent build, it's important to make the catalog entries
* visible to other transactions before we start to build the index. That
* will prevent them from making incompatible HOT updates. The new index
* will be marked not indisready and not indisvalid, so that no one else
* tries to either insert into it or use it for queries.
*
* We must commit our current transaction so that the index becomes
* visible; then start another. Note that all the data structures we just
* built are lost in the commit. The only data we keep past here are the
* relation IDs.
*
* Before committing, get a session-level lock on the table, to ensure
* that neither it nor the index can be dropped before we finish. This
* cannot block, even if someone else is waiting for access, because we
* already have the same lock within our transaction.
*
* Note: we don't currently bother with a session lock on the index,
* because there are no operations that could change its state while we
* hold lock on the parent table. This might need to change later.
*/
```
关于上面这段注释,先解释两个词,它们不是一个英文单词,它们是pg_index的两个字段。
1. indisvalid bool If true, the index is currently valid for queries. False means the index is possibly incomplete: it must still be modified by INSERT/UPDATE operations, but it cannot safely be used for queries. If it is unique, the uniqueness property is not true either.
1. indisready bool If true, the index is currently ready for inserts. False means the index must be ignored by INSERT/UPDATE operations.
这里涉及到了session level lock的概念,这个锁不会block其他的insert update之类。
```
LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
PopActiveSnapshot();
CommitTransactionCommand();
StartTransactionCommand();
```
这里是第一次commit。注意,ShareUpdateExclusiveLock不阻止insert/update/delete
```
/*
* Phase 2 of concurrent index build (see comments for validate_index()
* for an overview of how this works)
*
* Now we must wait until no running transaction could have the table open
* with the old list of indexes. To do this, inquire which xacts
* currently would conflict with ShareLock on the table -- ie, which ones
* have a lock that permits writing the table. Then wait for each of
* these xacts to commit or abort. Note we do not need to worry about
* xacts that open the table for writing after this point; they will see
* the new index when they open it.
*
* Note: the reason we use actual lock acquisition here, rather than just
* checking the ProcArray and sleeping, is that deadlock is possible if
* one of the transactions in question is blocked trying to acquire an
* exclusive lock on our table. The lock code will detect deadlock and
* error out properly.
*
* Note: GetLockConflicts() never reports our own xid, hence we need not
* check for that. Also, prepared xacts are not reported, which is fine
* since they certainly aren't going to do anything more.
*/
old_lockholders = GetLockConflicts(&heaplocktag, ShareLock);
```
取得当前正在运行的事务
```
while (VirtualTransactionIdIsValid(*old_lockholders))
{
VirtualXactLockTableWait(*old_lockholders);
old_lockholders++;
}
/*
* At this moment we are sure that there are no transactions with the
* table open for write that don't have this new index in their list of
* indexes. We have waited out all the existing transactions and any new
* transaction will have the new index in its list, but the index is still
* marked as "not-ready-for-inserts". The index is consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
```
仔细读注释
```
/* Open and lock the parent heap relation */
rel = heap_openrv(heapRelation, ShareUpdateExclusiveLock);
/* And the target index relation */
indexRelation = index_open(indexRelationId, RowExclusiveLock);
/* Set ActiveSnapshot since functions in the indexes may need it */
PushActiveSnapshot(GetTransactionSnapshot());
/* We have to re-build the IndexInfo struct, since it was lost in commit */
indexInfo = BuildIndexInfo(indexRelation);
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
/* Now build the index */
index_build(rel, indexRelation, indexInfo, primary, false);
/* Close both the relations, but keep the locks */
heap_close(rel, NoLock);
index_close(indexRelation, NoLock);
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
pg_index = heap_open(IndexRelationId, RowExclusiveLock);
indexTuple = SearchSysCacheCopy1(INDEXRELID,
ObjectIdGetDatum(indexRelationId));
if (!HeapTupleIsValid(indexTuple))
elog(ERROR, "cache lookup failed for index %u", indexRelationId);
indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
Assert(!indexForm->indisready);
Assert(!indexForm->indisvalid);
indexForm->indisready = true;
simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
CatalogUpdateIndexes(pg_index, indexTuple);
heap_close(pg_index, RowExclusiveLock);
/* we can do away with our snapshot */
PopActiveSnapshot();
/*
* Commit this transaction to make the indisready update visible.
*/
CommitTransactionCommand();
StartTransactionCommand();
```
第一次scan结束,这个时候index已经对上层可见了。
```
/*
* Phase 3 of concurrent index build
*
* We once again wait until no transaction can have the table open with
* the index marked as read-only for updates.
*/
old_lockholders = GetLockConflicts(&heaplocktag, ShareLock);
while (VirtualTransactionIdIsValid(*old_lockholders))
{
VirtualXactLockTableWait(*old_lockholders);
old_lockholders++;
}
```
again,等所有不认识index的事务结束。
```
/*
* Now take the "reference snapshot" that will be used by validate_index()
* to filter candidate tuples. Beware! There might still be snapshots in
* use that treat some transaction as in-progress that our reference
* snapshot treats as committed. If such a recently-committed transaction
* deleted tuples in the table, we will not include them in the index; yet
* those transactions which see the deleting one as still-in-progress will
* expect such tuples to be there once we mark the index as valid.
*
* We solve this by waiting for all endangered transactions to exit before
* we mark the index as valid.
*
* We also set ActiveSnapshot to this snap, since functions in indexes may
* need a snapshot.
*/
snapshot = RegisterSnapshot(GetTransactionSnapshot());
PushActiveSnapshot(snapshot);
/*
* Scan the index and the heap, insert any missing index entries.
*/
validate_index(relationId, indexRelationId, snapshot);
```
第二次扫描结束,这个时候索引可用了。
```
/*
* The index is now valid in the sense that it contains all currently
* interesting tuples. But since it might not contain tuples deleted just
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots. Obtain a list of VXIDs
* of such transactions, and wait for them individually.
*
* We can exclude any running transactions that have xmin > the xmin of
* our reference snapshot; their oldest snapshot must be newer than ours.
* We can also exclude any transactions that have xmin = zero, since they
* evidently have no live snapshot at all (and any one they might be in
* process of taking is certainly newer than ours). Transactions in other
* DBs can be ignored too, since they'll never even be able to see this
* index.
*
* We can also exclude autovacuum processes and processes running manual
* lazy VACUUMs, because they won't be fazed by missing index entries
* either. (Manual ANALYZEs, however, can't be excluded because they
* might be within transactions that are going to do arbitrary operations
* later.)
*
* Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
* check for that.
*
* If a process goes idle-in-transaction with xmin zero, we do not need to
* wait for it anymore, per the above argument. We do not have the
* infrastructure right now to stop waiting if that happens, but we can at
* least avoid the folly of waiting when it is idle at the time we would
* begin to wait. We do this by repeatedly rechecking the output of
* GetCurrentVirtualXIDs. If, during any iteration, a particular vxid
* doesn't show up in the output, we know we can forget about it.
*/
old_snapshots = GetCurrentVirtualXIDs(snapshot->xmin, true, false,
PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_old_snapshots);
for (i = 0; i < n_old_snapshots; i++)
{
if (!VirtualTransactionIdIsValid(old_snapshots[i]))
continue; /* found uninteresting in previous cycle */
if (i > 0)
{
/* see if anything's changed ... */
VirtualTransactionId *newer_snapshots;
int n_newer_snapshots;
int j;
int k;
newer_snapshots = GetCurrentVirtualXIDs(snapshot->xmin,
true, false,
PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_newer_snapshots);
for (j = i; j < n_old_snapshots; j++)
{
if (!VirtualTransactionIdIsValid(old_snapshots[j]))
continue; /* found uninteresting in previous cycle */
for (k = 0; k < n_newer_snapshots; k++)
{
if (VirtualTransactionIdEquals(old_snapshots[j],
newer_snapshots[k]))
break;
}
if (k >= n_newer_snapshots) /* not there anymore */
SetInvalidVirtualTransactionId(old_snapshots[j]);
}
pfree(newer_snapshots);
}
if (VirtualTransactionIdIsValid(old_snapshots[i]))
VirtualXactLockTableWait(old_snapshots[i]);
}
```
请读注释,基本意思是在做validate_index,这个时间可预见的会很长,so需要修改这段时间的数据,让它正确。
```
/*
* Index can now be marked valid -- update its pg_index entry
*/
pg_index = heap_open(IndexRelationId, RowExclusiveLock);
indexTuple = SearchSysCacheCopy1(INDEXRELID,
ObjectIdGetDatum(indexRelationId));
if (!HeapTupleIsValid(indexTuple))
elog(ERROR, "cache lookup failed for index %u", indexRelationId);
indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
Assert(indexForm->indisready);
Assert(!indexForm->indisvalid);
indexForm->indisvalid = true;
simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
CatalogUpdateIndexes(pg_index, indexTuple);
heap_close(pg_index, RowExclusiveLock);
/*
* The pg_index update will cause backends (including this one) to update
* relcache entries for the index itself, but we should also send a
* relcache inval on the parent table to force replanning of cached plans.
* Otherwise existing sessions might fail to use the new index where it
* would be useful. (Note that our earlier commits did not create reasons
* to replan; relcache flush on the index itself was sufficient.)
*/
CacheInvalidateRelcacheByRelid(heaprelid.relId);
/* we can now do away with our active snapshot */
PopActiveSnapshot();
/* And we can remove the validating snapshot too */
UnregisterSnapshot(snapshot);
/*
* Last thing to do is release the session-level lock on the parent table.
*/
UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
```
让索引对查询可见。
#### validate_index
```
* validate_index() works by first gathering all the TIDs currently in the
* index, using a bulkdelete callback that just stores the TIDs and doesn't
* ever say "delete it". (This should be faster than a plain indexscan;
* also, not all index AMs support full-index indexscan.) Then we sort the
* TIDs, and finally scan the table doing a "merge join" against the TID list
* to see which tuples are missing from the index. Thus we will ensure that
* all tuples valid according to the reference snapshot are in the index.
*
* Building a unique index this way is tricky: we might try to insert a
* tuple that is already dead or is in process of being deleted, and we
* mustn't have a uniqueness failure against an updated version of the same
* row. We could try to check the tuple to see if it's already dead and tell
* index_insert() not to do the uniqueness check, but that still leaves us
* with a race condition against an in-progress update. To handle that,
* we expect the index AM to recheck liveness of the to-be-inserted tuple
* before it declares a uniqueness error.
```
如注释所示,第二次扫面,实际上执行了一个tid的收集以及一个merge join操作。
## 实验
本实验的目的在于测试,concurrent对性能的影响
### 机器配置
普通台式机,PG 8.4,默认配置。
### 数据格式
```sql
CREATE TABLE test2 (
did integer PRIMARY KEY,
val float,
name varchar(40)
);
insert into test2 select generate_series(1, 10000000), random()*100., md5(random()::text);
create index i2v on test2(val);
create index concurrently i2vc on test2(val);
```
### 结果
| 元组个数| val | concurrently on value | name | concurrently on name |
| :--: | :--:| :--: | :--: | :--: |
| 1,000,000 | 4597.145 ms | 8678.217 ms | 13179.993 ms | 17327.401 ms |
| 10,000,000 | 58315.700 ms | 102393.426 ms |164888.703 | 208067.063 ms |
| 20,000,000 | 117594.772 ms | 235908.396 ms | 167957.984 ms | 488424.852 ms |
### 小结
在一下条件下:
1. 上述数据的取得是在没有查询的情况下得出的。
1. 平均来看,使用concurrently create index的时间是不适用concurently build时间的2倍上下。
PG CREATEINDEX CONCURRENTLY的更多相关文章
-
简析服务端通过GT导入SHP至PG的方法
文章版权由作者李晓晖和博客园共有,若转载请于明显处标明出处:http://www.cnblogs.com/naaoveGIS/ 1.背景 项目中需要在浏览器端直接上传SHP后服务端进行数据的自动入PG ...
-
PG 中 JSON 字段的应用
13 年发现 pg 有了 json 类型,便从 oracle 转 pg,几年下来也算比较熟稔了,总结几个有益的实践. 用途一:存储设计时无法预料的文档性的数据.比如,通常可以在人员表准备一个 json ...
-
pg gem 安装(postgresql94)
使用下面命令安装报错 gem install pg 错误: [root@AS-test middle_database]# gem install pgBuilding native extensio ...
-
#pg学习#postgresql的安装
1.按照官网给的步骤编译安装(Mac安装是比较容易的,相比Liunx) cd /Users/renlipeng/Desktop/postgresql-9.5.1 ./configure --prefi ...
-
PG 函数的易变性(Function Volatility Categories)
此概念的接触是在做分区表的时候碰到的,分区表按时间字段分区,在查询时当where条件中时间为now()或者current_time()等时是无法查询的,即使进行格式转换也不行,只有是时间格式如‘201 ...
-
mysql 序列与pg序列的比较
mysql序列(这里只谈innodb引擎): 在使用mysql的AUTO_INCREMENT时,使用AUTO_INCREMENT的字段必须建有索引,也可以为索引的一部分.当没有索引时会报错: ...
-
使用zfs进行pg的pitr恢复测试
前段时间做了一下zfs做pg的增量恢复测试,mark一下. 服务器信息: 主机:192.168.173.43 备机:192.168.173.41 主备使用流复制搭建,在备机上面进行了zfs快照备份. ...
-
PG, Pool之间的一些数量关系
先说一下我的环境: Ceph cluster中包含6台OSD节点 (osd.0 - 5), 一共有10个Pool (0 - 9), 这些Pool共享了144个PG (这个数字是所有Pool的PG_SI ...
-
ruby on rails gem install pg时无法安装
gem install pg -v '0.18.2' Building native extensions. This could take a while... ERROR: Error insta ...
随机推荐
-
关于如何在cenos7.0上实现mysql数据库远程连接
设置mysql允许别的客户机控制的权限 mysql -uroot -p #此处为本地linux帐号密码 select user,host from mysql.user; #查看mysql表对应use ...
-
CustomerConfigHelper
public static class CustomerConfigHelper { public static object _lockObject = new object(); private ...
-
flexjson 的使用
日期转换 JSONSerializer serializer = new JSONSerializer(); serializer.exclude(new St ...
-
《初识PE》导入表
最近听别人讲的我晕晕乎乎的,于是上网上百度下,感觉这篇还不错. 链接:http://www.blogfshare.com/pe-export.html 一.导入表简介 在编程中常常用到"导 ...
-
ES6中的Symbol类型
前面的话 ES5中包含5种原始类型:字符串.数字.布尔值.null和undefined.ES6引入了第6种原始类型——Symbol ES5的对象属性名都是字符串,很容易造成属性名冲突.比如,使用了一个 ...
-
Quartz入门例子简介 从入门到菜鸟(一)
转: Quartz入门例子简介 从入门到菜鸟(一) 2016年11月19日 22:58:24 爱种鱼的猫 阅读数:4039 刚接触quartz这个词并不是在学习过程中...而是WOW里面的界面插件 ...
-
TP5中的小知识
在TP5中如果想用select 查询后,变成数组,用toArray()这个函数的话,必须在连接数据库中把 数据集返回类型变成 'resultset_type'=>'\think\Collecti ...
-
内存溢出和内存泄漏 mark下
https://jingyan.baidu.com/article/495ba841e4423438b30edeb5.html https://www.cnblogs.com/hyh-test/p/8 ...
-
npm 5.4.2 更新后就不能用了
今天刚,npm run dev 就出现更新提示,没多想就更了, 更新用了49S,下来npm 的所以命令包一个semer的插件 ... 最后下载新node 8.5覆盖安装, 就解决了, node 8.5 ...
-
PHP 结合实例认识 Socket
一.简介 (本部分参考自 揭开Socket的神秘面纱 - 源码工作室) 请结合下图 Socket 抽象层的位置来理解.Socket 是应用层和 TCP/IP 协议族进行通信的中间软件抽象层,是一组接口 ...