greenplum全量恢复gprecoverseg -F出现Unable to connect to database时的相关分析及解决方法

时间:2022-06-27 02:30:56

之前有两位朋友碰到过在对greenplum的系统构架更改后,出现全量恢复gprecoverseg -F也无法正常执行的情况。

报错信息为Unable to connect to database. Retrying 1

gprecoverseg failed. (Reason='Unable to connect to database and start transaction') exiting...

有幸拷得一份虚拟机上的所有文件。对其进行分析。

发现事实上出现这个问题仅仅须要改动pg_changetracking下的CT_METADATA。或者说从其它正常的主事例上拷贝一份到出问题的主事例上就可以。



下面为大致分析的过程,感兴趣的能够看一下。

--启动数据库。有一个mirror出错。

[gpadmin@gpmaster ~]$ gpstart -a
20150727:22:28:21:001922 gpstart:gpmaster:gpadmin-[INFO]:-Starting gpstart with args: -a
20150727:22:28:21:001922 gpstart:gpmaster:gpadmin-[INFO]:-Gathering information and validating the environment...
20150727:22:28:28:001922 gpstart:gpmaster:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
20150727:22:28:28:001922 gpstart:gpmaster:gpadmin-[INFO]:-Greenplum Catalog Version: '201310150'
20150727:22:28:29:001922 gpstart:gpmaster:gpadmin-[INFO]:-Starting Master instance in admin mode
20150727:22:28:33:001922 gpstart:gpmaster:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20150727:22:28:33:001922 gpstart:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:28:34:001922 gpstart:gpmaster:gpadmin-[INFO]:-Setting new master era
20150727:22:28:34:001922 gpstart:gpmaster:gpadmin-[INFO]:-Master Started...
20150727:22:28:34:001922 gpstart:gpmaster:gpadmin-[INFO]:-Shutting down master
20150727:22:28:36:001922 gpstart:gpmaster:gpadmin-[WARNING]:-Skipping startup of segment marked down in configuration: on gpslave-2 directory /data/mirror/gpseg0 <<<<<
20150727:22:28:36:001922 gpstart:gpmaster:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
...............................................
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:-Process results...
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:- Successful segment starts = 3
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:- Failed segment starts = 0
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[WARNING]:-Skipped segment starts (segments are marked down in configuration) = 1 <<<<<<<<
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:-
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:-Successfully started 3 of 3 segment instances, skipped 1 other segments
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[WARNING]:-****************************************************************************
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[WARNING]:-There are 1 segment(s) marked down in the database
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[WARNING]:-To recover from this current state, review usage of the gprecoverseg
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[WARNING]:-management utility which will recover failed segment instance databases.
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[WARNING]:-****************************************************************************
20150727:22:29:23:001922 gpstart:gpmaster:gpadmin-[INFO]:-Starting Master instance gpmaster directory /data/master/gpseg-1
20150727:22:29:24:001922 gpstart:gpmaster:gpadmin-[INFO]:-Command pg_ctl reports Master gpmaster instance active
20150727:22:29:26:001922 gpstart:gpmaster:gpadmin-[INFO]:-No standby master configured. skipping...
20150727:22:29:26:001922 gpstart:gpmaster:gpadmin-[WARNING]:-Number of segments not attempted to start: 1
20150727:22:29:26:001922 gpstart:gpmaster:gpadmin-[INFO]:-Check status of database with gpstate utility

--运行全量恢复,5次尝试超时,提示Unable to connect to database and start transaction。

[gpadmin@gpmaster ~]$ gprecoverseg -F
20150727:22:38:04:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting gprecoverseg with args: -F
20150727:22:38:04:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
20150727:22:38:04:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:15:10'
20150727:22:38:04:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:22:38:04:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:38:04:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 1
20150727:22:38:09:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:22:38:09:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:38:09:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 2
20150727:22:38:14:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:22:38:14:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:38:15:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 3
20150727:22:38:20:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:22:38:20:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:38:20:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 4
20150727:22:38:25:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:22:38:25:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:38:25:002478 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 5
20150727:22:38:30:002478 gprecoverseg:gpmaster:gpadmin-[CRITICAL]:-gprecoverseg failed. (Reason='Unable to connect to database and start transaction') exiting...

--下面为相关节点的配置。

template1=# select * from gp_segment_configuration ;
dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+-----------+-----------+------------------+------------
1 | -1 | p | p | s | u | 5432 | gpmaster | gpmaster | |
2 | 0 | p | p | c | u | 40000 | gpslave-1 | gpslave-1 | 41000 |
4 | 0 | m | m | s | d | 50000 | gpslave-2 | gpslave-2 | 51000 |
3 | 1 | p | p | s | u | 40000 | gpslave-2 | gpslave-2 | 41000 |
5 | 1 | m | m | s | u | 50000 | gpslave-1 | gpslave-1 | 51000 |
(5 rows)

--在这里将content=1的mirror也停掉,该mirror恢复时应该是没有问题的。将其对照content=0上的mirror,来发现相关问题。

--又一次启动数据库。

[gpadmin@gpmaster ~]$ gpstart -a
20150727:22:49:41:002958 gpstart:gpmaster:gpadmin-[INFO]:-Starting gpstart with args: -a
20150727:22:49:41:002958 gpstart:gpmaster:gpadmin-[INFO]:-Gathering information and validating the environment...
20150727:22:49:41:002958 gpstart:gpmaster:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
20150727:22:49:42:002958 gpstart:gpmaster:gpadmin-[INFO]:-Greenplum Catalog Version: '201310150'
20150727:22:49:42:002958 gpstart:gpmaster:gpadmin-[INFO]:-Starting Master instance in admin mode
20150727:22:49:43:002958 gpstart:gpmaster:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20150727:22:49:43:002958 gpstart:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:49:43:002958 gpstart:gpmaster:gpadmin-[INFO]:-Setting new master era
20150727:22:49:43:002958 gpstart:gpmaster:gpadmin-[INFO]:-Master Started...
20150727:22:49:43:002958 gpstart:gpmaster:gpadmin-[INFO]:-Shutting down master
20150727:22:49:44:002958 gpstart:gpmaster:gpadmin-[WARNING]:-Skipping startup of segment marked down in configuration: on gpslave-2 directory /data/mirror/gpseg0 <<<<<
20150727:22:49:44:002958 gpstart:gpmaster:gpadmin-[WARNING]:-Skipping startup of segment marked down in configuration: on gpslave-1 directory /data/mirror/gpseg1 <<<<<
20150727:22:49:44:002958 gpstart:gpmaster:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait.....
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:-Process results...
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:- Successful segment starts = 2
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:- Failed segment starts = 0
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[WARNING]:-Skipped segment starts (segments are marked down in configuration) = 2 <<<<<<<<
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:-
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:-Successfully started 2 of 2 segment instances, skipped 2 other segments
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[WARNING]:-****************************************************************************
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[WARNING]:-There are 2 segment(s) marked down in the database
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[WARNING]:-To recover from this current state, review usage of the gprecoverseg
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[WARNING]:-management utility which will recover failed segment instance databases.
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[WARNING]:-****************************************************************************
20150727:22:49:46:002958 gpstart:gpmaster:gpadmin-[INFO]:-Starting Master instance gpmaster directory /data/master/gpseg-1
20150727:22:49:47:002958 gpstart:gpmaster:gpadmin-[INFO]:-Command pg_ctl reports Master gpmaster instance active
20150727:22:49:47:002958 gpstart:gpmaster:gpadmin-[INFO]:-No standby master configured. skipping...
20150727:22:49:47:002958 gpstart:gpmaster:gpadmin-[WARNING]:-Number of segments not attempted to start: 2
20150727:22:49:47:002958 gpstart:gpmaster:gpadmin-[INFO]:-Check status of database with gpstate utility

--如今的配置信息。

template1=# select * from gp_segment_configuration ;
dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+-----------+-----------+------------------+------------
1 | -1 | p | p | s | u | 5432 | gpmaster | gpmaster | |
2 | 0 | p | p | c | u | 40000 | gpslave-1 | gpslave-1 | 41000 |
4 | 0 | m | m | s | d | 50000 | gpslave-2 | gpslave-2 | 51000 |
3 | 1 | p | p | c | u | 40000 | gpslave-2 | gpslave-2 | 41000 |
5 | 1 | m | m | s | d | 50000 | gpslave-1 | gpslave-1 | 51000 |
(5 rows)

--运行全量恢复。并查看相关的具体信息。

[gpadmin@gpmaster ~]$ gprecoverseg -F -v
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting gprecoverseg with args: -F -v
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Checking if MASTER_DATA_DIRECTORY env variable is set.
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Obtaining master's port from master data directory
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Read from postgresql.conf port=5432
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Read from postgresql.conf max_connections=200
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Checking if GPHOME env variable is set.
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Connecting to dbname=template1
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:15:10'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Connecting to dbname=template1
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Validating configuration...
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Sending msg getStatus and cmdStr $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Adding cmd to work_queue: $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Sending msg getStatus and cmdStr $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Adding cmd to work_queue: $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] got cmd: $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] got cmd: $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] finished cmd: Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' gpslave-1 ". /usr/local/greenplum-db/./greenplum_path.sh; $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000"' had result: cmd had rc=1 completed=True halted=False
stdout=''
stderr='mode: PrimarySegment
segmentState: ChangeTrackingDisabled --content=0的primary的状态是ChangeTrackingDisabled
dataState: InChangeTracking
faultType: NotInitialized
mode: PrimarySegment
segmentState: ChangeTrackingDisabled
dataState: InChangeTracking
faultType: NotInitialized
'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] finished cmd: Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' gpslave-2 ". /usr/local/greenplum-db/./greenplum_path.sh; $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000"' had result: cmd had rc=1 completed=True halted=False
stdout=''
stderr='mode: PrimarySegment
segmentState: Ready --content=1的primary的状态是Ready
dataState: InChangeTracking
faultType: NotInitialized
mode: PrimarySegment
segmentState: Ready
dataState: InChangeTracking
faultType: NotInitialized
'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Encountered error Not ready to connect to database mode: PrimarySegment
segmentState: ChangeTrackingDisabled --由此可见,之前无法运行全量恢复是因为content=0的primary有问题,与mirror无关
dataState: InChangeTracking
faultType: NotInitialized
mode: PrimarySegment
segmentState: ChangeTrackingDisabled
dataState: InChangeTracking
faultType: NotInitialized 20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 1
^C20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-WorkerPool haltWork()
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker2] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker3] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker4] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker5] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker6] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker7] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker8] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker9] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker10] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker11] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker12] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker13] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker14] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker15] haltWork
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker5] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker3] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker6] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker4] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker2] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker7] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker9] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker10] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker12] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker8] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker11] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker14] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker13] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker15] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] stopping User Interrupted

--查看content=0的primary此次启动的日志记录。与以下content=1的primary此次启动的日志记录对照,发现还是CHANGETRACKING有问题。并且gprecoverseg -F运行后没有产生日志记录,说明对应文件仅仅在数据库启动时生效。

[gpadmin@gpslave-1 pg_log]$ cat gpdb-2015-07-27_224945.csv
2015-07-27 22:49:46.228113 CST,"gpadmin","postgres",p2367,th2061358880,"[local]",,2015-07-27 22:49:46 CST,0,,,seg-1,,,,,"FATAL","57M01","the database system is in mirror or uninitialized mode",,,,,,,0,,"postmaster.c",2994,
2015-07-27 22:49:46.248685 CST,,,p2370,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","mirror transition, primary address(port) 'gpslave-1(41000)' mirror address(port) 'gpslave-2(51000)'",,,,,"mirroring role 'primary role' mirroring state 'change tracking' segment state 'not initialized' process name(pid) 'filerep main process(2370)' filerep state 'not initialized' ",,0,,"cdbfilerep.c",3472,
2015-07-27 22:49:46.252539 CST,,,p2371,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","CHANGETRACKING: ChangeTracking_RetrieveIsTransitionToInsync() found insync_transition_completed:'false' full resync:'true'",,,,,,,0,,"cdbresynchronizechangetracking.c",2545,
2015-07-27 22:49:46.252589 CST,,,p2371,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","CHANGETRACKING: ChangeTracking_RetrieveIsTransitionToResync() found resync_transition_completed:'false' full resync:'true'",,,,,,,0,,"cdbresynchronizechangetracking.c",2582,
2015-07-27 22:49:46.254921 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","database system was shut down at 2015-07-27 22:46:49 CST",,,,,,,0,,"xlog.c",6372,
2015-07-27 22:49:46.255156 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","checkpoint record is at 0/8C75AD8",,,,,,,0,,"xlog.c",6560,
2015-07-27 22:49:46.255194 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","redo record is at 0/8C75AD8; undo record is at 0/0; shutdown TRUE",,,,,,,0,,"xlog.c",6617,
2015-07-27 22:49:46.255219 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","next transaction ID: 0/3051; next OID: 43252",,,,,,,0,,"xlog.c",6621,
2015-07-27 22:49:46.255241 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","next MultiXactId: 1; next MultiXactOffset: 0",,,,,,,0,,"xlog.c",6624,
2015-07-27 22:49:46.255280 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","end of transaction log location is 0/8C75B70",,,,,,,0,,"xlog.c",6988,
2015-07-27 22:49:46.255863 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","Oldest active transaction from prepared transactions 3051",,,,,,,0,,"xlog.c",6033,
2015-07-27 22:49:46.292261 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",6084,
2015-07-27 22:49:46.292308 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:10:54",,,,,,,0,,"xlog.c",6095,
2015-07-27 22:49:46.294384 CST,,,p2372,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","Finished normal startup for clean shutdown case",,,,,,,0,,"xlog.c",7202,
2015-07-27 22:49:46.300763 CST,,,p2373,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","Finished BASIC startup integrity checking",,,,,,,0,,"xlog.c",7647,
2015-07-27 22:49:46.303294 CST,,,p2365,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:13:47",,,,,,,0,,"postmaster.c",4640,
2015-07-27 22:49:46.303322 CST,,,p2365,th2061358880,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:13:47",,,,,,0,,"postmaster.c",4647,
2015-07-27 22:49:47.847981 CST,"gpadmin","template1",p2381,th2061358880,"192.168.22.110","5303",2015-07-27 22:49:47 CST,3054,con5,,seg0,,,x3054,sx1,"LOG","00000","gp_max_distributed_xid returning 1",,,,,,"select gp_max_distributed_xid()",0,,"cdbmaxdistributedxid.c",29, [gpadmin@gpslave-2 pg_log]$ cat gpdb-2015-07-27_224944.csv
2015-07-27 22:49:45.707976 CST,"gpadmin","postgres",p2302,th-240429280,"[local]",,2015-07-27 22:49:45 CST,0,,,seg-1,,,,,"FATAL","57M01","the database system is in mirror or uninitialized mode",,,,,,,0,,"postmaster.c",2994,
2015-07-27 22:49:45.726077 CST,,,p2305,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","mirror transition, primary address(port) 'gpslave-2(41000)' mirror address(port) 'gpslave-1(51000)'",,,,,"mirroring role 'primary role' mirroring state 'change tracking' segment state 'not initialized' process name(pid) 'filerep main process(2305)' filerep state 'not initialized' ",,0,,"cdbfilerep.c",3472,
2015-07-27 22:49:45.735230 CST,,,p2306,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","last checkpoint location for generating initial changetracking log 0/CC12DD8",,,,,,,0,,"xlog.c",11560,
2015-07-27 22:49:45.735376 CST,,,p2306,th-240429280,,,,0,,,seg-1,,,,,"WARNING","XX000","unable to read change tracking 'full log' file : Success (cdbresynchronizechangetracking.c:3382)",,,,,,,0,,"cdbresynchronizechangetracking.c",3382,
2015-07-27 22:49:45.735423 CST,,,p2306,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","last changetracked location in changetracking full log 0/0",,,,,,,0,,"xlog.c",11572,
2015-07-27 22:49:45.735475 CST,,,p2306,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","record with zero length at 0/CC12E70",,,,,,,0,,"xlog.c",4155,
2015-07-27 22:49:45.735505 CST,,,p2306,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","scanned through 1 initial xlog records since last checkpoint for writing into the resynchronize change log",,,,,,,0,,"cdbresynchronizechangetracking.c",206,
2015-07-27 22:49:45.736987 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","database system was shut down at 2015-07-27 22:46:49 CST",,,,,,,0,,"xlog.c",6372,
2015-07-27 22:49:45.737260 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","checkpoint record is at 0/CC12DD8",,,,,,,0,,"xlog.c",6560,
2015-07-27 22:49:45.737298 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","redo record is at 0/CC12DD8; undo record is at 0/0; shutdown TRUE",,,,,,,0,,"xlog.c",6617,
2015-07-27 22:49:45.737322 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","next transaction ID: 0/3006; next OID: 43252",,,,,,,0,,"xlog.c",6621,
2015-07-27 22:49:45.737343 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","next MultiXactId: 1; next MultiXactOffset: 0",,,,,,,0,,"xlog.c",6624,
2015-07-27 22:49:45.737382 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","end of transaction log location is 0/CC12E70",,,,,,,0,,"xlog.c",6988,
2015-07-27 22:49:45.738000 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","Oldest active transaction from prepared transactions 3006",,,,,,,0,,"xlog.c",6033,
2015-07-27 22:49:45.743458 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",6084,
2015-07-27 22:49:45.743497 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:10:54",,,,,,,0,,"xlog.c",6095,
2015-07-27 22:49:45.745424 CST,,,p2307,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","Finished normal startup for clean shutdown case",,,,,,,0,,"xlog.c",7202,
2015-07-27 22:49:45.751648 CST,,,p2308,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","Finished BASIC startup integrity checking",,,,,,,0,,"xlog.c",7647,
2015-07-27 22:49:45.754195 CST,,,p2300,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:13:47",,,,,,,0,,"postmaster.c",4640,
2015-07-27 22:49:45.754222 CST,,,p2300,th-240429280,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:13:47",,,,,,0,,"postmaster.c",4647,
2015-07-27 22:49:47.296961 CST,"gpadmin","template1",p2316,th-240429280,"192.168.22.110","39217",2015-07-27 22:49:47 CST,3009,con5,,seg1,,,x3009,sx1,"LOG","00000","gp_max_distributed_xid returning 1",,,,,,"select gp_max_distributed_xid()",0,,"cdbmaxdistributedxid.c",29,

--查看/data/primary/gpseg0/pg_changetracking下的文件,就这么三个文件。而造成上面gprecoverseg无法运行的正是当中的CT_METADATA。

[gpadmin@gpslave-1 pg_changetracking]$ ll

total 102900

-rw-------. 1 gpadmin gpadmin       128 Jul 27 22:49 CT_METADATA  --二进制文件,就是这个出问题了的,关于这个文件的信息,我也没有查到。

-rw-------. 1 gpadmin gpadmin    322636 Jul 27 22:46 FILEREP_CONFIG_LOG --记录配置变化的文件。

-rw-------. 1 gpadmin gpadmin 105028298 Jul 27 22:46 FILEREP_LOG   --记录数据变化的文件。



--对照content=0的primary和content=1的primary,这两个文件还是不同。但详细信息我也不是非常清楚。

&#1;[gpadmin@gpslave-1 pg_changetracking]$ od -tx1 -tc -Ax CT_METADATA 

000000  00  00  00  00  00  00  00  00  01  00  00  00  00  00  00  00

        \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0

000010  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00

        \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0

*

000080



[gpadmin@gpslave-2 pg_changetracking]$ od -tx1 -tc -Ax CT_METADATA 

000000  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00

        \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0

*

000080



--将content=0的primary上的CT_METADATA重命名。并将content=1的primary上的CT_METADATA传到content=0的primary上。

[gpadmin@gpslave-2 pg_changetracking]$ scp CT_METADATA gpadmin@gpslave-1:/data/primary/gpseg0/pg_changetracking

CT_METADATA                                                                                                                             100%  129     0.1KB/s   00:00



[gpadmin@gpslave-1 pg_changetracking]$ ll

total 102904

-rw-------. 1 gpadmin gpadmin       129 Jul 27 23:09 CT_METADATA

-rw-------. 1 gpadmin gpadmin       128 Jul 27 22:49 CT_METADATA_original

-rw-------. 1 gpadmin gpadmin    322636 Jul 27 22:46 FILEREP_CONFIG_LOG

-rw-------. 1 gpadmin gpadmin 105028298 Jul 27 22:46 FILEREP_LOG



--重新启动数据库。运行全量恢复成功。

[gpadmin@gpmaster ~]$ gpstart -a
[gpadmin@gpmaster ~]$ gprecoverseg -F
20150727:23:29:48:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting gprecoverseg with args: -F
20150727:23:29:48:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
20150727:23:29:48:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:15:10'
20150727:23:29:48:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:23:29:48:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:23:29:49:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:23:29:50:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Performing persistent table check
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Greenplum instance recovery parameters
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:----------------------------------------------------------
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Recovery type = Standard
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:----------------------------------------------------------
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Recovery 1 of 2
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:----------------------------------------------------------
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Synchronization mode = Full
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance host = gpslave-2
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance address = gpslave-2
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance directory = /data/mirror/gpseg0
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance port = 50000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance replication port = 51000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance host = gpslave-1
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance address = gpslave-1
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance directory = /data/primary/gpseg0
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance port = 40000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance replication port = 41000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Target = in-place
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:----------------------------------------------------------
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Recovery 2 of 2
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:----------------------------------------------------------
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Synchronization mode = Full
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance host = gpslave-1
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance address = gpslave-1
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance directory = /data/mirror/gpseg1
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance port = 50000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Failed instance replication port = 51000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance host = gpslave-2
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance address = gpslave-2
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance directory = /data/primary/gpseg1
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance port = 40000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Source instance replication port = 41000
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:- Recovery Target = in-place
20150727:23:30:30:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:---------------------------------------------------------- Continue with segment recovery procedure Yy|Nn (default=N):
> y
20150727:23:30:34:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-2 segment(s) to recover
20150727:23:30:34:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Ensuring 2 failed segment(s) are stopped 20150727:23:30:36:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20150727:23:30:41:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Cleaning files from 2 segment(s)
..
20150727:23:30:43:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Building template directory
20150727:23:30:46:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Validating remote directories
.
20150727:23:30:47:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Copying template directory file
.
20150727:23:30:48:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Configuring new segments
.
20150727:23:30:49:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Cleaning files
.
20150727:23:30:50:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting file move procedure for gpslave-2:/data/mirror/gpseg0:content=0:dbid=4:mode=s:status=d
20150727:23:30:50:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting file move procedure for gpslave-1:/data/mirror/gpseg1:content=1:dbid=5:mode=s:status=d
updating flat files
20150727:23:30:50:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Updating configuration with new mirrors
20150727:23:30:50:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Updating mirrors
.
20150727:23:30:51:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting mirrors
20150727:23:30:51:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
............................................................................
20150727:23:32:10:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Process results...
20150727:23:32:10:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Updating configuration to mark mirrors up
20150727:23:32:10:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Updating primaries
20150727:23:32:10:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Commencing parallel primary conversion of 2 segments, please wait...
....
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Process results...
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Done updating primaries
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-******************************************************************
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Updating segments for resynchronization is completed.
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-For segments updated successfully, resynchronization will continue in the background.
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-Use gpstate -s to check the resynchronization progress.
20150727:23:32:15:004102 gprecoverseg:gpmaster:gpadmin-[INFO]:-****************************************************************** [gpadmin@gpmaster ~]$ gpstate -e
20150727:23:34:45:005069 gpstate:gpmaster:gpadmin-[INFO]:-Starting gpstate with args: -e
20150727:23:34:45:005069 gpstate:gpmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
20150727:23:34:45:005069 gpstate:gpmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:15:10'
20150727:23:34:45:005069 gpstate:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:23:34:45:005069 gpstate:gpmaster:gpadmin-[INFO]:-Gathering data from segments...
..
20150727:23:34:47:005069 gpstate:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:23:34:47:005069 gpstate:gpmaster:gpadmin-[INFO]:-Segment Mirroring Status Report
20150727:23:34:47:005069 gpstate:gpmaster:gpadmin-[INFO]:-----------------------------------------------------
20150727:23:34:47:005069 gpstate:gpmaster:gpadmin-[INFO]:-All segments are running normally

--总结一下解决方法:当全量恢复gprecoverseg -F无法正常执行,且报错信息为Unable to connect to database. Retrying 1

gprecoverseg failed. (Reason='Unable to connect to database and start transaction') exiting...时。

从其它正常的主事例上拷贝一份data/primary/gpseg*/pg_changetracking下的CT_METADATA到出问题的主事例上,覆盖原CT_METADATA,然后启动数据库,就能够正常进行全量恢复了。