MySQL实例多库某张表数据文件损坏导致xxx库无法访问故障恢复

时间:2021-08-09 21:43:36

一、问题发现

  命令行进入数据库实例手动给某张表进行alter操作,发现如下报错。

mysql> use xx_xxx;
No connection. Trying to reconnect...
Connection id:
Current database: *** NONE *** Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A Database changed
mysql> show tables;
ERROR (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:
Current database: xx_xxx ERROR (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (111)
ERROR:
Can't connect to the server

二、问题定位

  上述这种错误常见是MySQL实例关闭或者非正常宕机、MySQL连接超时、MySQL请求线程被kill。根据现有的业务场景,审核平台能正常连接数据库甚至连接有问题的数据库建表,MySQL服务对外正常,错误日志没有非正常输出,没有开发或者测试人员反映有问题的数据库存在使用问题。但是通过Navicat连接打开问题数据库发现报错(MySQL server has gone away),通过命令行界面进入其他数据库,执行数据库命令都正常,进入问题数据库连最基本的数据库相应变量值和状态值都无法show。

  排查暴力破解数据库尝试连接的源头,缩小问题来源(这里排查走偏了),发现问题依然存在。但是比较难理解的是通过审核平台使用问题库却能建表成功,与之前遇到的整库数据文件损坏还不一样,这里怀疑可能是某张表数据文件损坏导致了错误。

  查看日志发现实例存在异常shutdown和崩溃恢复记录,但还不能确定具体的原因,可以明确的是单个库存在问题,可以从其他途径去恢复。但是DBA存在即有价值,我们可以尽可能的先尝试以最小的代价解决问题。虽之前遇到过多次硬件故障导致的数据文件损坏,可以通过集群的其他实例和备份完成恢复并不会很大影响业务,也遇到过自己测试发现单个库文件坏得很彻底,但能通过dump出来数据文件进行恢复。

  解决问题后追溯问题,发现日志记录如下故障点(这个日志比较久远,当问题来临时可能没有那么多时间给予分析,需要快速定位初步问题并解决,解决问题的时候不一定能发现这个重要的排错依据,放上日志只是仅供参考和回溯故障原因)。

2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201
2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210
2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210
2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally!
2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery.
2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97
4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb-
troubleshooting-datadict.html for how to resolve the issue.
2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in
nodb-troubleshooting-datadict.html for how to resolve the issue.
2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened.
2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. [ERROR] [FATAL] InnoDB: Tablespace id is 974 in the data dictionary but in file ./xx_xxxxxx/xx_xxxxxx_fans_person.ibd it is 2760!
2018-12-18 20:30:29 0x7f014872b700 InnoDB: Assertion failure in thread 139643487172352 in file ut0ut.cc line 916
InnoDB: We intentionally generate a memory trap. 2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201
2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210
2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210
2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally!
2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery.
2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97
4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb-
troubleshooting-datadict.html for how to resolve the issue.
2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in
nodb-troubleshooting-datadict.html for how to resolve the issue.
2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened.
2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

三、问题解决

  遇到数据文件损坏导致的数据无法正常存取,通常解决的办法是通过备份进行恢复,包括对坏点进行备份恢复。尝试过才知道有思路是好的,但是实践起来不一定容易,果不其然当我想通过dump备份数据再尝试修复的时候出现了错误 MySQL server has gone away。遇到好的问题就要分享,往往问题比较宽泛不好定位的时候容易忽略正确的处理方向。通过好朋友圈的提醒,发现use库的时候输出了-A选项,查询得知可以不加载元数据信息就能进入数据库。

-A选项意义
当我们打开数据库,即use dbname时,要预读数据库信息。由于数据库太大,即数据库中表非常多,所以如果预读数据库信息,将非常慢,所以就卡住了,如果数据库中表非常少,将不会出现问题

  幸运的是通过不预读数据可以正常查看当前数据库所有表、系统变量值和状态值,然后尝试通过对InnoDB和MyISAM表进行批量修复,不过在此应该通过select...into的方式做好数据备份,这里因为是测试环境且有相应的冗余环境,就没做备份处理再修复。通过如下命令查询所有的base表并拼接SQL语句,果然发现了无法修复的坏表,印证了MySQL错误日志的信息。

##批量修复MyISAM表
select concat('repair table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'MyISAM'; #批量修复InnoDB表
select concat('optimize table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'InnoDB'; ##### optimize table xx_xxxxxx_fans_person;  

  通过上述命令发现修复结果不OK的表,并通过查看表行数确认数据已无法导出,删除相应的坏表并重新建立新表(drop table可能出现表不存在或者建表1068错误),导入最近的一次数据备份,重启MySQL实例,发现问题解决,问题库可以正常访问。

、删除错误表xx_xxxxxx_fans_person
、重建表
mysql> CREATE TABLE `xx_xxxxxx_fans_person` (
-> `person_id` int() NOT NULL AUTO_INCREMENT,
-> `person_circle_id` int() NOT NULL,
-> `person_user_id` int() NOT NULL,
-> `person_time` datetime NOT NULL,
-> `type` int() DEFAULT '' COMMENT '1. 组长 2. 成员',
-> `merchant_id` int() DEFAULT '',
-> `leave_type` int() DEFAULT '' COMMENT '请假状态 0.未请假 1.请假',
-> `leave_start_time` datetime DEFAULT NULL COMMENT '请假开始时间',
-> `leave_end_time` datetime DEFAULT NULL COMMENT '请假结束时间',
-> `is_invalid` int() DEFAULT '' COMMENT '是否失效 0有效 1失效',
-> `invalid_id` int() DEFAULT '' COMMENT '失效记录关联ID ',
-> PRIMARY KEY (`person_id`),
-> KEY `person_circle_id` (`person_circle_id`) USING BTREE,
-> KEY `person_user_id` (`person_user_id`) USING BTREE
-> ) ENGINE=InnoDB;
ERROR (HY000): Got error from storage engine
、重启
mysql> select count(*) from xx_xxxxxx_fans_person;
ERROR (HY000): Tablespace is missing for table `xx_xxxxxx`.`xx_xxxxxx_fans_person`. mysql> drop table xx_xxxxxx_fans_person;
Query OK, rows affected (0.00 sec) mysql> CREATE TABLE `xx_xxxxxx_fans_person` (
-> `person_id` int() NOT NULL AUTO_INCREMENT,
-> `person_circle_id` int() NOT NULL,
-> `person_user_id` int() NOT NULL,
-> `person_time` datetime NOT NULL,
-> `type` int() DEFAULT '' COMMENT '1. 组长 2. 成员',
-> `merchant_id` int() DEFAULT '',
-> `leave_type` int() DEFAULT '' COMMENT '请假状态 0.未请假 1.请假',
-> `leave_start_time` datetime DEFAULT NULL COMMENT '请假开始时间',
-> `leave_end_time` datetime DEFAULT NULL COMMENT '请假结束时间',
-> `is_invalid` int() DEFAULT '' COMMENT '是否失效 0有效 1失效',
-> `invalid_id` int() DEFAULT '' COMMENT '失效记录关联ID ',
-> PRIMARY KEY (`person_id`),
-> KEY `person_circle_id` (`person_circle_id`) USING BTREE,
-> KEY `person_user_id` (`person_user_id`) USING BTREE
-> ) ENGINE=InnoDB;
ERROR (HY000): Tablespace '`xx_xxxxxx`.`xx_xxxxxx_fans_person`' exists.

四、总结

  1、数据库需要定时备份,防止硬件或者其他问题导致的数据文件损坏

  2、先分析问题,排查基本的不可能点,必需查看日志分析问题,注意查看命令报错的输出提示信息(可能帮助我们排查或者修复)

3、可使用-A选项不加载数据库信息尝试进行表修复,提前做好备份