1、故障描述
某次,用户MySQL数据库出现异常宕机,MySQL错误日志中出现断言失败,实例崩溃并且不断尝试重启恢复,发现业务异常后,紧急进行主从切换,先恢复业务。对于MySQL而言,如果实例出现奔溃,一般来讲,问题已经非常严重,虽然做了容灾切换,但是如果是数据库本身的问题,可能在未来很短的时间内复现,因此需要高度关注,并做根因排查。
MySQL错误日志的断言输出,如下:
2022-10-29T09:14:08.044233+08:00 458807751 [ERROR] [MY-013183] [InnoDB] Assertion failure: btr0cur.cc:3660:rec thread 140137891268352 InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html InnoDB: about forcing recovery. 01:14:08 UTC - mysqld got signal 6 ; Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware. Thread pointer: 0x7f745407b860 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 7f74653a7bf0 thread_stack 0x100000 /usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x200097d] /usr/sbin/mysqld(print_fatal_signal(int)+0x2df) [0xf16eaf] /usr/sbin/mysqld(my_server_abort()+0x7e) [0xf16ffe] /usr/sbin/mysqld(my_abort()+0xa) [0x1ffacaa] /usr/sbin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x31f) [0x22b508f] /usr/sbin/mysqld(btr_cur_optimistic_update(unsigned long, btr_cur_t*, unsigned long**, mem_block_info_t**, upd_t const*, unsigned long, que_thr_t*, unsigned long, mtr_t*)+0x932) [0x22f0be2] /usr/sbin/mysqld() [0x2464ebd] /usr/sbin/mysqld() [0x2465390] /usr/sbin/mysqld(row_undo_mod(undo_node_t*, que_thr_t*)+0xadf) [0x246766f] /usr/sbin/mysqld(row_undo_step(que_thr_t*)+0x65) [0x2246d75] /usr/sbin/mysqld(que_run_threads(que_thr_t*)+0x46c) [0x21f37ec] /usr/sbin/mysqld() [0x229775a] /usr/sbin/mysqld() [0x2298299] /usr/sbin/mysqld(trx_rollback_for_mysql(trx_t*)+0xf1) [0x2299da1] /usr/sbin/mysqld() [0x20fcb34] /usr/sbin/mysqld(ha_rollback_low(THD*, bool)+0x127) [0x1021737] /usr/sbin/mysqld(trx_coordinator::rollback_in_engines(THD*, bool)+0x25) [0xec8b65] /usr/sbin/mysqld(MYSQL_BIN_LOG::rollback(THD*, bool)+0xcf) [0x1c3f8bf] /usr/sbin/mysqld(ha_rollback_trans(THD*, bool)+0x79) [0x1021919] /usr/sbin/mysqld(trans_rollback(THD*)+0x4e) [0xecbd1e] /usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0x225e) [0xdae21e] /usr/sbin/mysqld(dispatch_sql_command(THD*, Parser_state*)+0x41a) [0xdb069a] /usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xd3e) [0xdb193e] /usr/sbin/mysqld(do_command(THD*)+0x207) [0xdb3b37] /usr/sbin/mysqld() [0xf07610] /usr/sbin/mysqld() [0x2511155] /lib64/libpthread.so.0(+0x7ea5) [0x7f7cc0770ea5] /lib64/libc.so.6(clone+0x6d) [0x7f7cbeb2c8dd] Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (7f6eac049e70): rollback Connection ID (thread ID): 458807751 Status: NOT_KILLED The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
2、根因分析
如上,首先检查MySQL错误日志发生崩溃时,报错前后是否有有用信息。根据错误日志信息可知,MySQL崩溃之前并未有任何记录,可以判断在报错前数据库并未出现损坏。
然后又去查看了操作系统日志,了解MySQL崩溃时操作系统有无异常情况。根据message日志信息可知,MySQL进程突然退出,操作系统并未记录到异常。
Oct 29 09:14:01 rmssdb systemd: Started Session 779337 of user root. Oct 29 09:14:01 rmssdb systemd: Started Session 779336 of user root. Oct 29 09:14:14 rmssdb systemd: mysqld.service: main process exited, code=exited, status=2/INVALIDARGUMENT Oct 29 09:14:14 rmssdb systemd: Unit mysqld.service entered failed state. Oct 29 09:14:14 rmssdb systemd: mysqld.service failed. Oct 29 09:14:14 rmssdb systemd: mysqld.service holdoff time over, scheduling restart. Oct 29 09:14:14 rmssdb systemd: Stopped MySQL Server. Oct 29 09:14:14 rmssdb systemd: Starting MySQL Server... Oct 29 09:14:25 rmssdb systemd: mysqld.service: main process exited, code=exited, status=2/INVALIDARGUMENT Oct 29 09:14:25 rmssdb systemd: Failed to start MySQL Server.
既然如此,那就需要对错误信息进行分析。截至目前,其实可以大致猜测,MySQL可能是因为bug导致崩溃。根据错误信息作为关键词,可以到MySQL的Bug网站和Oracle内部网站MOS中查找是否有相关的记录。
2022-10-29T09:14:08.044233+08:00 458807751 [ERROR] [MY-013183] [InnoDB] Assertion failure: btr0cur.cc:3660:rec thread 140137891268352
果然,在MOS中,匹配到了类似的故障描述。详细可参考文档Instant DDL Causes Assertion Failure After Mysql Upgrade (Doc ID 2903537.1)。
根据文档的说明,当8.0.28版本以下的MySQL升级至8.0.30,运行DDL更改就有可能会导致数据损坏,引起该BUG。描述与实际故障情况符合。
3、解决方案
1、此BUG仅影响 8.0.29 和 8.0.30 版本,可下载8.0.31,根据标准流程对数据库执行升级,如果可能,请在测试环境中重新测试问题。
2、MySQL在升级前建议评估好版本的稳定性以及对系统的影响,生产系统不建议随便频繁升级。
3 、由于MySQL 8.0 系列版本,未经过市场充分的验证,可能存在很多未知的 BUG ,适当情况下,可以将版本回退至该系列更低的稳定版本。