记一次数据库bug故障恢复

从一件小事说起

最近一个月不知何故,博客的VPS IP无法ping通。提交ticket要求更换ip, 毫无意外的被拒绝了。于是直接从当前VPS clone 一台间接实现了IP的更换。clone的文章参考官方指引即可。因为是克隆操作,因此所有的环境和数据都保持不变,配置成本约定于0。整个过程加上修改DNS,10分钟搞定。正准备来杯清茶休息一下的时候,发现新机器的mysql挂了:

2020-03-21T12:00:03.913175Z 0 [Note] InnoDB: Apply batch completed
2020-03-21T12:00:04.016847Z 0 [Note] InnoDB: Removed temporary tablespace data file: “ibtmp1”
2020-03-21T12:00:04.016974Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-03-21T12:00:04.017051Z 0 [Note] InnoDB: Setting file ‘./ibtmp1’ size to 12 MB. Physically writing the file full; Please wait …
2020-03-21T12:00:04.020703Z 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
2020-03-21T12:00:04.020779Z 0 [Note] InnoDB: Rolling back trx with id 35499526, 1 rows to undo
12:00:04 UTC – mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=0
max_threads=151
thread_count=0
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68195 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xed2ebc]
/usr/sbin/mysqld(handle_fatal_signal+0x451)[0x7b4ea1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f85fd1a5340]
/usr/sbin/mysqld(_Z32page_cur_search_with_match_bytesPK11buf_block_tPK12dict_index_tPK8dtuple_t15page_cur_mode_tPmS9_S9_S9_P10page_cur_t+0x168)[0xf89528]
/usr/sbin/mysqld(_Z27btr_cur_search_to_nth_levelP12dict_index_tmPK8dtuple_t15page_cur_mode_tmP9btr_cur_tmPKcmP5mtr_t+0x1121)[0x10b2d31]
/usr/sbin/mysqld(_Z22row_search_index_entryP12dict_index_tPK8dtuple_tmP10btr_pcur_tP5mtr_t+0x11f)[0xffd4ff]
/usr/sbin/mysqld[0x11cb0d6]
/usr/sbin/mysqld(_Z12row_undo_insP11undo_node_tP9que_thr_t+0x22b)[0x11cc0bb]
/usr/sbin/mysqld(_Z13row_undo_stepP9que_thr_t+0x405)[0x1019385]
/usr/sbin/mysqld(_Z15que_run_threadsP9que_thr_t+0x871)[0xfa97c1]
/usr/sbin/mysqld(_Z31trx_rollback_or_clean_recoveredm+0xcdc)[0x106c8dc]
/usr/sbin/mysqld(trx_rollback_or_clean_all_recovered+0x4e)[0x106d7ee]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f85fd19d182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f85fc6aa47d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
2020-03-21T12:00:04.056733Z mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

从错误信息 mysqld got signal 11 看,应该是启动过程是尝试fix table触发了bug,因此启动失败。

如何解决 mysqld got signal 11

既然是fix corrupted table导致的问题,那么思路自然是先把mysql启动起来,备份数据,然后找到存在问题的table, 再逐步恢复数据(我的mysql server版本是:5.7.23,不同版本可能稍有差异):

  1. 如果服务器未正常停止,先停止mysql: service mysql stop
  2. 备份文件/var/lib/mysql/ib*: mkdir /tmp/fixdb && cp -rp /var/lib/mysql/ib* /tmp/fixdb
  3. 在配置文件/etc/mysql/my.cnf中添加如下内容:

[mysqld]

innodb_force_recovery = 3

  1. 启动mysql: service mysql start. 如果启动失败,尝试改大步骤3的数字,直到启动成功。
  2. 备份数据表: mysqldump -A -uUSERNAME -p > dump.sql
  3. 删除所有用户数据库:

mysql -uUSERNAME -p

drop database xxx…

  1. 停止mysql: service mysql stop
  2. 删除ib文件:rm /var/lib/mysql/ib*
  3. 删除步骤3中添加的两行内容
  4. 启动mysql: service mysql start. 在mysql错误文件(/var/log/mysql/error.log)中应该可以看到最开始导致启动失败的corrupted table.
  5. 恢复数据:mysql -uUSERNAME -p < dump.sql. 如果恢复过程中出现 Tablespace exists错误,则尝试删除ibd文件,然后重新导入数据:

ERROR 1813 (HY000) at line 268: Tablespace ‘DBNAME.TABLENAME‘ exists.

rm /var/lib/mysql/DBNAME/*.ibd

mysql -uUSERNAME -p < dump.sql

终于大功告成。恢复过程中小紧张了一下,使用了10年的VPS终于满血复活。

参考文献