用pt-online-schema-change给大表在线加字段的时候导致从库数据丢失的问题

时间:2022-10-16 18:39:27

  今天同事在主库在给一个大表的字段新加了一个索引,因为是大表,所以用了pt-osc工具,在添加完索引没过多久,开发那边反应丢数据了。

  这个表以前是写在主库的,后来不知道是什么原因改成了写从库,也就是说主库有部分以前的历史数据,但是新的数据都是写在从库上,同事用pt-osc加索引的时候是在主库加的,于是把主库的历史数据同步过来了,从库上的新数据丢了。

  通过以下例子也以看出为什么会发生这种情况:


>select count(*) from goods;

+----------+

| count(*) |

+----------+

|  1426200 |

+----------+


CREATE TABLE `goods` (

  `rec_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,

  `user_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `goods_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `add_time` int(11) unsigned NOT NULL DEFAULT '0',

  `is_attention` tinyint(1) NOT NULL DEFAULT '0',

  `wid` tinyint(3) unsigned NOT NULL DEFAULT '1' COMMENT '仓id,默认为1表示中国仓',

  PRIMARY KEY (`rec_id`),

  KEY `user_id` (`user_id`),

  KEY `goods_id` (`goods_id`),

  KEY `is_attention` (`is_attention`)

) ENGINE=InnoDB AUTO_INCREMENT=1721748 DEFAULT CHARSET=utf8  


我们先把goods表上的user_id索引删除:

alter table goods drop key user_id;


在主库上删除部分数据:

>set sql_log_bin=off;

>delete from goods where rec_id>721747;

Query OK, 848254 rows affected (20.24 sec)

>set sql_log_bin=on;

>select count(*) from test;

+----------+

| count(*) |

+----------+

|   577946 |

+----------+



测试一:

先直接用alter table语句对goods表的user_id字段加索引:

>alter table  test add key (user_id);


主库:

show create table test\G

*************************** 1. row ***************************

       Table: test

Create Table: CREATE TABLE `test` (

  `rec_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,

  `user_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `goods_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `add_time` int(11) unsigned NOT NULL DEFAULT '0',

  `is_attention` tinyint(1) NOT NULL DEFAULT '0',

  `wid` tinyint(3) unsigned NOT NULL DEFAULT '1' COMMENT '仓id,默认为1表示中国仓',

  PRIMARY KEY (`rec_id`),

  KEY `goods_id` (`goods_id`),

  KEY `is_attention` (`is_attention`),

  KEY `user_id` (`user_id`)

) ENGINE=InnoDB AUTO_INCREMENT=1721748 DEFAULT CHARSET=utf8

1 row in set (0.00 sec)


>select count(*) from test;

+----------+

| count(*) |

+----------+

|   577946 |

+----------+

1 row in set (0.22 sec)


从库:

show create table test\G

*************************** 1. row ***************************

       Table: test

Create Table: CREATE TABLE `test` (

  `rec_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,

  `user_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `goods_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `add_time` int(11) unsigned NOT NULL DEFAULT '0',

  `is_attention` tinyint(1) NOT NULL DEFAULT '0',

  `wid` tinyint(3) unsigned NOT NULL DEFAULT '1' COMMENT '仓id,默认为1表示中国仓',

  PRIMARY KEY (`rec_id`),

  KEY `goods_id` (`goods_id`),

  KEY `is_attention` (`is_attention`),

  KEY `user_id` (`user_id`)

) ENGINE=InnoDB AUTO_INCREMENT=1721748 DEFAULT CHARSET=utf8

1 row in set (0.00 sec)


>select count(*) from test;

+----------+

| count(*) |

+----------+

|  1426200 |

+----------+

1 row in set (0.47 sec)


可以看到直接用alter table加索引的时候从库的数据没有变化。


测试二:

用pt-osc工作在主库goods表的user_id列加索引

先删除user_id列的索引,在主库上执行:

alter table  test drop index user_id;


pt-osc加索引:

pt-online-schema-change --nocheck-replication-filters --recursion-method=none --charset=utf8 --alter "add index (user_id) " h=localhost,P=3306,u=root,p=123456,D=test,t=test --print --execute


Successfully altered `test`.`test`. 提示索引加成功。

主库:

>select count(*) from test;

+----------+

| count(*) |

+----------+

|   577946 |

+----------+

从库:

>select count(*) from test;

+----------+

| count(*) |

+----------+

|   577946 |

+----------+

可以看到,从库的数据被主库覆盖。


为什么会出现这个原因呢,我们可以具体看一下加索引的过程:

# pt-online-schema-change --nocheck-replication-filters --recursion-method=none --charset=utf8 --alter "add index (user_id) " h=localhost,P=3306,u=root,p=123456,D=test,t=test --print --execute

No slaves found.  See --recursion-method if host master has slaves.

Not checking slave lag because no slaves were found and --check-slave-lag was not specified.

Operation, tries, wait:

  copy_rows, 10, 0.25

  create_triggers, 10, 1

  drop_triggers, 10, 1

  swap_tables, 10, 1

  update_foreign_keys, 10, 1

Altering `test`.`test`...

Creating new table...

CREATE TABLE `test`.`_test_new` (

  `rec_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,

  `user_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `goods_id` mediumint(8) unsigned NOT NULL DEFAULT '0',

  `add_time` int(11) unsigned NOT NULL DEFAULT '0',

  `is_attention` tinyint(1) NOT NULL DEFAULT '0',

  `wid` tinyint(3) unsigned NOT NULL DEFAULT '1' COMMENT '仓id,默认为1表示中国仓',

  PRIMARY KEY (`rec_id`),

  KEY `goods_id` (`goods_id`),

  KEY `is_attention` (`is_attention`)

) ENGINE=InnoDB AUTO_INCREMENT=1721748 DEFAULT CHARSET=utf8

Created new table test._test_new OK.

Altering new table...

ALTER TABLE `test`.`_test_new` add index (user_id) 

Altered `test`.`_test_new` OK.

2016-02-01T18:11:48 Creating triggers...

CREATE TRIGGER `pt_osc_test_test_del` AFTER DELETE ON `test`.`test` FOR EACH ROW DELETE IGNORE FROM `test`.`_test_new` WHERE `test`.`_test_new`.`rec_id` <=> OLD.`rec_id`

CREATE TRIGGER `pt_osc_test_test_upd` AFTER UPDATE ON `test`.`test` FOR EACH ROW REPLACE INTO `test`.`_test_new` (`rec_id`, `user_id`, `goods_id`, `add_time`, `is_attention`, `wid`) VALUES (NEW.`rec_id`, NEW.`user_id`, NEW.`goods_id`, NEW.`add_time`, NEW.`is_attention`, NEW.`wid`)

CREATE TRIGGER `pt_osc_test_test_ins` AFTER INSERT ON `test`.`test` FOR EACH ROW REPLACE INTO `test`.`_test_new` (`rec_id`, `user_id`, `goods_id`, `add_time`, `is_attention`, `wid`) VALUES (NEW.`rec_id`, NEW.`user_id`, NEW.`goods_id`, NEW.`add_time`, NEW.`is_attention`, NEW.`wid`)

2016-02-01T18:11:48 Created triggers OK.

2016-02-01T18:11:48 Copying approximately 578405 rows...

INSERT LOW_PRIORITY IGNORE INTO `test`.`_test_new` (`rec_id`, `user_id`, `goods_id`, `add_time`, `is_attention`, `wid`) SELECT `rec_id`, `user_id`, `goods_id`, `add_time`, `is_attention`, `wid` FROM `test`.`test` FORCE INDEX(`PRIMARY`) WHERE ((`rec_id` >= ?)) AND ((`rec_id` <= ?)) LOCK IN SHARE MODE /*pt-online-schema-change 30910 copy nibble*/

SELECT /*!40001 SQL_NO_CACHE */ `rec_id` FROM `test`.`test` FORCE INDEX(`PRIMARY`) WHERE ((`rec_id` >= ?)) ORDER BY `rec_id` LIMIT ?, 2 /*next chunk boundary*/

2016-02-01T18:12:09 Copied rows OK.

2016-02-01T18:12:09 Swapping tables...

RENAME TABLE `test`.`test` TO `test`.`_test_old`, `test`.`_test_new` TO `test`.`test`

2016-02-01T18:12:09 Swapped original and new tables OK.

2016-02-01T18:12:09 Dropping old table...

DROP TABLE IF EXISTS `test`.`_test_old`

2016-02-01T18:12:09 Dropped old table `test`.`_test_old` OK.

2016-02-01T18:12:09 Dropping triggers...

DROP TRIGGER IF EXISTS `test`.`pt_osc_test_test_del`;

DROP TRIGGER IF EXISTS `test`.`pt_osc_test_test_upd`;

DROP TRIGGER IF EXISTS `test`.`pt_osc_test_test_ins`;

2016-02-01T18:12:09 Dropped triggers OK.

Successfully altered `test`.`test`.


在建索引的过程中,pt-osc会新将原表的数据拷贝到一张临时表里面,创建三个存储过程来同步有变更的数据,先在临时表上面加索引,加完索引后再将临时表rename。


其实如果可以尽量只写主库不写从库,这样可以避免很多坑。


本文出自 “一直在路上” 博客,请务必保留此出处http://chenql.blog.51cto.com/8732050/1740518