当需要向具有数百万行的表添加列时,Postgres比MySql更好吗?

时间:2021-09-15 09:18:34

We're having problems with Mysql. When I search around, I see many people having the same problem.

我们遇到Mysql问题。当我四处搜寻时,我发现很多人都有同样的问题。

I have joined up with a product where the database has some tables with as many as 150 million rows. One example of our problem is that one of these tables has over 30 columns and about half of them are no longer used. When trying to remove columns or renaming columns, mysql wants to copy the entire table and rename. With this amount of data, it would take many hours to do this and the site would be offline pretty much the whole time. This is just the first of several large migrations to improve the schema. These aren't intended as a regular thing. Just a lot of cleanup I inherited.

我加入了一个产品,其中数据库有一些表,行数多达1.5亿行。我们的问题的一个例子是这些表中的一个具有超过30列,并且其中大约一半不再使用。当试图删除列或重命名列时,mysql想要复制整个表并重命名。有了这么多的数据,这需要花费很多时间才能完成,并且网站几乎一直处于脱机状态。这只是改进模式的几次大型迁移中的第一次。这些不是常规的。我继承了很多清理工作。

I tried searching to see if people have the same problem with Postgres and I find almost nothing in comparison talking about this issue. Is this because Postgres is a lot better at it, or just that less people are using postgres?

我试着去查看人们是否与Postgres有同样的问题,我发现几乎没有什么可以比较这个问题。这是因为Postgres在这方面要好得多,还是只有少人使用postgres?

3 个解决方案

#1


18  

In PostgreSQL, adding a new column without default value to a table is instantaneous, because the new column is only registered in the system catalog, not actually added on disk.

在PostgreSQL中,向表中添加没有默认值的新列是即时的,因为新列仅在系统目录中注册,而不是实际添加到磁盘上。

#2


11  

When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.

当你知道的唯一工具是锤子时,你所有的问题看起来都像钉子一样。对于这个问题,PostgreSQL在处理这些类型的更改方面要好得多。事实上,无论你设计应用程序的程度如何,总有一天你必须在实时数据库上更改架构。虽然MySQL的各种引擎在某些极端情况下确实令人惊叹,但这些都没有帮助。 PostgreSQL在各个层之间的非常紧密的集成意味着您可以使用事务性ddl之类的东西,它允许您回滚任何不是alter / create database / tablespace的东西。或者非常快速地改变表格。或者非阻碍创建索引。等等。它将PostgreSQL限制在它做得很好的事情上(传统的事务性数据库负载处理是一个强点),并且在MySQL经常填补空白的事情上并没有那么好,比如使用ndb引擎的实时网络集群存储。

In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.

在这种情况下,MySQL中没有一个不同的引擎可以让你轻松解决这个问题。多个存储引擎的多功能性意味着数据库的词法分析器/解析器/顶层不能像存储引擎那样紧密集成,因此很多很酷的东西pgsql可以在这里做mysql不能。

I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.

我的统计数据库中有一个118G的表。它有11亿行。它确实应该被分区,但它不是很多,当它是我们可以等待它。在300MB / sec(它所能读取的阵列的速度)上,读取大约需要118 * ~3秒,或者大约需要5分钟。这台机器有32G的RAM,因此无法将表保存在内存中。

When I ran the simple statement on this table:

当我在这个表上运行简单语句时:

alter table mytable add test text;

alter table mytable添加测试文本;

it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.

它挂着等待真空。我杀死了真空(选择pg_cancel_backend(12345)(< - pid在那里)并立即完成。此表上的真空需要很长时间才能运行btw。通常这不是什么大问题,但是在更改表结构时,你必须等待真空吸尘器,或杀死它们。

Dropping a column is just as simple and fast.

删除列也同样简单快捷。

Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.

现在我们来讨论postgresql的问题,那就是堆内MVCC存储。如果添加该列,则执行更新表集test ='abc',它会更新每一行,并使表的大小精确加倍。除非HOT可以在适当的位置更新行,但是您需要一个50%的填充因子表,它是双倍大小的开头。获得空间的唯一方法是等待并让真空随着时间的推移回收它并一次重复使用一次更新,或运行集群或真空充满将其缩小。

you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.

你可以通过一次运行表的某些部分更新(更新pkid在1到10000000之间; ...)并在每次运行之间运行真空来回收空间来解决这个问题。

So, both systems have warts and bumps to deal with.

因此,两个系统都有疣和碰撞来处理。

#3


-4  

maybe because this should not be a regualr occurrence.

也许是因为这不应该是一个规范的发生。

perhaps, reading between the lines, you need to be adding a row to another table, instead of columns to a large existing table..?

也许,在行之间读取,你需要在另一个表中添加一行,而不是将列添加到一个大的现有表中。?

#1


18  

In PostgreSQL, adding a new column without default value to a table is instantaneous, because the new column is only registered in the system catalog, not actually added on disk.

在PostgreSQL中,向表中添加没有默认值的新列是即时的,因为新列仅在系统目录中注册,而不是实际添加到磁盘上。

#2


11  

When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.

当你知道的唯一工具是锤子时,你所有的问题看起来都像钉子一样。对于这个问题,PostgreSQL在处理这些类型的更改方面要好得多。事实上,无论你设计应用程序的程度如何,总有一天你必须在实时数据库上更改架构。虽然MySQL的各种引擎在某些极端情况下确实令人惊叹,但这些都没有帮助。 PostgreSQL在各个层之间的非常紧密的集成意味着您可以使用事务性ddl之类的东西,它允许您回滚任何不是alter / create database / tablespace的东西。或者非常快速地改变表格。或者非阻碍创建索引。等等。它将PostgreSQL限制在它做得很好的事情上(传统的事务性数据库负载处理是一个强点),并且在MySQL经常填补空白的事情上并没有那么好,比如使用ndb引擎的实时网络集群存储。

In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.

在这种情况下,MySQL中没有一个不同的引擎可以让你轻松解决这个问题。多个存储引擎的多功能性意味着数据库的词法分析器/解析器/顶层不能像存储引擎那样紧密集成,因此很多很酷的东西pgsql可以在这里做mysql不能。

I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.

我的统计数据库中有一个118G的表。它有11亿行。它确实应该被分区,但它不是很多,当它是我们可以等待它。在300MB / sec(它所能读取的阵列的速度)上,读取大约需要118 * ~3秒,或者大约需要5分钟。这台机器有32G的RAM,因此无法将表保存在内存中。

When I ran the simple statement on this table:

当我在这个表上运行简单语句时:

alter table mytable add test text;

alter table mytable添加测试文本;

it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.

它挂着等待真空。我杀死了真空(选择pg_cancel_backend(12345)(< - pid在那里)并立即完成。此表上的真空需要很长时间才能运行btw。通常这不是什么大问题,但是在更改表结构时,你必须等待真空吸尘器,或杀死它们。

Dropping a column is just as simple and fast.

删除列也同样简单快捷。

Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.

现在我们来讨论postgresql的问题,那就是堆内MVCC存储。如果添加该列,则执行更新表集test ='abc',它会更新每一行,并使表的大小精确加倍。除非HOT可以在适当的位置更新行,但是您需要一个50%的填充因子表,它是双倍大小的开头。获得空间的唯一方法是等待并让真空随着时间的推移回收它并一次重复使用一次更新,或运行集群或真空充满将其缩小。

you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.

你可以通过一次运行表的某些部分更新(更新pkid在1到10000000之间; ...)并在每次运行之间运行真空来回收空间来解决这个问题。

So, both systems have warts and bumps to deal with.

因此,两个系统都有疣和碰撞来处理。

#3


-4  

maybe because this should not be a regualr occurrence.

也许是因为这不应该是一个规范的发生。

perhaps, reading between the lines, you need to be adding a row to another table, instead of columns to a large existing table..?

也许,在行之间读取,你需要在另一个表中添加一行,而不是将列添加到一个大的现有表中。?