如何处理MySQL中的大表?

时间:2021-06-26 16:56:35

I've a database used to store items and properties about these items. The number of properties is extensible, thus there is a join table to store each property associated to an item value.

我有一个数据库用于存储有关这些项目的项目和属性。属性的数量是可扩展的,因此存在用于存储与项值相关联的每个属性的连接表。

CREATE TABLE `item_property` (
    `property_id` int(11) NOT NULL,
    `item_id` int(11) NOT NULL,
    `value` double NOT NULL,
    PRIMARY KEY  (`property_id`,`item_id`),
    KEY `item_id` (`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

This database has two goals : storing (which has first priority and has to be very quick, I would like to perform many inserts (hundreds) in few seconds), retrieving data (selects using item_id and property_id) (this is a second priority, it can be slower but not too much because this would ruin my usage of the DB).

这个数据库有两个目标:存储(具有第一优先级并且必须非常快,我想在几秒钟内执行许多插入(数百)),检索数据(使用item_id和property_id选择)(这是第二优先级,它可以更慢但不会太多,因为这会破坏我对DB的使用)。

Currently this table hosts 1.6 billions entries and a simple count can take up to 2 minutes... Inserting isn't fast enough to be usable.

目前这个表有1.6亿个条目,简单计数可能需要2分钟......插入速度不够快,无法使用。

I'm using Zend_Db to access my data and would really be happy if you don't suggest me to develop any PHP side element.

我正在使用Zend_Db来访问我的数据,如果你不建议我开发任何PHP副作用,我会非常高兴。

7 个解决方案

#1


10  

If you can't go for solutions using different database management systems or partitioning over a cluster for some reasons, there are still three main things you can do to radically improve your performance (and they work in combination with clusters too of course):

如果您出于某些原因无法使用不同的数据库管理系统或通过群集进行分区,那么您仍然可以采取三项主要措施来彻底改善您的性能(当然,它们也可以与群集结合使用):

  • Setup the MyISAM-storage engine
  • 设置MyISAM存储引擎
  • Use "LOAD DATA INFILE filename INTO TABLE tablename"
  • 使用“LOAD DATA INFILE filename INTO TABLE tablename”
  • Split your data over several tables
  • 将数据拆分为多个表

That's it. Read the rest only if you're interested in the details :)

而已。如果您对细节感兴趣,请阅读其余内容:)

Still reading? OK then, here goes: MyISAM is the corner stone, since it's the fastest engine by far. Instead of inserting data rows using regular SQL-statements you should batch them up in a file and insert that file at regular intervals (as often as you need to, but as seldom as your application allows would be best). That way you can insert in the order of a million rows per minute.

还在看?那么,接下来就是:MyISAM是角石,因为它是迄今为止最快的引擎。您不应该使用常规SQL语句插入数据行,而应将它们在文件中批处理并定期插入该文件(根据需要经常插入,但很少应用程序允许这样做最好)。这样,您可以按每分钟一百万行的顺序插入。

The next thing that will limit you is your keys/indexes. When those cant fit in your memory (because they're simply to big) you'll experience a huge slowdown in both inserts and queries. That's why you split the data over several tables, all with the same schema. Every table should be as big as possible, without filling your memory when loaded one at a time. The exact size depends on your machine and indexes of course, but should be somewhere between 5 and 50 million rows/table. You'll find this if you simply measure the time taken to insert a huge bunch of rows after another, looking for the moment it slows down significantly. When you know the limit, create a new table on the fly every time your last table gets close to that limit.

接下来会限制你的是键/索引。当那些不适合你的记忆时(因为它们只是很大)你会在插入和查询中遇到巨大的减速。这就是为什么你将数据分成几个表,都使用相同的模式。每个表都应该尽可能大,一次加载一个表就不会填满你的记忆。确切的大小取决于您的机器和索引当然,但应该介于5到5千万行/表之间。如果你只是测量插入一大堆一行又一行的时间,那么你会发现这一点,寻找它显着减速的那一刻。当您知道限制时,每当您的最后一个表接近该限制时,即时创建一个新表。

The consequence of the multitable-solution is that you'll have to query all your tables instead of just a single one when you need some data, which will slow your queries down a bit (but not too much if you "only" have a billion or so rows). Obviously there are optimizations to do here as well. If there's something fundamental you could use to separate the data (like date, client or something) you could split it into different tables using some structured pattern that lets you know where certain types of data are even without querying the tables. Use that knowledge to only query the tables that might contain the requested data etc.

多表解决方案的结果是,当您需要一些数据时,您将不得不查询所有表而不是单个表,这会使您的查询减慢一点(但如果您“只”拥有一个数据,则不会太多十亿左右)。显然,这里也有优化。如果有一些基本的东西可以用来分隔数据(比如日期,客户端或其他东西),你可以使用一些结构化的模式将它分成不同的表,让你知道某些类型的数据在哪里,甚至不查询表。使用该知识仅查询可能包含所请求数据的表等。

If you need even more tuning, go for partitioning, as suggested by Eineki and oedo.

如果您需要更多调整,请按照Eineki和oedo的建议进行分区。

Also, so you'll know all of this isn't wild speculation: I'm doing some scalability tests like this on our own data at the moment and this approach is doing wonders for us. We're managing to insert tens of millions of rows every day and queries takes ~100 ms.

另外,所以你知道所有这些都不是疯狂的推测:我现在正在对我们自己的数据进行一些这样的可伸缩性测试,这种方法对我们来说是奇迹。我们每天都要插入数千万行,查询大约需要100毫秒。

#2


0  

First of all don't use InnoDb as you don't seem to need its principal feature over MyISAM (locking, transaction etc..). So do use MyISAM, it will already make some difference. Then if that's still not speedy enough, get into some indexing, but you should already see a radical difference.

首先不要使用InnoDb,因为您似乎不需要MyISAM的主要功能(锁定,事务等...)。所以使用MyISAM,它已经会有所不同。然后,如果仍然不够快,请进入一些索引,但你应该已经看到了根本的区别。

#3


0  

wow, that is quite a large table :)

哇,这是一张相当大的桌子:)

if you need storing to be fast, you could batch up your inserts and insert them with a single multiple INSERT statement. however this would definitely require extra client-side (php) code, sorry!

如果您需要快速存储,可以批量插入并使用单个多个INSERT语句插入它们。但是这肯定需要额外的客户端(php)代码,对不起!

INSERT INTO `table` (`col1`, `col2`) VALUES (1, 2), (3, 4), (5, 6)...

also disable any indexes that you don't NEED as indexes slow down the insert commands.

还禁用任何您不需要的索引作为索引减慢插入命令。

alternatively you could look at partitioning your table : linky

或者你可以看看你的表分区:linky

#4


0  

Look into memcache to see where it can be applied. Also look into horizontal partitioning to keep table sizes/indexes smaller.

查看memcache以查看它可以应用的位置。还要研究水平分区以保持表大小/索引更小。

#5


0  

First: One Table with 1.6 billion entries seems so be a little too big. I work on some pretty heavy load systems where even the logging tables that keep track of all actions don't get this big over years. So if possible, think, if you can find a more optimal storage method. Can't give much more advice since I don't know your DB structure but I'm sure there will be plenty of room for optimization. 1.6 billion entries is just too big.

第一:有16亿条目的一张桌子看起来有点太大了。我在一些非常繁重的负载系统上工作,即使是记录所有操作的日志记录表也没有多年来这么大。如果可能,请考虑一下,如果您能找到更优化的存储方法。因为我不知道你的数据库结构,所以不能提供更多建议,但我确信会有足够的优化空间。 16亿条目太大了。

A few things on performance:

关于性能的一些事情:

If you don't need referential integrity checks, which is unlikely, you could switch to the MyISAM storage engine. It's a bit faster but lacks integrity ckecks and transactions.

如果您不需要参考完整性检查(这不太可能),您可以切换到MyISAM存储引擎。它有点快,但缺乏完整性ckecks和交易。

For anything else, more info would be necessary.

除此之外,还需要更多信息。

#6


0  

Have you considered the option of partitioning the table?

您是否考虑过分区表的选项?

#7


-2  

One important thing to remember is that a default installation of MySQL is not configured for heavy work like this. Make sure that you have tuned it for your workload.

需要记住的一件重要事情是MySQL的默认安装没有配置为像这样繁重的工作。确保您已针对工作负载进行了调整。

#1


10  

If you can't go for solutions using different database management systems or partitioning over a cluster for some reasons, there are still three main things you can do to radically improve your performance (and they work in combination with clusters too of course):

如果您出于某些原因无法使用不同的数据库管理系统或通过群集进行分区,那么您仍然可以采取三项主要措施来彻底改善您的性能(当然,它们也可以与群集结合使用):

  • Setup the MyISAM-storage engine
  • 设置MyISAM存储引擎
  • Use "LOAD DATA INFILE filename INTO TABLE tablename"
  • 使用“LOAD DATA INFILE filename INTO TABLE tablename”
  • Split your data over several tables
  • 将数据拆分为多个表

That's it. Read the rest only if you're interested in the details :)

而已。如果您对细节感兴趣,请阅读其余内容:)

Still reading? OK then, here goes: MyISAM is the corner stone, since it's the fastest engine by far. Instead of inserting data rows using regular SQL-statements you should batch them up in a file and insert that file at regular intervals (as often as you need to, but as seldom as your application allows would be best). That way you can insert in the order of a million rows per minute.

还在看?那么,接下来就是:MyISAM是角石,因为它是迄今为止最快的引擎。您不应该使用常规SQL语句插入数据行,而应将它们在文件中批处理并定期插入该文件(根据需要经常插入,但很少应用程序允许这样做最好)。这样,您可以按每分钟一百万行的顺序插入。

The next thing that will limit you is your keys/indexes. When those cant fit in your memory (because they're simply to big) you'll experience a huge slowdown in both inserts and queries. That's why you split the data over several tables, all with the same schema. Every table should be as big as possible, without filling your memory when loaded one at a time. The exact size depends on your machine and indexes of course, but should be somewhere between 5 and 50 million rows/table. You'll find this if you simply measure the time taken to insert a huge bunch of rows after another, looking for the moment it slows down significantly. When you know the limit, create a new table on the fly every time your last table gets close to that limit.

接下来会限制你的是键/索引。当那些不适合你的记忆时(因为它们只是很大)你会在插入和查询中遇到巨大的减速。这就是为什么你将数据分成几个表,都使用相同的模式。每个表都应该尽可能大,一次加载一个表就不会填满你的记忆。确切的大小取决于您的机器和索引当然,但应该介于5到5千万行/表之间。如果你只是测量插入一大堆一行又一行的时间,那么你会发现这一点,寻找它显着减速的那一刻。当您知道限制时,每当您的最后一个表接近该限制时,即时创建一个新表。

The consequence of the multitable-solution is that you'll have to query all your tables instead of just a single one when you need some data, which will slow your queries down a bit (but not too much if you "only" have a billion or so rows). Obviously there are optimizations to do here as well. If there's something fundamental you could use to separate the data (like date, client or something) you could split it into different tables using some structured pattern that lets you know where certain types of data are even without querying the tables. Use that knowledge to only query the tables that might contain the requested data etc.

多表解决方案的结果是,当您需要一些数据时,您将不得不查询所有表而不是单个表,这会使您的查询减慢一点(但如果您“只”拥有一个数据,则不会太多十亿左右)。显然,这里也有优化。如果有一些基本的东西可以用来分隔数据(比如日期,客户端或其他东西),你可以使用一些结构化的模式将它分成不同的表,让你知道某些类型的数据在哪里,甚至不查询表。使用该知识仅查询可能包含所请求数据的表等。

If you need even more tuning, go for partitioning, as suggested by Eineki and oedo.

如果您需要更多调整,请按照Eineki和oedo的建议进行分区。

Also, so you'll know all of this isn't wild speculation: I'm doing some scalability tests like this on our own data at the moment and this approach is doing wonders for us. We're managing to insert tens of millions of rows every day and queries takes ~100 ms.

另外,所以你知道所有这些都不是疯狂的推测:我现在正在对我们自己的数据进行一些这样的可伸缩性测试,这种方法对我们来说是奇迹。我们每天都要插入数千万行,查询大约需要100毫秒。

#2


0  

First of all don't use InnoDb as you don't seem to need its principal feature over MyISAM (locking, transaction etc..). So do use MyISAM, it will already make some difference. Then if that's still not speedy enough, get into some indexing, but you should already see a radical difference.

首先不要使用InnoDb,因为您似乎不需要MyISAM的主要功能(锁定,事务等...)。所以使用MyISAM,它已经会有所不同。然后,如果仍然不够快,请进入一些索引,但你应该已经看到了根本的区别。

#3


0  

wow, that is quite a large table :)

哇,这是一张相当大的桌子:)

if you need storing to be fast, you could batch up your inserts and insert them with a single multiple INSERT statement. however this would definitely require extra client-side (php) code, sorry!

如果您需要快速存储,可以批量插入并使用单个多个INSERT语句插入它们。但是这肯定需要额外的客户端(php)代码,对不起!

INSERT INTO `table` (`col1`, `col2`) VALUES (1, 2), (3, 4), (5, 6)...

also disable any indexes that you don't NEED as indexes slow down the insert commands.

还禁用任何您不需要的索引作为索引减慢插入命令。

alternatively you could look at partitioning your table : linky

或者你可以看看你的表分区:linky

#4


0  

Look into memcache to see where it can be applied. Also look into horizontal partitioning to keep table sizes/indexes smaller.

查看memcache以查看它可以应用的位置。还要研究水平分区以保持表大小/索引更小。

#5


0  

First: One Table with 1.6 billion entries seems so be a little too big. I work on some pretty heavy load systems where even the logging tables that keep track of all actions don't get this big over years. So if possible, think, if you can find a more optimal storage method. Can't give much more advice since I don't know your DB structure but I'm sure there will be plenty of room for optimization. 1.6 billion entries is just too big.

第一:有16亿条目的一张桌子看起来有点太大了。我在一些非常繁重的负载系统上工作,即使是记录所有操作的日志记录表也没有多年来这么大。如果可能,请考虑一下,如果您能找到更优化的存储方法。因为我不知道你的数据库结构,所以不能提供更多建议,但我确信会有足够的优化空间。 16亿条目太大了。

A few things on performance:

关于性能的一些事情:

If you don't need referential integrity checks, which is unlikely, you could switch to the MyISAM storage engine. It's a bit faster but lacks integrity ckecks and transactions.

如果您不需要参考完整性检查(这不太可能),您可以切换到MyISAM存储引擎。它有点快,但缺乏完整性ckecks和交易。

For anything else, more info would be necessary.

除此之外,还需要更多信息。

#6


0  

Have you considered the option of partitioning the table?

您是否考虑过分区表的选项?

#7


-2  

One important thing to remember is that a default installation of MySQL is not configured for heavy work like this. Make sure that you have tuned it for your workload.

需要记住的一件重要事情是MySQL的默认安装没有配置为像这样繁重的工作。确保您已针对工作负载进行了调整。