MySql - 处理表大小和性能

时间:2022-08-05 16:52:16

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of this customer. Each customer contains unique domain name.

我们正在使用Google Analytics产品。对于我们的每个客户,我们提供一个JavaScript代码,他们将其放在他们的网站上。如果用户访问我们的客户站点,则java脚本代码会点击我们的服务器,以便我们代表该客户存储此页面访问。每个客户都包含唯一的域名。

we are storing this page visits in MySql table.

我们将此页面访问存储在MySql表中。

Following is the table schema.

以下是表模式。

CREATE TABLE `page_visits` (
  `domain` varchar(50) DEFAULT NULL,
  `guid` varchar(100) DEFAULT NULL,
  `sid` varchar(100) DEFAULT NULL,
  `url` varchar(2500) DEFAULT NULL,
  `ip` varchar(20) DEFAULT NULL,
  `is_new` varchar(20) DEFAULT NULL,
  `ref` varchar(2500) DEFAULT NULL,
  `user_agent` varchar(255) DEFAULT NULL,
  `stats_time` datetime DEFAULT NULL,
  `country` varchar(50) DEFAULT NULL,
  `region` varchar(50) DEFAULT NULL,
  `city` varchar(50) DEFAULT NULL,
  `city_lat_long` varchar(50) DEFAULT NULL,
  `email` varchar(100) DEFAULT NULL,
  KEY `sid_index` (`sid`) USING BTREE,
  KEY `domain_index` (`domain`),
  KEY `email_index` (`email`),
  KEY `stats_time_index` (`stats_time`),
  KEY `domain_statstime` (`domain`,`stats_time`),
  KEY `domain_email` (`domain`,`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

We don't have primary key for this table.

我们没有此表的主键。

MySql server details

MySql服务器详细信息

It is Google cloud MySql (version is 5.6) and storage capacity is 10TB.

它是Google云MySql(版本为5.6),存储容量为10TB。

As of now we are having 350 million rows in our table and table size is 300 GB. We are storing all of our customer details in the same table even though there is no relation between one customer to another.

截至目前,我们的表中有3.5亿行,表大小为300 GB。即使一个客户与另一个客户之间没有关系,我们也会将所有客户详细信息存储在同一个表中。

Problem 1: For few of our customers having huge number of rows in table, so performance of queries against these customers are very slow.

问题1:由于我们的客户很少在表中拥有大量行,因此针对这些客户的查询性能非常低。

Example Query 1:

示例查询1:

SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'aaa' AND stats_time BETWEEN CONVERT_TZ('2015-02-05 00:00:00','+05:30','+00:00') AND CONVERT_TZ('2016-01-01 23:59:59','+05:30','+00:00');
+---------+---------+
| count   | total   |
+---------+---------+
| 1056546 | 2713729 |
+---------+---------+
1 row in set (13 min 19.71 sec)

I will update more queries here. We need results in below 5-10 seconds, will it be possible?

我会在这里更新更多查询。我们需要在5-10秒内获得结果,是否可能?

Problem 2: The table size is rapidly increasing, we might hit table size 5 TB by this year end so we want to shard our table. We want to keep all records related to one customer in one machine. What are the best practises for this sharding.

问题2:表格大小正在迅速增加,我们可能会在今年年底达到5 TB的表格大小,因此我们想要对表格进行分类。我们希望将与一个客户相关的所有记录保存在一台机器中。这种分片的最佳实践是什么?

We are thinking following approaches for above issues, please suggest us best practices to overcome these issues.

我们正在考虑针对上述问题采取以下方法,请向我们提出克服这些问题的最佳做法。

Create separate table for each customer

为每个客户创建单独的表

1) What are the advantages and disadvantages if we create separate table for each customer. As of now we are having 30k customers we might hit 100k by this year end that means 100k tables in DB. We access all tables simultaneously for Read and Write.

1)如果我们为每个客户创建单独的表,有什么优点和缺点。到目前为止,我们有30万客户,到今年年底我们可能达到100k,这意味着数据库中有100k表。我们同时访问所有表以进行读写。

2) We will go with same table and will create partitions based on date range

2)我们将使用相同的表,并将根据日期范围创建分区

UPDATE : Is a "customer" determined by the domain? Answer is Yes

更新:域名确定的“客户”?答案是肯定的

Thanks

2 个解决方案

#1


1  

First, a critique if the excessively large datatypes:

首先,批评如果数据类型过大:

  `domain` varchar(50) DEFAULT NULL,  -- normalize to MEDIUMINT UNSIGNED (3 bytes)
  `guid` varchar(100) DEFAULT NULL,  -- what is this for?
  `sid` varchar(100) DEFAULT NULL,  -- varchar?
  `url` varchar(2500) DEFAULT NULL,
  `ip` varchar(20) DEFAULT NULL,  -- too big for IPv4, too small for IPv6; see below
  `is_new` varchar(20) DEFAULT NULL,  -- flag?  Consider `TINYINT` or `ENUM`
  `ref` varchar(2500) DEFAULT NULL,
  `user_agent` varchar(255) DEFAULT NULL,  -- normalize! (add new rows as new agents are created)
  `stats_time` datetime DEFAULT NULL,
  `country` varchar(50) DEFAULT NULL,  -- use standard 2-letter code (see below)
  `region` varchar(50) DEFAULT NULL,  -- see below
  `city` varchar(50) DEFAULT NULL,  -- see below
  `city_lat_long` varchar(50) DEFAULT NULL,  -- unusable in current format; toss?
  `email` varchar(100) DEFAULT NULL,

For IP addresses, use inet6_aton(), then store in BINARY(16).

对于IP地址,使用inet6_aton(),然后存储在BINARY(16)中。

For country, use CHAR(2) CHARACTER SET ascii -- only 2 bytes.

对于国家/地区,请使用CHAR(2)CHARACTER SET ascii - 仅2个字节。

country + region + city + (maybe) latlng -- normalize this to a "location".

country + region + city +(也许)latlng - 将其标准化为“location”。

All these changes may cut the disk footprint in half. Smaller --> more cacheable --> less I/O --> faster.

所有这些变化可能会将磁盘占用空间减少一半。更小 - >更多可缓存 - >更少I / O - >更快。

Other issues...

To greatly speed up your sid counter, change

为了大大加快你的sid计数器,改变

KEY `domain_statstime` (`domain`,`stats_time`),

to

KEY dss (domain_id,`stats_time`, sid),

That will be a "covering index", hence won't have to bounce between the index and the data 2713729 times -- the bouncing is what cost 13 minutes. (domain_id is discussed below.)

这将是一个“覆盖指数”,因此不必在指数和数据之间反弹2713729次 - 弹跳是13分钟的成本。 (domain_id将在下面讨论。)

This is redundant with the above index, DROP it: KEY domain_index (domain)

这与上面的索引是多余的,DROP它:KEY domain_index(domain)

Is a "customer" determined by the domain?

是由域名决定的“客户”吗?

Every InnoDB table must have a PRIMARY KEY. There are 3 ways to get a PK; you picked the 'worst' one -- a hidden 6-byte integer fabricated by the engine. I assume there is no 'natural' PK available from some combination of columns? Then, an explicit BIGINT UNSIGNED is called for. (Yes that would be 8 bytes, but various forms of maintenance need an explicit PK.)

每个InnoDB表必须有一个PRIMARY KEY。获得PK的方法有3种;你选择了“最差”的 - 由引擎制造的隐藏的6字节整数。我假设某些列组合中没有“自然”PK?然后,调用显式BIGINT UNSIGNED。 (是的,这将是8个字节,但各种形式的维护需要一个明确的PK。)

If most queries include WHERE domain = '...', then I recommend the following. (And this will greatly improve all such queries.)

如果大多数查询包含WHERE domain ='...',那么我建议如下。 (这将大大改善所有此类查询。)

id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
domain_id MEDIUMINT UNSIGNED NOT NULL,   -- normalized to `Domains`
PRIMARY KEY(domain_id, id),  -- clustering on customer gives you the speedup
INDEX(id)  -- this keeps AUTO_INCREMENT happy

Recommend you look into pt-online-schema-change for making all these changes. However, I don't know if it can work without an explicit PRIMARY KEY.

建议您查看pt-online-schema-change以进行所有这些更改。但是,我不知道它是否可以在没有明确的PRIMARY KEY的情况下工作。

"Separate table for each customer"? No. This is a common question; the resounding answer is No. I won't repeat all the reasons for not having 100K tables.

“每个客户的独立桌子”?不,这是一个常见的问题;响亮的答案是否定的。我不会重复没有100K表的所有原因。

Sharding

"Sharding" is splitting the data across multiple machines.

“分片”是将数据分割到多台机器上。

To do sharding, you need to have code somewhere that looks at domain and decides which server will handle the query, then hands it off. Sharding is advisable when you have write scaling problems. You did not mention such, so it is unclear whether sharding is advisable.

要进行分片,您需要在某处查看域并确定哪个服务器将处理查询,然后将其移交。当您有写入缩放问题时,建议进行分片。你没有提到这一点,所以不清楚分片是否可取。

When sharding on something like domain (or domain_id), you could use (1) a hash to pick the server, (2) a dictionary lookup (of 100K rows), or (3) a hybrid.

在像域(或domain_id)这样的东西上进行分片时,可以使用(1)哈希来选择服务器,(2)字典查找(100K行),或(3)混合。

I like the hybrid -- hash to, say, 1024 values, then look up into a 1024-row table to see which machine has the data. Since adding a new shard and migrating a user to a different shard are major undertakings, I feel that the hybrid is a reasonable compromise. The lookup table needs to be distributed to all clients that redirect actions to shards.

我喜欢混合 - 散列到比如1024个值,然后查找1024行表以查看哪台机器有数据。由于添加新的分片并将用户迁移到不同的分片是主要的工作,我觉得混​​合是一种合理的妥协。需要将查找表分发给将操作重定向到分片的所有客户端。

If your 'writing' is running out of steam, see high speed ingestion for possible ways to speed that up.

如果你的“写作”正在失去动力,请参见高速摄取,以寻找加快速度的方法。

PARTITIONing

PARTITIONing is splitting the data across multiple "sub-tables".

PARTITIONing是跨多个“子表”分割数据。

There are only a limited number of use cases where partitioning buys you any performance. You not indicated that any apply to your use case. Read that blog and see if you think that partitioning might be useful.

只有有限数量的用例,分区可以为您带来任何性能。您没有表明任何适用于您的用例。阅读该博客,看看您是否认为分区可能有用。

You mentioned "partition by date range". Will most of the queries include a date range? If so, such partitioning may be advisable. (See the link above for best practices.) Some other options come to mind:

你提到了“按日期范围划分”。大多数查询是否包含日期范围?如果是这样,那么这种分区可能是可取的。 (请参阅上面的链接以获取最佳实践。)其他一些选项会浮现在脑海中:

Plan A: PRIMARY KEY(domain_id, stats_time, id) But that is bulky and requires even more overhead on each secondary index. (Each secondary index silently includes all the columns of the PK.)

计划A:PRIMARY KEY(domain_id,stats_time,id)但这很笨重,每个二级索引需要更多的开销。 (每个辅助索引都默默地包含PK的所有列。)

Plan B: Have stats_time include microseconds, then tweak the values to avoid having dups. Then use stats_time instead of id. But this requires some added complexity, especially if there are multiple clients inserting data. (I can elaborate if needed.)

计划B:让stats_time包含微秒,然后调整值以避免重复。然后使用stats_time而不是id。但这需要一些额外的复杂性,特别是如果有多个客户端插入数据。 (如果需要,我可以详细说明。)

Plan C: Have a table that maps stats_time values to ids. Look up the id range before doing the real query, then use both WHERE id BETWEEN ... AND stats_time .... (Again, messy code.)

计划C:有一个表将stats_time值映射到ids。在进行真正的查询之前查找id范围,然后同时使用WHERE id BETWEEN ... AND stats_time ....(再次,凌乱的代码。)

Summary tables

Are many of the queries of the form of counting things over date ranges? Suggest having Summary Tables based perhaps on per-hour. More discussion.

在日期范围内计算事物的形式的许多查询是什么?建议根据每小时计算摘要表。更多讨论。

COUNT(DISTINCT sid) is especially difficult to fold into summary tables. For example, the unique counts for each hour cannot be added together to get the unique count for the day. But I have a technique for that, too.

COUNT(DISTINCT sid)特别难以折叠到汇总表中。例如,不能将每小时的唯一计数加在一起以获得当天的唯一计数。但我也有一种技术。

#2


0  

I wouldn't do this if i were you. First thing that come to mind would be, on receive a pageview message, i send the message to a queue so that a worker can pickup and insert to database later (in bulk maybe); also i increase the counter of siteid:date in redis (for example). Doing count in sql is just a bad idea for this scenario.

如果我是你,我不会这样做。首先想到的是,在收到网页浏览消息时,我将消息发送到队列,以便工作人员可以稍后拾取并插入数据库(可能是批量);我也增加了siteid的计数器:redis中的日期(例如)。在这种情况下,在sql中进行计数只是一个坏主意。

#1


1  

First, a critique if the excessively large datatypes:

首先,批评如果数据类型过大:

  `domain` varchar(50) DEFAULT NULL,  -- normalize to MEDIUMINT UNSIGNED (3 bytes)
  `guid` varchar(100) DEFAULT NULL,  -- what is this for?
  `sid` varchar(100) DEFAULT NULL,  -- varchar?
  `url` varchar(2500) DEFAULT NULL,
  `ip` varchar(20) DEFAULT NULL,  -- too big for IPv4, too small for IPv6; see below
  `is_new` varchar(20) DEFAULT NULL,  -- flag?  Consider `TINYINT` or `ENUM`
  `ref` varchar(2500) DEFAULT NULL,
  `user_agent` varchar(255) DEFAULT NULL,  -- normalize! (add new rows as new agents are created)
  `stats_time` datetime DEFAULT NULL,
  `country` varchar(50) DEFAULT NULL,  -- use standard 2-letter code (see below)
  `region` varchar(50) DEFAULT NULL,  -- see below
  `city` varchar(50) DEFAULT NULL,  -- see below
  `city_lat_long` varchar(50) DEFAULT NULL,  -- unusable in current format; toss?
  `email` varchar(100) DEFAULT NULL,

For IP addresses, use inet6_aton(), then store in BINARY(16).

对于IP地址,使用inet6_aton(),然后存储在BINARY(16)中。

For country, use CHAR(2) CHARACTER SET ascii -- only 2 bytes.

对于国家/地区,请使用CHAR(2)CHARACTER SET ascii - 仅2个字节。

country + region + city + (maybe) latlng -- normalize this to a "location".

country + region + city +(也许)latlng - 将其标准化为“location”。

All these changes may cut the disk footprint in half. Smaller --> more cacheable --> less I/O --> faster.

所有这些变化可能会将磁盘占用空间减少一半。更小 - >更多可缓存 - >更少I / O - >更快。

Other issues...

To greatly speed up your sid counter, change

为了大大加快你的sid计数器,改变

KEY `domain_statstime` (`domain`,`stats_time`),

to

KEY dss (domain_id,`stats_time`, sid),

That will be a "covering index", hence won't have to bounce between the index and the data 2713729 times -- the bouncing is what cost 13 minutes. (domain_id is discussed below.)

这将是一个“覆盖指数”,因此不必在指数和数据之间反弹2713729次 - 弹跳是13分钟的成本。 (domain_id将在下面讨论。)

This is redundant with the above index, DROP it: KEY domain_index (domain)

这与上面的索引是多余的,DROP它:KEY domain_index(domain)

Is a "customer" determined by the domain?

是由域名决定的“客户”吗?

Every InnoDB table must have a PRIMARY KEY. There are 3 ways to get a PK; you picked the 'worst' one -- a hidden 6-byte integer fabricated by the engine. I assume there is no 'natural' PK available from some combination of columns? Then, an explicit BIGINT UNSIGNED is called for. (Yes that would be 8 bytes, but various forms of maintenance need an explicit PK.)

每个InnoDB表必须有一个PRIMARY KEY。获得PK的方法有3种;你选择了“最差”的 - 由引擎制造的隐藏的6字节整数。我假设某些列组合中没有“自然”PK?然后,调用显式BIGINT UNSIGNED。 (是的,这将是8个字节,但各种形式的维护需要一个明确的PK。)

If most queries include WHERE domain = '...', then I recommend the following. (And this will greatly improve all such queries.)

如果大多数查询包含WHERE domain ='...',那么我建议如下。 (这将大大改善所有此类查询。)

id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
domain_id MEDIUMINT UNSIGNED NOT NULL,   -- normalized to `Domains`
PRIMARY KEY(domain_id, id),  -- clustering on customer gives you the speedup
INDEX(id)  -- this keeps AUTO_INCREMENT happy

Recommend you look into pt-online-schema-change for making all these changes. However, I don't know if it can work without an explicit PRIMARY KEY.

建议您查看pt-online-schema-change以进行所有这些更改。但是,我不知道它是否可以在没有明确的PRIMARY KEY的情况下工作。

"Separate table for each customer"? No. This is a common question; the resounding answer is No. I won't repeat all the reasons for not having 100K tables.

“每个客户的独立桌子”?不,这是一个常见的问题;响亮的答案是否定的。我不会重复没有100K表的所有原因。

Sharding

"Sharding" is splitting the data across multiple machines.

“分片”是将数据分割到多台机器上。

To do sharding, you need to have code somewhere that looks at domain and decides which server will handle the query, then hands it off. Sharding is advisable when you have write scaling problems. You did not mention such, so it is unclear whether sharding is advisable.

要进行分片,您需要在某处查看域并确定哪个服务器将处理查询,然后将其移交。当您有写入缩放问题时,建议进行分片。你没有提到这一点,所以不清楚分片是否可取。

When sharding on something like domain (or domain_id), you could use (1) a hash to pick the server, (2) a dictionary lookup (of 100K rows), or (3) a hybrid.

在像域(或domain_id)这样的东西上进行分片时,可以使用(1)哈希来选择服务器,(2)字典查找(100K行),或(3)混合。

I like the hybrid -- hash to, say, 1024 values, then look up into a 1024-row table to see which machine has the data. Since adding a new shard and migrating a user to a different shard are major undertakings, I feel that the hybrid is a reasonable compromise. The lookup table needs to be distributed to all clients that redirect actions to shards.

我喜欢混合 - 散列到比如1024个值,然后查找1024行表以查看哪台机器有数据。由于添加新的分片并将用户迁移到不同的分片是主要的工作,我觉得混​​合是一种合理的妥协。需要将查找表分发给将操作重定向到分片的所有客户端。

If your 'writing' is running out of steam, see high speed ingestion for possible ways to speed that up.

如果你的“写作”正在失去动力,请参见高速摄取,以寻找加快速度的方法。

PARTITIONing

PARTITIONing is splitting the data across multiple "sub-tables".

PARTITIONing是跨多个“子表”分割数据。

There are only a limited number of use cases where partitioning buys you any performance. You not indicated that any apply to your use case. Read that blog and see if you think that partitioning might be useful.

只有有限数量的用例,分区可以为您带来任何性能。您没有表明任何适用于您的用例。阅读该博客,看看您是否认为分区可能有用。

You mentioned "partition by date range". Will most of the queries include a date range? If so, such partitioning may be advisable. (See the link above for best practices.) Some other options come to mind:

你提到了“按日期范围划分”。大多数查询是否包含日期范围?如果是这样,那么这种分区可能是可取的。 (请参阅上面的链接以获取最佳实践。)其他一些选项会浮现在脑海中:

Plan A: PRIMARY KEY(domain_id, stats_time, id) But that is bulky and requires even more overhead on each secondary index. (Each secondary index silently includes all the columns of the PK.)

计划A:PRIMARY KEY(domain_id,stats_time,id)但这很笨重,每个二级索引需要更多的开销。 (每个辅助索引都默默地包含PK的所有列。)

Plan B: Have stats_time include microseconds, then tweak the values to avoid having dups. Then use stats_time instead of id. But this requires some added complexity, especially if there are multiple clients inserting data. (I can elaborate if needed.)

计划B:让stats_time包含微秒,然后调整值以避免重复。然后使用stats_time而不是id。但这需要一些额外的复杂性,特别是如果有多个客户端插入数据。 (如果需要,我可以详细说明。)

Plan C: Have a table that maps stats_time values to ids. Look up the id range before doing the real query, then use both WHERE id BETWEEN ... AND stats_time .... (Again, messy code.)

计划C:有一个表将stats_time值映射到ids。在进行真正的查询之前查找id范围,然后同时使用WHERE id BETWEEN ... AND stats_time ....(再次,凌乱的代码。)

Summary tables

Are many of the queries of the form of counting things over date ranges? Suggest having Summary Tables based perhaps on per-hour. More discussion.

在日期范围内计算事物的形式的许多查询是什么?建议根据每小时计算摘要表。更多讨论。

COUNT(DISTINCT sid) is especially difficult to fold into summary tables. For example, the unique counts for each hour cannot be added together to get the unique count for the day. But I have a technique for that, too.

COUNT(DISTINCT sid)特别难以折叠到汇总表中。例如,不能将每小时的唯一计数加在一起以获得当天的唯一计数。但我也有一种技术。

#2


0  

I wouldn't do this if i were you. First thing that come to mind would be, on receive a pageview message, i send the message to a queue so that a worker can pickup and insert to database later (in bulk maybe); also i increase the counter of siteid:date in redis (for example). Doing count in sql is just a bad idea for this scenario.

如果我是你,我不会这样做。首先想到的是,在收到网页浏览消息时,我将消息发送到队列,以便工作人员可以稍后拾取并插入数据库(可能是批量);我也增加了siteid的计数器:redis中的日期(例如)。在这种情况下,在sql中进行计数只是一个坏主意。