高效的MySQL表设置和查询

时间:2022-05-19 21:03:45

Suppose I have the following database setup (a simplified version from what I actually have):

假设我有以下数据库设置(我实际拥有的简化版本):

Table: news_posting (500,000+ entries)
| --------------------------------------------------------------|
| posting_id  | name      | is_active   | released_date | token |
| 1           | posting_1 | 1           | 2013-01-10    | 123   |
| 2           | posting_2 | 1           | 2013-01-11    | 124   |
| 3           | posting_3 | 0           | 2013-01-12    | 125   |
| --------------------------------------------------------------|
PRIMARY posting_id
INDEX sorting ON (is_active, released_date, token)

Table: news_category (500 entries)
| ------------------------------|
| category_id   | name          |
| 1             | category_1    |
| 2             | category_2    |
| 3             | category_3    |
| ------------------------------|
PRIMARY category_id

Table: news_cat_match (1,000,000+ entries)
| ------------------------------|
| category_id   | posting_id    |
| 1             | 1             |
| 2             | 1             |
| 3             | 1             |
| 2             | 2             |
| 3             | 2             |
| 1             | 3             |
| 2             | 3             |
| ------------------------------|
UNIQUE idx (category_id, posting_id)

My task is as follows. I must get a list of 50 latest news postings (at some offset) that are active, that are before today's date, and that are in one of the 20 or so categories that are specified in the request. Before I choose the 50 news postings to return, I must sort the appropriate news postings by token in descending order. My query is currently similar to the following:

我的任务如下。我必须得到50个最新的新闻贴子(在某些偏移量),这些贴子在今天的日期之前是有效的,并且是在请求中指定的20个左右类别中的一个。在我选择要返回的50个新闻帖子之前,我必须按降序排序适当的新闻帖子。我的查询目前类似于以下内容:

SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50

With just one specified category_id the query does not involve a filesort and is reasonably fast because it does not have to process removal of duplicate results. However, calling EXPLAIN on the above query that has multiple category_id's returns a table that says that there is filesort to be done. And, the query is extremely slow on my data set.

只有一个指定的category_id,查询不涉及文件排序,并且相当快,因为​​它不必处理删除重复结果。但是,在具有多个category_id的上述查询上调用EXPLAIN会返回一个表,表明存在要执行的filesort。而且,我的数据集查询速度非常慢。

Is there any way to optimize the table setup and/or the query?

有没有办法优化表设置和/或查询?

1 个解决方案

#1


0  

I was able to get the above query to run even faster than with a single-value category list version by rewriting it as follows:

通过重写如下所示,我能够使上述查询比单值类别列表版本更快地运行:

SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
    SELECT ncm.posting_id
    FROM news_cat_match ncm 
    WHERE ncm.posting_id = np.posting_id
    AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
    LIMIT 1
)
ORDER BY np.token DESC LIMIT 50

This now takes under a second on my data set.

现在,我的数据集不到一秒钟。

The sad part is that this is even faster than if there is just one category_id specified. That's because the subset of news items is bigger than with just one category_id, so it finds the results more quickly.

可悲的是,这比只指定了一个category_id更快。这是因为新闻项的子集比只有一个category_id更大,因此它可以更快地找到结果。

Now my next question is whether this can be optimized for cases when a category has only few news that are spread in time?

现在我的下一个问题是,这是否可以针对类别只有少量新闻及时传播的情况进行优化?

The following is still pretty slow on my development machine. Although it's fast enough on the production server, I would like to optimize this if possible.

以下在我的开发机器上仍然很慢。虽然它在生产服务器上足够快,但我想尽可能优化它。

SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50

Does anyone have any further suggestions?

有没有人有任何进一步的建议?

#1


0  

I was able to get the above query to run even faster than with a single-value category list version by rewriting it as follows:

通过重写如下所示,我能够使上述查询比单值类别列表版本更快地运行:

SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
    SELECT ncm.posting_id
    FROM news_cat_match ncm 
    WHERE ncm.posting_id = np.posting_id
    AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
    LIMIT 1
)
ORDER BY np.token DESC LIMIT 50

This now takes under a second on my data set.

现在,我的数据集不到一秒钟。

The sad part is that this is even faster than if there is just one category_id specified. That's because the subset of news items is bigger than with just one category_id, so it finds the results more quickly.

可悲的是,这比只指定了一个category_id更快。这是因为新闻项的子集比只有一个category_id更大,因此它可以更快地找到结果。

Now my next question is whether this can be optimized for cases when a category has only few news that are spread in time?

现在我的下一个问题是,这是否可以针对类别只有少量新闻及时传播的情况进行优化?

The following is still pretty slow on my development machine. Although it's fast enough on the production server, I would like to optimize this if possible.

以下在我的开发机器上仍然很慢。虽然它在生产服务器上足够快,但我想尽可能优化它。

SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50

Does anyone have any further suggestions?

有没有人有任何进一步的建议?