MySQL中的SELECT DISTINCT语句需要10分钟

时间:2021-12-27 02:50:40

I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:

我是MySQL的新手,我试图使用这个语句选择一组不同的行:

SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);

However, the select statement is taking around 10 minutes, so something is clearly afoot.

然而,选择声明大约需要10分钟,所以有些事情正在发生。

One significant factor is that the table gtfsstop_times is huge. (~250 million records)

一个重要因素是表gtfsstop_times很大。 (约2.5亿条记录)

Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:

索引似乎设置得当;以上所有连接都使用索引列。表格大小大致如下:

gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows

The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.

服务器有22Gb的内存,我已经将InnoDB缓冲池设置为8G,而我使用的是MySQL 5.6。

Can anybody see a way of making this run faster? Or indeed, at all!

任何人都能看到让这种运行更快的方法吗?或者说,实际上!

Does it matter that the stoppoints table is in a different schema?

停止点表位于不同的模式中是否重要?

EDIT: EXPLAIN SELECT... returns this:

编辑:EXPLAIN SELECT ...返回:

MySQL中的SELECT DISTINCT语句需要10分钟

4 个解决方案

#1


6  

It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?

您似乎正在尝试根据特定条件查找停止点集合。并且,您正在使用SELECT DISTINCT来避免重复停止点。是对的吗?

It looks like atcoCode is a unique key for your stoppoints table. Is that right?

看起来atcoCode是您的停止点表的唯一键。是对的吗?

If so, try this:

如果是这样,试试这个:

SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
  FROM `transportdata`.stoppoints` AS sp
  JOIN ( 
     SELECT DISTINCT st.fk_atco_code AS atcoCode
       FROM `vehicledata`.gtfsroutes AS route
       JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
       JOIN `vehicledata`.gtfsstop_times AS st  ON trip.trip_id = st.trip_id
       WHERE route.agency_id BETWEEN 1 AND 4
  ) ids ON sp.atcoCode = ids.atcoCode

This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.

这做了一些事情:它消除了你似乎不需要的表(代理)。它将agency_id上的搜索从IN(a,b,c)更改为范围搜索,这可能有助于也可能没有帮助。最后,它将DISTINCT处理从必须处理大量数据的情况重新定位到子查询情况,在该情况下它只需要处理ID值。

(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)

(JOIN和INNER JOIN是相同的。我使用JOIN使查询更容易阅读。)

This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.

这应该会加快你的速度。但是,必须要说的是,四分之一的gigarow表是一张大表。

#2


3  

Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.

拥有250M记录,我会在一列上对gtfsstop_times表进行分片。然后,每个分片表可以在一个单独的查询中连接,该查询可以在不同的线程中并行运行,您只需要合并结果集。

#3


2  

The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?

诀窍是减少SQL要评估的gtfsstop_times行数。在这种情况下,SQL首先计算gtfsstop_times和transportdata.stoppoints的内连接中的每一行,对吧? transportdata.stoppoints有多少行?然后,SQL计算WHERE子句,然后计算DISTINCT。它是如何做到DISTINCT的?通过多次查看每一行来确定是否有其他类似的行。这需要永远,对吗?

However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.

但是,GROUP BY会快速地将所有匹配的行拼凑在一起,而不会对每个行进行评估。我通常使用连接来快速减少查询需要评估的行数,然后我查看我的分组。

In this case you want to replace DISTINCT with grouping.

在这种情况下,您希望将DISTINCT替换为分组。

Try this;

SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode

FROM `transportdata`.stoppoints as sp
    INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
    INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
    INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
    INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id

WHERE agency.agency_id IN (1,2,3,4)

GROUP BY sp.name
    , sp.longitude
    , sp.latitude
    , sp.atcoCode

#4


1  

There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.

对你的问题还有其他有价值的答案,我的补充也是如此。我假设sp.atcoCode和st.fk_atco_code是其表中的索引列。

If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.

如果您可以验证并确保WHERE子句中的代理ID有效,则可以在JOINS中删除加入`vehicledata.gtfsagencys`,因为您没有从表中获取任何记录。

SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);

#1


6  

It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?

您似乎正在尝试根据特定条件查找停止点集合。并且,您正在使用SELECT DISTINCT来避免重复停止点。是对的吗?

It looks like atcoCode is a unique key for your stoppoints table. Is that right?

看起来atcoCode是您的停止点表的唯一键。是对的吗?

If so, try this:

如果是这样,试试这个:

SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
  FROM `transportdata`.stoppoints` AS sp
  JOIN ( 
     SELECT DISTINCT st.fk_atco_code AS atcoCode
       FROM `vehicledata`.gtfsroutes AS route
       JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
       JOIN `vehicledata`.gtfsstop_times AS st  ON trip.trip_id = st.trip_id
       WHERE route.agency_id BETWEEN 1 AND 4
  ) ids ON sp.atcoCode = ids.atcoCode

This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.

这做了一些事情:它消除了你似乎不需要的表(代理)。它将agency_id上的搜索从IN(a,b,c)更改为范围搜索,这可能有助于也可能没有帮助。最后,它将DISTINCT处理从必须处理大量数据的情况重新定位到子查询情况,在该情况下它只需要处理ID值。

(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)

(JOIN和INNER JOIN是相同的。我使用JOIN使查询更容易阅读。)

This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.

这应该会加快你的速度。但是,必须要说的是,四分之一的gigarow表是一张大表。

#2


3  

Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.

拥有250M记录,我会在一列上对gtfsstop_times表进行分片。然后,每个分片表可以在一个单独的查询中连接,该查询可以在不同的线程中并行运行,您只需要合并结果集。

#3


2  

The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?

诀窍是减少SQL要评估的gtfsstop_times行数。在这种情况下,SQL首先计算gtfsstop_times和transportdata.stoppoints的内连接中的每一行,对吧? transportdata.stoppoints有多少行?然后,SQL计算WHERE子句,然后计算DISTINCT。它是如何做到DISTINCT的?通过多次查看每一行来确定是否有其他类似的行。这需要永远,对吗?

However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.

但是,GROUP BY会快速地将所有匹配的行拼凑在一起,而不会对每个行进行评估。我通常使用连接来快速减少查询需要评估的行数,然后我查看我的分组。

In this case you want to replace DISTINCT with grouping.

在这种情况下,您希望将DISTINCT替换为分组。

Try this;

SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode

FROM `transportdata`.stoppoints as sp
    INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
    INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
    INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
    INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id

WHERE agency.agency_id IN (1,2,3,4)

GROUP BY sp.name
    , sp.longitude
    , sp.latitude
    , sp.atcoCode

#4


1  

There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.

对你的问题还有其他有价值的答案,我的补充也是如此。我假设sp.atcoCode和st.fk_atco_code是其表中的索引列。

If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.

如果您可以验证并确保WHERE子句中的代理ID有效,则可以在JOINS中删除加入`vehicledata.gtfsagencys`,因为您没有从表中获取任何记录。

SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);