I have a table containing the runtimes for generators on different sites, and I want to select the most recent entry for each site. Each generator is run once or twice a week.
我有一个表,其中包含不同站点上生成器的运行时,我想为每个站点选择最近的条目。每个发电机每周运行一到两次。
I have a query that will do this, but I wonder if it's the best option. I can't help thinking that using WHERE x IN (SELECT ...) is lazy and not the best way to formulate the query - any query.
我有一个查询可以做到这一点,但我想知道这是否是最好的选择。我不禁认为使用WHERE x IN (SELECT…)是懒惰的,而且不是表示查询的最佳方式——任何查询。
The table is as follows:
下表如下:
CREATE TABLE generator_logs (
id integer NOT NULL,
site_id character varying(4) NOT NULL,
start timestamp without time zone NOT NULL,
"end" timestamp without time zone NOT NULL,
duration integer NOT NULL
);
And the query:
和查询:
SELECT id, site_id, start, "end", duration
FROM generator_logs
WHERE start IN (SELECT MAX(start) AS start
FROM generator_logs
GROUP BY site_id)
ORDER BY start DESC
There isn't a huge amount of data, so I'm not worried about optimizing the query. However, I do have to do similar things on tables with 10s of millions of rows, (big tables as far as I'm concerned!) and there optimisation is more important.
数据量不大,所以我不担心优化查询。但是,我必须在有10s的数百万行的表上做类似的事情,(就我而言是大表!)
So is there a better query for this, and are inline queries generally a bad idea?
有更好的查询吗?内联查询通常是个坏主意吗?
5 个解决方案
#1
1
I would use joins as they perform much better then "IN" clause:
我将使用join,因为它们比IN条款执行得更好:
select gl.id, gl.site_id, gl.start, gl."end", gl.duration
from
generator_logs gl
inner join (
select max(start) as start, site_id
from generator_logs
group by site_id
) gl2
on gl.site_id = gl2.site_id
and gl.start = gl2.start
Also as Tony pointed out you were missing correlation in your original query
正如Tony指出的,你在原来的查询中缺少相关性。
#2
4
Should your query not be correlated? i.e.:
您的查询不应该是相关的吗?例如:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE start = (SELECT MAX(g2.start) AS start
FROM generator_logs g2
WHERE g2.site_id = g1.site_id)
ORDER BY start DESC
Otherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.
否则,您可能会获取非最新日志,其起始值恰好与另一个站点的最新起始值相匹配。
Or alternatively:
或者:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start
FROM generator_logs g2
GROUP BY site_id)
ORDER BY start DESC
#3
0
In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )
在MYSQL中,这可能会有问题,因为我上次检查时,它无法有效地优化子查询(例如:通过查询重写)
Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.
许多DBMS都有遗传查询规划器,不管输入查询结构如何,它们都将执行相同的操作。
MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.
MYSQL在某些情况下会为这种情况创建临时表,而在其他情况下不会,根据情况、索引、索引,子查询仍然可以非常快。
Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.
有些人抱怨说,子查询很难读,但如果您将它们分叉到本地变量中,它们就很好了。
$maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id';
$q ="
SELECT id, site_id, start, \"end\", duration
FROM generator_logs
WHERE start IN ($maxids)
ORDER BY start DESC
";
#4
0
This problem - finding not just the the MAX
, but the rest of the corresponding row - is a common one. Luckily, Postgres provides a nice way to do this with one query, using DISTINCT ON
:
这个问题——不仅要找到最大值,还要找到相应行的其余部分——是一个常见的问题。幸运的是,Postgres提供了一种很好的方式来处理一个查询,使用独特的ON:
SELECT DISTINCT ON (site_id)
id, site_id, start, "end", duration
FROM generator_logs
ORDER BY site_id, start DESC;
DISTINCT ON (site_id)
means "return one record per site_id
". The order by clause determines which record that is. Note, however, that this is subtly different from your original query - if you have two records for the same site with the same start
, your query would return two records, while this returns only one.
不同的(site_id)意味着“每个site_id返回一条记录”。order by子句决定哪个记录是。但是,请注意,这与您的原始查询有微妙的不同——如果您对同一个站点的两个记录具有相同的开始,那么您的查询将返回两个记录,而这个查询只返回一个记录。
#5
0
A way to find records having the MAX value per group is to select those records for which there is no record within the same group having a higher value:
找到每个组有最大值的记录的一种方法是选择那些在同一组中没有记录值更高的记录的记录:
SELECT id, site_id, "start", "end", duration
FROM generator_logs g1
WHERE NOT EXISTS (
SELECT 1
FROM generator_logs g2
WHERE g2.site_id = g1.site_id
AND g2."start" > g1."start"
);
#1
1
I would use joins as they perform much better then "IN" clause:
我将使用join,因为它们比IN条款执行得更好:
select gl.id, gl.site_id, gl.start, gl."end", gl.duration
from
generator_logs gl
inner join (
select max(start) as start, site_id
from generator_logs
group by site_id
) gl2
on gl.site_id = gl2.site_id
and gl.start = gl2.start
Also as Tony pointed out you were missing correlation in your original query
正如Tony指出的,你在原来的查询中缺少相关性。
#2
4
Should your query not be correlated? i.e.:
您的查询不应该是相关的吗?例如:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE start = (SELECT MAX(g2.start) AS start
FROM generator_logs g2
WHERE g2.site_id = g1.site_id)
ORDER BY start DESC
Otherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.
否则,您可能会获取非最新日志,其起始值恰好与另一个站点的最新起始值相匹配。
Or alternatively:
或者:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start
FROM generator_logs g2
GROUP BY site_id)
ORDER BY start DESC
#3
0
In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )
在MYSQL中,这可能会有问题,因为我上次检查时,它无法有效地优化子查询(例如:通过查询重写)
Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.
许多DBMS都有遗传查询规划器,不管输入查询结构如何,它们都将执行相同的操作。
MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.
MYSQL在某些情况下会为这种情况创建临时表,而在其他情况下不会,根据情况、索引、索引,子查询仍然可以非常快。
Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.
有些人抱怨说,子查询很难读,但如果您将它们分叉到本地变量中,它们就很好了。
$maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id';
$q ="
SELECT id, site_id, start, \"end\", duration
FROM generator_logs
WHERE start IN ($maxids)
ORDER BY start DESC
";
#4
0
This problem - finding not just the the MAX
, but the rest of the corresponding row - is a common one. Luckily, Postgres provides a nice way to do this with one query, using DISTINCT ON
:
这个问题——不仅要找到最大值,还要找到相应行的其余部分——是一个常见的问题。幸运的是,Postgres提供了一种很好的方式来处理一个查询,使用独特的ON:
SELECT DISTINCT ON (site_id)
id, site_id, start, "end", duration
FROM generator_logs
ORDER BY site_id, start DESC;
DISTINCT ON (site_id)
means "return one record per site_id
". The order by clause determines which record that is. Note, however, that this is subtly different from your original query - if you have two records for the same site with the same start
, your query would return two records, while this returns only one.
不同的(site_id)意味着“每个site_id返回一条记录”。order by子句决定哪个记录是。但是,请注意,这与您的原始查询有微妙的不同——如果您对同一个站点的两个记录具有相同的开始,那么您的查询将返回两个记录,而这个查询只返回一个记录。
#5
0
A way to find records having the MAX value per group is to select those records for which there is no record within the same group having a higher value:
找到每个组有最大值的记录的一种方法是选择那些在同一组中没有记录值更高的记录的记录:
SELECT id, site_id, "start", "end", duration
FROM generator_logs g1
WHERE NOT EXISTS (
SELECT 1
FROM generator_logs g2
WHERE g2.site_id = g1.site_id
AND g2."start" > g1."start"
);