Mysql Query:返回组中所有前面的行匹配条件的行

时间:2022-09-23 12:06:48

Given a table like the following, which contains a list of names, tasks, priority of task, and status of task:

给定如下表格,其中包含名称,任务,任务优先级和任务状态的列表:

mysql> select * from test;
+----+------+--------+----------+--------+
| id | name | task   | priority | status |
+----+------+--------+----------+--------+
|  1 | bob  | start  |        1 | done   |
|  2 | bob  | work   |        2 | NULL   |
|  3 | bob  | finish |        3 | NULL   |
|  4 | jim  | start  |        1 | done   |
|  5 | jim  | work   |        2 | done   |
|  6 | jim  | finish |        3 | NULL   |
|  7 | mike | start  |        1 | done   |
|  8 | mike | work   |        2 | failed |
|  9 | mike | finish |        3 | NULL   |
| 10 | joan | start  |        1 | NULL   |
| 11 | joan | work   |        2 | NULL   |
| 12 | joan | finish |        3 | NULL   |
+----+------+--------+----------+--------+
12 rows in set (0.00 sec)

I want to build a query which returns only the next task to be run for each name. Specifically, I want to return the row containing the lowest number priority which has a NULL status per person.

我想构建一个查询,它只返回每个名称要运行的下一个任务。具体来说,我想返回包含每个人具有NULL状态的最低优先级的行。

But here's the catch: I want to only return the row if all preceding tasks have a status of "done".

但是这里有一个问题:如果所有前面的任务都具有“完成”状态,我想只返回该行。

Given the above table and query logic, the end result of this query should look like this:

给定上面的表和查询逻辑,此查询的最终结果应如下所示:

+----+------+--------+----------+--------+
| id | name | task   | priority | status |
+----+------+--------+----------+--------+
|  2 | bob  | work   |        2 | NULL   |
|  6 | jim  | finish |        3 | NULL   |
+----+------+--------+----------+--------+

Initially, this was being done with a whole mess of sub-queries and derived tables, which extremely inefficient and slow. I have managed to speed it up considerably by using several temporary tables to get the result I want.

最初,这是通过一堆子查询和派生表来完成的,这些表非常低效且速度慢。通过使用几个临时表来获得我想要的结果,我已经成功地加快了速度。

In the real world, this will be run on a table with about 200k records, and multiple servers will each be executing this query several times per minute. My current solution takes about 2 seconds to run, which simply won't do.

在现实世界中,这将在具有大约200k记录的表上运行,并且每个服务器将每分钟多次执行该查询。我目前的解决方案需要大约2秒才能运行,这根本不会。

Here is the DML/DDL to get my example data:

这是获取我的示例数据的DML / DDL:

DROP TABLE IF EXISTS `test`;
CREATE TABLE `test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(20) DEFAULT NULL,
  `task` varchar(20) DEFAULT NULL,
  `priority` int(11) DEFAULT NULL,
  `status` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO `test` VALUES 
(1,'bob','start',1,'done'),
(2,'bob','work',2,NULL),
(3,'bob','finish',3,NULL),
(4,'jim','start',1,'done'),
(5,'jim','work',2,'done'),
(6,'jim','finish',3,NULL),
(7,'mike','start',1,'done'),
(8,'mike','work',2,'failed'),
(9,'mike','finish',3,NULL),
(10,'joan','start',1,NULL),
(11,'joan','work',2,NULL),
(12,'joan','finish',3,NULL);

Here's what I am currently doing to get the desired result (which works, but is slow):

这是我目前正在做的事情,以获得所需的结果(这是有效的,但很慢):

drop table if exists tmp1;
create temporary table tmp1 as 
select 
    name, 
    min(priority) as priority 
from test t 
where status is null 
group by name;
create index idx_pri on tmp1(priority);
create index idx_name on tmp1(name);

drop table if exists tmp2;
create temporary table tmp2 as 
select tmp.* 
from test t 
join tmp1 tmp 
    on t.name = tmp.name 
    and t.priority < tmp.priority 
group by name having sum(
    case when status = 'done' 
    then 0 
    else 1 
    end
) = 0;
create index idx_pri on tmp2(priority);
create index idx_name on tmp2(name);


select 
    t.*
from test t 
join tmp2 t2
    on t.name = t2.name
    and t.priority = t2.priority;

I have the DDL/DML in SQL Fiddle as well, but I can't put my solution in there because technically the creation of these temp tables is DDL, and it doen't allow DDL in the query box. http://sqlfiddle.com/#!2/2d9e2/1

我在SQL Fiddle中也有DDL / DML,但我不能把我的解决方案放在那里,因为从技术上讲,这些临时表的创建是DDL,并且它不允许在查询框中使用DDL。 http://sqlfiddle.com/#!2/2d9e2/1

Please help me with coming up with a better way to do this. I am open to modifying schema or logic to accommodate outside of the box solutions as well, so long as said solution is efficient.

请帮助我提出一个更好的方法来做到这一点。我愿意修改模式或逻辑以适应开箱即用的解决方案,只要所述解决方案有效。

4 个解决方案

#1


1  

You can turn your logic pretty directly into a query like this:

您可以将您的逻辑直接转换为如下查询:

select t.*
from test t 
where t.status is null and
      not exists (select 1
                  from test t2
                  where t2.name = t.name and
                        t2.id < t.id and
                        (t2.status <> 'done' or
                         t2.status is null
                        )
                 ) and
      exists (select 1
              from test t2
              where t2.name = t.name and
                    t2.id < t.id and
                    t2.status = 'done'
             );

For performance, create an index on test(name, id, status).

为了提高性能,请在test(name,id,status)上创建索引。

Here is a SQL Fiddle.

这是一个SQL小提琴。

#2


1  

This query determines whether all tasks before a given task are done by verifying that the # of tasks that are not done before the given task are 0

此查询通过验证在给定任务之前未完成的任务数为0来确定是否完成给定任务之前的所有任务

SELECT t1.name, t1.id, t1.priority, t1.task
FROM test t1
JOIN test t2 
    ON t2.name = t1.name
    AND t2.priority < t1.priority
WHERE t1.status IS NULL
GROUP BY t1.name, t1.priority, t1.id, t1.task
HAVING COUNT(CASE WHEN t2.status = 'done' THEN NULL ELSE 1 END) = 0 

CREATE INDEX test_index1 ON test (name,status,priority,id,task);

http://sqlfiddle.com/#!2/c912f/7

#3


1  

I have no way of testing the speed against a really large table, but this at least returns the correct answer from your smaller sample table. It should be competitive with other answers, however, as it performs only one subquery without a join:

我无法针对一个非常大的表测试速度,但这至少会从较小的样本表中返回正确的答案。但是,它应该与其他答案竞争,因为它只执行一个没有连接的子查询:

select  *
from    test  t1
where   t1.Status is null
  and exists (
    select 1
    from   test
    where  Name = t1.Name and
           Priority < t1.Priority
    group by Name
    having count(*) = sum( case when Status = 'done' then 1 else 0 end )
);

#4


0  

SELECT a.* 
  FROM test a 
  JOIN 
     ( SELECT x.name
            , MIN(x.priority) priority 
         FROM test x 
         LEFT 
         JOIN test y 
           ON y.name = x.name 
          AND y.priority < x.priority 
          AND y.status <> 'done' 
        WHERE y.id IS NULL 
          AND x.status IS NULL
        GROUP BY x.name
     ) b 
    ON b.name = a.name 
   AND b.priority = a.priority
   AND a.priority > 1;

#1


1  

You can turn your logic pretty directly into a query like this:

您可以将您的逻辑直接转换为如下查询:

select t.*
from test t 
where t.status is null and
      not exists (select 1
                  from test t2
                  where t2.name = t.name and
                        t2.id < t.id and
                        (t2.status <> 'done' or
                         t2.status is null
                        )
                 ) and
      exists (select 1
              from test t2
              where t2.name = t.name and
                    t2.id < t.id and
                    t2.status = 'done'
             );

For performance, create an index on test(name, id, status).

为了提高性能,请在test(name,id,status)上创建索引。

Here is a SQL Fiddle.

这是一个SQL小提琴。

#2


1  

This query determines whether all tasks before a given task are done by verifying that the # of tasks that are not done before the given task are 0

此查询通过验证在给定任务之前未完成的任务数为0来确定是否完成给定任务之前的所有任务

SELECT t1.name, t1.id, t1.priority, t1.task
FROM test t1
JOIN test t2 
    ON t2.name = t1.name
    AND t2.priority < t1.priority
WHERE t1.status IS NULL
GROUP BY t1.name, t1.priority, t1.id, t1.task
HAVING COUNT(CASE WHEN t2.status = 'done' THEN NULL ELSE 1 END) = 0 

CREATE INDEX test_index1 ON test (name,status,priority,id,task);

http://sqlfiddle.com/#!2/c912f/7

#3


1  

I have no way of testing the speed against a really large table, but this at least returns the correct answer from your smaller sample table. It should be competitive with other answers, however, as it performs only one subquery without a join:

我无法针对一个非常大的表测试速度,但这至少会从较小的样本表中返回正确的答案。但是,它应该与其他答案竞争,因为它只执行一个没有连接的子查询:

select  *
from    test  t1
where   t1.Status is null
  and exists (
    select 1
    from   test
    where  Name = t1.Name and
           Priority < t1.Priority
    group by Name
    having count(*) = sum( case when Status = 'done' then 1 else 0 end )
);

#4


0  

SELECT a.* 
  FROM test a 
  JOIN 
     ( SELECT x.name
            , MIN(x.priority) priority 
         FROM test x 
         LEFT 
         JOIN test y 
           ON y.name = x.name 
          AND y.priority < x.priority 
          AND y.status <> 'done' 
        WHERE y.id IS NULL 
          AND x.status IS NULL
        GROUP BY x.name
     ) b 
    ON b.name = a.name 
   AND b.priority = a.priority
   AND a.priority > 1;