每组按组选择第一行?

时间:2023-01-19 13:02:49

As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY.

正如标题所示,我想要选择一组与一个组分组的每一组行的第一行。

Specifically, if I've got a purchases table that looks like this:

具体来说,如果我有一张这样的购买表:

SELECT * FROM purchases;

My Output:

我的输出:

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

I'd like to query for the id of the largest purchase (total) made by each customer. Something like this:

我想查询一下每个客户的最大采购量。是这样的:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

Expected Output:

预期的输出:

FIRST(id) | customer | FIRST(total)
----------+----------+-------------
        1 | Joe      | 5
        2 | Sally    | 3

10 个解决方案

#1


749  

On Oracle 9.2+ (not 8i+ as originally stated), SQL Server 2005+, PostgreSQL 8.4+, DB2, Firebird 3.0+, Teradata, Sybase, Vertica:

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rk
      FROM PURCHASES p)
SELECT s.*
  FROM summary s
 WHERE s.rk = 1

Supported by any database:

But you need to add logic to break ties:

但是你需要添加逻辑来打破联系:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

#2


758  

In PostgreSQL this is typically simpler and faster (more performance optimization below):

在PostgreSQL中,这通常是更简单和更快的(更多的性能优化):

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

或更短(如果不清楚)与序数的输出列:

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

如果total可以为NULL(不会受到任何影响,但您将希望匹配现有索引):

...
ORDER  BY customer, total DESC NULLS LAST, id;

Major points

  • DISTINCT ON is a PostgreSQL extension of the standard (where only DISTINCT on the whole SELECT list is defined).

    不同的是标准的PostgreSQL扩展(在整个选择列表中只定义了不同的地方)。

  • List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:

    在不同的子句中列出任意数量的表达式,组合行值定义了重复项。手册:

    Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.

    显然,如果两行在至少一个列值上有差异,则被认为是不同的。在这个比较中,空值被认为是相等的。

    Bold emphasis mine.

    我大胆的重点。

  • DISTINCT ON can be combined with ORDER BY. Leading expressions have to match leading DISTINCT ON expressions in the same order. You can add additional expressions to ORDER BY to pick a particular row from each group of peers. I added id as last item to break ties:

    不同的可以结合顺序。在相同的顺序中,引导表达式必须匹配不同的表达式。您可以添加额外的表达式,以从每组对等点中选择一个特定的行。我添加了id作为最后一项来断开连接:

    "Pick the row with the smallest id from each group sharing the highest total."

    “从每个组中选择最小id的行,共享最高的总数。”

    If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. Details:

    如果total可以为NULL,那么您很可能希望该行具有最大的非空值。最后像演示一样添加null。细节:

  • The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way. (Not needed in the simple case above):

    选择列表不受不同的表达式或任何方式的顺序的约束。(上述简单情况不需要):

    • You don't have to include any of the expressions in DISTINCT ON or ORDER BY.

      您不需要将任何表达式包含在不同的或ORDER BY中。

    • You can include any other expression in the SELECT list. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.

      您可以在选择列表中包含任何其他表达式。这有助于用子查询和聚合/窗口函数替换更复杂的查询。

  • I tested with versions 8.3 – 10. But the feature has been there at least since version 7.1, so basically always.

    我测试了版本8.3 - 10。但至少从7.1版本开始,这个功能就一直存在,所以基本上一直如此。

Index

The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:

上述查询的完美索引将是一个多列索引,该索引涵盖匹配序列中的所有三列,以及匹配排序顺序:

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

May be too specialized for real world applications. But use it if read performance is crucial. If you have DESC NULLS LAST in the query, use the same in the index so Postgres knows sort order matches.

对于现实世界的应用程序来说,可能太专业了。但是,如果阅读性能是至关重要的,请使用它。如果在查询中有DESC NULLS,那么在索引中使用相同的代码,所以Postgres知道排序匹配。

Effectiveness / Performance optimization

You have to weigh cost and benefit before you create a tailored index for every query. The potential of above index largely depends on data distribution.

在为每个查询创建一个定制的索引之前,您必须权衡成本和收益。上述指标的潜力很大程度上取决于数据的分布。

The index is used because it delivers pre-sorted data, and in Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.

使用索引是因为它提供预先排序的数据,在Postgres 9.2或更高版本中,如果索引比底层表小,查询也可以从索引中获益。不过,该指数必须全部扫描。

  • For few rows per customer, this is very efficient (even more so if you need sorted output anyway). The benefit shrinks with a growing number of rows per customer.
    Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk. Generally setting work_mem too high can have adverse effects. Consider SET LOCAL for singular queries on big sets. Find how much you need with EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:

    对于每个客户的几行,这是非常有效的(如果您需要排序输出的话,更是如此)。随着每个客户的数量不断增加,收益也会减少。理想情况下,您有足够的work_mem来处理内存中涉及的排序步骤,而不会溢出到磁盘。一般情况下,过高的工件会产生不利的影响。考虑在大集合上设置单一查询的本地设置。找出你需要多少解释分析。提到“磁盘”,在排序步骤中表示需要更多:

  • For many rows per customer, a loose index scan would be (much) more efficient, but that's not currently implemented in Postgres (up to v10).
    There are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:

    对于每个客户的许多行,一个松散的索引扫描将会(非常)高效,但这并不是在Postgres(直到v10)中实现的。有更快的查询技术来替代它。特别是如果您有一个单独的表,它拥有独特的客户,这是典型的用例。但如果你没有:

Benchmark

I had a simple benchmark here for Postgres 9.1, which was outdated by 2016. So I ran a new one with a better, reproducible setup for Postgres 9.4 and 9.5 and added the detailed results in another answer.

对于Postgres 9.1,我有一个简单的基准,到2016年已经过时了。因此,我运行了一个新的更好的、可复制的Postgres 9.4和9.5的设置,并在另一个答案中添加了详细的结果。

#3


69  

Benchmark

Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in purchases and 10k distinct customer_id (avg. 20 rows per customer).

用Postgres 9.4和9.5测试最有趣的候选者,其中包含了200k行和10k截然不同的customer_id(每个客户20行)的折中方案。

For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).

对于Postgres 9.5,我对86446个不同的客户进行了第二次测试。见下文(avg. 2.3每客户行)。

Setup

Main table

主要表

CREATE TABLE purchases (
  id          serial
, customer_id int  -- REFERENCES customer
, total       int  -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);

I use a serial (PK constraint added below) and an integer customer_id since that's a more typical setup. Also added some_column to make up for typically more columns.

我使用了一个串行(PK约束)和一个整数customer_id,因为这是一个更典型的设置。还添加了some_column以弥补通常更多的列。

Dummy data, PK, index - a typical table also has some dead tuples:

虚拟数据,PK,索引-一个典型的表也有一些死的元组:

INSERT INTO purchases (customer_id, total, some_column)    -- insert 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9; -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

customer table - for superior query

客户表-高级查询。

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

In my second test for 9.5 I used the same setup, but with random() * 100000 to generate customer_id to get only few rows per customer_id.

在我对9.5的第二次测试中,我使用了相同的设置,但是使用random() * 100000来生成customer_id,以获得每个customer_id的几行。

Object sizes for table purchases

Generated with this query.

生成的查询。

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

Queries

1. row_number() in CTE, (see other answer)

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2. row_number() in subquery (my optimization)

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3. DISTINCT ON (see other answer)

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery (see here)

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5. customer table with LATERAL (see here)

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6. array_agg() with ORDER BY (see other answer)

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

Results

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

以上查询的执行时间(以及所有选项),最好是5次运行。

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

所有查询使用一个索引,只扫描购买的2_3c_idx(在其他步骤中)。其中一些只适用于较小的指数,另一些则更有效。

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  
5.  37.679 ms  -- winner
6. 189.495 ms

B. The same with Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  
5.  33.944 ms  -- winner
6. 211.540 ms  

C. Same as B., but with ~ 2.3 rows per customer_id

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

Original (outdated) benchmark from 2011

I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies' first query (A) to the above DISTINCT ON solution (B):

我使用PostgreSQL 9.1运行了3个测试,在一个实际的生命表中,在涉及到的3个列上的每一列上都有65579行和单列的btree索引,并获得了5次运行的最佳执行时间。将@OMGPonies的第一个查询(A)与上述不同的解决方案(B)进行比较:

  1. Select the whole table, results in 5958 rows in this case.

    选择整个表,在本例中结果为5958行。

    A: 567.218 ms
    B: 386.673 ms
    
  2. Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.

    在x和y之间产生1000行的情况。

    A: 249.136 ms
    B:  55.111 ms
    
  3. Select a single customer with WHERE customer = x.

    选择customer = x的单个客户。

    A:   0.143 ms
    B:   0.072 ms
    

Same test repeated with the index described in the other answer

在另一个答案中所描述的指数重复同样的测试。

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms

#4


35  

This is common problem, which has already well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).

这是一种常见的最常见的问题,它已经经过了很好的测试和高度优化的解决方案。就我个人而言,我更喜欢比尔·卡尔文(Bill Karwin)留下的左图(这是一篇有很多其他解决方案的文章)。

Note that bunch of solutions to this common problem can surprisingly be found in the one of most official sources, MySQL manual! See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.

请注意,这一常见问题的解决方案可以在最官方的MySQL手册中找到。查看常见查询的示例::持有某一列的Group-wise最大值的行。

#5


17  

In Postgres you can use array_agg like this:

在Postgres中,可以使用array_agg:

SELECT  customer,
        (array_agg(id ORDER BY total DESC))[1],
        max(total)
FROM purchases
GROUP BY customer

This will give you the id of each customer's largest purchase.

这将为您提供每个客户的最大购买的id。

Some things to note:

一些注意事项:

  • array_agg is an aggregate function, so it works with GROUP BY.
  • array_agg是一个聚合函数,所以它与GROUP BY一起工作。
  • array_agg lets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
  • array_agg允许您为自己指定一个排序范围,因此它不会限制整个查询的结构。如果需要做一些与默认值不同的操作,那么对于如何排序null也有语法。
  • Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
  • 一旦我们构建了数组,我们就使用第一个元素。(Postgres数组是1索引的,而不是0索引的)。
  • You could use array_agg in a similar way for your third output column, but max(total) is simpler.
  • 您可以以类似的方式使用array_agg作为第三个输出列,但是max(total)更简单。
  • Unlike DISTINCT ON, using array_agg lets you keep your GROUP BY, in case you want that for other reasons.
  • 不同于不同的是,使用array_agg可以让您的组被保留,如果您想要其他原因的话。

#6


11  

The solution is not very efficient as pointed by Erwin, because of presence of SubQs

由于存在子问题,解决方案不像Erwin所指出的那样有效。

select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;

#7


6  

I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29

我使用这种方法(postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29。

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
        sfunc    = public.last_agg,
        basetype = anyelement,
        stype    = anyelement
);

Then your example should work almost as is:

那么你的例子应该是:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;

CAVEAT: It ignore's NULL rows

注意:它忽略了空行。


Edit 1 - Use the postgres extension instead

Now I use this way: http://pgxn.org/dist/first_last_agg/

现在我使用这个方法:http://pgxn.org/dist/first_last_agg/。

To install on ubuntu 14.04:

安装在ubuntu 14.04:

apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'

It's a postgres extension that gives you first and last functions; apparently faster than the above way.

这是一个postgres扩展,它提供了第一个和最后一个函数;显然比上面的方法快。


Edit 2 - Ordering and filtering

If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:

如果您使用聚合函数(类似于这些),您可以对结果进行排序,而不需要有已经排序的数据:

http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES

So the equivalent example, with ordering would be something like:

所以这个等价的例子,排序是这样的

SELECT first(id order by id), customer, first(total order by id)
  FROM purchases
 GROUP BY customer
 ORDER BY first(total);

Of course you can order and filter as you deem fit within the aggregate; it's very powerful syntax.

当然,你可以按照你认为合适的顺序排列和过滤;这是非常强大的语法。

#8


5  

Very fast solution

快速解决方案

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

and really very fast if table is indexed by id:

如果表被id索引,速度会非常快:

create index purchases_id on purchases (id);

#9


2  

The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.

在我的测试中,被接受的OMG Ponies“支持的任何数据库”解决方案都有良好的速度。

Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.

在这里,我提供了一个相同的方法,但更完整、更干净的任何数据库解决方案。我们考虑了联系(假设每个客户只需要一个行,甚至每个客户最多可以有多个记录),以及其他购买字段(例如,buy e_payment_id)将被选择用于购买表中的实际匹配行。

Supported by any database:

支持任何数据库:

select * from purchase
join (
    select min(id) as id from purchase
    join (
        select customer, max(total) as total from purchase
        group by customer
    ) t1 using (customer, total)
    group by customer
) t2 using (id)
order by customer

This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.

这个查询相当快,特别是在购买表上有像(customer, total)这样的复合索引时。

Remark:

备注:

  1. t1, t2 are subquery alias which could be removed depending on database.

    t1, t2是子查询别名,可以根据数据库删除。

  2. Caveat: the using (...) clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g. on t2.id = purchase.id etc. The USING syntax works in SQLite, MySQL and PostgreSQL.

    注意:使用(…)子句目前在MS-SQL和Oracle db中不支持,在2017年1月进行编辑。你必须自己扩展,比如t2。id =购买。id等。在SQLite、MySQL和PostgreSQL中使用语法。

#10


2  

The Query:

查询:

SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p 
ON 
  p.customer = purchases.customer 
  AND 
  purchases.total < p.total
WHERE p.total IS NULL

HOW DOES THAT WORK! (I've been there)

这是怎么工作!(我有)

We want to make sure that we only have the highest total for each purchase.

我们要确保每次购买的总金额都是最高的。


Some Theoretical Stuff (skip this part if you only want to understand the query)

一些理论内容(如果您只想理解查询的话,跳过这一部分)

Let Total be a function T(customer,id) where it returns a value given the name and id To prove that the given total (T(customer,id)) is the highest we have to prove that We want to prove either

让Total成为一个函数T(customer,id),它返回一个给定名称和id的值,以证明给定的Total (T(customer,id))是最高的,我们必须证明我们也想证明这一点。

  • ∀x T(customer,id) > T(customer,x) (this total is higher than all other total for that customer)
  • ∀x T(客户id)>(客户,x)(这总比其他总这个客户)

OR

  • ¬∃x T(customer, id) < T(customer, x) (there exists no higher total for that customer)
  • ¬∃x T(客户id)< T(客户,x)(客户不存在更高的总)

The first approach will need us to get all the records for that name which I do not really like.

第一个方法需要我们获得所有我不喜欢的名字的记录。

The second one will need a smart way to say there can be no record higher than this one.

第二种方法需要一种聪明的方法来证明没有比这更高的记录。


Back to SQL

回到SQL

If we left joins the table on the name and total being less than the joined table:

如果我们离开联接表的名称和总数小于连接表:

      LEFT JOIN purchases as p 
      ON 
      p.customer = purchases.customer 
      AND 
      purchases.total < p.total

we make sure that all records that have another record with the higher total for the same user to be joined:

我们确保所有记录都有另一个记录,以供同一用户加入:

purchases.id, purchases.customer, purchases.total, p.id, p.customer, p.total
1           , Tom           , 200             , 2   , Tom   , 300
2           , Tom           , 300
3           , Bob           , 400             , 4   , Bob   , 500
4           , Bob           , 500
5           , Alice         , 600             , 6   , Alice   , 700
6           , Alice         , 700

That will help us filter for the highest total for each purchase with no grouping needed:

这将帮助我们在不需要分组的情况下,为每一笔购买提供最高金额的过滤:

WHERE p.total IS NULL

purchases.id, purchases.name, purchases.total, p.id, p.name, p.total
2           , Tom           , 300
4           , Bob           , 500
6           , Alice         , 700

And that's the answer we need.

这就是我们需要的答案。

#1


749  

On Oracle 9.2+ (not 8i+ as originally stated), SQL Server 2005+, PostgreSQL 8.4+, DB2, Firebird 3.0+, Teradata, Sybase, Vertica:

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rk
      FROM PURCHASES p)
SELECT s.*
  FROM summary s
 WHERE s.rk = 1

Supported by any database:

But you need to add logic to break ties:

但是你需要添加逻辑来打破联系:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

#2


758  

In PostgreSQL this is typically simpler and faster (more performance optimization below):

在PostgreSQL中,这通常是更简单和更快的(更多的性能优化):

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

或更短(如果不清楚)与序数的输出列:

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

如果total可以为NULL(不会受到任何影响,但您将希望匹配现有索引):

...
ORDER  BY customer, total DESC NULLS LAST, id;

Major points

  • DISTINCT ON is a PostgreSQL extension of the standard (where only DISTINCT on the whole SELECT list is defined).

    不同的是标准的PostgreSQL扩展(在整个选择列表中只定义了不同的地方)。

  • List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:

    在不同的子句中列出任意数量的表达式,组合行值定义了重复项。手册:

    Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.

    显然,如果两行在至少一个列值上有差异,则被认为是不同的。在这个比较中,空值被认为是相等的。

    Bold emphasis mine.

    我大胆的重点。

  • DISTINCT ON can be combined with ORDER BY. Leading expressions have to match leading DISTINCT ON expressions in the same order. You can add additional expressions to ORDER BY to pick a particular row from each group of peers. I added id as last item to break ties:

    不同的可以结合顺序。在相同的顺序中,引导表达式必须匹配不同的表达式。您可以添加额外的表达式,以从每组对等点中选择一个特定的行。我添加了id作为最后一项来断开连接:

    "Pick the row with the smallest id from each group sharing the highest total."

    “从每个组中选择最小id的行,共享最高的总数。”

    If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. Details:

    如果total可以为NULL,那么您很可能希望该行具有最大的非空值。最后像演示一样添加null。细节:

  • The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way. (Not needed in the simple case above):

    选择列表不受不同的表达式或任何方式的顺序的约束。(上述简单情况不需要):

    • You don't have to include any of the expressions in DISTINCT ON or ORDER BY.

      您不需要将任何表达式包含在不同的或ORDER BY中。

    • You can include any other expression in the SELECT list. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.

      您可以在选择列表中包含任何其他表达式。这有助于用子查询和聚合/窗口函数替换更复杂的查询。

  • I tested with versions 8.3 – 10. But the feature has been there at least since version 7.1, so basically always.

    我测试了版本8.3 - 10。但至少从7.1版本开始,这个功能就一直存在,所以基本上一直如此。

Index

The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:

上述查询的完美索引将是一个多列索引,该索引涵盖匹配序列中的所有三列,以及匹配排序顺序:

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

May be too specialized for real world applications. But use it if read performance is crucial. If you have DESC NULLS LAST in the query, use the same in the index so Postgres knows sort order matches.

对于现实世界的应用程序来说,可能太专业了。但是,如果阅读性能是至关重要的,请使用它。如果在查询中有DESC NULLS,那么在索引中使用相同的代码,所以Postgres知道排序匹配。

Effectiveness / Performance optimization

You have to weigh cost and benefit before you create a tailored index for every query. The potential of above index largely depends on data distribution.

在为每个查询创建一个定制的索引之前,您必须权衡成本和收益。上述指标的潜力很大程度上取决于数据的分布。

The index is used because it delivers pre-sorted data, and in Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.

使用索引是因为它提供预先排序的数据,在Postgres 9.2或更高版本中,如果索引比底层表小,查询也可以从索引中获益。不过,该指数必须全部扫描。

  • For few rows per customer, this is very efficient (even more so if you need sorted output anyway). The benefit shrinks with a growing number of rows per customer.
    Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk. Generally setting work_mem too high can have adverse effects. Consider SET LOCAL for singular queries on big sets. Find how much you need with EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:

    对于每个客户的几行,这是非常有效的(如果您需要排序输出的话,更是如此)。随着每个客户的数量不断增加,收益也会减少。理想情况下,您有足够的work_mem来处理内存中涉及的排序步骤,而不会溢出到磁盘。一般情况下,过高的工件会产生不利的影响。考虑在大集合上设置单一查询的本地设置。找出你需要多少解释分析。提到“磁盘”,在排序步骤中表示需要更多:

  • For many rows per customer, a loose index scan would be (much) more efficient, but that's not currently implemented in Postgres (up to v10).
    There are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:

    对于每个客户的许多行,一个松散的索引扫描将会(非常)高效,但这并不是在Postgres(直到v10)中实现的。有更快的查询技术来替代它。特别是如果您有一个单独的表,它拥有独特的客户,这是典型的用例。但如果你没有:

Benchmark

I had a simple benchmark here for Postgres 9.1, which was outdated by 2016. So I ran a new one with a better, reproducible setup for Postgres 9.4 and 9.5 and added the detailed results in another answer.

对于Postgres 9.1,我有一个简单的基准,到2016年已经过时了。因此,我运行了一个新的更好的、可复制的Postgres 9.4和9.5的设置,并在另一个答案中添加了详细的结果。

#3


69  

Benchmark

Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in purchases and 10k distinct customer_id (avg. 20 rows per customer).

用Postgres 9.4和9.5测试最有趣的候选者,其中包含了200k行和10k截然不同的customer_id(每个客户20行)的折中方案。

For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).

对于Postgres 9.5,我对86446个不同的客户进行了第二次测试。见下文(avg. 2.3每客户行)。

Setup

Main table

主要表

CREATE TABLE purchases (
  id          serial
, customer_id int  -- REFERENCES customer
, total       int  -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);

I use a serial (PK constraint added below) and an integer customer_id since that's a more typical setup. Also added some_column to make up for typically more columns.

我使用了一个串行(PK约束)和一个整数customer_id,因为这是一个更典型的设置。还添加了some_column以弥补通常更多的列。

Dummy data, PK, index - a typical table also has some dead tuples:

虚拟数据,PK,索引-一个典型的表也有一些死的元组:

INSERT INTO purchases (customer_id, total, some_column)    -- insert 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9; -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

customer table - for superior query

客户表-高级查询。

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

In my second test for 9.5 I used the same setup, but with random() * 100000 to generate customer_id to get only few rows per customer_id.

在我对9.5的第二次测试中,我使用了相同的设置,但是使用random() * 100000来生成customer_id,以获得每个customer_id的几行。

Object sizes for table purchases

Generated with this query.

生成的查询。

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

Queries

1. row_number() in CTE, (see other answer)

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2. row_number() in subquery (my optimization)

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3. DISTINCT ON (see other answer)

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery (see here)

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5. customer table with LATERAL (see here)

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6. array_agg() with ORDER BY (see other answer)

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

Results

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

以上查询的执行时间(以及所有选项),最好是5次运行。

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

所有查询使用一个索引,只扫描购买的2_3c_idx(在其他步骤中)。其中一些只适用于较小的指数,另一些则更有效。

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  
5.  37.679 ms  -- winner
6. 189.495 ms

B. The same with Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  
5.  33.944 ms  -- winner
6. 211.540 ms  

C. Same as B., but with ~ 2.3 rows per customer_id

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

Original (outdated) benchmark from 2011

I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies' first query (A) to the above DISTINCT ON solution (B):

我使用PostgreSQL 9.1运行了3个测试,在一个实际的生命表中,在涉及到的3个列上的每一列上都有65579行和单列的btree索引,并获得了5次运行的最佳执行时间。将@OMGPonies的第一个查询(A)与上述不同的解决方案(B)进行比较:

  1. Select the whole table, results in 5958 rows in this case.

    选择整个表,在本例中结果为5958行。

    A: 567.218 ms
    B: 386.673 ms
    
  2. Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.

    在x和y之间产生1000行的情况。

    A: 249.136 ms
    B:  55.111 ms
    
  3. Select a single customer with WHERE customer = x.

    选择customer = x的单个客户。

    A:   0.143 ms
    B:   0.072 ms
    

Same test repeated with the index described in the other answer

在另一个答案中所描述的指数重复同样的测试。

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms

#4


35  

This is common problem, which has already well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).

这是一种常见的最常见的问题,它已经经过了很好的测试和高度优化的解决方案。就我个人而言,我更喜欢比尔·卡尔文(Bill Karwin)留下的左图(这是一篇有很多其他解决方案的文章)。

Note that bunch of solutions to this common problem can surprisingly be found in the one of most official sources, MySQL manual! See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.

请注意,这一常见问题的解决方案可以在最官方的MySQL手册中找到。查看常见查询的示例::持有某一列的Group-wise最大值的行。

#5


17  

In Postgres you can use array_agg like this:

在Postgres中,可以使用array_agg:

SELECT  customer,
        (array_agg(id ORDER BY total DESC))[1],
        max(total)
FROM purchases
GROUP BY customer

This will give you the id of each customer's largest purchase.

这将为您提供每个客户的最大购买的id。

Some things to note:

一些注意事项:

  • array_agg is an aggregate function, so it works with GROUP BY.
  • array_agg是一个聚合函数,所以它与GROUP BY一起工作。
  • array_agg lets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
  • array_agg允许您为自己指定一个排序范围,因此它不会限制整个查询的结构。如果需要做一些与默认值不同的操作,那么对于如何排序null也有语法。
  • Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
  • 一旦我们构建了数组,我们就使用第一个元素。(Postgres数组是1索引的,而不是0索引的)。
  • You could use array_agg in a similar way for your third output column, but max(total) is simpler.
  • 您可以以类似的方式使用array_agg作为第三个输出列,但是max(total)更简单。
  • Unlike DISTINCT ON, using array_agg lets you keep your GROUP BY, in case you want that for other reasons.
  • 不同于不同的是,使用array_agg可以让您的组被保留,如果您想要其他原因的话。

#6


11  

The solution is not very efficient as pointed by Erwin, because of presence of SubQs

由于存在子问题,解决方案不像Erwin所指出的那样有效。

select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;

#7


6  

I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29

我使用这种方法(postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29。

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
        sfunc    = public.last_agg,
        basetype = anyelement,
        stype    = anyelement
);

Then your example should work almost as is:

那么你的例子应该是:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;

CAVEAT: It ignore's NULL rows

注意:它忽略了空行。


Edit 1 - Use the postgres extension instead

Now I use this way: http://pgxn.org/dist/first_last_agg/

现在我使用这个方法:http://pgxn.org/dist/first_last_agg/。

To install on ubuntu 14.04:

安装在ubuntu 14.04:

apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'

It's a postgres extension that gives you first and last functions; apparently faster than the above way.

这是一个postgres扩展,它提供了第一个和最后一个函数;显然比上面的方法快。


Edit 2 - Ordering and filtering

If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:

如果您使用聚合函数(类似于这些),您可以对结果进行排序,而不需要有已经排序的数据:

http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES

So the equivalent example, with ordering would be something like:

所以这个等价的例子,排序是这样的

SELECT first(id order by id), customer, first(total order by id)
  FROM purchases
 GROUP BY customer
 ORDER BY first(total);

Of course you can order and filter as you deem fit within the aggregate; it's very powerful syntax.

当然,你可以按照你认为合适的顺序排列和过滤;这是非常强大的语法。

#8


5  

Very fast solution

快速解决方案

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

and really very fast if table is indexed by id:

如果表被id索引,速度会非常快:

create index purchases_id on purchases (id);

#9


2  

The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.

在我的测试中,被接受的OMG Ponies“支持的任何数据库”解决方案都有良好的速度。

Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.

在这里,我提供了一个相同的方法,但更完整、更干净的任何数据库解决方案。我们考虑了联系(假设每个客户只需要一个行,甚至每个客户最多可以有多个记录),以及其他购买字段(例如,buy e_payment_id)将被选择用于购买表中的实际匹配行。

Supported by any database:

支持任何数据库:

select * from purchase
join (
    select min(id) as id from purchase
    join (
        select customer, max(total) as total from purchase
        group by customer
    ) t1 using (customer, total)
    group by customer
) t2 using (id)
order by customer

This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.

这个查询相当快,特别是在购买表上有像(customer, total)这样的复合索引时。

Remark:

备注:

  1. t1, t2 are subquery alias which could be removed depending on database.

    t1, t2是子查询别名,可以根据数据库删除。

  2. Caveat: the using (...) clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g. on t2.id = purchase.id etc. The USING syntax works in SQLite, MySQL and PostgreSQL.

    注意:使用(…)子句目前在MS-SQL和Oracle db中不支持,在2017年1月进行编辑。你必须自己扩展,比如t2。id =购买。id等。在SQLite、MySQL和PostgreSQL中使用语法。

#10


2  

The Query:

查询:

SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p 
ON 
  p.customer = purchases.customer 
  AND 
  purchases.total < p.total
WHERE p.total IS NULL

HOW DOES THAT WORK! (I've been there)

这是怎么工作!(我有)

We want to make sure that we only have the highest total for each purchase.

我们要确保每次购买的总金额都是最高的。


Some Theoretical Stuff (skip this part if you only want to understand the query)

一些理论内容(如果您只想理解查询的话,跳过这一部分)

Let Total be a function T(customer,id) where it returns a value given the name and id To prove that the given total (T(customer,id)) is the highest we have to prove that We want to prove either

让Total成为一个函数T(customer,id),它返回一个给定名称和id的值,以证明给定的Total (T(customer,id))是最高的,我们必须证明我们也想证明这一点。

  • ∀x T(customer,id) > T(customer,x) (this total is higher than all other total for that customer)
  • ∀x T(客户id)>(客户,x)(这总比其他总这个客户)

OR

  • ¬∃x T(customer, id) < T(customer, x) (there exists no higher total for that customer)
  • ¬∃x T(客户id)< T(客户,x)(客户不存在更高的总)

The first approach will need us to get all the records for that name which I do not really like.

第一个方法需要我们获得所有我不喜欢的名字的记录。

The second one will need a smart way to say there can be no record higher than this one.

第二种方法需要一种聪明的方法来证明没有比这更高的记录。


Back to SQL

回到SQL

If we left joins the table on the name and total being less than the joined table:

如果我们离开联接表的名称和总数小于连接表:

      LEFT JOIN purchases as p 
      ON 
      p.customer = purchases.customer 
      AND 
      purchases.total < p.total

we make sure that all records that have another record with the higher total for the same user to be joined:

我们确保所有记录都有另一个记录,以供同一用户加入:

purchases.id, purchases.customer, purchases.total, p.id, p.customer, p.total
1           , Tom           , 200             , 2   , Tom   , 300
2           , Tom           , 300
3           , Bob           , 400             , 4   , Bob   , 500
4           , Bob           , 500
5           , Alice         , 600             , 6   , Alice   , 700
6           , Alice         , 700

That will help us filter for the highest total for each purchase with no grouping needed:

这将帮助我们在不需要分组的情况下,为每一笔购买提供最高金额的过滤:

WHERE p.total IS NULL

purchases.id, purchases.name, purchases.total, p.id, p.name, p.total
2           , Tom           , 300
4           , Bob           , 500
6           , Alice         , 700

And that's the answer we need.

这就是我们需要的答案。