使用自定义值更新PostgreSQL表

时间:2021-02-18 00:04:32

I am attempting to update multiple columns on a table with values from another row in the same table:

我正在尝试更新一个表中的多个列,其中的值来自同一表中的另一行:

CREATE TEMP TABLE person (
  pid INT
 ,name VARCHAR(40)
 ,dob DATE
 ,younger_sibling_name VARCHAR(40)
 ,younger_sibling_dob DATE
);

INSERT INTO person VALUES (pid, name, dob)
(1, 'John', '1980-01-05'),
(2, 'Jimmy', '1975-04-25'),
(3, 'Sarah', '2004-02-10'),
(4, 'Frank', '1934-12-12');

The task is to populate younger_sibling_name and younger_sibling_dob with the name and birth day of the person that is closest to them in age, but not older or the same age.

任务是填充younger_sibling_name和younger_sibling_dob,其中包含与他们年龄最接近的人的姓名和生日,但不包括年龄较大或同龄的人。

I can set the younger sibling dob easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):

我可以很容易地设置年幼的兄弟dob,因为这个值决定了要与相关子查询一起使用的记录(我认为这是一个例子?)

UPDATE person SET younger_sibling_dob=(
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);

I just can't see any way to get the name?
The real query of this will run over about 1M records in groups of 100-500 for each MAX selection so performance is a concern.

我就是想不出有什么办法来取这个名字?真正的查询将会在100-500组的100万组记录中运行,每个MAX的选择都是值得关注的。

EDIT:

编辑:

After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:

在尝试了许多不同的方法之后,我决定采用这个方法,我认为这是一个很好的平衡,可以用中间结果验证数据,显示逻辑的意图,并充分地执行:

WITH sibling AS (
  SELECT person.pid, sibling.dob, sibling.name,
         row_number() OVER (PARTITION BY person.pid
                            ORDER BY sibling.dob DESC) AS age_closeness
  FROM person
  JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
  SET younger_sibling_name = sibling.name
     ,younger_sibling_dob  = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
   AND sibling.age_closeness = 1;

SELECT * FROM person ORDER BY dob;

3 个解决方案

#1


4  

Correlated subqueries are infamous for abysmal performance. Doesn't matter much for small tables, matters a lot for big tables. Use one of these instead, preferably the second:

相关子查询因其糟糕的性能而臭名昭著。小桌子不重要,大桌子重要。使用其中一个,最好是第二个:

Query 1

WITH cte AS (
   SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
   FROM   person
    )
UPDATE person p
SET    younger_sibling_name = y.name
      ,younger_sibling_dob  = y.dob
FROM   cte x
JOIN   (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk + 1
WHERE  x.pid = p.pid;

-> SQLfiddle (with extended test case)

-> SQLfiddle(扩展测试用例)

  • In the CTE cte use the window function dense_rank() to get a rank without gaps according to the dop for every person.

    在CTE CTE中,使用窗口函数dense_rank()根据每个人的dop获得一个无间隔的秩。

  • Join cte to itself, but remove duplicates on dob from the second instance. Thereby everybody gets exactly one UPDATE. If more than one person share the same dop, the same one is selected as younger sibling for all persons on the next dob. I do this with:

    将cte连接到自己,但是从第二个实例中删除dob上的副本。因此每个人都会得到一个更新。如果一个以上的人共享相同的dop,那么对于下一个dob上的所有人来说,同样的一个人将被选择为最小的兄弟。我这样做:

    (SELECT DISTINCT ON (rnk) * FROM cte)
    

    Add ORDER BY rnk, ... if you want to pick a particular person for every dob.

    增加订单rnk,…如果你想为每一个dob挑选一个特定的人。

  • If no younger person exists, no UPDATE happens and the columns stay NULL.

    如果没有较年轻的人存在,则不会发生更新,列保持为空。

  • Indices on dob and pid make this fast.

    dob和pid的指标使这个速度更快。

Query 2

WITH cte AS (
   SELECT dob, min(name) AS name
         ,row_number() OVER (ORDER BY dob) rn
   FROM   person p
   GROUP  BY dob
   )
UPDATE person p
SET    younger_sibling_name = y.name
      ,younger_sibling_dob  = y.dob
FROM   cte x
JOIN   cte y ON y.rn = x.rn + 1
WHERE  x.dob = p.dob;

-> SQLfiddle

- > SQLfiddle

  • This works, because aggregate functions are applied before window functions. And it should be very fast, since both operations agree on the sort order.

    这是可行的,因为聚合函数是在窗口函数之前应用的。而且它应该非常快,因为两个操作在排序顺序上是一致的。

  • Obviates the need for a later DISTINCT like in query 1.

    避免以后需要使用查询1中所示的不同类型。

  • Result is the same as query 1, exactly.
    Again, you can add more columns to ORDER BY to pick a particular person for every dob.

    结果与查询1完全相同。同样,您可以通过为每个dob选择一个特定的人来添加更多的列。

  • Only needs an index on dob to be fast.

    只需要在dob上快速的索引。

#2


2  

1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)

1)查找MAX()总是可以重写为不存在(…)

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
   OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
        SELECT * FROM person nx
        WHERE nx.dob < dst.dob
           OR nx.dob = dst.dob AND nx.pid < dst.pid
        AND nx.dob > src.dob
           OR nx.dob = src.dob AND nx.pid > src.pid
        );

2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:

2)除了rank() / row_number()之外,还可以在窗口上使用LAG()函数:

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM    (
        SELECT pid
        , LAG(name) OVER win AS name
        , LAG(dob) OVER win AS dob 
        FROM person
        WINDOW win AS (ORDER BY dob, pid)
        ) src
WHERE src.pid = dst.pid
        ;

Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.

两个版本都需要一个自连接子查询(或CTE),因为UPDATE不允许窗口函数。

#3


1  

To get the dob and name, you can do:

要获得dob和名称,你可以这样做:

update person
    set younger_sibling_dob = (select dob
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1),
       younger_sibling_name = (select name
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1)

If you have an index on dob, then the query will run faster.

如果dob上有索引,那么查询将运行得更快。

#1


4  

Correlated subqueries are infamous for abysmal performance. Doesn't matter much for small tables, matters a lot for big tables. Use one of these instead, preferably the second:

相关子查询因其糟糕的性能而臭名昭著。小桌子不重要,大桌子重要。使用其中一个,最好是第二个:

Query 1

WITH cte AS (
   SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
   FROM   person
    )
UPDATE person p
SET    younger_sibling_name = y.name
      ,younger_sibling_dob  = y.dob
FROM   cte x
JOIN   (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk + 1
WHERE  x.pid = p.pid;

-> SQLfiddle (with extended test case)

-> SQLfiddle(扩展测试用例)

  • In the CTE cte use the window function dense_rank() to get a rank without gaps according to the dop for every person.

    在CTE CTE中,使用窗口函数dense_rank()根据每个人的dop获得一个无间隔的秩。

  • Join cte to itself, but remove duplicates on dob from the second instance. Thereby everybody gets exactly one UPDATE. If more than one person share the same dop, the same one is selected as younger sibling for all persons on the next dob. I do this with:

    将cte连接到自己,但是从第二个实例中删除dob上的副本。因此每个人都会得到一个更新。如果一个以上的人共享相同的dop,那么对于下一个dob上的所有人来说,同样的一个人将被选择为最小的兄弟。我这样做:

    (SELECT DISTINCT ON (rnk) * FROM cte)
    

    Add ORDER BY rnk, ... if you want to pick a particular person for every dob.

    增加订单rnk,…如果你想为每一个dob挑选一个特定的人。

  • If no younger person exists, no UPDATE happens and the columns stay NULL.

    如果没有较年轻的人存在,则不会发生更新,列保持为空。

  • Indices on dob and pid make this fast.

    dob和pid的指标使这个速度更快。

Query 2

WITH cte AS (
   SELECT dob, min(name) AS name
         ,row_number() OVER (ORDER BY dob) rn
   FROM   person p
   GROUP  BY dob
   )
UPDATE person p
SET    younger_sibling_name = y.name
      ,younger_sibling_dob  = y.dob
FROM   cte x
JOIN   cte y ON y.rn = x.rn + 1
WHERE  x.dob = p.dob;

-> SQLfiddle

- > SQLfiddle

  • This works, because aggregate functions are applied before window functions. And it should be very fast, since both operations agree on the sort order.

    这是可行的,因为聚合函数是在窗口函数之前应用的。而且它应该非常快,因为两个操作在排序顺序上是一致的。

  • Obviates the need for a later DISTINCT like in query 1.

    避免以后需要使用查询1中所示的不同类型。

  • Result is the same as query 1, exactly.
    Again, you can add more columns to ORDER BY to pick a particular person for every dob.

    结果与查询1完全相同。同样,您可以通过为每个dob选择一个特定的人来添加更多的列。

  • Only needs an index on dob to be fast.

    只需要在dob上快速的索引。

#2


2  

1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)

1)查找MAX()总是可以重写为不存在(…)

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
   OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
        SELECT * FROM person nx
        WHERE nx.dob < dst.dob
           OR nx.dob = dst.dob AND nx.pid < dst.pid
        AND nx.dob > src.dob
           OR nx.dob = src.dob AND nx.pid > src.pid
        );

2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:

2)除了rank() / row_number()之外,还可以在窗口上使用LAG()函数:

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM    (
        SELECT pid
        , LAG(name) OVER win AS name
        , LAG(dob) OVER win AS dob 
        FROM person
        WINDOW win AS (ORDER BY dob, pid)
        ) src
WHERE src.pid = dst.pid
        ;

Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.

两个版本都需要一个自连接子查询(或CTE),因为UPDATE不允许窗口函数。

#3


1  

To get the dob and name, you can do:

要获得dob和名称,你可以这样做:

update person
    set younger_sibling_dob = (select dob
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1),
       younger_sibling_name = (select name
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1)

If you have an index on dob, then the query will run faster.

如果dob上有索引,那么查询将运行得更快。