I am attempting to update multiple columns on a table with values from another row in the same table:
我正在尝试更新一个表中的多个列,其中的值来自同一表中的另一行:
CREATE TEMP TABLE person (
pid INT
,name VARCHAR(40)
,dob DATE
,younger_sibling_name VARCHAR(40)
,younger_sibling_dob DATE
);
INSERT INTO person VALUES (pid, name, dob)
(1, 'John', '1980-01-05'),
(2, 'Jimmy', '1975-04-25'),
(3, 'Sarah', '2004-02-10'),
(4, 'Frank', '1934-12-12');
The task is to populate younger_sibling_name
and younger_sibling_dob
with the name and birth day of the person that is closest to them in age, but not older or the same age.
任务是填充younger_sibling_name和younger_sibling_dob,其中包含与他们年龄最接近的人的姓名和生日,但不包括年龄较大或同龄的人。
I can set the younger sibling dob
easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):
我可以很容易地设置年幼的兄弟dob,因为这个值决定了要与相关子查询一起使用的记录(我认为这是一个例子?)
UPDATE person SET younger_sibling_dob=(
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);
I just can't see any way to get the name
?
The real query of this will run over about 1M records in groups of 100-500 for each MAX selection so performance is a concern.
我就是想不出有什么办法来取这个名字?真正的查询将会在100-500组的100万组记录中运行,每个MAX的选择都是值得关注的。
EDIT:
编辑:
After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:
在尝试了许多不同的方法之后,我决定采用这个方法,我认为这是一个很好的平衡,可以用中间结果验证数据,显示逻辑的意图,并充分地执行:
WITH sibling AS (
SELECT person.pid, sibling.dob, sibling.name,
row_number() OVER (PARTITION BY person.pid
ORDER BY sibling.dob DESC) AS age_closeness
FROM person
JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
SET younger_sibling_name = sibling.name
,younger_sibling_dob = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
AND sibling.age_closeness = 1;
SELECT * FROM person ORDER BY dob;
3 个解决方案
#1
4
Correlated subqueries are infamous for abysmal performance. Doesn't matter much for small tables, matters a lot for big tables. Use one of these instead, preferably the second:
相关子查询因其糟糕的性能而臭名昭著。小桌子不重要,大桌子重要。使用其中一个,最好是第二个:
Query 1
WITH cte AS (
SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
FROM person
)
UPDATE person p
SET younger_sibling_name = y.name
,younger_sibling_dob = y.dob
FROM cte x
JOIN (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk + 1
WHERE x.pid = p.pid;
-> SQLfiddle (with extended test case)
-> SQLfiddle(扩展测试用例)
-
In the CTE
cte
use the window functiondense_rank()
to get a rank without gaps according to thedop
for every person.在CTE CTE中,使用窗口函数dense_rank()根据每个人的dop获得一个无间隔的秩。
-
Join
cte
to itself, but remove duplicates ondob
from the second instance. Thereby everybody gets exactly oneUPDATE
. If more than one person share the samedop
, the same one is selected as younger sibling for all persons on the nextdob
. I do this with:将cte连接到自己,但是从第二个实例中删除dob上的副本。因此每个人都会得到一个更新。如果一个以上的人共享相同的dop,那么对于下一个dob上的所有人来说,同样的一个人将被选择为最小的兄弟。我这样做:
(SELECT DISTINCT ON (rnk) * FROM cte)
Add
ORDER BY rnk, ...
if you want to pick a particular person for everydob
.增加订单rnk,…如果你想为每一个dob挑选一个特定的人。
-
If no younger person exists, no
UPDATE
happens and the columns stayNULL
.如果没有较年轻的人存在,则不会发生更新,列保持为空。
-
Indices on
dob
andpid
make this fast.dob和pid的指标使这个速度更快。
Query 2
WITH cte AS (
SELECT dob, min(name) AS name
,row_number() OVER (ORDER BY dob) rn
FROM person p
GROUP BY dob
)
UPDATE person p
SET younger_sibling_name = y.name
,younger_sibling_dob = y.dob
FROM cte x
JOIN cte y ON y.rn = x.rn + 1
WHERE x.dob = p.dob;
- > SQLfiddle
-
This works, because aggregate functions are applied before window functions. And it should be very fast, since both operations agree on the sort order.
这是可行的,因为聚合函数是在窗口函数之前应用的。而且它应该非常快,因为两个操作在排序顺序上是一致的。
-
Obviates the need for a later
DISTINCT
like in query 1.避免以后需要使用查询1中所示的不同类型。
-
Result is the same as query 1, exactly.
Again, you can add more columns toORDER BY
to pick a particular person for everydob
.结果与查询1完全相同。同样,您可以通过为每个dob选择一个特定的人来添加更多的列。
-
Only needs an index on
dob
to be fast.只需要在dob上快速的索引。
#2
2
1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)
1)查找MAX()总是可以重写为不存在(…)
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
SELECT * FROM person nx
WHERE nx.dob < dst.dob
OR nx.dob = dst.dob AND nx.pid < dst.pid
AND nx.dob > src.dob
OR nx.dob = src.dob AND nx.pid > src.pid
);
2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:
2)除了rank() / row_number()之外,还可以在窗口上使用LAG()函数:
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM (
SELECT pid
, LAG(name) OVER win AS name
, LAG(dob) OVER win AS dob
FROM person
WINDOW win AS (ORDER BY dob, pid)
) src
WHERE src.pid = dst.pid
;
Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.
两个版本都需要一个自连接子查询(或CTE),因为UPDATE不允许窗口函数。
#3
1
To get the dob and name, you can do:
要获得dob和名称,你可以这样做:
update person
set younger_sibling_dob = (select dob
from person p2
where s.dob < person.dob
order by dob desc
limit 1),
younger_sibling_name = (select name
from person p2
where s.dob < person.dob
order by dob desc
limit 1)
If you have an index on dob
, then the query will run faster.
如果dob上有索引,那么查询将运行得更快。
#1
4
Correlated subqueries are infamous for abysmal performance. Doesn't matter much for small tables, matters a lot for big tables. Use one of these instead, preferably the second:
相关子查询因其糟糕的性能而臭名昭著。小桌子不重要,大桌子重要。使用其中一个,最好是第二个:
Query 1
WITH cte AS (
SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
FROM person
)
UPDATE person p
SET younger_sibling_name = y.name
,younger_sibling_dob = y.dob
FROM cte x
JOIN (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk + 1
WHERE x.pid = p.pid;
-> SQLfiddle (with extended test case)
-> SQLfiddle(扩展测试用例)
-
In the CTE
cte
use the window functiondense_rank()
to get a rank without gaps according to thedop
for every person.在CTE CTE中,使用窗口函数dense_rank()根据每个人的dop获得一个无间隔的秩。
-
Join
cte
to itself, but remove duplicates ondob
from the second instance. Thereby everybody gets exactly oneUPDATE
. If more than one person share the samedop
, the same one is selected as younger sibling for all persons on the nextdob
. I do this with:将cte连接到自己,但是从第二个实例中删除dob上的副本。因此每个人都会得到一个更新。如果一个以上的人共享相同的dop,那么对于下一个dob上的所有人来说,同样的一个人将被选择为最小的兄弟。我这样做:
(SELECT DISTINCT ON (rnk) * FROM cte)
Add
ORDER BY rnk, ...
if you want to pick a particular person for everydob
.增加订单rnk,…如果你想为每一个dob挑选一个特定的人。
-
If no younger person exists, no
UPDATE
happens and the columns stayNULL
.如果没有较年轻的人存在,则不会发生更新,列保持为空。
-
Indices on
dob
andpid
make this fast.dob和pid的指标使这个速度更快。
Query 2
WITH cte AS (
SELECT dob, min(name) AS name
,row_number() OVER (ORDER BY dob) rn
FROM person p
GROUP BY dob
)
UPDATE person p
SET younger_sibling_name = y.name
,younger_sibling_dob = y.dob
FROM cte x
JOIN cte y ON y.rn = x.rn + 1
WHERE x.dob = p.dob;
- > SQLfiddle
-
This works, because aggregate functions are applied before window functions. And it should be very fast, since both operations agree on the sort order.
这是可行的,因为聚合函数是在窗口函数之前应用的。而且它应该非常快,因为两个操作在排序顺序上是一致的。
-
Obviates the need for a later
DISTINCT
like in query 1.避免以后需要使用查询1中所示的不同类型。
-
Result is the same as query 1, exactly.
Again, you can add more columns toORDER BY
to pick a particular person for everydob
.结果与查询1完全相同。同样,您可以通过为每个dob选择一个特定的人来添加更多的列。
-
Only needs an index on
dob
to be fast.只需要在dob上快速的索引。
#2
2
1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)
1)查找MAX()总是可以重写为不存在(…)
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
SELECT * FROM person nx
WHERE nx.dob < dst.dob
OR nx.dob = dst.dob AND nx.pid < dst.pid
AND nx.dob > src.dob
OR nx.dob = src.dob AND nx.pid > src.pid
);
2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:
2)除了rank() / row_number()之外,还可以在窗口上使用LAG()函数:
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM (
SELECT pid
, LAG(name) OVER win AS name
, LAG(dob) OVER win AS dob
FROM person
WINDOW win AS (ORDER BY dob, pid)
) src
WHERE src.pid = dst.pid
;
Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.
两个版本都需要一个自连接子查询(或CTE),因为UPDATE不允许窗口函数。
#3
1
To get the dob and name, you can do:
要获得dob和名称,你可以这样做:
update person
set younger_sibling_dob = (select dob
from person p2
where s.dob < person.dob
order by dob desc
limit 1),
younger_sibling_name = (select name
from person p2
where s.dob < person.dob
order by dob desc
limit 1)
If you have an index on dob
, then the query will run faster.
如果dob上有索引,那么查询将运行得更快。