We have a group of patients in one table and we want to match each of them to a patient exactly like them in another table - but we want pairs of patients so we cannot match a patient to more than one other patient.
我们有一组病人在一张桌子上,我们想让他们每一个都和另一张桌子上的病人相匹配——但是我们想要成对的病人,所以我们不能把一个病人和另一个病人匹配。
Left Outer Joins add every occurrence of a match - which matches patients to every other possible match - so we need some other approach.
左外连接每一个匹配的发生——它匹配病人与每一个可能的匹配——所以我们需要一些其他的方法。
We see lots of answers on SO about matching to the first row - but that leaves us with a single patient being matched to multiple other patients - not a pair like we need.
我们看到了很多关于匹配第一行的答案——但这就给我们留下了一个病人与其他多个病人的匹配——而不是我们需要的一对。
Is there any possible way to create pair matches without duplication between tables in Google Big Query? (Even if it takes multiple steps.)
在谷歌大查询中,有没有可能的方法来创建一对不重复的表?(即使它需要多个步骤。)
ADDENDUM: Here are example tables. It would be great to see a SQL example using this.
附录:这里是示例表。如果能看到一个SQL示例使用这种方法,那就太好了。
Here is what is needed.
这就是我们需要的。
Example Source Tables:
Table A
PatientID Race Gender
1 A F
2 B M
3 A F
Table B
PatientID
4 A F
5 A F
6 B M
Results Table Desired:
Table C
A.PatientID B.PatientID_Match
1 4
2 6
3 5
CLARIFICATION: Patients in Table A must match patients from Table B. (They cannot match patients in their own table.)
说明:表A中的患者必须匹配表b中的患者(他们不能匹配自己表中的患者)。
3 个解决方案
#1
2
select min (case tab when 'A' then patientID end) as A_patientID
,min (case tab when 'B' then patientID end) as B_patientID
from (select tab
,patientID
,rank() over (order by race,gender) r
,row_number() over (partition by tab,race,gender order by patientID) rn
from ( select 'A' as tab,A.* from A
union all select 'B' as tab,B.* from B
) t
) t
group by t.r
,t.rn
-- having count(*) = 2
;
+-------------+-------------+
| a_patientid | b_patientid |
+-------------+-------------+
| 3 | 5 |
+-------------+-------------+
| 2 | 6 |
+-------------+-------------+
| 1 | 4 |
+-------------+-------------+
The main idea -
的主要思想,
Rows from both tables are divided to groups by their attributes (race,gender).
This is being done using the RANK function.
两个表中的行根据其属性(种族、性别)划分为组。这是用秩函数完成的。
Within each group of attributes (race,gender) the rows are being ordered, per table, by their patientid .
在每组属性(种族、性别)中,每个表中的行都由它们的patientid排序。
+-----+-----------+------+--------+ +---+----+
| tab | patientid | race | gender | | r | rn |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 1 | A | F | | 1 | 1 |
+-----+-----------+------+--------+ +---+----+
| B | 4 | A | F | | 1 | 1 |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 3 | A | F | | 1 | 2 |
+-----+-----------+------+--------+ +---+----+
| B | 5 | A | F | | 1 | 2 |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 2 | B | M | | 5 | 1 |
+-----+-----------+------+--------+ +---+----+
| B | 6 | B | M | | 5 | 1 |
+-----+-----------+------+--------+ +---+----+
In the final phase, the rows are being divided into groups (GROUP BY) by their RANK (r) and ROW_NUMBER (rn) values, which means each group has a row from each table (or only a single row if there is no matching row from the other table).
在最后一个阶段,行被分成组(GROUP BY),由它们的RANK (r)和ROW_NUMBER (rn)值组成,这意味着每个组在每个表中都有一行(如果没有与另一个表相匹配的行,则只有一行)。
#2
1
In many databases, a lateral join would be the way to go. In Google, you can use row_number()
. The query looks something like this:
在许多数据库中,横向连接是一种方法。在谷歌中,可以使用row_number()。该查询如下所示:
select p.*, pp.patient_id as other_patient_id
from patients p cross join
(select p.*,
row_number() over (partition by col1, col2, col3 order by col1) as seqnum
from patients p
) pp
where pp.seqnum = 1;
The columns in the partition by
are the columns used for similarity.
分区中的列是用来表示相似的列。
#3
0
SELECT
a.PatientID AS PatientID,
b.PatientID AS PatientID_Match
FROM (
SELECT PatientID, Race, Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableA
) AS a
JOIN (
SELECT PatientID, Race, Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos
Above will leave out those patients from TableA which either do not have match in TableB or potential match in TableB was already used as match for another patient in TableA (as per your we want pairs of patients so we cannot match a patient to more than one other patient.
requirement)
上面将患者从为多,表b中没有匹配或潜在的匹配表b已经用作另一个病人为多对匹配(按你们的我们希望对病人因此我们不能匹配病人超过另一个病人。要求)
To address Dudu's comments about NULL for attributes:
为了解决Dudu对于属性无效的评论:
SELECT
a.PatientID AS PatientID,
b.PatientID AS PatientID_Match
FROM (
SELECT
PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableA
) AS a
JOIN (
SELECT
PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos
#1
2
select min (case tab when 'A' then patientID end) as A_patientID
,min (case tab when 'B' then patientID end) as B_patientID
from (select tab
,patientID
,rank() over (order by race,gender) r
,row_number() over (partition by tab,race,gender order by patientID) rn
from ( select 'A' as tab,A.* from A
union all select 'B' as tab,B.* from B
) t
) t
group by t.r
,t.rn
-- having count(*) = 2
;
+-------------+-------------+
| a_patientid | b_patientid |
+-------------+-------------+
| 3 | 5 |
+-------------+-------------+
| 2 | 6 |
+-------------+-------------+
| 1 | 4 |
+-------------+-------------+
The main idea -
的主要思想,
Rows from both tables are divided to groups by their attributes (race,gender).
This is being done using the RANK function.
两个表中的行根据其属性(种族、性别)划分为组。这是用秩函数完成的。
Within each group of attributes (race,gender) the rows are being ordered, per table, by their patientid .
在每组属性(种族、性别)中,每个表中的行都由它们的patientid排序。
+-----+-----------+------+--------+ +---+----+
| tab | patientid | race | gender | | r | rn |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 1 | A | F | | 1 | 1 |
+-----+-----------+------+--------+ +---+----+
| B | 4 | A | F | | 1 | 1 |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 3 | A | F | | 1 | 2 |
+-----+-----------+------+--------+ +---+----+
| B | 5 | A | F | | 1 | 2 |
+-----+-----------+------+--------+ +---+----+
+-----+-----------+------+--------+ +---+----+
| A | 2 | B | M | | 5 | 1 |
+-----+-----------+------+--------+ +---+----+
| B | 6 | B | M | | 5 | 1 |
+-----+-----------+------+--------+ +---+----+
In the final phase, the rows are being divided into groups (GROUP BY) by their RANK (r) and ROW_NUMBER (rn) values, which means each group has a row from each table (or only a single row if there is no matching row from the other table).
在最后一个阶段,行被分成组(GROUP BY),由它们的RANK (r)和ROW_NUMBER (rn)值组成,这意味着每个组在每个表中都有一行(如果没有与另一个表相匹配的行,则只有一行)。
#2
1
In many databases, a lateral join would be the way to go. In Google, you can use row_number()
. The query looks something like this:
在许多数据库中,横向连接是一种方法。在谷歌中,可以使用row_number()。该查询如下所示:
select p.*, pp.patient_id as other_patient_id
from patients p cross join
(select p.*,
row_number() over (partition by col1, col2, col3 order by col1) as seqnum
from patients p
) pp
where pp.seqnum = 1;
The columns in the partition by
are the columns used for similarity.
分区中的列是用来表示相似的列。
#3
0
SELECT
a.PatientID AS PatientID,
b.PatientID AS PatientID_Match
FROM (
SELECT PatientID, Race, Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableA
) AS a
JOIN (
SELECT PatientID, Race, Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos
Above will leave out those patients from TableA which either do not have match in TableB or potential match in TableB was already used as match for another patient in TableA (as per your we want pairs of patients so we cannot match a patient to more than one other patient.
requirement)
上面将患者从为多,表b中没有匹配或潜在的匹配表b已经用作另一个病人为多对匹配(按你们的我们希望对病人因此我们不能匹配病人超过另一个病人。要求)
To address Dudu's comments about NULL for attributes:
为了解决Dudu对于属性无效的评论:
SELECT
a.PatientID AS PatientID,
b.PatientID AS PatientID_Match
FROM (
SELECT
PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableA
) AS a
JOIN (
SELECT
PatientID, IFNULL(Race, 'null') AS Race, IFNULL(Gender, 'null') AS Gender,
ROW_NUMBER() OVER(PARTITION BY Race, Gender) AS Pos
FROM TableB
) AS b
ON a.Race = b.Race AND a.Gender = b.Gender AND a.Pos = b.Pos