My table has two integer columns: a
and b
. For each row, I want to select the nth smallest value of b
among the rows with smaller a
values. Here's a sample input/output, with n=2.
我的表有两个整数列:a和b。对于每一行,我想在具有较小值的行中选择b的第n个最小值。这是一个输入/输出示例,n = 2。
Input:
输入:
a | b
-------
1 | 4
2 | 2
3 | 5
4 | 3
5 | 9
6 | 1
7 | 7
8 | 6
9 | 0
Output:
输出:
a | 2th min b
-------------
1 | null ← only 1 element in [4], no 2nd min
2 | 4 ← 2nd min between [4,2]
3 | 4 ← 2nd min between [4,2,5]
4 | 3 ← 2nd min between [4,2,5,3]
5 | 3 ← etc.
6 | 2
7 | 2
8 | 2
9 | 1
I used n=2 here to keep it simple, but in practice, I want the 2000th smallest value (or some other large-ish constant). The column a
can be assumed to contain distinct integers (and even 1, 2, 3, … if that's easier).
我在这里使用n = 2来保持简单,但在实践中,我想要第2000个最小值(或其他一些大的常数)。可以假设列a包含不同的整数(甚至1,2,3,......如果这更容易)。
The problem is that if I use ORDER BY b
in my window clause and NTH_VALUE
, it just computes the answer on the wrong set of values:
问题是,如果我在我的window子句和NTH_VALUE中使用ORDER BY b,它只是在错误的值集上计算答案:
WITH data AS (
SELECT 1 AS a, 4 AS b
UNION ALL SELECT 2 AS a, 2 AS b
UNION ALL SELECT 3 AS a, 5 AS b
UNION ALL SELECT 4 AS a, 3 AS b
UNION ALL SELECT 5 AS a, 9 AS b
UNION ALL SELECT 6 AS a, 1 AS b
)
SELECT nth_value(b, 2) over (order by a)
from data
returns [null, 2, 2, 2, 2, 2]
: the values are ordered by a
(so in the same order than they appear), so the value b=2
is always the one in second place. I want to order by a and then take the nth smallest value of b. Any idea how to write this in BigQuery (preferably Standard SQL)?
返回[null,2,2,2,2,2]:值按a排序(因此顺序与它们出现的顺序相同),因此值b = 2始终是第二位的值。我想按a排序,然后取b的第n个最小值。知道怎么用BigQuery(最好是标准SQL)写这个吗?
1 个解决方案
#1
3
Below is for BigQuery Standard SQL and produces correct result for given example.
下面是BigQuery Standard SQL,并为给定的示例生成正确的结果。
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 a, 4 b UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5 UNION ALL
SELECT 4, 3 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 1 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 8, 6 UNION ALL
SELECT 9, 0
)
SELECT
a,
(SELECT b FROM
(SELECT b FROM UNNEST(c) b ORDER BY b LIMIT 2)
ORDER BY b DESC LIMIT 1
) b2
FROM (
SELECT a, IF(ARRAY_LENGTH(c) > 1, c, [NULL]) c
FROM (
SELECT a, ARRAY_AGG(b) OVER (ORDER BY a) c
FROM `project.dataset.table`
)
)
-- ORDER BY a
with expected result as below
预期结果如下
Row a b2
1 1 null
2 2 4
3 3 4
4 4 3
5 5 3
6 6 2
7 7 2
8 8 2
9 9 1
Note: to make it work for 2000th element you might change 2 to 2000 in LIMIT 2
注意:要使其适用于第2000个元素,您可以在LIMIT 2中将2更改为2000
meantime, i can admit it looks a little ugly/messy to me and not sure about scalability but you can give it a shot
与此同时,我可以承认它对我来说看起来有点难看/凌乱,不确定可扩展性,但你可以试一试
Quick Update
快速更新
Below is a little less ugly looking version (same output of course)
下面是一个看起来不那么难看的版本(当然是相同的输出)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 a, 4 b UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5 UNION ALL
SELECT 4, 3 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 1 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 8, 6 UNION ALL
SELECT 9, 0
)
SELECT a, c[SAFE_ORDINAL(2)] b2 FROM (
SELECT x.a, ARRAY_AGG(y.b ORDER BY y.b LIMIT 2) c
FROM `project.dataset.table` x
CROSS JOIN `project.dataset.table` y
WHERE y.a <= x.a
GROUP BY x.a
)
-- ORDER BY a
For 2000th element replace 2
to 2000
in LIMIT 2
and SAFE_ORDINAL(2)
Still potentially same issue with scalability because of (now) explicit CROSS JOIN
对于第2000个元素,在LIMIT 2和SAFE_ORDINAL中替换2到2000(2)由于(现在)显式CROSS JOIN,可伸缩性仍然可能相同
#1
3
Below is for BigQuery Standard SQL and produces correct result for given example.
下面是BigQuery Standard SQL,并为给定的示例生成正确的结果。
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 a, 4 b UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5 UNION ALL
SELECT 4, 3 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 1 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 8, 6 UNION ALL
SELECT 9, 0
)
SELECT
a,
(SELECT b FROM
(SELECT b FROM UNNEST(c) b ORDER BY b LIMIT 2)
ORDER BY b DESC LIMIT 1
) b2
FROM (
SELECT a, IF(ARRAY_LENGTH(c) > 1, c, [NULL]) c
FROM (
SELECT a, ARRAY_AGG(b) OVER (ORDER BY a) c
FROM `project.dataset.table`
)
)
-- ORDER BY a
with expected result as below
预期结果如下
Row a b2
1 1 null
2 2 4
3 3 4
4 4 3
5 5 3
6 6 2
7 7 2
8 8 2
9 9 1
Note: to make it work for 2000th element you might change 2 to 2000 in LIMIT 2
注意:要使其适用于第2000个元素,您可以在LIMIT 2中将2更改为2000
meantime, i can admit it looks a little ugly/messy to me and not sure about scalability but you can give it a shot
与此同时,我可以承认它对我来说看起来有点难看/凌乱,不确定可扩展性,但你可以试一试
Quick Update
快速更新
Below is a little less ugly looking version (same output of course)
下面是一个看起来不那么难看的版本(当然是相同的输出)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 a, 4 b UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5 UNION ALL
SELECT 4, 3 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 1 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 8, 6 UNION ALL
SELECT 9, 0
)
SELECT a, c[SAFE_ORDINAL(2)] b2 FROM (
SELECT x.a, ARRAY_AGG(y.b ORDER BY y.b LIMIT 2) c
FROM `project.dataset.table` x
CROSS JOIN `project.dataset.table` y
WHERE y.a <= x.a
GROUP BY x.a
)
-- ORDER BY a
For 2000th element replace 2
to 2000
in LIMIT 2
and SAFE_ORDINAL(2)
Still potentially same issue with scalability because of (now) explicit CROSS JOIN
对于第2000个元素,在LIMIT 2和SAFE_ORDINAL中替换2到2000(2)由于(现在)显式CROSS JOIN,可伸缩性仍然可能相同