Example:
例:
row_number |id |firstname | middlename | lastname |
0 | 1 | John | NULL | Doe |
1 | 1 | John | Jacob | Doe |
2 | 2 | Alison | Marie | Smith |
3 | 2 | NULL | Marie | Smith |
4 | 2 | Alison | Marie | Smith |
I'm trying to figure out how to groupby id, and then grab the row with the least number of NULL values for each groupby, dropping any extra rows that contain the least number of NULLs is fine (for example, dropping row_number 4 since it ties row_number 2 for the least number of NULLS where id=2)
我试图找出如何组合id,然后获取每个groupby的NULL值最少的行,删除任何包含最少NULL数的额外行是很好的(例如,删除row_number 4,因为它绑定row_number 2以获得最少数量的NULLS,其中id = 2)
The answer for this example would be the row_numbers 1 and 2
这个例子的答案是row_numbers 1和2
Preferably would be ANSI SQL, but I can translate other languages (like python with pandas) if you can think of a way to do it
最好是ANSI SQL,但如果你能想到一种方法,我可以翻译其他语言(比如python和pandas)
Edit: Added a row for the case of tie-breaking.
编辑:为打破平局的情况添加了一行。
3 个解决方案
#1
1
If you want to do this pandas, you can do it this way:
如果你想做这个熊猫,你可以这样做:
df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]
Output:
输出:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith
For tiebreaker:
对于决胜局:
Add a row:
添加一行:
df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']
Then use groupby
, transform
, and idxmin
:
然后使用groupby,transform和idxmin:
df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]
Output:
输出:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith
#2
1
Oh, you want the rows with the fewest null
values. I would suggest:
哦,你想要具有最少空值的行。我会建议:
select t.*
from (select t.*,
dense_rank() over (order by (case when firstname is null then 1 else 0 end) +
(case when middlename is null then 1 else 0 end) +
(case when lastname is null then 1 else 0 end)
) as seqnum
from t
) t
where seqnum = 1;
This is ANSI-standard SQL.
这是ANSI标准SQL。
#3
1
select a.id,max(a.sum),a.firstname,a.middlename,a.lastname from (
select
id,
case firstname
when null then 0
ELSE 1 end as c1,
firstname
case middlename
when null then 0
ELSE 1 end as c2,
middlename,
case lastname
when null then 0
ELSE 1 end as c3,
lastname,
sum(c1+c2+c3) as sum
from tablename) a group by id
#1
1
If you want to do this pandas, you can do it this way:
如果你想做这个熊猫,你可以这样做:
df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]
Output:
输出:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith
For tiebreaker:
对于决胜局:
Add a row:
添加一行:
df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']
Then use groupby
, transform
, and idxmin
:
然后使用groupby,transform和idxmin:
df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]
Output:
输出:
row_number id firstname middlename lastname
1 1 1 John Jacob Doe
2 2 2 Alison Marie Smith
#2
1
Oh, you want the rows with the fewest null
values. I would suggest:
哦,你想要具有最少空值的行。我会建议:
select t.*
from (select t.*,
dense_rank() over (order by (case when firstname is null then 1 else 0 end) +
(case when middlename is null then 1 else 0 end) +
(case when lastname is null then 1 else 0 end)
) as seqnum
from t
) t
where seqnum = 1;
This is ANSI-standard SQL.
这是ANSI标准SQL。
#3
1
select a.id,max(a.sum),a.firstname,a.middlename,a.lastname from (
select
id,
case firstname
when null then 0
ELSE 1 end as c1,
firstname
case middlename
when null then 0
ELSE 1 end as c2,
middlename,
case lastname
when null then 0
ELSE 1 end as c3,
lastname,
sum(c1+c2+c3) as sum
from tablename) a group by id