如何在groupby中选择空值最小的组?

时间:2021-08-23 09:09:16

Example:

例:

row_number |id |firstname | middlename | lastname |
0          | 1 | John     | NULL       | Doe      |
1          | 1 | John     | Jacob      | Doe      |
2          | 2 | Alison   | Marie      | Smith    |
3          | 2 | NULL     | Marie      | Smith    |
4          | 2 | Alison   | Marie      | Smith    |

I'm trying to figure out how to groupby id, and then grab the row with the least number of NULL values for each groupby, dropping any extra rows that contain the least number of NULLs is fine (for example, dropping row_number 4 since it ties row_number 2 for the least number of NULLS where id=2)

我试图找出如何组合id,然后获取每个groupby的NULL值最少的行,删除任何包含最少NULL数的额外行是很好的(例如,删除row_number 4,因为它绑定row_number 2以获得最少数量的NULLS,其中id = 2)

The answer for this example would be the row_numbers 1 and 2

这个例子的答案是row_numbers 1和2

Preferably would be ANSI SQL, but I can translate other languages (like python with pandas) if you can think of a way to do it

最好是ANSI SQL,但如果你能想到一种方法,我可以翻译其他语言(比如python和pandas)

Edit: Added a row for the case of tie-breaking.

编辑:为打破平局的情况添加了一行。

3 个解决方案

#1


1  

If you want to do this pandas, you can do it this way:

如果你想做这个熊猫,你可以这样做:

df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]

Output:

输出:

   row_number  id firstname middlename lastname
1           1   1      John      Jacob      Doe
2           2   2    Alison      Marie    Smith

For tiebreaker:

对于决胜局:

Add a row:

添加一行:

df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']

Then use groupby, transform, and idxmin:

然后使用groupby,transform和idxmin:

df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]

Output:

输出:

  row_number id firstname middlename lastname
1          1  1      John      Jacob      Doe
2          2  2    Alison      Marie    Smith

#2


1  

Oh, you want the rows with the fewest null values. I would suggest:

哦,你想要具有最少空值的行。我会建议:

select t.*
from (select t.*,
             dense_rank() over (order by (case when firstname is null then 1 else 0 end) + 
                                         (case when middlename is null then 1 else 0 end) + 
                                         (case when lastname is null then 1 else 0 end)
                               ) as seqnum

      from t
     ) t
where seqnum = 1;

This is ANSI-standard SQL.

这是ANSI标准SQL。

#3


1  

select a.id,max(a.sum),a.firstname,a.middlename,a.lastname from (
select 
id,
case firstname 
when null then 0
ELSE 1 end as c1,
firstname
case middlename 
when null then 0
ELSE 1 end as c2,
middlename,
case lastname 
when null then 0
ELSE 1 end as c3,
lastname,
sum(c1+c2+c3) as sum
from  tablename) a group by id

#1


1  

If you want to do this pandas, you can do it this way:

如果你想做这个熊猫,你可以这样做:

df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]

Output:

输出:

   row_number  id firstname middlename lastname
1           1   1      John      Jacob      Doe
2           2   2    Alison      Marie    Smith

For tiebreaker:

对于决胜局:

Add a row:

添加一行:

df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']

Then use groupby, transform, and idxmin:

然后使用groupby,transform和idxmin:

df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]

Output:

输出:

  row_number id firstname middlename lastname
1          1  1      John      Jacob      Doe
2          2  2    Alison      Marie    Smith

#2


1  

Oh, you want the rows with the fewest null values. I would suggest:

哦,你想要具有最少空值的行。我会建议:

select t.*
from (select t.*,
             dense_rank() over (order by (case when firstname is null then 1 else 0 end) + 
                                         (case when middlename is null then 1 else 0 end) + 
                                         (case when lastname is null then 1 else 0 end)
                               ) as seqnum

      from t
     ) t
where seqnum = 1;

This is ANSI-standard SQL.

这是ANSI标准SQL。

#3


1  

select a.id,max(a.sum),a.firstname,a.middlename,a.lastname from (
select 
id,
case firstname 
when null then 0
ELSE 1 end as c1,
firstname
case middlename 
when null then 0
ELSE 1 end as c2,
middlename,
case lastname 
when null then 0
ELSE 1 end as c3,
lastname,
sum(c1+c2+c3) as sum
from  tablename) a group by id