如何从熊猫的每一列中获取n个列值

时间:2021-05-13 13:09:43

I know how to get most frequent value of each column in dataframe using "mode". For example:

我知道如何使用“模式”在dataframe中获取每个列的最频繁值。例如:

df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
   A
0  2

But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:

但是我找不到dataframe每个列的最常见值“n”吗?例如,对于上面提到的dataframe,我希望输出n=2:

   A
0  2
1  1

Any pointer ?

指针吗?

2 个解决方案

#1


1  

One way is to use pd.Series.value_counts and extract the index:

一种方法是使用pd.Series。value_counts并提取索引:

df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})

res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})

#    A
# 0  2
# 1  1

#2


1  

Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:

使用value_counts和通过索引选择索引值,但它对每个列分别有效,因此需要使用DataFrame contructor应用或dict组合。如果可能的指数不存在,则需要对级数进行更通解,例如:

df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3], 
                   'B': [1, 1, 1, 1, 1, 1]})

N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))

Or:

或者:

N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})

print (df)
   A    B  C
0  2  1.0  d
1  1  NaN  e

For more general solution select only numeric columns first by select_dtypes:

对于更一般的解决方案,请先通过select_dtypes选择数字列:

df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3], 
                   'B': [1, 1, 1, 1, 1, 1],
                   'C': list('abcdef')})

N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))

N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})

print (df)
   A    B  C
0  2  1.0  d
1  1  NaN  e

#1


1  

One way is to use pd.Series.value_counts and extract the index:

一种方法是使用pd.Series。value_counts并提取索引:

df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})

res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})

#    A
# 0  2
# 1  1

#2


1  

Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:

使用value_counts和通过索引选择索引值,但它对每个列分别有效,因此需要使用DataFrame contructor应用或dict组合。如果可能的指数不存在,则需要对级数进行更通解,例如:

df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3], 
                   'B': [1, 1, 1, 1, 1, 1]})

N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))

Or:

或者:

N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})

print (df)
   A    B  C
0  2  1.0  d
1  1  NaN  e

For more general solution select only numeric columns first by select_dtypes:

对于更一般的解决方案,请先通过select_dtypes选择数字列:

df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3], 
                   'B': [1, 1, 1, 1, 1, 1],
                   'C': list('abcdef')})

N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))

N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})

print (df)
   A    B  C
0  2  1.0  d
1  1  NaN  e