Pandas - 添加包含行的元数据的列

时间:2021-04-08 22:58:13

I want to add a column to a Dataframe that will contain a number derived from the number of NaN values in the row, specifically: one less than the number of non-NaN values in the row.

我想在Dataframe中添加一个列,该列将包含从行中NaN值的数量派生的数字,具体为:比行中非NaN值的数量少一个。

I tried:

我试过了:

for index, row in df.iterrows():
    count = row.value_counts()
    val = sum(count) - 1
    df['Num Hits'] = val

Which returns an error:

哪个返回错误:

-c:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

and puts the first val value into every cell of the new column. I've tried reading about .loc and indexing in the Pandas documentation and failed to make sense of it. I gather that .loc wants a row_index and a column_index but I don't know if these are pre-defined in every dataframe and I just have to specify them somehow or if I need to "set" an index on the dataframe somehow before telling the loop where to place the new value, val.

并将第一个val值放入新列的每个单元格中。我已经尝试在Pandas文档中阅读.loc和索引,但没有理解它。我知道.loc需要一个row_index和一个column_index,但我不知道这些是否在每个数据帧中都是预先定义的,我只需要以某种方式指定它们,或者我是否需要在告诉数据帧之前以某种方式“设置”索引循环放置新值的位置,val。

2 个解决方案

#1


1  

You can totally do it in a vectorized way without using a loop, which is likely to be faster than the loop version:

您可以完全以矢量化方式执行它而不使用循环,这可能比循环版本更快:

In [89]:

print df
          0         1         2         3
0  0.835396  0.330275  0.786579  0.493567
1  0.751678  0.299354  0.050638  0.483490
2  0.559348  0.106477  0.807911  0.883195
3  0.250296  0.281871  0.439523  0.117846
4  0.480055  0.269579  0.282295  0.170642
In [90]:
#number of valid numbers - 1
df.apply(lambda x: np.isfinite(x).sum()-1, axis=1)
Out[90]:
0    3
1    3
2    3
3    3
4    3
dtype: int64

@DSM brought up an good point that the above solution is still not fully vectorized. A vectorized form can be simply (~df.isnull()).sum(axis=1)-1.

@DSM提出了一个很好的观点,即上述解决方案仍然没有完全矢量化。矢量化形式可以简单地(~df.isnull())。sum(axis = 1)-1。

#2


0  

You can use the index variable that you define as part of the for loop as the row_index that .loc is looking for:

您可以将作为for循环一部分定义的索引变量用作.loc正在查找的row_index:

for index, row in df.iterrows():
    count = row.value_counts()
    val = sum(count) - 1
    df.loc[index, 'Num Hits'] = val

#1


1  

You can totally do it in a vectorized way without using a loop, which is likely to be faster than the loop version:

您可以完全以矢量化方式执行它而不使用循环,这可能比循环版本更快:

In [89]:

print df
          0         1         2         3
0  0.835396  0.330275  0.786579  0.493567
1  0.751678  0.299354  0.050638  0.483490
2  0.559348  0.106477  0.807911  0.883195
3  0.250296  0.281871  0.439523  0.117846
4  0.480055  0.269579  0.282295  0.170642
In [90]:
#number of valid numbers - 1
df.apply(lambda x: np.isfinite(x).sum()-1, axis=1)
Out[90]:
0    3
1    3
2    3
3    3
4    3
dtype: int64

@DSM brought up an good point that the above solution is still not fully vectorized. A vectorized form can be simply (~df.isnull()).sum(axis=1)-1.

@DSM提出了一个很好的观点,即上述解决方案仍然没有完全矢量化。矢量化形式可以简单地(~df.isnull())。sum(axis = 1)-1。

#2


0  

You can use the index variable that you define as part of the for loop as the row_index that .loc is looking for:

您可以将作为for循环一部分定义的索引变量用作.loc正在查找的row_index:

for index, row in df.iterrows():
    count = row.value_counts()
    val = sum(count) - 1
    df.loc[index, 'Num Hits'] = val