修改pandas数据帧中的行子集

时间:2022-12-13 22:58:13

Assume I have a pandas DataFrame with two columns, A and B. I'd like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?

假设我有一个带有两列A和B的pandas DataFrame。我想修改这个DataFrame(或者创建一个副本),这样只要A为0,B就总是NaN。我将如何实现?

I tried the following

我尝试了以下内容

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.

没有成功。

5 个解决方案

#1


176  

Update

更新

ix is deprecated, use .loc for label based indexing

不推荐使用ix,使用.loc进行基于标签的索引

df.loc[df.A==0, 'B'] = np.nan

Try this:

尝试这个:

df.ix[df.A==0, 'B'] = np.nan

the df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.A == 0表达式创建一个索引行的布尔序列,'B'选择列。您也可以使用它来转换列的子集,例如:

df.ix[df.A==0, 'B'] = df.ix[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

我对pandas内部知识的了解不足以确切了解其原理,但基本问题是有时索引到DataFrame会返回结果的副本,有时会返回原始对象的视图。根据此处的文档,此行为取决于潜在的numpy行为。我发现在一次操作中访问所有内容(而不是[一] [两])更有可能用于设置。

#2


66  

Here is from pandas docs on advanced indexing:

这是来自pandas docs的高级索引:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

该部分将准确解释您的需求!结果是df.loc(因为.ix已被弃用 - 正如下面许多人所指出的那样)可以用于数据帧的冷切片/切割。和。它也可以用来设置东西。

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0, select column B and set it to np.nan'

所以Bren的回答是说'找到df.A == 0的所有地方,选择B栏并将其设置为np.nan'

#3


21  

Starting from pandas 0.20 ix is deprecated. The right way is to use loc

从pandas 0.20 ix开始不推荐使用。正确的方法是使用loc

here is a working example

这是一个有效的例子

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

如本文档中所述,.loc主要基于标签,但也可以与布尔数组一起使用。

So, what we are doing above is applying df.loc[row_index, column_index] by:

所以,我们上面所做的是通过以下方式应用df.loc [row_index,column_index]:

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
  • 利用loc可以将布尔数组作为掩码的事实告诉pandas我们想要在row_index中更改哪些行的子集
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index
  • 利用事实loc也是基于标签的,以使用column_index中的标签“B”选择列

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

我们可以使用逻辑,条件或任何返回一系列布尔值的操作来构造布尔数组。在上面的例子中,我们想要任何包含0的行,为此我们可以使用df.A == 0,正如您在下面的示例中所看到的,这将返回一系列布尔值。

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

Then, we use the above array of booleans to select and modify the necessary rows:

然后,我们使用上面的布尔数组来选择和修改必要的行:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.

有关更多信息,请在此处查看高级索引文档。

#4


2  

For a massive speed increase, use NumPy's where function.

为了大幅提速,请使用NumPy的功能。

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

创建一个包含100,000行且带有零的两列DataFrame。

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster

Numpy的速度提高了约4倍

#5


1  

To replace multiples columns convert to numpy array using .values:

要替换多列,请使用.values转换为numpy数组:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

#1


176  

Update

更新

ix is deprecated, use .loc for label based indexing

不推荐使用ix,使用.loc进行基于标签的索引

df.loc[df.A==0, 'B'] = np.nan

Try this:

尝试这个:

df.ix[df.A==0, 'B'] = np.nan

the df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.A == 0表达式创建一个索引行的布尔序列,'B'选择列。您也可以使用它来转换列的子集,例如:

df.ix[df.A==0, 'B'] = df.ix[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

我对pandas内部知识的了解不足以确切了解其原理,但基本问题是有时索引到DataFrame会返回结果的副本,有时会返回原始对象的视图。根据此处的文档,此行为取决于潜在的numpy行为。我发现在一次操作中访问所有内容(而不是[一] [两])更有可能用于设置。

#2


66  

Here is from pandas docs on advanced indexing:

这是来自pandas docs的高级索引:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

该部分将准确解释您的需求!结果是df.loc(因为.ix已被弃用 - 正如下面许多人所指出的那样)可以用于数据帧的冷切片/切割。和。它也可以用来设置东西。

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0, select column B and set it to np.nan'

所以Bren的回答是说'找到df.A == 0的所有地方,选择B栏并将其设置为np.nan'

#3


21  

Starting from pandas 0.20 ix is deprecated. The right way is to use loc

从pandas 0.20 ix开始不推荐使用。正确的方法是使用loc

here is a working example

这是一个有效的例子

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

如本文档中所述,.loc主要基于标签,但也可以与布尔数组一起使用。

So, what we are doing above is applying df.loc[row_index, column_index] by:

所以,我们上面所做的是通过以下方式应用df.loc [row_index,column_index]:

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
  • 利用loc可以将布尔数组作为掩码的事实告诉pandas我们想要在row_index中更改哪些行的子集
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index
  • 利用事实loc也是基于标签的,以使用column_index中的标签“B”选择列

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

我们可以使用逻辑,条件或任何返回一系列布尔值的操作来构造布尔数组。在上面的例子中,我们想要任何包含0的行,为此我们可以使用df.A == 0,正如您在下面的示例中所看到的,这将返回一系列布尔值。

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

Then, we use the above array of booleans to select and modify the necessary rows:

然后,我们使用上面的布尔数组来选择和修改必要的行:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.

有关更多信息,请在此处查看高级索引文档。

#4


2  

For a massive speed increase, use NumPy's where function.

为了大幅提速,请使用NumPy的功能。

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

创建一个包含100,000行且带有零的两列DataFrame。

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster

Numpy的速度提高了约4倍

#5


1  

To replace multiples columns convert to numpy array using .values:

要替换多列,请使用.values转换为numpy数组:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2