熊猫dataframe view vs copy，我怎么说?

What's the difference between:

有什么区别:

pandas df.loc[:,('col_a','col_b')]

熊猫df.loc[:(' col_a ',' col_b '))

and

和

df.loc[:,['col_a','col_b']]

df.loc[:[‘col_a’,‘col_b]]

The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.

下面的链接并没有提到后者，尽管它确实有效。两种方法都能得出结论吗?第一个拉视图，第二个拉一个拷贝?爱学习的熊猫。

http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

http://pandas.pydata.org/pandas-docs/stable/indexing.html indexing-view-versus-copy

Thanks

谢谢

1 个解决方案

#1

If your DataFrame has a simple column index, then there is no difference. For example,

如果您的DataFrame有一个简单的列索引，那么就没有区别。例如,

In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))

In [9]: df.loc[:, ['A','B']]
Out[9]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

In [10]: df.loc[:, ('A','B')]
Out[10]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

But if the DataFrame has a MultiIndex, there can be a big difference:

但是如果DataFrame有一个多索引，那么会有很大的不同:

df = pd.DataFrame(np.random.randint(10, size=(5,4)),
                  columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
                                                     list('ABAB')]),
                  index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
                                                   list('CDCDC')]))

#       foo    bar   
#         A  B   A  B
# baz C   7  9   9  9
#     D   7  5   5  4
# qux C   5  0   5  1
#     D   1  7   7  4
#     C   6  4   3  5

In [27]: df.loc[:, ('foo','B')]
Out[27]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:

KeyError是说必须对多索引进行排序。如果我们这样做，我们仍然会得到不同的结果:

In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]: 
      foo   
        A  B
baz C   7  9
    D   7  5
qux C   5  0
    D   1  7
    C   6  4

Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.

这是为什么呢?df.sortlevel(轴= 1)。loc[:， ('foo'，'B')]正在选择第一列水平为foo的列，而第二列水平为B的列。

In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.

相比之下,df.sortlevel(轴= 1)。loc[:， ['foo'，'B']]正在选择第一列是foo或B的列，对于第一列，没有B列，但是有两个foo列。

I think the operating principle with Pandas is that if you use df.loc[...] as an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect. However, if you make an assignment of the form

我认为熊猫的工作原理是，如果你使用df.loc[…作为一个表达式，您应该假设df。loc可能会返回一个副本或视图。熊猫纪录片没有具体说明你应该期待什么规则。但是，如果你对表格进行赋值

df.loc[...] = value

then you can trust Pandas to alter df itself.

然后你可以相信熊猫会改变df本身。

The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form

文档警告视图和副本之间的区别的原因是为了让您了解使用表单的链分配的陷阱

df.loc[...][...] = value

Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then

在这里,熊猫评估df.loc[…首先，它可能是一个视图或一个副本。如果是拷贝的话

df.loc[...][...] = value

is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.

正在修改df的某个部分的副本，因此对df本身没有影响。更糟糕的是，由于没有对该副本的引用，所以在赋值语句完成后无法访问该副本，因此(至少在CPython中)该副本将很快被垃圾收集。

I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.

我不知道一种实用的万无一失的方法来确定df.loc[…返回一个视图或一个副本。

However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):

然而，有一些经验法则可以帮助指导你的直觉(但请注意，我们在这里讨论的是实现细节，因此不能保证熊猫将来需要这样做):

If the resultant NDFrame can not be expressed as a basic slice of the underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
如果合成的NDFrame不能表示为底层NumPy数组的基本部分，那么它很可能就是一个副本。因此，选择任意的行或列将导致复制。选择顺序行和/或顺序列(可以表示为切片)可以返回一个视图。
If the resultant NDFrame has columns of different dtypes, then df.loc will again probably return a copy.
如果合成的NDFrame具有不同类型的列，则使用df。loc可能会再次返回一个副本。

However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

然而，有一种简单的方法可以确定x = df.loc[..]是一个postiori:简单地看看在x中改变一个值是否会影响df。如果是，它就是一个视图，如果不是，x就是一个拷贝。

#1