Pandas文档中的术语“广播”是什么意思?

时间:2021-06-01 16:58:42

I'm reading through the Pandas documentation, and the term "broadcasting" is used extensively, but never really defined or explained.

我正在阅读Pandas文档,“广播”一词被广泛使用,但从未真正定义或解释过。

What does it mean?

这是什么意思?

1 个解决方案

#1


38  

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.

所以术语广播来自numpy,简单地说它解释了当你在n维数组(可能是面板,数据帧,系列)或标量值之间执行操作时将产生的输出规则。

Broadcasting using a scalar value

So the simplest case is just multiplying by a scalar value:

所以最简单的情况就是乘以一个标量值:

In [4]:
s = pd.Series(np.arange(5))
s

Out[4]:
0    0
1    1
2    2
3    3
4    4
dtype: int32

In [5]:    
s * 10

Out[5]:
0     0
1    10
2    20
3    30
4    40
dtype: int32

and we get the same expected results with a dataframe:

我们用数据帧得到了相同的预期结果:

In [6]:    
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)})
df

Out[6]:
          a         b
0  0.216920  0.652193
1  0.968969  0.033369
2  0.637784  0.856836
3 -2.303556  0.426238

In [7]:    
df * 10

Out[7]:
           a         b
0   2.169204  6.521925
1   9.689690  0.333695
2   6.377839  8.568362
3 -23.035557  4.262381

So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.

因此,技术上发生的是标量值已经沿着上面的Series和DataFrame的相同维度进行了广播。

Broadcasting using a 1-D array

Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:

假设我们有一个形状为4 x 3(4行x 3列)的二维数据帧,我们可以使用与行长度相同的1-D系列沿x轴执行操作:

In [8]:
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)})
df

Out[8]:
          a         b         c
0  0.122073 -1.178127 -1.531254
1  0.011346 -0.747583 -1.967079
2 -0.019716 -0.235676  1.419547
3  0.215847  1.112350  0.659432

In [26]:    
df.iloc[0]

Out[26]:
a    0.122073
b   -1.178127
c   -1.531254
Name: 0, dtype: float64

In [27]:    
df + df.iloc[0]

Out[27]:
          a         b         c
0  0.244146 -2.356254 -3.062507
1  0.133419 -1.925710 -3.498333
2  0.102357 -1.413803 -0.111707
3  0.337920 -0.065777 -0.871822

the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy):

上面看起来很有趣,直到你理解发生了什么,我采取了第一行的值,并将这个行添加到df,它可以使用这张图片(源自scipy)可视化:

Pandas文档中的术语“广播”是什么意思?

The general rule is this:

一般规则如下:

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

为了进行广播,操作中两个数组的尾轴大小必须相同,或者其中一个必须为1。

So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values:

因此,如果我尝试添加一个长度不匹配的一维数组,比如一个有4个元素的数组,不像numpy会引发一个ValueError,在Pandas中你会得到一个充满NaN值的df:

In [30]:
df + pd.Series(np.arange(4))

Out[30]:
    a   b   c   0   1   2   3
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN

Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:

现在关于pandas的一些好处是,它会尝试使用现有的列名和行标签进行对齐,这可能会妨碍尝试执行像这样的发烧友广播:

Pandas文档中的术语“广播”是什么意思?

In [55]:
df[['a']] + df.iloc[0]

Out[55]:
          a   b   c
0  0.244146 NaN NaN
1  0.133419 NaN NaN
2  0.102357 NaN NaN
3  0.337920 NaN NaN

In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:

在上面我使用双下标强制形状为(4,1)但我们在尝试使用第一行广播时看到一个问题,因为列对齐仅在第一列上对齐。为了像上面的图表那样发生相同形式的广播,我们必须分解为numpy数组,然后成为匿名数据:

In [56]:
df[['a']].values + df.iloc[0].values

Out[56]:
array([[ 0.24414608, -1.05605392, -1.4091805 ],
       [ 0.13341899, -1.166781  , -1.51990758],
       [ 0.10235701, -1.19784299, -1.55096957],
       [ 0.33792013, -0.96227987, -1.31540645]])

It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.

它也有可能以三维方式进行广播,但我不会经常接近那些东西,但是numpy,scipy和pandas的书中有一些例子可以说明它是如何工作的。

Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.

一般来说,要记住的是除了简单的标量值之外,对于n-D阵列,次/轴的长度必须匹配,或者其中一个必须为1。

Update

it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3 in latest version of pandas 0.20.2

看来上面现在导致了ValueError:无法强制进入Series,长度必须为1:在最新版本的pandas 0.20.2中给出3

so you have to call .values on the df first:

所以你必须先在df上调用.values:

In[42]:
df[['a']].values + df.iloc[0].values

Out[42]: 
array([[ 0.244146, -1.056054, -1.409181],
       [ 0.133419, -1.166781, -1.519908],
       [ 0.102357, -1.197843, -1.55097 ],
       [ 0.33792 , -0.96228 , -1.315407]])

To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:

要将其恢复到原始df,我们可以从np数组构造一个df,并将args中的原始列传递给构造函数:

In[43]:
pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)

Out[43]: 
          a         b         c
0  0.244146 -1.056054 -1.409181
1  0.133419 -1.166781 -1.519908
2  0.102357 -1.197843 -1.550970
3  0.337920 -0.962280 -1.315407

#1


38  

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.

所以术语广播来自numpy,简单地说它解释了当你在n维数组(可能是面板,数据帧,系列)或标量值之间执行操作时将产生的输出规则。

Broadcasting using a scalar value

So the simplest case is just multiplying by a scalar value:

所以最简单的情况就是乘以一个标量值:

In [4]:
s = pd.Series(np.arange(5))
s

Out[4]:
0    0
1    1
2    2
3    3
4    4
dtype: int32

In [5]:    
s * 10

Out[5]:
0     0
1    10
2    20
3    30
4    40
dtype: int32

and we get the same expected results with a dataframe:

我们用数据帧得到了相同的预期结果:

In [6]:    
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)})
df

Out[6]:
          a         b
0  0.216920  0.652193
1  0.968969  0.033369
2  0.637784  0.856836
3 -2.303556  0.426238

In [7]:    
df * 10

Out[7]:
           a         b
0   2.169204  6.521925
1   9.689690  0.333695
2   6.377839  8.568362
3 -23.035557  4.262381

So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.

因此,技术上发生的是标量值已经沿着上面的Series和DataFrame的相同维度进行了广播。

Broadcasting using a 1-D array

Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:

假设我们有一个形状为4 x 3(4行x 3列)的二维数据帧,我们可以使用与行长度相同的1-D系列沿x轴执行操作:

In [8]:
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)})
df

Out[8]:
          a         b         c
0  0.122073 -1.178127 -1.531254
1  0.011346 -0.747583 -1.967079
2 -0.019716 -0.235676  1.419547
3  0.215847  1.112350  0.659432

In [26]:    
df.iloc[0]

Out[26]:
a    0.122073
b   -1.178127
c   -1.531254
Name: 0, dtype: float64

In [27]:    
df + df.iloc[0]

Out[27]:
          a         b         c
0  0.244146 -2.356254 -3.062507
1  0.133419 -1.925710 -3.498333
2  0.102357 -1.413803 -0.111707
3  0.337920 -0.065777 -0.871822

the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy):

上面看起来很有趣,直到你理解发生了什么,我采取了第一行的值,并将这个行添加到df,它可以使用这张图片(源自scipy)可视化:

Pandas文档中的术语“广播”是什么意思?

The general rule is this:

一般规则如下:

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

为了进行广播,操作中两个数组的尾轴大小必须相同,或者其中一个必须为1。

So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values:

因此,如果我尝试添加一个长度不匹配的一维数组,比如一个有4个元素的数组,不像numpy会引发一个ValueError,在Pandas中你会得到一个充满NaN值的df:

In [30]:
df + pd.Series(np.arange(4))

Out[30]:
    a   b   c   0   1   2   3
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN

Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:

现在关于pandas的一些好处是,它会尝试使用现有的列名和行标签进行对齐,这可能会妨碍尝试执行像这样的发烧友广播:

Pandas文档中的术语“广播”是什么意思?

In [55]:
df[['a']] + df.iloc[0]

Out[55]:
          a   b   c
0  0.244146 NaN NaN
1  0.133419 NaN NaN
2  0.102357 NaN NaN
3  0.337920 NaN NaN

In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:

在上面我使用双下标强制形状为(4,1)但我们在尝试使用第一行广播时看到一个问题,因为列对齐仅在第一列上对齐。为了像上面的图表那样发生相同形式的广播,我们必须分解为numpy数组,然后成为匿名数据:

In [56]:
df[['a']].values + df.iloc[0].values

Out[56]:
array([[ 0.24414608, -1.05605392, -1.4091805 ],
       [ 0.13341899, -1.166781  , -1.51990758],
       [ 0.10235701, -1.19784299, -1.55096957],
       [ 0.33792013, -0.96227987, -1.31540645]])

It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.

它也有可能以三维方式进行广播,但我不会经常接近那些东西,但是numpy,scipy和pandas的书中有一些例子可以说明它是如何工作的。

Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.

一般来说,要记住的是除了简单的标量值之外,对于n-D阵列,次/轴的长度必须匹配,或者其中一个必须为1。

Update

it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3 in latest version of pandas 0.20.2

看来上面现在导致了ValueError:无法强制进入Series,长度必须为1:在最新版本的pandas 0.20.2中给出3

so you have to call .values on the df first:

所以你必须先在df上调用.values:

In[42]:
df[['a']].values + df.iloc[0].values

Out[42]: 
array([[ 0.244146, -1.056054, -1.409181],
       [ 0.133419, -1.166781, -1.519908],
       [ 0.102357, -1.197843, -1.55097 ],
       [ 0.33792 , -0.96228 , -1.315407]])

To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:

要将其恢复到原始df,我们可以从np数组构造一个df,并将args中的原始列传递给构造函数:

In[43]:
pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)

Out[43]: 
          a         b         c
0  0.244146 -1.056054 -1.409181
1  0.133419 -1.166781 -1.519908
2  0.102357 -1.197843 -1.550970
3  0.337920 -0.962280 -1.315407