从pandas返回多个值适用于DataFrame

时间:2021-02-18 00:04:26

I'm using a Pandas DataFrame to do a row-wise t-test as per this example:

我正在使用Pandas DataFrame按照此示例执行行式t检验:

import numpy
import pandas

df = pandas.DataFrame(numpy.log2(numpy.randn(1000, 4), 
                      columns=["a", "b", "c", "d"])

df = df.dropna()

Now, supposing I have "a" and "b" as one group, and "c" and "d" at the other, I'm performing the t-test row-wise. This is fairly trivial with pandas, using apply with axis=1. However, I can either return a DataFrame of the same shape if my function doesn't aggregate, or a Series if it aggregates.

现在,假设我将“a”和“b”作为一个组,而将“c”和“d”作为另一个组,我正在逐行执行t检验。这对于pandas来说相当简单,使用apply = 1。但是,如果我的函数没有聚合,我可以返回相同形状的DataFrame,如果聚合则返回Series。

Normally I would just output the p-value (so, aggregation) but I would like to generate an additional value based on other calculations (in other words, return two values). I can of course do two runs, aggregating the p-values first, then doing the other work, but I was wondering if there is a more efficient way to do so as the data is reasonably large.

通常我只会输出p值(所以,聚合),但我想基于其他计算生成一个额外的值(换句话说,返回两个值)。我当然可以做两次运行,首先聚合p值,然后进行其他工作,但我想知道是否有更有效的方法这样做,因为数据相当大。

As an example of the calculation, a hypotethical function would be:

作为计算的一个例子,一个hypotethical函数将是:

from scipy.stats import ttest_ind

def t_test_and_mean(series, first, second):
    first_group = series[first]
    second_group = series[second]
    _, pvalue = ttest_ind(first_group, second_group)

    mean_ratio = second_group.mean() / first_group.mean()

    return (pvalue, mean_ratio)

Then invoked with

然后调用

df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1)

Of course in this case it returns a single Series with the two tuples as value.

当然,在这种情况下,它返回一个系列,其中两个元组作为值。

Instead, ny expected output would be a DataFrame with two columns, one for the first result, and one for the second. Is this possible or I have to do two runs for the two calculations, then merge them together?

相反,ny预期输出将是具有两列的DataFrame,一列用于第一个结果,一列用于第二列。这是可能的还是我必须为两次计算做两次运行,然后将它们合并在一起?

1 个解决方案

#1


62  

Returning a Series, rather than tuple, should produce a new multi-column DataFrame. For example,

返回一个系列,而不是元组,应该产生一个新的多列DataFrame。例如,

return pandas.Series({'pvalue': pvalue, 'mean_ratio': mean_ratio})

#1


62  

Returning a Series, rather than tuple, should produce a new multi-column DataFrame. For example,

返回一个系列,而不是元组,应该产生一个新的多列DataFrame。例如,

return pandas.Series({'pvalue': pvalue, 'mean_ratio': mean_ratio})