I'm trying to do a pivot of a table containing strings as results.
我试着用一个包含字符串的表作为结果。
import pandas as pd
df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})
df1.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
But I get: DataError: No numeric types to aggregate
.
但是我得到:DataError:没有要聚合的数值类型。
This works as intended when I change result values to numbers:
当我将结果值更改为数字时,它可以正常工作:
df2 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': [1,0,0,1,1,0,0,1]})
df2.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
And I get what I need:
我得到了我需要的:
variable1 A B
variable2 a b a b
variable3 x y x y x y
index
0 1 NaN NaN NaN NaN NaN
1 NaN NaN 0 NaN NaN NaN
2 NaN NaN NaN NaN 0 NaN
3 NaN NaN NaN NaN NaN 1
4 NaN 1 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN 0
6 NaN NaN NaN NaN 0 NaN
7 NaN NaN NaN 1 NaN NaN
I know I can map the strings to numerical values and then reverse the operation, but maybe there is a more elegant solution?
我知道我可以将字符串映射到数值,然后反转操作,但也许有更好的解决方案?
2 个解决方案
#1
24
My original reply was based on Pandas 0.14.1, and since then, many things changed in the pivot_table function (rows --> index, cols --> columns... )
我最初的回复是基于熊猫0.14.1,从那以后,数据透视表函数(行——>索引,cols——>列…)
Additionally, it appears that the original lambda trick I posted no longer works on Pandas 0.18. You have to provide a reducing function (even if it is min, max or mean). But even that seemed improper - because we are not reducing the data set, just transforming it.... So I looked harder at unstack...
此外,我发布的最初lambda戏法似乎不再适用于熊猫0.18。你必须提供一个还原函数(即使它是最小值、最大值或平均值)。但即使是不合时宜的,因为我们不是减少数据集,就把....所以我更仔细地看了看unstack…
import pandas as pd
df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})
# these are the columns to end up in the multi-index columns.
unstack_cols = ['variable1', 'variable2', 'variable3']
First, set an index on the data using the index + the columns you want to stack, then call unstack using the level arg.
首先,使用索引+要堆栈的列在数据上设置一个索引,然后使用级别arg调用unstack。
df1.set_index(['index'] + unstack_cols).unstack(level=unstack_cols)
Resulting dataframe is below.
结果dataframe如下。
#2
2
I think the best compromise is to replace on/off with True/False, which will enable pandas to "understand" the data better and act in an intelligent, expected way.
我认为最好的折衷办法是用真/假来代替开/关,这将使熊猫能够更好地“理解”数据,并以一种智能的、预期的方式行动。
df2 = df1.replace({'on': True, 'off': False})
You essentially conceded this in your question. My answer is, I don't think there's a better way, and you should replace 'on'/'off' anyway for whatever comes next.
在你的问题中,你基本上承认了这一点。我的回答是,我不认为有更好的方法,不管怎样,你应该用on / off来代替接下来要做的事。
As Andy Hayden points out in the comments, you'll get better performance if you replace on/off with 1/0.
正如安迪·海登在评论中指出的,如果你用1/0替换开/关,你会得到更好的性能。
#1
24
My original reply was based on Pandas 0.14.1, and since then, many things changed in the pivot_table function (rows --> index, cols --> columns... )
我最初的回复是基于熊猫0.14.1,从那以后,数据透视表函数(行——>索引,cols——>列…)
Additionally, it appears that the original lambda trick I posted no longer works on Pandas 0.18. You have to provide a reducing function (even if it is min, max or mean). But even that seemed improper - because we are not reducing the data set, just transforming it.... So I looked harder at unstack...
此外,我发布的最初lambda戏法似乎不再适用于熊猫0.18。你必须提供一个还原函数(即使它是最小值、最大值或平均值)。但即使是不合时宜的,因为我们不是减少数据集,就把....所以我更仔细地看了看unstack…
import pandas as pd
df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})
# these are the columns to end up in the multi-index columns.
unstack_cols = ['variable1', 'variable2', 'variable3']
First, set an index on the data using the index + the columns you want to stack, then call unstack using the level arg.
首先,使用索引+要堆栈的列在数据上设置一个索引,然后使用级别arg调用unstack。
df1.set_index(['index'] + unstack_cols).unstack(level=unstack_cols)
Resulting dataframe is below.
结果dataframe如下。
#2
2
I think the best compromise is to replace on/off with True/False, which will enable pandas to "understand" the data better and act in an intelligent, expected way.
我认为最好的折衷办法是用真/假来代替开/关,这将使熊猫能够更好地“理解”数据,并以一种智能的、预期的方式行动。
df2 = df1.replace({'on': True, 'off': False})
You essentially conceded this in your question. My answer is, I don't think there's a better way, and you should replace 'on'/'off' anyway for whatever comes next.
在你的问题中,你基本上承认了这一点。我的回答是,我不认为有更好的方法,不管怎样,你应该用on / off来代替接下来要做的事。
As Andy Hayden points out in the comments, you'll get better performance if you replace on/off with 1/0.
正如安迪·海登在评论中指出的,如果你用1/0替换开/关,你会得到更好的性能。