复杂的(对我来说)从大到长,在熊猫。

时间:2021-12-23 04:28:51

Individuals (indexed from 0 to 5) choose between two locations: A and B. My data has a wide format containing characteristics that vary by individual (ind_var) and characteristics that vary only by location (location_var).

个人(从0到5)在两个位置之间进行选择:A和b。我的数据有一个宽格式,包含不同的个体(ind_var)和不同的特征(location_var)。

For example, I have:

例如,我有:

In [281]:

df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

df_reshape_test

Out[281]:
    dist_to_A   dist_to_B   ind_var location location_var
0    0            50             3   A       10
1    0            50             8   A       10
2    0            50            10   A       10
3    50           0              1   B       14
4    50           0              3   B       14
5    50           0              4   B       14

The variable 'location' is the one chosen by the individual. dist_to_A is the distance to location A from the location chosen by the individual (same thing with dist_to_B)

变量的位置是由个人选择的。dist_to_A是从个人选择的位置(与dist_to_B相同的)到位置A的距离

I'd like my data to have this form:

我希望我的数据有这样的形式:

    choice  dist_S  ind_var location    location_var
0    1        0       3         A           10
0    0       50       3         B           14
1    1        0       8         A           10
1    0       50       8         B           14
2    1        0      10         A           10
2    0       50      10         B           14
3    0       50       1         A           10
3    1        0       1         B           14
4    0       50       3         A           10
4    1        0       3         B           14
5    0       50       4         A           10
5    1        0       4         B           14

where choice == 1 indicates individual has chosen that location and dist_S is the distance from the location chosen.

选择== 1表示个人选择了位置,而dist_S是距离所选位置的距离。

I read about the .stack method but couldn't figure out how to apply it for this case. Thanks for your time!

我读过。stack方法,但不知道如何应用它。谢谢你的时间!

NOTE: this is just a simple example. The datasets I'm looking have varying numbers of location and number of individuals per location, so I'm looking for a flexible solution if possible

注意:这只是一个简单的例子。我所看到的数据集有不同的位置和每个位置的个人数量,所以如果可能的话,我正在寻找一个灵活的解决方案。

3 个解决方案

#1


6  

In fact, pandas has a wide_to_long command that can conveniently do what you intend to do.

事实上,熊猫有一个wide_to_long命令,可以方便地做你想做的事情。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 
                'dist_to_A' : [0, 0, 0, 50, 50, 50], 
                'dist_to_B' : [50, 50, 50, 0, 0, 0], 
                'location_var': [10, 10, 10, 14, 14, 14], 
                'ind_var': [3, 8, 10, 1, 3, 4]})

df['ind'] = df.index

#The `location` and `location_var` corresponds to the choices, 
#record them as dictionaries and drop them 
#(Just realized you had a cleaner way, copied from yous). 

ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location') 

# use the dictionaries to get variables `choice` and `location_var` back.

df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()

This gives you the table you asked for:

这是你要的表格:

              ind_var  dist_to_ choice  location_var
ind location                                        
0   A               3         0      A            10
    B               3        50      A            14
1   A               8         0      A            10
    B               8        50      A            14
2   A              10         0      A            10
    B              10        50      A            14
3   A               1        50      B            10
    B               1         0      B            14
4   A               3        50      B            10
    B               3         0      B            14
5   A               4        50      B            10
    B               4         0      B            14

Of course you can generate a choice variable that takes 0 and 1 if that's what you want.

当然,你可以生成一个选择变量,如果你想要的话,你可以选择0和1。

#2


3  

I'm a bit curious why you'd like it in the format. There's probably a much better way to store your data. But here goes.

我有点好奇为什么你会喜欢它的格式。可能有更好的方法来存储数据。但在这里。

In [137]: import numpy as np

In [138]: import pandas as pd

In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B
', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 
0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

In [140]: print(df_reshape_test)
   dist_to_A  dist_to_B  ind_var location  location_var
0          0         50        3        A            10
1          0         50        8        A            10
2          0         50       10        A            10
3         50          0        1        B            14
4         50          0        3        B            14
5         50          0        4        B            14

In [141]: # Get the new axis separately:

In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2)

In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx)

In [144]: print(df2)
   ind_var location  location_var
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14

In [145]: # Swap the location for the second half

In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d 

In [147]: df2['choice'] = [1] * 6 + [0] * 6  # may need to play with this.

In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True)

In [149]: df2 = df2.sort()

In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50)

In [151]: print(df2)
   ind_var location  location_var  choice  dist_S
0        3        A            10       1       0
0        3        B            10       0      50
1        8        A            10       1       0
1        8        B            10       0      50
2       10        A            10       1       0
2       10        B            10       0      50
3        1        B            14       1       0
3        1        A            14       0      50
4        3        B            14       1       0
4        3        A            14       0      50
5        4        B            14       1       0
5        4        A            14       0      50

It's not going to generalize well, but there are probably alternative (better) ways to get around the uglier parts like generating the choice col.

它不会很好地推广,但是可能有更好的方法来绕过更丑陋的部分,比如生成选择col。

#3


2  

Ok, this took longer that I expected, but here's a more general answer that works with an arbitrary number of choices per individual. I'm sure there are simpler ways, so it would be great if somebody can chime in with something better for some of the following code.

好吧,这比我预期的要长,但这里有一个更普遍的答案,它适用于每个个体的任意数量的选择。我相信有更简单的方法,所以如果有人能在下面的代码中找到更好的东西,那就太好了。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

which gives

这给了

    dist_to_A   dist_to_B   ind_var location   location_var
0    0           50          3     A            10
1    0           50          8     A            10
2    0           50         10     A            10
3    50          0           1     B            14
4    50          0           3     B            14
5    50          0           4     B            14

Then we do:

然后我们做什么:

df.index.names = ['ind']

# Add choice var

df['choice'] = 1

# Create dictionaries we'll use later

ind_to_loc = dict(df['location'])
# gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'}

ind_dict = dict(df['ind_var'])
#gives  { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4}

loc_dict = dict(  df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var']  )
# gives  {'A' : 10, 'B' : 14}

Now I create a Multi-Index and do a re-index to get a long shape

现在我创建一个多索引,并做一个重新索引以获得一个长形状。

df = df.set_index( [df.index, df['location']] )

df.index.names = ['ind', 'location']

# re-index to long shape

loc_list = ['A', 'B']
ind_list = [0, 1, 2, 3, 4, 5]
new_shape = [  (ind, loc) for ind in ind_list for loc in loc_list]
idx = pd.Index(new_shape)
df_long = df.reindex(idx, method = None)
df_long.index.names = ['ind', 'loc']

The long shape looks like this:

长形像这样:

         dist_to_A  dist_to_B  ind_var location  location_var  choice
ind loc                                                              
0   A            0         50        3        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
1   A            0         50        8        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
2   A            0         50       10        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
3   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        1        B            14       1
4   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        3        B            14       1
5   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        4        B            14       1

So now fill the NaN values with the dictionaries:

现在用字典来填充NaN的值:

df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] )
df_long['location']  = df_long.index.map(lambda x : ind_to_loc[x[0]] )
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] )

# Fill in choice
df_long['choice'] = df_long['choice'].fillna(0)

Finally, all that is left is creating dist_S
I'll cheat here and assume I can create a nested dictionary like this one

最后,剩下的就是创建dist_S,我在这里作弊,假设我可以创建一个像这样的嵌套字典。

nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}}

(This reads: if you're in location A, then location A is at 0 km and location B at 50 km)

(这句话的意思是:如果你在A位置,那么A位置在0公里处,地点B在50公里处)

def nested_f(x):    
    return nested_loc[x[0]][x[1]]

df_long = df_long.reset_index()
df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1)

df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 )

df_long

gives the desired result

给出了期望的结果

    ind loc ind_var location_var    choice  dist_S
0    0   A   3         10            1      0
1    0   B   3         14            0      50
2    1   A   8         10            1      0
3    1   B   8         14            0      50
4    2   A   10        10            1      0
5    2   B   10        14            0      50
6    3   A   1         10            0      50
7    3   B   1         14            1      0
8    4   A   3         10            0      50
9    4   B   3         14            1      0
10   5   A   4         10            0      50
11   5   B   4         14            1      0

#1


6  

In fact, pandas has a wide_to_long command that can conveniently do what you intend to do.

事实上,熊猫有一个wide_to_long命令,可以方便地做你想做的事情。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 
                'dist_to_A' : [0, 0, 0, 50, 50, 50], 
                'dist_to_B' : [50, 50, 50, 0, 0, 0], 
                'location_var': [10, 10, 10, 14, 14, 14], 
                'ind_var': [3, 8, 10, 1, 3, 4]})

df['ind'] = df.index

#The `location` and `location_var` corresponds to the choices, 
#record them as dictionaries and drop them 
#(Just realized you had a cleaner way, copied from yous). 

ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location') 

# use the dictionaries to get variables `choice` and `location_var` back.

df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()

This gives you the table you asked for:

这是你要的表格:

              ind_var  dist_to_ choice  location_var
ind location                                        
0   A               3         0      A            10
    B               3        50      A            14
1   A               8         0      A            10
    B               8        50      A            14
2   A              10         0      A            10
    B              10        50      A            14
3   A               1        50      B            10
    B               1         0      B            14
4   A               3        50      B            10
    B               3         0      B            14
5   A               4        50      B            10
    B               4         0      B            14

Of course you can generate a choice variable that takes 0 and 1 if that's what you want.

当然,你可以生成一个选择变量,如果你想要的话,你可以选择0和1。

#2


3  

I'm a bit curious why you'd like it in the format. There's probably a much better way to store your data. But here goes.

我有点好奇为什么你会喜欢它的格式。可能有更好的方法来存储数据。但在这里。

In [137]: import numpy as np

In [138]: import pandas as pd

In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B
', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 
0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

In [140]: print(df_reshape_test)
   dist_to_A  dist_to_B  ind_var location  location_var
0          0         50        3        A            10
1          0         50        8        A            10
2          0         50       10        A            10
3         50          0        1        B            14
4         50          0        3        B            14
5         50          0        4        B            14

In [141]: # Get the new axis separately:

In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2)

In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx)

In [144]: print(df2)
   ind_var location  location_var
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14

In [145]: # Swap the location for the second half

In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d 

In [147]: df2['choice'] = [1] * 6 + [0] * 6  # may need to play with this.

In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True)

In [149]: df2 = df2.sort()

In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50)

In [151]: print(df2)
   ind_var location  location_var  choice  dist_S
0        3        A            10       1       0
0        3        B            10       0      50
1        8        A            10       1       0
1        8        B            10       0      50
2       10        A            10       1       0
2       10        B            10       0      50
3        1        B            14       1       0
3        1        A            14       0      50
4        3        B            14       1       0
4        3        A            14       0      50
5        4        B            14       1       0
5        4        A            14       0      50

It's not going to generalize well, but there are probably alternative (better) ways to get around the uglier parts like generating the choice col.

它不会很好地推广,但是可能有更好的方法来绕过更丑陋的部分,比如生成选择col。

#3


2  

Ok, this took longer that I expected, but here's a more general answer that works with an arbitrary number of choices per individual. I'm sure there are simpler ways, so it would be great if somebody can chime in with something better for some of the following code.

好吧,这比我预期的要长,但这里有一个更普遍的答案,它适用于每个个体的任意数量的选择。我相信有更简单的方法,所以如果有人能在下面的代码中找到更好的东西,那就太好了。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

which gives

这给了

    dist_to_A   dist_to_B   ind_var location   location_var
0    0           50          3     A            10
1    0           50          8     A            10
2    0           50         10     A            10
3    50          0           1     B            14
4    50          0           3     B            14
5    50          0           4     B            14

Then we do:

然后我们做什么:

df.index.names = ['ind']

# Add choice var

df['choice'] = 1

# Create dictionaries we'll use later

ind_to_loc = dict(df['location'])
# gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'}

ind_dict = dict(df['ind_var'])
#gives  { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4}

loc_dict = dict(  df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var']  )
# gives  {'A' : 10, 'B' : 14}

Now I create a Multi-Index and do a re-index to get a long shape

现在我创建一个多索引,并做一个重新索引以获得一个长形状。

df = df.set_index( [df.index, df['location']] )

df.index.names = ['ind', 'location']

# re-index to long shape

loc_list = ['A', 'B']
ind_list = [0, 1, 2, 3, 4, 5]
new_shape = [  (ind, loc) for ind in ind_list for loc in loc_list]
idx = pd.Index(new_shape)
df_long = df.reindex(idx, method = None)
df_long.index.names = ['ind', 'loc']

The long shape looks like this:

长形像这样:

         dist_to_A  dist_to_B  ind_var location  location_var  choice
ind loc                                                              
0   A            0         50        3        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
1   A            0         50        8        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
2   A            0         50       10        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
3   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        1        B            14       1
4   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        3        B            14       1
5   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        4        B            14       1

So now fill the NaN values with the dictionaries:

现在用字典来填充NaN的值:

df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] )
df_long['location']  = df_long.index.map(lambda x : ind_to_loc[x[0]] )
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] )

# Fill in choice
df_long['choice'] = df_long['choice'].fillna(0)

Finally, all that is left is creating dist_S
I'll cheat here and assume I can create a nested dictionary like this one

最后,剩下的就是创建dist_S,我在这里作弊,假设我可以创建一个像这样的嵌套字典。

nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}}

(This reads: if you're in location A, then location A is at 0 km and location B at 50 km)

(这句话的意思是:如果你在A位置,那么A位置在0公里处,地点B在50公里处)

def nested_f(x):    
    return nested_loc[x[0]][x[1]]

df_long = df_long.reset_index()
df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1)

df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 )

df_long

gives the desired result

给出了期望的结果

    ind loc ind_var location_var    choice  dist_S
0    0   A   3         10            1      0
1    0   B   3         14            0      50
2    1   A   8         10            1      0
3    1   B   8         14            0      50
4    2   A   10        10            1      0
5    2   B   10        14            0      50
6    3   A   1         10            0      50
7    3   B   1         14            1      0
8    4   A   3         10            0      50
9    4   B   3         14            1      0
10   5   A   4         10            0      50
11   5   B   4         14            1      0