我如何使用熔体()将熊猫DataFrame变成一个列表,从交叉表创建索引,并在其位置创建一个新的变量?

时间:2022-07-13 22:56:59

I have a matrix of data 29523 rows x 503 cols of which 3 cols are indices (below is a subset for example).

我有一个数据矩阵,数据为29523行x 503,其中3个cols是索引(下面是一个子集)。

IDX1|  IDX2  | IDX3 | 1983 Q4   |  X  | Y |  Z  |1984 Q1 |   X  | Y | Z 
---------------------------------------------------------------------------
A   |   A1   |  Q   |   10      |  A  | F | NaN | 110    |   A  | F | NaN
A   |   A2   |  Q   |   20      |  B  | C | 40  | 120    |   B  | C | 240
A   |   A3   |  Q   |   30      |  A  | F | NaN | 130    |   A  | F | NaN
A   |   A4   |  Q   |   40      |  B  | C | 80  | 140    |   B  | C | 280
A   |   A5   |  Q   |   50      |  A  | F | NaN | 150    |   A  | F | NaN
A   |   A6   |  Q   |   60      |  B  | F | 120 | 160    |   B  | F | 320

I read this into a DataFrame with:

我把这个读入了一个DataFrame:

>>> df = pd.read_csv(C:\filename.csv, low_memory=False, mangle_dupe_cols=False)

and then use pandas.melt() to pivot the data:

然后使用pandas.melt()来转换数据:

df1 = pd.melt(df, id_vars=['IDX1', 'IDX2', 'IDX3'], var_name='ValueType',
              value_name = 'Value')

I have also tried stack() but melt() proved better here.

我也尝试过stack(),但是在这里被证明是更好的。

IDX1    |   IDX2    |   IDX3    |   ValueType   |   Value
---------------------------------------------------------------
A       |   A1      |   Q       |   1983 Q4     |   10
A       |   A1      |   Q       |   X           |   A
A       |   A1      |   Q       |   Y           |   F
A       |   A1      |   Q       |   Z           |   NaN
A       |   A1      |   Q       |   1984 Q1     |   110
A       |   A1      |   Q       |   X           |   A
A       |   A1      |   Q       |   Y           |   F
A       |   A1      |   Q       |   Z           |   NaN
A       |   A2      |   Q       |   1983 Q4     |   20
A       |   A2      |   Q       |   X           |   B
A       |   A2      |   Q       |   Y           |   C
A       |   A2      |   Q       |   Z           |   40

The option mangle_dupe_cols on the read_csv if True will place a .int suffix against all ValueTypes that are duplicated. This is not ideal, but without it there is no way of linking the values for the variables to the correct period.

在read_csv中选择mangle_dupe_cols,如果True将对复制的所有valuetype设置一个.int后缀。这并不理想,但是没有它,就无法将变量的值与正确的周期联系起来。

What I would prefer to do is instead of having the Period (1984 Q1) as a ValueType, give the Periods corresponding Value a variable 'W' and have each period form part of the IDX as below:

我更愿意做的是将周期(1984年Q1)作为一个ValueType,给出相应的周期对应的变量“W”,并将每个周期作为IDX的一部分,如下所示:

IDX1    |   IDX2    |   IDX3 | IDX4    |    ValueType   |   Value
---------------------------------------------------------------
A       |   A1      |   Q    |  1983 Q4|    W           |   10
A       |   A1      |   Q    |  1983 Q4|    X           |   A
A       |   A1      |   Q    |  1983 Q4|    Y           |   F
A       |   A1      |   Q    |  1983 Q4|    Z           |   NaN
A       |   A1      |   Q    |  1984 Q1|    W           |   110
A       |   A1      |   Q    |  1984 Q1|    X           |   A
A       |   A1      |   Q    |  1984 Q1|    Y           |   F
A       |   A1      |   Q    |  1984 Q1|    Z           |   NaN
A       |   A2      |   Q    |  1983 Q4|    W           |   20
A       |   A2      |   Q    |  1983 Q4|    X           |   B
A       |   A2      |   Q    |  1983 Q4|    Y           |   C
A       |   A2      |   Q    |  1983 Q4|    Z           |   40

Is the above possible with pandas or numpy?

上面的可能是熊猫还是数字?

My final DataFrame is going to be 14,761,500 rows x 6 cols.

最后一个DataFrame是14761,500行x 6 cols。

1 个解决方案

#1


2  

Given

鉴于

In [189]: df
Out[189]: 
  IDX1 IDX2 IDX3  1983 Q4  X  Y    Z  1984 Q1 X.1 Y.1  Z.1
0    A   A1    Q       10  A  F  NaN      110   A   F  NaN
1    A   A2    Q       20  B  C   40      120   B   C  240
2    A   A3    Q       30  A  F  NaN      130   A   F  NaN
3    A   A4    Q       40  B  C   80      140   B   C  280
4    A   A5    Q       50  A  F  NaN      150   A   F  NaN
5    A   A6    Q       60  B  F  120      160   B   F  320

Let us first set ['IDX1', 'IDX2', 'IDX3'] as the index.

让我们首先设置['IDX1', 'IDX2', 'IDX3']作为索引。

df = df.set_index(['IDX1', 'IDX2', 'IDX3'])

The other columns have a periodic quality to them; we want to handle every 4 columns as a group. This idea of "handling as a group" leads naturally to assigning a new index level to the column index; some value which is the same for every 4 columns. This would be ideal:

其他列对它们有周期性的质量;我们想把每4列作为一个组来处理。这种“作为一组处理”的思想自然会将一个新的索引级别分配给列索引;每个4列的值是相同的。这将是理想:

               1983 Q4            1984 Q1           
                     W  X  Y    Z       W  X  Y    Z
IDX1 IDX2 IDX3                                      
A    A1   Q         10  A  F  NaN     110  A  F  NaN
     A2   Q         20  B  C  240     120  B  C  240
     A3   Q         30  A  F  NaN     130  A  F  NaN
     A4   Q         40  B  C  280     140  B  C  280
     A5   Q         50  A  F  NaN     150  A  F  NaN
     A6   Q         60  B  F  320     160  B  F  320

We can achieve this by building a MultiIndex and assigning it to df.columns:

我们可以通过构建一个多索引并将其分配给df列来实现这一点:

columns = [col for col in df.columns if col[0] not in set(list('XYZ'))]
df.columns = pd.MultiIndex.from_product([columns, list('WXYZ')])

Now the desired long-format DataFrame can be obtained by calling df.stack to move the column levels into the row index:

现在,可以通过调用df来获得所需的长格式DataFrame。栈将列级别移动到行索引:

df.columns.names = ['IDX4', 'ValueType']
series = df.stack(['IDX4', 'ValueType'], dropna=False)

Note also that when mangle_dupe_cols=False, the duplicate columns, X, Y, Z, get overwritten. So you lose data with mangle_dupe_cols=False. For example, when you use mangle_dupe_cols=False the last row's Z value gets assigns to every Z column regardless of the period.

还要注意,当mangle_dupe_cols=False时,重复的列,X, Y, Z,被覆盖。因此,您丢失了mangle_dupe_cols=False的数据。例如,当您使用mangle_dupe_cols=False时,不管周期如何,最后一行的Z值都会被分配给每个Z列。

So we must use mangle_dupe_cols=True, (or just leave it out since that is the default) and adjust the code accordingly. That, fortunately, is not hard to do since we are reassigning df.columns to a custom-build MultiIndex anyway.

因此,我们必须使用mangle_dupe_cols=True,(或者干脆省略它,因为这是默认值),并相应地调整代码。幸运的是,这并不难做到,因为我们重新分配了df。无论如何,列到定制构建的多索引。


Putting it all together:

把它放在一起:

import numpy as np
import pandas as pd
df = pd.read_table('data', sep=r'\s*[|]\s*')
df = df.set_index(['IDX1', 'IDX2', 'IDX3'])
columns = [col for col in df.columns if col[0] not in set(list('XYZ'))]
df.columns = pd.MultiIndex.from_product([columns, list('WXYZ')])
df.columns.names = ['IDX4', 'ValueType']
series = df.stack(['IDX4', 'ValueType'], dropna=False)
print(series.head())

yields

收益率

IDX1  IDX2  IDX3  IDX4     ValueType
A     A1    Q     1983 Q4  W             10
                           X              A
                           Y              F
                           Z            NaN
                  1984 Q1  W            110
dtype: object

Note that since we've removed all the column levels, the result is a Series. If you want a DataFrame with 6 columns, then we should follow it up with:

注意,由于我们已经删除了所有的列级别,结果是一个系列。如果您想要一个有6个列的DataFrame,那么我们应该继续使用:

series.name = 'Value'
df = series.reset_index()
print(df.head())

which yields

的收益率

  IDX1 IDX2 IDX3     IDX4 ValueType Value
0    A   A1    Q  1983 Q4         W    10
1    A   A1    Q  1983 Q4         X     A
2    A   A1    Q  1983 Q4         Y     F
3    A   A1    Q  1983 Q4         Z   NaN
4    A   A1    Q  1984 Q1         W   110
...

#1


2  

Given

鉴于

In [189]: df
Out[189]: 
  IDX1 IDX2 IDX3  1983 Q4  X  Y    Z  1984 Q1 X.1 Y.1  Z.1
0    A   A1    Q       10  A  F  NaN      110   A   F  NaN
1    A   A2    Q       20  B  C   40      120   B   C  240
2    A   A3    Q       30  A  F  NaN      130   A   F  NaN
3    A   A4    Q       40  B  C   80      140   B   C  280
4    A   A5    Q       50  A  F  NaN      150   A   F  NaN
5    A   A6    Q       60  B  F  120      160   B   F  320

Let us first set ['IDX1', 'IDX2', 'IDX3'] as the index.

让我们首先设置['IDX1', 'IDX2', 'IDX3']作为索引。

df = df.set_index(['IDX1', 'IDX2', 'IDX3'])

The other columns have a periodic quality to them; we want to handle every 4 columns as a group. This idea of "handling as a group" leads naturally to assigning a new index level to the column index; some value which is the same for every 4 columns. This would be ideal:

其他列对它们有周期性的质量;我们想把每4列作为一个组来处理。这种“作为一组处理”的思想自然会将一个新的索引级别分配给列索引;每个4列的值是相同的。这将是理想:

               1983 Q4            1984 Q1           
                     W  X  Y    Z       W  X  Y    Z
IDX1 IDX2 IDX3                                      
A    A1   Q         10  A  F  NaN     110  A  F  NaN
     A2   Q         20  B  C  240     120  B  C  240
     A3   Q         30  A  F  NaN     130  A  F  NaN
     A4   Q         40  B  C  280     140  B  C  280
     A5   Q         50  A  F  NaN     150  A  F  NaN
     A6   Q         60  B  F  320     160  B  F  320

We can achieve this by building a MultiIndex and assigning it to df.columns:

我们可以通过构建一个多索引并将其分配给df列来实现这一点:

columns = [col for col in df.columns if col[0] not in set(list('XYZ'))]
df.columns = pd.MultiIndex.from_product([columns, list('WXYZ')])

Now the desired long-format DataFrame can be obtained by calling df.stack to move the column levels into the row index:

现在,可以通过调用df来获得所需的长格式DataFrame。栈将列级别移动到行索引:

df.columns.names = ['IDX4', 'ValueType']
series = df.stack(['IDX4', 'ValueType'], dropna=False)

Note also that when mangle_dupe_cols=False, the duplicate columns, X, Y, Z, get overwritten. So you lose data with mangle_dupe_cols=False. For example, when you use mangle_dupe_cols=False the last row's Z value gets assigns to every Z column regardless of the period.

还要注意,当mangle_dupe_cols=False时,重复的列,X, Y, Z,被覆盖。因此,您丢失了mangle_dupe_cols=False的数据。例如,当您使用mangle_dupe_cols=False时,不管周期如何,最后一行的Z值都会被分配给每个Z列。

So we must use mangle_dupe_cols=True, (or just leave it out since that is the default) and adjust the code accordingly. That, fortunately, is not hard to do since we are reassigning df.columns to a custom-build MultiIndex anyway.

因此,我们必须使用mangle_dupe_cols=True,(或者干脆省略它,因为这是默认值),并相应地调整代码。幸运的是,这并不难做到,因为我们重新分配了df。无论如何,列到定制构建的多索引。


Putting it all together:

把它放在一起:

import numpy as np
import pandas as pd
df = pd.read_table('data', sep=r'\s*[|]\s*')
df = df.set_index(['IDX1', 'IDX2', 'IDX3'])
columns = [col for col in df.columns if col[0] not in set(list('XYZ'))]
df.columns = pd.MultiIndex.from_product([columns, list('WXYZ')])
df.columns.names = ['IDX4', 'ValueType']
series = df.stack(['IDX4', 'ValueType'], dropna=False)
print(series.head())

yields

收益率

IDX1  IDX2  IDX3  IDX4     ValueType
A     A1    Q     1983 Q4  W             10
                           X              A
                           Y              F
                           Z            NaN
                  1984 Q1  W            110
dtype: object

Note that since we've removed all the column levels, the result is a Series. If you want a DataFrame with 6 columns, then we should follow it up with:

注意,由于我们已经删除了所有的列级别,结果是一个系列。如果您想要一个有6个列的DataFrame,那么我们应该继续使用:

series.name = 'Value'
df = series.reset_index()
print(df.head())

which yields

的收益率

  IDX1 IDX2 IDX3     IDX4 ValueType Value
0    A   A1    Q  1983 Q4         W    10
1    A   A1    Q  1983 Q4         X     A
2    A   A1    Q  1983 Q4         Y     F
3    A   A1    Q  1983 Q4         Z   NaN
4    A   A1    Q  1984 Q1         W   110
...