在Pandas数据帧中总结列的行 - 明智的行

时间:2022-08-09 21:41:01

With the following code:

使用以下代码:

import pandas as pd
df = pd.DataFrame({'ProbeGenes' : ['1431492_at Lipn', '1448678_at Fam118a','1452580_a_at Mrpl21'],
                   '(5)foo.ID.LN.x1' : [20.3, 25.3,3.1],
                   '(5)foo.ID.LN.x2' : [130, 150,173],        
                   '(5)foo.ID.LN.x3' : [1.0, 2.0,12.0],         
                   '(3)bar.ID.LN.x1' : [1,2,3],
                   '(3)bar.ID.LN.x2' : [4,5,6],        
                   '(3)bar.ID.LN.x3' : [7,8,9]        
                   })


new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series)
new_cols.columns = ["Probe","Gene"]
df = df.join(new_cols)
cols = df.columns.tolist()
cols = cols[-2:] + cols[:-2]
df = df[cols]
df

I can make the following data frame:

我可以制作以下数据框:

          Probe     Gene  (5)bar.ID.LN.x1  (5)bar.ID.LN.x2  (5)bar.ID.LN.x3  \
0    1431492_at     Lipn                1                4                7
1    1448678_at  Fam118a                2                5                8
2  1452580_a_at   Mrpl21                3                6                9

   (3)foo.ID.LN.x1  (3)foo.ID.LN.x2  (3)foo.ID.LN.x3
0             20.3              130                1
1             25.3              150                2
2              3.1              173               12

Notice that the data frame contain two chunks (named foo and bar), in turns each chunk contain x1,x2,x3. What I want to do is to sum up the values within each chunk resulting in this data frame:

请注意,数据框包含两个块(名为foo和bar),每个块依次包含x1,x2,x3。我想要做的是总结每个块中的值,从而产生这个数据框:

          Probe     Gene  foo   bar
     1431492_at     Lipn  151.3 12
     1448678_at  Fam118a  177.3 15
   1452580_a_at   Mrpl21  188.1 18 

The actual data can contain more than two chunk names. And each chunk will contain 2 or 3 members (x1,x2 or x1,x2,x3).

实际数据可以包含两个以上的块名称。每个块将包含2或3个成员(x1,x2或x1,x2,x3)。

The chunk name can be captured with the following regex /\(\d+\)(\w+)\..*/

可以使用以下正则表达式/\(\\ d+ \)(\\ w+)\ .. * /捕获块名称

How can I achieve that?

我怎样才能做到这一点?

3 个解决方案

#1


This is a way to get started to find such "chunks":

这是一种开始寻找这种“块”的方法:

   chunks = set([re.split('\(\d+\)',i)[1].split('.')[0] for i in df.columns if '.' in i])

for each_chunk in chunks:
        column_name = '%s' %each_chunk
        df[column_name] = df[[i for i in df.columns if each_chunk in i]].sum(axis=1)

## -- End pasted text --

In [1298]: df.head()
Out[1298]: 
          Probe     Gene  (3)bar.ID.LN.x1  (3)bar.ID.LN.x2  (3)bar.ID.LN.x3  \
0    1431492_at     Lipn                1                4                7   
1    1448678_at  Fam118a                2                5                8   
2  1452580_a_at   Mrpl21                3                6                9   

   (5)foo.ID.LN.x1  (5)foo.ID.LN.x2  (5)foo.ID.LN.x3    foo  bar  
0             20.3              130                1  151.3   12  
1             25.3              150                2  177.3   15  
2              3.1              173               12  188.1   18  

Benchmarks:

In [1266]: %timeit df[bar_cols].sum(axis=1)
1000 loops, best of 3: 476 µs per loop

In [1267]: %timeit df[[i for i in df.columns if 'bar' in i]].sum(axis=1)
1000 loops, best of 3: 483 µs per loop

In [1268]: %timeit df.filter(regex='foo').sum(axis=1)
1000 loops, best of 3: 483 µs per loop

#2


One option if the data size is small

数据大小很小的一种选择

df['foo'] = df.filter(regex='foo').sum(axis=1) # It will filter all the columns which has the word 'foo' in it
df['bar'] = df.filter(regex='bar').sum(axis=1)

Please dont use this, if your data size is larger than 10,000 rows. Generally summing up using axis=1 is slow

如果您的数据大小超过10,000行,请不要使用它。通常使用axis = 1求和是缓慢的

#3


If you're doing this for many columns, I would suggest using a MultiIndex rather than a dot seperated string:

如果您对许多列执行此操作,我建议使用MultiIndex而不是点分隔字符串:

In [11]: new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series)  # do something with this later

In [12]: df.columns = pd.MultiIndex.from_tuples(df.columns.map(lambda x: tuple(x.split("."))))

In [13]: df
Out[13]:
  (3)bar       (5)foo
      ID           ID
      LN           LN
      x1 x2 x3     x1   x2  x3
0      1  4  7   20.3  130   1
1      2  5  8   25.3  150   2
2      3  6  9    3.1  173  12

In [14]: df.loc[:, "(3)bar"].sum(axis=1)
Out[14]:
0    12
1    15
2    18
dtype: int64

#1


This is a way to get started to find such "chunks":

这是一种开始寻找这种“块”的方法:

   chunks = set([re.split('\(\d+\)',i)[1].split('.')[0] for i in df.columns if '.' in i])

for each_chunk in chunks:
        column_name = '%s' %each_chunk
        df[column_name] = df[[i for i in df.columns if each_chunk in i]].sum(axis=1)

## -- End pasted text --

In [1298]: df.head()
Out[1298]: 
          Probe     Gene  (3)bar.ID.LN.x1  (3)bar.ID.LN.x2  (3)bar.ID.LN.x3  \
0    1431492_at     Lipn                1                4                7   
1    1448678_at  Fam118a                2                5                8   
2  1452580_a_at   Mrpl21                3                6                9   

   (5)foo.ID.LN.x1  (5)foo.ID.LN.x2  (5)foo.ID.LN.x3    foo  bar  
0             20.3              130                1  151.3   12  
1             25.3              150                2  177.3   15  
2              3.1              173               12  188.1   18  

Benchmarks:

In [1266]: %timeit df[bar_cols].sum(axis=1)
1000 loops, best of 3: 476 µs per loop

In [1267]: %timeit df[[i for i in df.columns if 'bar' in i]].sum(axis=1)
1000 loops, best of 3: 483 µs per loop

In [1268]: %timeit df.filter(regex='foo').sum(axis=1)
1000 loops, best of 3: 483 µs per loop

#2


One option if the data size is small

数据大小很小的一种选择

df['foo'] = df.filter(regex='foo').sum(axis=1) # It will filter all the columns which has the word 'foo' in it
df['bar'] = df.filter(regex='bar').sum(axis=1)

Please dont use this, if your data size is larger than 10,000 rows. Generally summing up using axis=1 is slow

如果您的数据大小超过10,000行,请不要使用它。通常使用axis = 1求和是缓慢的

#3


If you're doing this for many columns, I would suggest using a MultiIndex rather than a dot seperated string:

如果您对许多列执行此操作,我建议使用MultiIndex而不是点分隔字符串:

In [11]: new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series)  # do something with this later

In [12]: df.columns = pd.MultiIndex.from_tuples(df.columns.map(lambda x: tuple(x.split("."))))

In [13]: df
Out[13]:
  (3)bar       (5)foo
      ID           ID
      LN           LN
      x1 x2 x3     x1   x2  x3
0      1  4  7   20.3  130   1
1      2  5  8   25.3  150   2
2      3  6  9    3.1  173  12

In [14]: df.loc[:, "(3)bar"].sum(axis=1)
Out[14]:
0    12
1    15
2    18
dtype: int64