With the following code:
使用以下代码:
import pandas as pd
df = pd.DataFrame({'ProbeGenes' : ['1431492_at Lipn', '1448678_at Fam118a','1452580_a_at Mrpl21'],
'(5)foo.ID.LN.x1' : [20.3, 25.3,3.1],
'(5)foo.ID.LN.x2' : [130, 150,173],
'(5)foo.ID.LN.x3' : [1.0, 2.0,12.0],
'(3)bar.ID.LN.x1' : [1,2,3],
'(3)bar.ID.LN.x2' : [4,5,6],
'(3)bar.ID.LN.x3' : [7,8,9]
})
new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series)
new_cols.columns = ["Probe","Gene"]
df = df.join(new_cols)
cols = df.columns.tolist()
cols = cols[-2:] + cols[:-2]
df = df[cols]
df
I can make the following data frame:
我可以制作以下数据框:
Probe Gene (5)bar.ID.LN.x1 (5)bar.ID.LN.x2 (5)bar.ID.LN.x3 \
0 1431492_at Lipn 1 4 7
1 1448678_at Fam118a 2 5 8
2 1452580_a_at Mrpl21 3 6 9
(3)foo.ID.LN.x1 (3)foo.ID.LN.x2 (3)foo.ID.LN.x3
0 20.3 130 1
1 25.3 150 2
2 3.1 173 12
Notice that the data frame contain two chunks (named foo
and bar
), in turns each chunk contain x1,x2,x3
. What I want to do is to sum up the values within each chunk resulting in this data frame:
请注意,数据框包含两个块(名为foo和bar),每个块依次包含x1,x2,x3。我想要做的是总结每个块中的值,从而产生这个数据框:
Probe Gene foo bar
1431492_at Lipn 151.3 12
1448678_at Fam118a 177.3 15
1452580_a_at Mrpl21 188.1 18
The actual data can contain more than two chunk names. And each chunk will contain 2 or 3 members (x1,x2
or x1,x2,x3
).
实际数据可以包含两个以上的块名称。每个块将包含2或3个成员(x1,x2或x1,x2,x3)。
The chunk name can be captured with the following regex /\(\d+\)(\w+)\..*/
可以使用以下正则表达式/\(\\ d+ \)(\\ w+)\ .. * /捕获块名称
How can I achieve that?
我怎样才能做到这一点?
3 个解决方案
#1
This is a way to get started to find such "chunks":
这是一种开始寻找这种“块”的方法:
chunks = set([re.split('\(\d+\)',i)[1].split('.')[0] for i in df.columns if '.' in i])
for each_chunk in chunks:
column_name = '%s' %each_chunk
df[column_name] = df[[i for i in df.columns if each_chunk in i]].sum(axis=1)
## -- End pasted text --
In [1298]: df.head()
Out[1298]:
Probe Gene (3)bar.ID.LN.x1 (3)bar.ID.LN.x2 (3)bar.ID.LN.x3 \
0 1431492_at Lipn 1 4 7
1 1448678_at Fam118a 2 5 8
2 1452580_a_at Mrpl21 3 6 9
(5)foo.ID.LN.x1 (5)foo.ID.LN.x2 (5)foo.ID.LN.x3 foo bar
0 20.3 130 1 151.3 12
1 25.3 150 2 177.3 15
2 3.1 173 12 188.1 18
Benchmarks:
In [1266]: %timeit df[bar_cols].sum(axis=1)
1000 loops, best of 3: 476 µs per loop
In [1267]: %timeit df[[i for i in df.columns if 'bar' in i]].sum(axis=1)
1000 loops, best of 3: 483 µs per loop
In [1268]: %timeit df.filter(regex='foo').sum(axis=1)
1000 loops, best of 3: 483 µs per loop
#2
One option if the data size is small
数据大小很小的一种选择
df['foo'] = df.filter(regex='foo').sum(axis=1) # It will filter all the columns which has the word 'foo' in it
df['bar'] = df.filter(regex='bar').sum(axis=1)
Please dont use this, if your data size is larger than 10,000 rows. Generally summing up using axis=1
is slow
如果您的数据大小超过10,000行,请不要使用它。通常使用axis = 1求和是缓慢的
#3
If you're doing this for many columns, I would suggest using a MultiIndex rather than a dot seperated string:
如果您对许多列执行此操作,我建议使用MultiIndex而不是点分隔字符串:
In [11]: new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series) # do something with this later
In [12]: df.columns = pd.MultiIndex.from_tuples(df.columns.map(lambda x: tuple(x.split("."))))
In [13]: df
Out[13]:
(3)bar (5)foo
ID ID
LN LN
x1 x2 x3 x1 x2 x3
0 1 4 7 20.3 130 1
1 2 5 8 25.3 150 2
2 3 6 9 3.1 173 12
In [14]: df.loc[:, "(3)bar"].sum(axis=1)
Out[14]:
0 12
1 15
2 18
dtype: int64
#1
This is a way to get started to find such "chunks":
这是一种开始寻找这种“块”的方法:
chunks = set([re.split('\(\d+\)',i)[1].split('.')[0] for i in df.columns if '.' in i])
for each_chunk in chunks:
column_name = '%s' %each_chunk
df[column_name] = df[[i for i in df.columns if each_chunk in i]].sum(axis=1)
## -- End pasted text --
In [1298]: df.head()
Out[1298]:
Probe Gene (3)bar.ID.LN.x1 (3)bar.ID.LN.x2 (3)bar.ID.LN.x3 \
0 1431492_at Lipn 1 4 7
1 1448678_at Fam118a 2 5 8
2 1452580_a_at Mrpl21 3 6 9
(5)foo.ID.LN.x1 (5)foo.ID.LN.x2 (5)foo.ID.LN.x3 foo bar
0 20.3 130 1 151.3 12
1 25.3 150 2 177.3 15
2 3.1 173 12 188.1 18
Benchmarks:
In [1266]: %timeit df[bar_cols].sum(axis=1)
1000 loops, best of 3: 476 µs per loop
In [1267]: %timeit df[[i for i in df.columns if 'bar' in i]].sum(axis=1)
1000 loops, best of 3: 483 µs per loop
In [1268]: %timeit df.filter(regex='foo').sum(axis=1)
1000 loops, best of 3: 483 µs per loop
#2
One option if the data size is small
数据大小很小的一种选择
df['foo'] = df.filter(regex='foo').sum(axis=1) # It will filter all the columns which has the word 'foo' in it
df['bar'] = df.filter(regex='bar').sum(axis=1)
Please dont use this, if your data size is larger than 10,000 rows. Generally summing up using axis=1
is slow
如果您的数据大小超过10,000行,请不要使用它。通常使用axis = 1求和是缓慢的
#3
If you're doing this for many columns, I would suggest using a MultiIndex rather than a dot seperated string:
如果您对许多列执行此操作,我建议使用MultiIndex而不是点分隔字符串:
In [11]: new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series) # do something with this later
In [12]: df.columns = pd.MultiIndex.from_tuples(df.columns.map(lambda x: tuple(x.split("."))))
In [13]: df
Out[13]:
(3)bar (5)foo
ID ID
LN LN
x1 x2 x3 x1 x2 x3
0 1 4 7 20.3 130 1
1 2 5 8 25.3 150 2
2 3 6 9 3.1 173 12
In [14]: df.loc[:, "(3)bar"].sum(axis=1)
Out[14]:
0 12
1 15
2 18
dtype: int64