I have a dataframe in the following form:
我有一个以下形式的数据框:
W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1
W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1
There is a parameter DIFF = 3. I'm looking at every row for columns from W1 to W4 and search for last 1. It will be in columns W3, W3, W2, W1. Subsequently I change to 0 next 3 (DIFF) elements on the right side of this 1 in the whole row. See example, I marked those elements by x :
有一个参数DIFF = 3.我正在查看W1到W4列的每一行并搜索最后一行。它将在W3,W3,W2,W1列中。随后我在整行的这个1的右侧改为0接下来的3(DIFF)元素。参见示例,我用x标记了这些元素:
W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 x x x 1 1 0 0 1 x x x 1 1 0 1 x x x 1 0 0 1 x x x 1 1 0 1
W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 x x x 1 1 0 0 1 x x x 1 1 0 1 x x x 1 0 0 1 x x x 1 1 0 1
And final result:
最终结果:
W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1
W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1
Now, I have very convoluted solution that uses iterrows()
, but I'm looking for a pandastic one.
现在,我有一个非常复杂的解决方案,使用iterrows(),但我正在寻找一个pandastic。
2 个解决方案
#1
1
Use:
df = df.mask(df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]), 0)
print (df)
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 0 0 0 1 1
1 0 0 1 0 0 0 1 1
2 0 1 0 0 0 1 0 0
3 1 0 0 0 1 1 0 1
Explanation:
Use cumsum
per rows:
每行使用cumsum:
print (df.cumsum(axis=1))
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 1 2 3 4 5
1 0 0 1 1 1 2 3 4
2 0 1 1 1 2 3 3 3
3 1 1 1 1 2 3 3 4
Comapre by >=1
with ge
:
Comapre> = 1 with ge:
print (df.cumsum(axis=1).ge(1))
W1 W2 W3 W4 W5 W6 W7 W8
0 False False True True True True True True
1 False False True True True True True True
2 False True True True True True True True
3 True True True True True True True True
Again cumsum
by boolen mask:
再次通过boolen面具的cumsum:
print (df.cumsum(axis=1).ge(1).cumsum(axis=1))
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 2 3 4 5 6
1 0 0 1 2 3 4 5 6
2 0 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7 8
Compare by 2,3,4
for next 3 values with omit first:
比较2,3,4表示接下来的3个值,省略第一个:
print (df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]))
W1 W2 W3 W4 W5 W6 W7 W8
0 False False False True True True False False
1 False False False True True True False False
2 False False True True True False False False
3 False True True True False False False False
More dynamic solution if want define n
and DIFF
values:
如果要定义n和DIFF值,则需要更加动态的解决方案:
df = pd.DataFrame({'W1': [0, 0, 0, 0], 'W2': [0, 0, 1, 0],
'W3': [1, 1, 0, 0], 'W4': [0, 0, 0, 0],
'W5': [1, 0, 1, 0], 'W6': [1, 1, 1, 0],
'W7': [1, 1, 0, 0], 'W8': [1, 1, 0, 1]})
print (df)
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 0 1 1 1 1
1 0 0 1 0 0 1 1 1
2 0 1 0 0 1 1 0 0
3 0 0 0 0 0 0 0 1
DIFF = 4
n = 3
#select columns for check by positions
subset = df.iloc[:, :n]
#replace 0 to NaNs replace back filling, change order of columns with cumsum
last_1 = subset.mask(subset == 0).bfill(axis=1).iloc[:, ::-1].cumsum(axis=1)
print (last_1)
W3 W2 W1
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 NaN 1.0 2.0
3 NaN NaN NaN
#add missing columns and create ones rows by forward filling
df1 = last_1.reindex(index=df.index, columns=df.columns).ffill(axis=1)
print (df1)
W1 W2 W3 W4 W5 W6 W7 W8
0 3.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0
1 3.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0
2 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
3 NaN NaN NaN NaN NaN NaN NaN NaN
#compare by 1 and get cumsum
print (df1.eq(1).cumsum(axis=1))
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 2 3 4 5 6
1 0 0 1 2 3 4 5 6
2 0 1 2 3 4 5 6 7
3 0 0 0 0 0 0 0 0
#last check range of values
df = df.mask(df1.eq(1).cumsum(axis=1).isin(range(2, DIFF + 2)), 0)
print (df)
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 0 0 0 0 1
1 0 0 1 0 0 0 0 1
2 0 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1
#2
2
Here is mixture of functional and pandastic approach:
这是功能和pandastic方法的混合:
df = pd.DataFrame({'w1': [0, 1, 1, 0],
'w2': [1, 1, 0, 1],
'w3': [1, 0, 0, 0],
'w4': [0, 1, 1, 0],
'w5': [1, 1, 0, 1],
'w6': [0, 0, 1, 1],
'w7': [0, 1, 1, 0],
'w8': [1, 1, 1, 1]})
def errase_diff(row, n = 4, Diff = 3):
"""
returns array with erassed diff values after last positive value
in first n column
"""
row_length = len(row)
last_positive_id = [i for i, v in enumerate(row[:4]) if v == 1][-1]
row[last_positive_id + 1: last_positive_id + 1 + Diff] = [0 for _ in range(Diff)]
return row[:row_length]
df.apply(lambda x: errase_diff(x), 1)
w1 w2 w3 w4 w5 w6 w7 w8
0 0 1 1 0 0 0 0 1
1 1 1 0 1 0 0 0 1
2 1 0 0 1 0 0 0 1
3 0 1 0 0 0 1 0 1
be aware that this solution erase data in your original df
请注意,此解决方案会删除原始df中的数据
#1
1
Use:
df = df.mask(df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]), 0)
print (df)
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 0 0 0 1 1
1 0 0 1 0 0 0 1 1
2 0 1 0 0 0 1 0 0
3 1 0 0 0 1 1 0 1
Explanation:
Use cumsum
per rows:
每行使用cumsum:
print (df.cumsum(axis=1))
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 1 2 3 4 5
1 0 0 1 1 1 2 3 4
2 0 1 1 1 2 3 3 3
3 1 1 1 1 2 3 3 4
Comapre by >=1
with ge
:
Comapre> = 1 with ge:
print (df.cumsum(axis=1).ge(1))
W1 W2 W3 W4 W5 W6 W7 W8
0 False False True True True True True True
1 False False True True True True True True
2 False True True True True True True True
3 True True True True True True True True
Again cumsum
by boolen mask:
再次通过boolen面具的cumsum:
print (df.cumsum(axis=1).ge(1).cumsum(axis=1))
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 2 3 4 5 6
1 0 0 1 2 3 4 5 6
2 0 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7 8
Compare by 2,3,4
for next 3 values with omit first:
比较2,3,4表示接下来的3个值,省略第一个:
print (df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]))
W1 W2 W3 W4 W5 W6 W7 W8
0 False False False True True True False False
1 False False False True True True False False
2 False False True True True False False False
3 False True True True False False False False
More dynamic solution if want define n
and DIFF
values:
如果要定义n和DIFF值,则需要更加动态的解决方案:
df = pd.DataFrame({'W1': [0, 0, 0, 0], 'W2': [0, 0, 1, 0],
'W3': [1, 1, 0, 0], 'W4': [0, 0, 0, 0],
'W5': [1, 0, 1, 0], 'W6': [1, 1, 1, 0],
'W7': [1, 1, 0, 0], 'W8': [1, 1, 0, 1]})
print (df)
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 0 1 1 1 1
1 0 0 1 0 0 1 1 1
2 0 1 0 0 1 1 0 0
3 0 0 0 0 0 0 0 1
DIFF = 4
n = 3
#select columns for check by positions
subset = df.iloc[:, :n]
#replace 0 to NaNs replace back filling, change order of columns with cumsum
last_1 = subset.mask(subset == 0).bfill(axis=1).iloc[:, ::-1].cumsum(axis=1)
print (last_1)
W3 W2 W1
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 NaN 1.0 2.0
3 NaN NaN NaN
#add missing columns and create ones rows by forward filling
df1 = last_1.reindex(index=df.index, columns=df.columns).ffill(axis=1)
print (df1)
W1 W2 W3 W4 W5 W6 W7 W8
0 3.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0
1 3.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0
2 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
3 NaN NaN NaN NaN NaN NaN NaN NaN
#compare by 1 and get cumsum
print (df1.eq(1).cumsum(axis=1))
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 2 3 4 5 6
1 0 0 1 2 3 4 5 6
2 0 1 2 3 4 5 6 7
3 0 0 0 0 0 0 0 0
#last check range of values
df = df.mask(df1.eq(1).cumsum(axis=1).isin(range(2, DIFF + 2)), 0)
print (df)
W1 W2 W3 W4 W5 W6 W7 W8
0 0 0 1 0 0 0 0 1
1 0 0 1 0 0 0 0 1
2 0 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1
#2
2
Here is mixture of functional and pandastic approach:
这是功能和pandastic方法的混合:
df = pd.DataFrame({'w1': [0, 1, 1, 0],
'w2': [1, 1, 0, 1],
'w3': [1, 0, 0, 0],
'w4': [0, 1, 1, 0],
'w5': [1, 1, 0, 1],
'w6': [0, 0, 1, 1],
'w7': [0, 1, 1, 0],
'w8': [1, 1, 1, 1]})
def errase_diff(row, n = 4, Diff = 3):
"""
returns array with erassed diff values after last positive value
in first n column
"""
row_length = len(row)
last_positive_id = [i for i, v in enumerate(row[:4]) if v == 1][-1]
row[last_positive_id + 1: last_positive_id + 1 + Diff] = [0 for _ in range(Diff)]
return row[:row_length]
df.apply(lambda x: errase_diff(x), 1)
w1 w2 w3 w4 w5 w6 w7 w8
0 0 1 1 0 0 0 0 1
1 1 1 0 1 0 0 0 1
2 1 0 0 1 0 0 0 1
3 0 1 0 0 0 1 0 1
be aware that this solution erase data in your original df
请注意,此解决方案会删除原始df中的数据