如何在需要为每行选择不同的列时更改数据框值

时间:2022-09-13 13:20:36

I have a dataframe in the following form:

我有一个以下形式的数据框:

W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1

W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1

There is a parameter DIFF = 3. I'm looking at every row for columns from W1 to W4 and search for last 1. It will be in columns W3, W3, W2, W1. Subsequently I change to 0 next 3 (DIFF) elements on the right side of this 1 in the whole row. See example, I marked those elements by x :

有一个参数DIFF = 3.我正在查看W1到W4列的每一行并搜索最后一行。它将在W3,W3,W2,W1列中。随后我在整行的这个1的右侧改为0接下来的3(DIFF)元素。参见示例,我用x标记了这些元素:

W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 x x x 1 1 0 0 1 x x x 1 1 0 1 x x x 1 0 0 1 x x x 1 1 0 1

W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 x x x 1 1 0 0 1 x x x 1 1 0 1 x x x 1 0 0 1 x x x 1 1 0 1

And final result:

最终结果:

W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1

W1 W2 W3 W4 W5 W6 W7 W8 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1

Now, I have very convoluted solution that uses iterrows(), but I'm looking for a pandastic one.

现在,我有一个非常复杂的解决方案,使用iterrows(),但我正在寻找一个pandastic。

2 个解决方案

#1


1  

Use:

df = df.mask(df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]), 0)
print (df)
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   0   0   0   1   1
1   0   0   1   0   0   0   1   1
2   0   1   0   0   0   1   0   0
3   1   0   0   0   1   1   0   1

Explanation:

Use cumsum per rows:

每行使用cumsum:

print (df.cumsum(axis=1))
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   1   2   3   4   5
1   0   0   1   1   1   2   3   4
2   0   1   1   1   2   3   3   3
3   1   1   1   1   2   3   3   4

Comapre by >=1 with ge:

Comapre> = 1 with ge:

print (df.cumsum(axis=1).ge(1))
      W1     W2    W3    W4    W5    W6    W7    W8
0  False  False  True  True  True  True  True  True
1  False  False  True  True  True  True  True  True
2  False   True  True  True  True  True  True  True
3   True   True  True  True  True  True  True  True

Again cumsum by boolen mask:

再次通过boolen面具的cumsum:

print (df.cumsum(axis=1).ge(1).cumsum(axis=1))
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   2   3   4   5   6
1   0   0   1   2   3   4   5   6
2   0   1   2   3   4   5   6   7
3   1   2   3   4   5   6   7   8

Compare by 2,3,4 for next 3 values with omit first:

比较2,3,4表示接下来的3个值,省略第一个:

print (df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]))
      W1     W2     W3    W4     W5     W6     W7     W8
0  False  False  False  True   True   True  False  False
1  False  False  False  True   True   True  False  False
2  False  False   True  True   True  False  False  False
3  False   True   True  True  False  False  False  False

More dynamic solution if want define n and DIFF values:

如果要定义n和DIFF值,则需要更加动态的解决方案:

df = pd.DataFrame({'W1': [0, 0, 0, 0], 'W2': [0, 0, 1, 0], 
                   'W3': [1, 1, 0, 0], 'W4': [0, 0, 0, 0], 
                   'W5': [1, 0, 1, 0], 'W6': [1, 1, 1, 0], 
                   'W7': [1, 1, 0, 0], 'W8': [1, 1, 0, 1]})

print (df)
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   0   1   1   1   1
1   0   0   1   0   0   1   1   1
2   0   1   0   0   1   1   0   0
3   0   0   0   0   0   0   0   1

DIFF = 4
n = 3

#select columns for check by positions
subset = df.iloc[:, :n]
#replace 0 to NaNs replace back filling, change order of columns with cumsum
last_1 = subset.mask(subset == 0).bfill(axis=1).iloc[:, ::-1].cumsum(axis=1)
print (last_1)
    W3   W2   W1
0  1.0  2.0  3.0
1  1.0  2.0  3.0
2  NaN  1.0  2.0
3  NaN  NaN  NaN

#add missing columns and create ones rows by forward filling
df1 = last_1.reindex(index=df.index, columns=df.columns).ffill(axis=1)
print (df1)
    W1   W2   W3   W4   W5   W6   W7   W8
0  3.0  2.0  1.0  1.0  1.0  1.0  1.0  1.0
1  3.0  2.0  1.0  1.0  1.0  1.0  1.0  1.0
2  2.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

#compare by 1 and get cumsum 
print (df1.eq(1).cumsum(axis=1))
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   2   3   4   5   6
1   0   0   1   2   3   4   5   6
2   0   1   2   3   4   5   6   7
3   0   0   0   0   0   0   0   0

#last check range of values
df = df.mask(df1.eq(1).cumsum(axis=1).isin(range(2, DIFF + 2)), 0)
print (df)
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   0   0   0   0   1
1   0   0   1   0   0   0   0   1
2   0   1   0   0   0   0   0   0
3   0   0   0   0   0   0   0   1

#2


2  

Here is mixture of functional and pandastic approach:

这是功能和pandastic方法的混合:

df = pd.DataFrame({'w1': [0, 1, 1, 0],
                   'w2': [1, 1, 0, 1],
                   'w3': [1, 0, 0, 0],
                   'w4': [0, 1, 1, 0],
                   'w5': [1, 1, 0, 1],
                   'w6': [0, 0, 1, 1],
                   'w7': [0, 1, 1, 0],
                   'w8': [1, 1, 1, 1]})



def errase_diff(row, n = 4, Diff = 3):
    """
    returns array with erassed diff values after last positive value
    in first n column
    """
    row_length = len(row)
    last_positive_id = [i for i, v in enumerate(row[:4]) if v == 1][-1]
    row[last_positive_id + 1: last_positive_id + 1 + Diff] = [0 for _ in range(Diff)]
    return row[:row_length]


df.apply(lambda x: errase_diff(x), 1)

   w1  w2  w3  w4  w5  w6  w7  w8
0   0   1   1   0   0   0   0   1
1   1   1   0   1   0   0   0   1
2   1   0   0   1   0   0   0   1
3   0   1   0   0   0   1   0   1

be aware that this solution erase data in your original df

请注意,此解决方案会删除原始df中的数据

#1


1  

Use:

df = df.mask(df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]), 0)
print (df)
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   0   0   0   1   1
1   0   0   1   0   0   0   1   1
2   0   1   0   0   0   1   0   0
3   1   0   0   0   1   1   0   1

Explanation:

Use cumsum per rows:

每行使用cumsum:

print (df.cumsum(axis=1))
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   1   2   3   4   5
1   0   0   1   1   1   2   3   4
2   0   1   1   1   2   3   3   3
3   1   1   1   1   2   3   3   4

Comapre by >=1 with ge:

Comapre> = 1 with ge:

print (df.cumsum(axis=1).ge(1))
      W1     W2    W3    W4    W5    W6    W7    W8
0  False  False  True  True  True  True  True  True
1  False  False  True  True  True  True  True  True
2  False   True  True  True  True  True  True  True
3   True   True  True  True  True  True  True  True

Again cumsum by boolen mask:

再次通过boolen面具的cumsum:

print (df.cumsum(axis=1).ge(1).cumsum(axis=1))
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   2   3   4   5   6
1   0   0   1   2   3   4   5   6
2   0   1   2   3   4   5   6   7
3   1   2   3   4   5   6   7   8

Compare by 2,3,4 for next 3 values with omit first:

比较2,3,4表示接下来的3个值,省略第一个:

print (df.cumsum(axis=1).ge(1).cumsum(axis=1).isin([2,3,4]))
      W1     W2     W3    W4     W5     W6     W7     W8
0  False  False  False  True   True   True  False  False
1  False  False  False  True   True   True  False  False
2  False  False   True  True   True  False  False  False
3  False   True   True  True  False  False  False  False

More dynamic solution if want define n and DIFF values:

如果要定义n和DIFF值,则需要更加动态的解决方案:

df = pd.DataFrame({'W1': [0, 0, 0, 0], 'W2': [0, 0, 1, 0], 
                   'W3': [1, 1, 0, 0], 'W4': [0, 0, 0, 0], 
                   'W5': [1, 0, 1, 0], 'W6': [1, 1, 1, 0], 
                   'W7': [1, 1, 0, 0], 'W8': [1, 1, 0, 1]})

print (df)
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   0   1   1   1   1
1   0   0   1   0   0   1   1   1
2   0   1   0   0   1   1   0   0
3   0   0   0   0   0   0   0   1

DIFF = 4
n = 3

#select columns for check by positions
subset = df.iloc[:, :n]
#replace 0 to NaNs replace back filling, change order of columns with cumsum
last_1 = subset.mask(subset == 0).bfill(axis=1).iloc[:, ::-1].cumsum(axis=1)
print (last_1)
    W3   W2   W1
0  1.0  2.0  3.0
1  1.0  2.0  3.0
2  NaN  1.0  2.0
3  NaN  NaN  NaN

#add missing columns and create ones rows by forward filling
df1 = last_1.reindex(index=df.index, columns=df.columns).ffill(axis=1)
print (df1)
    W1   W2   W3   W4   W5   W6   W7   W8
0  3.0  2.0  1.0  1.0  1.0  1.0  1.0  1.0
1  3.0  2.0  1.0  1.0  1.0  1.0  1.0  1.0
2  2.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

#compare by 1 and get cumsum 
print (df1.eq(1).cumsum(axis=1))
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   2   3   4   5   6
1   0   0   1   2   3   4   5   6
2   0   1   2   3   4   5   6   7
3   0   0   0   0   0   0   0   0

#last check range of values
df = df.mask(df1.eq(1).cumsum(axis=1).isin(range(2, DIFF + 2)), 0)
print (df)
   W1  W2  W3  W4  W5  W6  W7  W8
0   0   0   1   0   0   0   0   1
1   0   0   1   0   0   0   0   1
2   0   1   0   0   0   0   0   0
3   0   0   0   0   0   0   0   1

#2


2  

Here is mixture of functional and pandastic approach:

这是功能和pandastic方法的混合:

df = pd.DataFrame({'w1': [0, 1, 1, 0],
                   'w2': [1, 1, 0, 1],
                   'w3': [1, 0, 0, 0],
                   'w4': [0, 1, 1, 0],
                   'w5': [1, 1, 0, 1],
                   'w6': [0, 0, 1, 1],
                   'w7': [0, 1, 1, 0],
                   'w8': [1, 1, 1, 1]})



def errase_diff(row, n = 4, Diff = 3):
    """
    returns array with erassed diff values after last positive value
    in first n column
    """
    row_length = len(row)
    last_positive_id = [i for i, v in enumerate(row[:4]) if v == 1][-1]
    row[last_positive_id + 1: last_positive_id + 1 + Diff] = [0 for _ in range(Diff)]
    return row[:row_length]


df.apply(lambda x: errase_diff(x), 1)

   w1  w2  w3  w4  w5  w6  w7  w8
0   0   1   1   0   0   0   0   1
1   1   1   0   1   0   0   0   1
2   1   0   0   1   0   0   0   1
3   0   1   0   0   0   1   0   1

be aware that this solution erase data in your original df

请注意,此解决方案会删除原始df中的数据