This is a general question, but I will use an example to help ask the question. I have a dataframe (df
) with df[col_1]
= all true or false. In df[col_2]
, I would like to return another true or false based on if the prior 5 rows of column 1 ( df[col_1][i-6:i-1]
) contain a match for df[col_1][i]
.
这是一个普遍的问题,但我会用一个例子来帮助提出这个问题。我有一个数据帧(df)与df [col_1] =所有true或false。在df [col_2]中,我想根据第1列的先前5行(df [col_1] [i-6:i-1])是否包含df [col_1] [i]的匹配来返回另一个真或假]。
This is the loop I am using now, but it is one of many so I think they must be slowing things down as I increase data.
这是我现在使用的循环,但它是众多循环中的一个,所以我认为它们必须在增加数据时减慢速度。
for i in df.index:
if i < 6:
df[col_2][i] = 0.
else:
df[col_2][i] = df[col_1][i] not in tuple(df[col_1].ix[i-6:i-1,col_1)
Should look like this:
应该是这样的:
. col_1 col_2
0 TRUE
1 TRUE
2 TRUE
3 TRUE
4 FALSE
5 FALSE FALSE
6 FALSE FALSE
7 FALSE FALSE
8 FALSE FALSE
9 TRUE TRUE
10 FALSE FALSE
11 FALSE FALSE
12 FALSE FALSE
13 FALSE FALSE
14 TRUE FALSE
15 TRUE FALSE
16 TRUE FALSE
17 TRUE FALSE
18 TRUE FALSE
19 TRUE FALSE
20 FALSE TRUE
I am wondering if there is a way to do something clever (or basic) with pandas to make use of vectorization - maybe using shift or an offset function?
我想知道是否有办法用熊猫做一些聪明的(或基本的)来利用矢量化 - 可能使用移位或偏移功能?
I hope I haven't missed an answer that already exists - I wasn't exactly sure how to phrase the question. Thanks in advance.
我希望我没有错过已经存在的答案 - 我不确定如何表达这个问题。提前致谢。
1 个解决方案
#1
2
Here's a simple vectorized solution that should be pretty fast, although there is probably a more elegant way to write it. You can just ignore the first 5 rows or overwrite them to NaN if you prefer.
这是一个简单的矢量化解决方案应该非常快,尽管可能有更优雅的方式来编写它。如果您愿意,可以忽略前5行或将其覆盖为NaN。
df = pd.DataFrame({ 'col_1':[True,True,True,True,False,False,False,False,
False,True,False,False,False,False,True,True,
True,True,True,True,False] })
df['col_2'] = ((df!=df.shift(1)) & (df!=df.shift(2)) & (df!=df.shift(3)) &
(df!=df.shift(4)) & (df!=df.shift(5)))
If speed really matters, you could do something like the following. It's more than 3x faster than the above and probably about as efficient as you can do here. This is just using the fact that rolling_sum()
will interpret booleans as 0/1 and you just need to know if the sum is 0 or 5.
如果速度真的很重要,你可以做类似以下的事情。它比上面快3倍以上,可能和你在这里做的效率差不多。这只是使用了rolling_sum()将布尔值解释为0/1的事实,你只需要知道总和是0还是5。
df['rollsum'] = pd.rolling_sum(df.col_1,6) - df.col_1
df['col_3'] = ( ((df.col_1==True ) & (df.rollsum==0))
| ((df.col_1==False) & (df.rollsum==5)) )
col_1 col_2 rollsum col_3
0 True True NaN False
1 True False NaN False
2 True False NaN False
3 True False NaN False
4 False True NaN False
5 False False 4 False
6 False False 3 False
7 False False 2 False
8 False False 1 False
9 True True 0 True
10 False False 1 False
11 False False 1 False
12 False False 1 False
13 False False 1 False
14 True False 1 False
15 True False 1 False
16 True False 2 False
17 True False 3 False
18 True False 4 False
19 True False 5 False
20 False True 5 True
#1
2
Here's a simple vectorized solution that should be pretty fast, although there is probably a more elegant way to write it. You can just ignore the first 5 rows or overwrite them to NaN if you prefer.
这是一个简单的矢量化解决方案应该非常快,尽管可能有更优雅的方式来编写它。如果您愿意,可以忽略前5行或将其覆盖为NaN。
df = pd.DataFrame({ 'col_1':[True,True,True,True,False,False,False,False,
False,True,False,False,False,False,True,True,
True,True,True,True,False] })
df['col_2'] = ((df!=df.shift(1)) & (df!=df.shift(2)) & (df!=df.shift(3)) &
(df!=df.shift(4)) & (df!=df.shift(5)))
If speed really matters, you could do something like the following. It's more than 3x faster than the above and probably about as efficient as you can do here. This is just using the fact that rolling_sum()
will interpret booleans as 0/1 and you just need to know if the sum is 0 or 5.
如果速度真的很重要,你可以做类似以下的事情。它比上面快3倍以上,可能和你在这里做的效率差不多。这只是使用了rolling_sum()将布尔值解释为0/1的事实,你只需要知道总和是0还是5。
df['rollsum'] = pd.rolling_sum(df.col_1,6) - df.col_1
df['col_3'] = ( ((df.col_1==True ) & (df.rollsum==0))
| ((df.col_1==False) & (df.rollsum==5)) )
col_1 col_2 rollsum col_3
0 True True NaN False
1 True False NaN False
2 True False NaN False
3 True False NaN False
4 False True NaN False
5 False False 4 False
6 False False 3 False
7 False False 2 False
8 False False 1 False
9 True True 0 True
10 False False 1 False
11 False False 1 False
12 False False 1 False
13 False False 1 False
14 True False 1 False
15 True False 1 False
16 True False 2 False
17 True False 3 False
18 True False 4 False
19 True False 5 False
20 False True 5 True