So I'm currently iterating for this, but I'm looking to get the results in a vectorized manner and I'm drawing a blank. It's probably best to start with a DataFrame:
所以我现在正在迭代这个,但我希望以矢量化的方式得到结果,我正在画一个空白。最好从DataFrame开始:
actual_sum expected_to_date Late
id
1 11086 11086 0
1 22172 22172 0
1 33258 33258 0
1 33258 44344 1
1 33258 55430 2
1 33258 66516 3
1 33258 77602 4
1 33258 88688 5
3 0 25748 1****
3 0 51496 2
3 0 77244 3
3 0 102992 4
3 0 128740 5
3 0 154488 6
10005 19818 19818 0
10005 19818 39636 1
10005 59454 59454 0
10005 79272 79272 0
10005 79272 99090 1
10005 99090 118908 1
10005 118908 138726 1
10005 138726 158544 1
10005 164544 178362 1
10005 184362 198180 1
10005 184362 217998 2
10005 184362 237824 3
10006 26425 26425 0
10006 52850 52850 0
10006 79275 79275 0
10006 79275 105700 1
10006 132125 132125 0
10006 158550 158550 0
10006 158550 184975 1
10006 158550 211400 2
10006 158550 237825 3
10006 158550 264250 4
10006 158550 290666 5
10006 158550 317091 6
10009 21217 21217 0
10009 42434 42434 0
10009 63651 63651 0
So as you can see, here is what I'm doing:
你可以看到,这就是我正在做的事情:
- If actual_sum and expected_to_date are equal, put a 0
- If expected is > actual, then grab the last time that expected was <= our current actual_sum within the same id, and take the difference in periods of the two.
- This is done on a per id basis. So check out #3. The very first one has a difference, so it is immediately Late by 1.
如果actual_sum和expected_to_date相等,则输入0
如果预期是>实际,那么抓住期望的最后一次<=我们当前的相同id中的actual_sum,并取两者的周期差。
这是基于每个id完成的。所以看看#3。第一个有差异,因此它会立即延迟1。
Any ideas on a vectorized approach to something like this? I can't think of anything. Currently most of my code time is spent on finding the last time within this id that we have something less than our current actual_sum:
有关像这样的矢量化方法的任何想法?我什么都想不到。目前,我的大部分代码时间用于查找此ID中的最后一次,我们的内容少于当前的actual_sum:
last_current = d[(d.id==cur_id)&(d.expected_to_date > cur_sum)][:idx]
I have to add 1 to this result to get what I want, but it does work.
我必须在此结果中添加1以获得我想要的内容,但它确实有效。
1 个解决方案
#1
I think this comes close to what you want ... it does not get rid of every for-loop ... but it reduces the number ... the data is in a DataFrame called df ...
我认为这接近你想要的...它没有摆脱每个for循环......但它减少了数量......数据在一个名为df的DataFrame中......
df.reset_index(inplace=True) # get a unique index
df['ones'] = 1 # temp column of ones (for use in cumulative sum)
df['Later'] = 0 # this is where our result will be put
for label, group in df.groupby('id'):
# for each group select cases and number them from 1 using cumsum()
df['Later'].update(group[group.expected_to_date > group.actual_sum]
['ones'].cumsum())
del df['ones'] # remove temporary working column
df.set_index('id', inplace=True) # restore original index
Which yields ...
产量......
In [66]: df
Out[66]:
actual_sum expected_to_date Late Later
id
1 11086 11086 0 0
1 22172 22172 0 0
1 33258 33258 0 0
1 33258 44344 1 1
1 33258 55430 2 2
1 33258 66516 3 3
1 33258 77602 4 4
1 33258 88688 5 5
3 0 25748 1**** 1
3 0 51496 2 2
3 0 77244 3 3
3 0 102992 4 4
3 0 128740 5 5
3 0 154488 6 6
10005 19818 19818 0 0
10005 19818 39636 1 1
10005 59454 59454 0 0
10005 79272 79272 0 0
10005 79272 99090 1 2
10005 99090 118908 1 3
10005 118908 138726 1 4
10005 138726 158544 1 5
10005 164544 178362 1 6
10005 184362 198180 1 7
10005 184362 217998 2 8
10005 184362 237824 3 9
10006 26425 26425 0 0
10006 52850 52850 0 0
10006 79275 79275 0 0
10006 79275 105700 1 1
10006 132125 132125 0 0
10006 158550 158550 0 0
10006 158550 184975 1 2
10006 158550 211400 2 3
10006 158550 237825 3 4
10006 158550 264250 4 5
10006 158550 290666 5 6
10006 158550 317091 6 7
10009 21217 21217 0 0
10009 42434 42434 0 0
10009 63651 63651 0 0
For completeness, this is how I got your data into a dataframe called df ...
为了完整起见,这就是我将数据放入名为df的数据框中的方法。
data="""id actual_sum expected_to_date Late
1 11086 11086 0
1 22172 22172 0
1 33258 33258 0
1 33258 44344 1
1 33258 55430 2
1 33258 66516 3
1 33258 77602 4
1 33258 88688 5
3 0 25748 1****
3 0 51496 2
3 0 77244 3
3 0 102992 4
3 0 128740 5
3 0 154488 6
10005 19818 19818 0
10005 19818 39636 1
10005 59454 59454 0
10005 79272 79272 0
10005 79272 99090 1
10005 99090 118908 1
10005 118908 138726 1
10005 138726 158544 1
10005 164544 178362 1
10005 184362 198180 1
10005 184362 217998 2
10005 184362 237824 3
10006 26425 26425 0
10006 52850 52850 0
10006 79275 79275 0
10006 79275 105700 1
10006 132125 132125 0
10006 158550 158550 0
10006 158550 184975 1
10006 158550 211400 2
10006 158550 237825 3
10006 158550 264250 4
10006 158550 290666 5
10006 158550 317091 6
10009 21217 21217 0
10009 42434 42434 0
10009 63651 63651 0
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep='\s+')
#1
I think this comes close to what you want ... it does not get rid of every for-loop ... but it reduces the number ... the data is in a DataFrame called df ...
我认为这接近你想要的...它没有摆脱每个for循环......但它减少了数量......数据在一个名为df的DataFrame中......
df.reset_index(inplace=True) # get a unique index
df['ones'] = 1 # temp column of ones (for use in cumulative sum)
df['Later'] = 0 # this is where our result will be put
for label, group in df.groupby('id'):
# for each group select cases and number them from 1 using cumsum()
df['Later'].update(group[group.expected_to_date > group.actual_sum]
['ones'].cumsum())
del df['ones'] # remove temporary working column
df.set_index('id', inplace=True) # restore original index
Which yields ...
产量......
In [66]: df
Out[66]:
actual_sum expected_to_date Late Later
id
1 11086 11086 0 0
1 22172 22172 0 0
1 33258 33258 0 0
1 33258 44344 1 1
1 33258 55430 2 2
1 33258 66516 3 3
1 33258 77602 4 4
1 33258 88688 5 5
3 0 25748 1**** 1
3 0 51496 2 2
3 0 77244 3 3
3 0 102992 4 4
3 0 128740 5 5
3 0 154488 6 6
10005 19818 19818 0 0
10005 19818 39636 1 1
10005 59454 59454 0 0
10005 79272 79272 0 0
10005 79272 99090 1 2
10005 99090 118908 1 3
10005 118908 138726 1 4
10005 138726 158544 1 5
10005 164544 178362 1 6
10005 184362 198180 1 7
10005 184362 217998 2 8
10005 184362 237824 3 9
10006 26425 26425 0 0
10006 52850 52850 0 0
10006 79275 79275 0 0
10006 79275 105700 1 1
10006 132125 132125 0 0
10006 158550 158550 0 0
10006 158550 184975 1 2
10006 158550 211400 2 3
10006 158550 237825 3 4
10006 158550 264250 4 5
10006 158550 290666 5 6
10006 158550 317091 6 7
10009 21217 21217 0 0
10009 42434 42434 0 0
10009 63651 63651 0 0
For completeness, this is how I got your data into a dataframe called df ...
为了完整起见,这就是我将数据放入名为df的数据框中的方法。
data="""id actual_sum expected_to_date Late
1 11086 11086 0
1 22172 22172 0
1 33258 33258 0
1 33258 44344 1
1 33258 55430 2
1 33258 66516 3
1 33258 77602 4
1 33258 88688 5
3 0 25748 1****
3 0 51496 2
3 0 77244 3
3 0 102992 4
3 0 128740 5
3 0 154488 6
10005 19818 19818 0
10005 19818 39636 1
10005 59454 59454 0
10005 79272 79272 0
10005 79272 99090 1
10005 99090 118908 1
10005 118908 138726 1
10005 138726 158544 1
10005 164544 178362 1
10005 184362 198180 1
10005 184362 217998 2
10005 184362 237824 3
10006 26425 26425 0
10006 52850 52850 0
10006 79275 79275 0
10006 79275 105700 1
10006 132125 132125 0
10006 158550 158550 0
10006 158550 184975 1
10006 158550 211400 2
10006 158550 237825 3
10006 158550 264250 4
10006 158550 290666 5
10006 158550 317091 6
10009 21217 21217 0
10009 42434 42434 0
10009 63651 63651 0
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep='\s+')