熊猫中的矢量化方法比较两个累积和

So I'm currently iterating for this, but I'm looking to get the results in a vectorized manner and I'm drawing a blank. It's probably best to start with a DataFrame:

所以我现在正在迭代这个,但我希望以矢量化的方式得到结果,我正在画一个空白。最好从DataFrame开始:

         actual_sum  expected_to_date           Late
id
1             11086             11086              0
1             22172             22172              0
1             33258             33258              0
1             33258             44344              1
1             33258             55430              2
1             33258             66516              3
1             33258             77602              4
1             33258             88688              5
3                 0             25748              1****
3                 0             51496              2
3                 0             77244              3
3                 0            102992              4
3                 0            128740              5
3                 0            154488              6
10005         19818             19818              0
10005         19818             39636              1
10005         59454             59454              0
10005         79272             79272              0
10005         79272             99090              1
10005         99090            118908              1
10005        118908            138726              1
10005        138726            158544              1
10005        164544            178362              1
10005        184362            198180              1
10005        184362            217998              2
10005        184362            237824              3
10006         26425             26425              0
10006         52850             52850              0
10006         79275             79275              0
10006         79275            105700              1
10006        132125            132125              0
10006        158550            158550              0
10006        158550            184975              1
10006        158550            211400              2
10006        158550            237825              3
10006        158550            264250              4
10006        158550            290666              5
10006        158550            317091              6
10009         21217             21217              0
10009         42434             42434              0
10009         63651             63651              0

So as you can see, here is what I'm doing:

你可以看到,这就是我正在做的事情:

If actual_sum and expected_to_date are equal, put a 0

如果actual_sum和expected_to_date相等,则输入0

If expected is > actual, then grab the last time that expected was <= our current actual_sum within the same id, and take the difference in periods of the two.

如果预期是>实际,那么抓住期望的最后一次<=我们当前的相同id中的actual_sum,并取两者的周期差。

This is done on a per id basis. So check out #3. The very first one has a difference, so it is immediately Late by 1.

这是基于每个id完成的。所以看看#3。第一个有差异,因此它会立即延迟1。

Any ideas on a vectorized approach to something like this? I can't think of anything. Currently most of my code time is spent on finding the last time within this id that we have something less than our current actual_sum:

有关像这样的矢量化方法的任何想法?我什么都想不到。目前,我的大部分代码时间用于查找此ID中的最后一次,我们的内容少于当前的actual_sum:

last_current = d[(d.id==cur_id)&(d.expected_to_date > cur_sum)][:idx]

I have to add 1 to this result to get what I want, but it does work.

我必须在此结果中添加1以获得我想要的内容,但它确实有效。

1 个解决方案

#1

I think this comes close to what you want ... it does not get rid of every for-loop ... but it reduces the number ... the data is in a DataFrame called df ...

我认为这接近你想要的...它没有摆脱每个for循环......但它减少了数量......数据在一个名为df的DataFrame中......

df.reset_index(inplace=True)   # get a unique index
df['ones'] = 1  # temp column of ones (for use in cumulative sum)
df['Later'] = 0 # this is where our result will be put 
for label, group in df.groupby('id'):
    # for each group select cases and number them from 1 using cumsum()
    df['Later'].update(group[group.expected_to_date > group.actual_sum]
        ['ones'].cumsum())
del df['ones']  # remove temporary working column
df.set_index('id', inplace=True) # restore original index

Which yields ...

产量......

In [66]: df
Out[66]: 
       actual_sum  expected_to_date   Late  Later
id                                               
1           11086             11086      0      0
1           22172             22172      0      0
1           33258             33258      0      0
1           33258             44344      1      1
1           33258             55430      2      2
1           33258             66516      3      3
1           33258             77602      4      4
1           33258             88688      5      5
3               0             25748  1****      1
3               0             51496      2      2
3               0             77244      3      3
3               0            102992      4      4
3               0            128740      5      5
3               0            154488      6      6
10005       19818             19818      0      0
10005       19818             39636      1      1
10005       59454             59454      0      0
10005       79272             79272      0      0
10005       79272             99090      1      2
10005       99090            118908      1      3
10005      118908            138726      1      4
10005      138726            158544      1      5
10005      164544            178362      1      6
10005      184362            198180      1      7
10005      184362            217998      2      8
10005      184362            237824      3      9
10006       26425             26425      0      0
10006       52850             52850      0      0
10006       79275             79275      0      0
10006       79275            105700      1      1
10006      132125            132125      0      0
10006      158550            158550      0      0
10006      158550            184975      1      2
10006      158550            211400      2      3
10006      158550            237825      3      4
10006      158550            264250      4      5
10006      158550            290666      5      6
10006      158550            317091      6      7
10009       21217             21217      0      0
10009       42434             42434      0      0
10009       63651             63651      0      0

For completeness, this is how I got your data into a dataframe called df ...

为了完整起见,这就是我将数据放入名为df的数据框中的方法。

data="""id       actual_sum  expected_to_date           Late
1             11086             11086              0
1             22172             22172              0
1             33258             33258              0
1             33258             44344              1
1             33258             55430              2
1             33258             66516              3
1             33258             77602              4
1             33258             88688              5
3                 0             25748              1****
3                 0             51496              2
3                 0             77244              3
3                 0            102992              4
3                 0            128740              5
3                 0            154488              6
10005         19818             19818              0
10005         19818             39636              1
10005         59454             59454              0
10005         79272             79272              0
10005         79272             99090              1
10005         99090            118908              1
10005        118908            138726              1
10005        138726            158544              1
10005        164544            178362              1
10005        184362            198180              1
10005        184362            217998              2
10005        184362            237824              3
10006         26425             26425              0
10006         52850             52850              0
10006         79275             79275              0
10006         79275            105700              1
10006        132125            132125              0
10006        158550            158550              0
10006        158550            184975              1
10006        158550            211400              2
10006        158550            237825              3
10006        158550            264250              4
10006        158550            290666              5
10006        158550            317091              6
10009         21217             21217              0
10009         42434             42434              0
10009         63651             63651              0
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep='\s+')

#1