pandas：删除另一系列时间索引的时间间隔内的所有行（即时间范围排除）

Suppose I have two dataframes:

假设我有两个数据帧:

#df1
time
2016-09-12 13:00:00.017    1.0
2016-09-12 13:00:03.233    1.0
2016-09-12 13:00:10.256    1.0
2016-09-12 13:00:19.605    1.0

#df2
time
2016-09-12 13:00:00.017    1.0
2016-09-12 13:00:00.233    0.0
2016-09-12 13:00:01.016    1.0
2016-09-12 13:00:01.505    0.0
2016-09-12 13:00:06.017    1.0
2016-09-12 13:00:07.233    0.0
2016-09-12 13:00:08.256    1.0
2016-09-12 13:00:19.705    0.0

I want to remove all rows in df2 that are up to +1 second of the time indices in df1, so yielding:

我想删除df2中df1中时间指数高达+1秒的所有行,因此产生:

#result
time
2016-09-12 13:00:01.505    0.0
2016-09-12 13:00:06.017    1.0
2016-09-12 13:00:07.233    0.0
2016-09-12 13:00:08.256    1.0

What's the most efficient way to do this? I don't see anything useful for time range exclusions in the API.

最有效的方法是什么?我认为API中的时间范围排除没有任何用处。

3 个解决方案

#1

You can use pd.merge_asof which is a new inclusion starting with 0.19.0 and also accepts a tolerance argument to match +/- that specified amount of time interval.

您可以使用pd.merge_asof这是一个以0.19.0开头的新包含,并且还接受容差参数以匹配+/-指定的时间间隔量。

# Assuming time to be set as the index axis for both df's
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

df2.loc[pd.merge_asof(df2, df1, on='time', tolerance=pd.Timedelta('1s')).isnull().any(1)]

Note that default matching is carried out in the backwards direction, which means that selection occurs at the last row in the right DataFrame (df1) whose "on" key (which is "time") is less than or equal to the left's (df2) key. Hence, the tolerance parameter extends only in this direction (backward) resulting in a - range of matching.

请注意,默认匹配是在向后方向上执行的,这意味着选择发生在右侧DataFrame(df1)的最后一行,其“on”键(即“time”)小于或等于left(df2) )关键。因此,公差参数仅在该方向(向后)延伸,从而产生匹配范围。

To have both forward as well as backward lookups possible, starting with 0.20.0 this can be achieved by making use of direction='nearest' argument and including it in the function call. Due to this, the tolerance also gets extended both ways resulting in a +/- bandwidth range of matching.

要使正向和反向查找成为可能,从0.20.0开始,这可以通过使用direction ='nearest'参数并将其包含在函数调用中来实现。因此,公差也会在两个方向上得到扩展,从而产生+/-带宽匹配范围。

#2

Similar idea as @Nickil Maveli, but using reindex to build a Boolean indexer:

与@Nickil Maveli类似的想法,但使用reindex来构建布尔索引器:

df2 = df2[df1.reindex(df2.index, method='nearest', tolerance=pd.Timedelta('1s')).isnull()]

The resulting output:

结果输出:

time
2016-09-12 13:00:01.505    0.0
2016-09-12 13:00:06.017    1.0
2016-09-12 13:00:07.233    0.0
2016-09-12 13:00:08.256    1.0

#3

One way to do it would be to lookup via time indexing (assuming both time columns are indices):

一种方法是通过时间索引进行查找(假设两个时间列都是索引):

td = pd.to_timedelta(1, unit='s')
df2.apply(lambda row: df1[row.name - td:row.name].size > 0, axis=1)

#1