Suppose I have two dataframes:
假设我有两个数据帧:
#df1
time
2016-09-12 13:00:00.017 1.0
2016-09-12 13:00:03.233 1.0
2016-09-12 13:00:10.256 1.0
2016-09-12 13:00:19.605 1.0
#df2
time
2016-09-12 13:00:00.017 1.0
2016-09-12 13:00:00.233 0.0
2016-09-12 13:00:01.016 1.0
2016-09-12 13:00:01.505 0.0
2016-09-12 13:00:06.017 1.0
2016-09-12 13:00:07.233 0.0
2016-09-12 13:00:08.256 1.0
2016-09-12 13:00:19.705 0.0
I want to remove all rows in df2
that are up to +1 second of the time indices in df1
, so yielding:
我想删除df2中df1中时间指数高达+1秒的所有行,因此产生:
#result
time
2016-09-12 13:00:01.505 0.0
2016-09-12 13:00:06.017 1.0
2016-09-12 13:00:07.233 0.0
2016-09-12 13:00:08.256 1.0
What's the most efficient way to do this? I don't see anything useful for time range exclusions in the API.
最有效的方法是什么?我认为API中的时间范围排除没有任何用处。
3 个解决方案
#1
11
You can use pd.merge_asof
which is a new inclusion starting with 0.19.0
and also accepts a tolerance argument to match +/- that specified amount of time interval.
您可以使用pd.merge_asof这是一个以0.19.0开头的新包含,并且还接受容差参数以匹配+/-指定的时间间隔量。
# Assuming time to be set as the index axis for both df's
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df2.loc[pd.merge_asof(df2, df1, on='time', tolerance=pd.Timedelta('1s')).isnull().any(1)]
Note that default matching is carried out in the backwards direction, which means that selection occurs at the last row in the right DataFrame (df1
) whose "on"
key (which is "time"
) is less than or equal to the left's (df2
) key. Hence, the tolerance
parameter extends only in this direction (backward) resulting in a -
range of matching.
请注意,默认匹配是在向后方向上执行的,这意味着选择发生在右侧DataFrame(df1)的最后一行,其“on”键(即“time”)小于或等于left(df2) )关键。因此,公差参数仅在该方向(向后)延伸,从而产生匹配范围。
To have both forward as well as backward lookups possible, starting with 0.20.0
this can be achieved by making use of direction='nearest'
argument and including it in the function call. Due to this, the tolerance
also gets extended both ways resulting in a +/-
bandwidth range of matching.
要使正向和反向查找成为可能,从0.20.0开始,这可以通过使用direction ='nearest'参数并将其包含在函数调用中来实现。因此,公差也会在两个方向上得到扩展,从而产生+/-带宽匹配范围。
#2
4
Similar idea as @Nickil Maveli, but using reindex
to build a Boolean indexer:
与@Nickil Maveli类似的想法,但使用reindex来构建布尔索引器:
df2 = df2[df1.reindex(df2.index, method='nearest', tolerance=pd.Timedelta('1s')).isnull()]
The resulting output:
结果输出:
time
2016-09-12 13:00:01.505 0.0
2016-09-12 13:00:06.017 1.0
2016-09-12 13:00:07.233 0.0
2016-09-12 13:00:08.256 1.0
#3
1
One way to do it would be to lookup via time indexing (assuming both time columns are indices):
一种方法是通过时间索引进行查找(假设两个时间列都是索引):
td = pd.to_timedelta(1, unit='s')
df2.apply(lambda row: df1[row.name - td:row.name].size > 0, axis=1)
#1
11
You can use pd.merge_asof
which is a new inclusion starting with 0.19.0
and also accepts a tolerance argument to match +/- that specified amount of time interval.
您可以使用pd.merge_asof这是一个以0.19.0开头的新包含,并且还接受容差参数以匹配+/-指定的时间间隔量。
# Assuming time to be set as the index axis for both df's
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df2.loc[pd.merge_asof(df2, df1, on='time', tolerance=pd.Timedelta('1s')).isnull().any(1)]
Note that default matching is carried out in the backwards direction, which means that selection occurs at the last row in the right DataFrame (df1
) whose "on"
key (which is "time"
) is less than or equal to the left's (df2
) key. Hence, the tolerance
parameter extends only in this direction (backward) resulting in a -
range of matching.
请注意,默认匹配是在向后方向上执行的,这意味着选择发生在右侧DataFrame(df1)的最后一行,其“on”键(即“time”)小于或等于left(df2) )关键。因此,公差参数仅在该方向(向后)延伸,从而产生匹配范围。
To have both forward as well as backward lookups possible, starting with 0.20.0
this can be achieved by making use of direction='nearest'
argument and including it in the function call. Due to this, the tolerance
also gets extended both ways resulting in a +/-
bandwidth range of matching.
要使正向和反向查找成为可能,从0.20.0开始,这可以通过使用direction ='nearest'参数并将其包含在函数调用中来实现。因此,公差也会在两个方向上得到扩展,从而产生+/-带宽匹配范围。
#2
4
Similar idea as @Nickil Maveli, but using reindex
to build a Boolean indexer:
与@Nickil Maveli类似的想法,但使用reindex来构建布尔索引器:
df2 = df2[df1.reindex(df2.index, method='nearest', tolerance=pd.Timedelta('1s')).isnull()]
The resulting output:
结果输出:
time
2016-09-12 13:00:01.505 0.0
2016-09-12 13:00:06.017 1.0
2016-09-12 13:00:07.233 0.0
2016-09-12 13:00:08.256 1.0
#3
1
One way to do it would be to lookup via time indexing (assuming both time columns are indices):
一种方法是通过时间索引进行查找(假设两个时间列都是索引):
td = pd.to_timedelta(1, unit='s')
df2.apply(lambda row: df1[row.name - td:row.name].size > 0, axis=1)