合并pandas数据帧,其中一个值介于两个其他值之间[重复]

时间:2022-05-28 22:54:58

This question already has an answer here:

这个问题在这里已有答案:

I need to merge two pandas dataframes on an identifier and a condition where a date in one dataframe is between two dates in the other dataframe.

我需要在标识符和条件上合并两个pandas数据帧,其中一个数据帧中的日期在另一个数据帧中的两个日期之间。

Dataframe A has a date ("fdate") and an ID ("cusip"):

Dataframe A有一个日期(“fdate”)和一个ID(“cusip”):

合并pandas数据帧,其中一个值介于两个其他值之间[重复]

I need to merge this with this dataframe B:

我需要将此与此数据框B合并:

合并pandas数据帧,其中一个值介于两个其他值之间[重复]

on A.cusip==B.ncusip and A.fdate is between B.namedt and B.nameenddt.

在A.cusip上== B.ncusip和A.fdate在B.namedt和B.nameenddt之间。

In SQL this would be trivial, but the only way I can see how to do this in pandas is to first merge unconditionally on the identifier, and then filter on the date condition:

在SQL中这将是微不足道的,但我能看到如何在pandas中执行此操作的唯一方法是首先在标识符上无条件合并,然后在日期条件上过滤:

df = pd.merge(A, B, how='inner', left_on='cusip', right_on='ncusip')
df = df[(df['fdate']>=df['namedt']) & (df['fdate']<=df['nameenddt'])]

Is this really the best way to do this? It seems that it would be much better if one could filter within the merge so as to avoid having a potentially very large dataframe after the merge but before the filter has completed.

这真的是最好的方法吗?似乎如果可以在合并中进行过滤以避免在合并之后但在过滤器完成之前具有可能非常大的数据帧,则会好得多。

4 个解决方案

#1


17  

As you say, this is pretty easy in SQL, so why not do it in SQL?

正如你所说,这在SQL中非常简单,那么为什么不在SQL中呢?

import pandas as pd
import sqlite3

#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)

qry = '''
    select  
        start_date PresTermStart,
        end_date PresTermEnd,
        wars.date WarStart,
        presidents.name Pres
    from
        terms join wars on
        date between start_date and end_date join presidents on
        terms.president_id = presidents.president_id
    '''
df = pd.read_sql_query(qry, conn)

df:

DF:

         PresTermStart          PresTermEnd             WarStart  Pres
0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush
1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush

#2


9  

You should be able to do this now using the package pandasql

您现在应该可以使用包pandasql来执行此操作

import pandasql as ps

sqlcode = '''
select A.cusip
from A
inner join B on A.cusip=B.ncusip
where A.fdate >= B.namedt and A.fdate <= B.nameenddt
group by A.cusip
'''

newdf = ps.sqldf(sqlcode,locals())

I think the answer from @ChuHo is good. I believe pandasql is doing the same for you. I haven't benchmarked the two, but it is easier to read.

我认为来自@ChuHo的答案很好。我相信pandasql正在为你做同样的事情。我没有对这两者进行基准测试,但它更容易阅读。

#3


6  

There is no pandamic way of doing this at the moment.

This answer used to be about tackling the problem with polymorphism, which tured out to be a very bad idea.

这个答案曾经是关于解决多态性的问题,这是一个非常糟糕的主意。

Then the numpy.piecewise function appeared in another answer, but with little explanation, so I thought I would clarify how this function can be used.

然后numpy.piecewise函数出现在另一个答案中,但几乎没有解释,所以我想我会澄清如何使用这个函数。

Numpy way with piecewise (Memory heavy)

The np.piecewise function can be used to generate the behavior of a custom join. There is a lot of overhead involved and it is not very efficient perse, but it does the job.

np.piecewise函数可用于生成自定义连接的行为。涉及到很多开销,并且它不是非常有效的perse,但它完成了这项工作。

Producing conditions for joining

import pandas as pd
from datetime import datetime


presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})

start_end_date_tuples = zip(terms.start_date.values, terms.end_date.values)
conditions = [(war_declarations.date.values >= start_date) &
              (war_declarations.date.values <= end_date) for start_date, end_date in start_end_date_tuples]

> conditions
[array([ True,  True], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool)]

This is a list of arrays where each array tells us if the term time span matched for each of the two war declarations we have. The conditions can explode with larger datasets as it will be the length of the left df and the right df multiplied.

这是一个数组列表,其中每个数组告诉我们,对于我们所拥有的两个war声明中的每一个,术语时间跨度是否匹配。条件可能会因较大的数据集而爆炸,因为它将是左侧df和右侧df的长度相乘。

The piecewise "magic"

Now piecewise will take the president_id from the terms and place it in the war_declarations dataframe for each of the corresponding wars.

现在,分段将从条款中取出president_id,并将其放入每个相应战争的war_declarations数据框中。

war_declarations['president_id'] = np.piecewise(np.zeros(len(war_declarations)),
                                                conditions,
                                                terms.president_id.values)
    date        name                president_id
0   2001-09-14  War in Afghanistan          43.0
1   2003-03-03  Iraq War                    43.0

Now to finish this example we just need to regularly merge in the presidents' name.

现在要完成这个例子,我们只需要定期合并总统的名字。

war_declarations.merge(presidents, on="president_id", suffixes=["_war", "_president"])

    date        name_war            president_id    name_president
0   2001-09-14  War in Afghanistan          43.0    Bush
1   2003-03-03  Iraq War                    43.0    Bush

Polymorphism (does not work)

I wanted to share my research efforts, so even if this does not solve the problem, I hope it will be allowed to live on here as a useful reply at least. Since it is hard to spot the error, someone else may try this and think they have a working solution, while in fact, they don't.

我想分享我的研究工作,所以即使这不能解决问题,我希望它至少可以作为一个有用的答案留在这里。由于很难发现错误,其他人可能会尝试这个并认为他们有一个可行的解决方案,而实际上他们没有。

The only other way I could figure out is to create two new classes, one PointInTime and one Timespan

我唯一能想到的另一种方法是创建两个新类,一个PointInTime和一个Timespan

Both should have __eq__ methods where they return true if a PointInTime is compared to a Timespan which contains it.

两者都应该有__eq__方法,如果将PointInTime与包含它的Timespan进行比较,它们将返回true。

After that you can fill your DataFrame with these objects, and join on the columns they live in.

之后,您可以使用这些对象填充DataFrame,并加入它们所在的列。

Something like this:

像这样的东西:

class PointInTime(object):

    def __init__(self, year, month, day):
        self.dt = datetime(year, month, day)

    def __eq__(self, other):
        return other.start_date < self.dt < other.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)

class Timespan(object):
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date

    def __eq__(self, other):
        return self.start_date < other.dt < self.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day,
                                             self.end_date.year, self.end_date.month, self.end_date.day)

Important note: I do not subclass datetime because pandas will consider the dtype of the column of datetime objects to be a datetime dtype, and since the timespan is not, pandas silently refuses to merge on them.

重要说明:我没有将datetime子类化,因为pandas会将datetime对象列的dtype视为datetime dtype,并且由于时间跨度不是,所以pandas会默默地拒绝对它们进行合并。

If we instantiate two objects of these classes, they can now be compared:

如果我们实例化这些类的两个对象,现在可以比较它们:

pit = PointInTime(2015,1,1)
ts = Timespan(datetime(2014,1,1), datetime(2015,2,2))
pit == ts
True

We can also fill two DataFrames with these objects:

我们还可以用这些对象填充两个DataFrame:

df = pd.DataFrame({"pit":[PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3)]})

df2 = pd.DataFrame({"ts":[Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1))]})

And then the merging kind of works:

然后是合并的作品:

pd.merge(left=df, left_on='pit', right=df2, right_on='ts')

        pit                    ts
0  2015-2-2  2015-2-1 -> 2015-2-5
1  2015-2-2  2015-2-1 -> 2015-4-1

But only kind of.

但只有一点。

PointInTime(2015,3,3) should also have been included in this join on Timespan(datetime(2015,2,1), datetime(2015,4,1))

PointInTime(2015,3,3)也应包含在Timespan的此联接中(datetime(2015,2,1),datetime(2015,4,1))

But it is not.

但事实并非如此。

I figure pandas compares PointInTime(2015,3,3) to PointInTime(2015,2,2) and makes the assumption that since they are not equal, PointInTime(2015,3,3) cannot be equal to Timespan(datetime(2015,2,1), datetime(2015,4,1)), since this timespan was equal to PointInTime(2015,2,2)

我认为pandas将PointInTime(2015,3,3)与PointInTime(2015,2,2)进行比较,并假设由于它们不相等,PointInTime(2015,3,3)不能等于Timespan(datetime(2015, 2,1),datetime(2015,4,1)),因为这个时间跨度等于PointInTime(2015,2,2)

Sort of like this:

有点像:

Rose == Flower
Lilly != Rose

Therefore:

因此:

Lilly != Flower

Edit:

编辑:

I tried to make all PointInTime equal to each other, this changed the behaviour of the join to include the 2015-3-3, but the 2015-2-2 was only included for the Timespan 2015-2-1 -> 2015-2-5, so this strengthens my above hypothesis.

我试图使所有PointInTime相互之间相等,这改变了加入的行为以包括2015-3-3,但2015-2-2仅包括在Timespan 2015-2-1 - > 2015-2 -5,所以这加强了我的上述假设。

If anyone has any other ideas, please comment and I can try it.

如果有人有任何其他想法,请评论,我可以尝试。

#4


3  

A pandas solution would be great if implemented similar to foverlaps() from data.table package in R. So far I've found numpy's piecewise() to be efficient. I've provided the code based on an earlier discussion Merging dataframes based on date range

如果实现类似于来自R中的data.table包的foverlaps(),大熊猫解决方案会很棒。到目前为止,我发现numpy的piecewise()是高效的。我已根据之前的讨论提供了基于日期范围合并数据帧的代码

A['permno'] = np.piecewise(np.zeros(A.count()[0]),
                                 [ (A['cusip'].values == id) & (A['fdate'].values >= start) & (A['fdate'].values <= end) for id, start, end in zip(B['ncusip'].values, B['namedf'].values, B['nameenddt'].values)],
                                 B['permno'].values).astype(int)

#1


17  

As you say, this is pretty easy in SQL, so why not do it in SQL?

正如你所说,这在SQL中非常简单,那么为什么不在SQL中呢?

import pandas as pd
import sqlite3

#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)

qry = '''
    select  
        start_date PresTermStart,
        end_date PresTermEnd,
        wars.date WarStart,
        presidents.name Pres
    from
        terms join wars on
        date between start_date and end_date join presidents on
        terms.president_id = presidents.president_id
    '''
df = pd.read_sql_query(qry, conn)

df:

DF:

         PresTermStart          PresTermEnd             WarStart  Pres
0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush
1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush

#2


9  

You should be able to do this now using the package pandasql

您现在应该可以使用包pandasql来执行此操作

import pandasql as ps

sqlcode = '''
select A.cusip
from A
inner join B on A.cusip=B.ncusip
where A.fdate >= B.namedt and A.fdate <= B.nameenddt
group by A.cusip
'''

newdf = ps.sqldf(sqlcode,locals())

I think the answer from @ChuHo is good. I believe pandasql is doing the same for you. I haven't benchmarked the two, but it is easier to read.

我认为来自@ChuHo的答案很好。我相信pandasql正在为你做同样的事情。我没有对这两者进行基准测试,但它更容易阅读。

#3


6  

There is no pandamic way of doing this at the moment.

This answer used to be about tackling the problem with polymorphism, which tured out to be a very bad idea.

这个答案曾经是关于解决多态性的问题,这是一个非常糟糕的主意。

Then the numpy.piecewise function appeared in another answer, but with little explanation, so I thought I would clarify how this function can be used.

然后numpy.piecewise函数出现在另一个答案中,但几乎没有解释,所以我想我会澄清如何使用这个函数。

Numpy way with piecewise (Memory heavy)

The np.piecewise function can be used to generate the behavior of a custom join. There is a lot of overhead involved and it is not very efficient perse, but it does the job.

np.piecewise函数可用于生成自定义连接的行为。涉及到很多开销,并且它不是非常有效的perse,但它完成了这项工作。

Producing conditions for joining

import pandas as pd
from datetime import datetime


presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})

start_end_date_tuples = zip(terms.start_date.values, terms.end_date.values)
conditions = [(war_declarations.date.values >= start_date) &
              (war_declarations.date.values <= end_date) for start_date, end_date in start_end_date_tuples]

> conditions
[array([ True,  True], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool)]

This is a list of arrays where each array tells us if the term time span matched for each of the two war declarations we have. The conditions can explode with larger datasets as it will be the length of the left df and the right df multiplied.

这是一个数组列表,其中每个数组告诉我们,对于我们所拥有的两个war声明中的每一个,术语时间跨度是否匹配。条件可能会因较大的数据集而爆炸,因为它将是左侧df和右侧df的长度相乘。

The piecewise "magic"

Now piecewise will take the president_id from the terms and place it in the war_declarations dataframe for each of the corresponding wars.

现在,分段将从条款中取出president_id,并将其放入每个相应战争的war_declarations数据框中。

war_declarations['president_id'] = np.piecewise(np.zeros(len(war_declarations)),
                                                conditions,
                                                terms.president_id.values)
    date        name                president_id
0   2001-09-14  War in Afghanistan          43.0
1   2003-03-03  Iraq War                    43.0

Now to finish this example we just need to regularly merge in the presidents' name.

现在要完成这个例子,我们只需要定期合并总统的名字。

war_declarations.merge(presidents, on="president_id", suffixes=["_war", "_president"])

    date        name_war            president_id    name_president
0   2001-09-14  War in Afghanistan          43.0    Bush
1   2003-03-03  Iraq War                    43.0    Bush

Polymorphism (does not work)

I wanted to share my research efforts, so even if this does not solve the problem, I hope it will be allowed to live on here as a useful reply at least. Since it is hard to spot the error, someone else may try this and think they have a working solution, while in fact, they don't.

我想分享我的研究工作,所以即使这不能解决问题,我希望它至少可以作为一个有用的答案留在这里。由于很难发现错误,其他人可能会尝试这个并认为他们有一个可行的解决方案,而实际上他们没有。

The only other way I could figure out is to create two new classes, one PointInTime and one Timespan

我唯一能想到的另一种方法是创建两个新类,一个PointInTime和一个Timespan

Both should have __eq__ methods where they return true if a PointInTime is compared to a Timespan which contains it.

两者都应该有__eq__方法,如果将PointInTime与包含它的Timespan进行比较,它们将返回true。

After that you can fill your DataFrame with these objects, and join on the columns they live in.

之后,您可以使用这些对象填充DataFrame,并加入它们所在的列。

Something like this:

像这样的东西:

class PointInTime(object):

    def __init__(self, year, month, day):
        self.dt = datetime(year, month, day)

    def __eq__(self, other):
        return other.start_date < self.dt < other.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)

class Timespan(object):
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date

    def __eq__(self, other):
        return self.start_date < other.dt < self.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day,
                                             self.end_date.year, self.end_date.month, self.end_date.day)

Important note: I do not subclass datetime because pandas will consider the dtype of the column of datetime objects to be a datetime dtype, and since the timespan is not, pandas silently refuses to merge on them.

重要说明:我没有将datetime子类化,因为pandas会将datetime对象列的dtype视为datetime dtype,并且由于时间跨度不是,所以pandas会默默地拒绝对它们进行合并。

If we instantiate two objects of these classes, they can now be compared:

如果我们实例化这些类的两个对象,现在可以比较它们:

pit = PointInTime(2015,1,1)
ts = Timespan(datetime(2014,1,1), datetime(2015,2,2))
pit == ts
True

We can also fill two DataFrames with these objects:

我们还可以用这些对象填充两个DataFrame:

df = pd.DataFrame({"pit":[PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3)]})

df2 = pd.DataFrame({"ts":[Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1))]})

And then the merging kind of works:

然后是合并的作品:

pd.merge(left=df, left_on='pit', right=df2, right_on='ts')

        pit                    ts
0  2015-2-2  2015-2-1 -> 2015-2-5
1  2015-2-2  2015-2-1 -> 2015-4-1

But only kind of.

但只有一点。

PointInTime(2015,3,3) should also have been included in this join on Timespan(datetime(2015,2,1), datetime(2015,4,1))

PointInTime(2015,3,3)也应包含在Timespan的此联接中(datetime(2015,2,1),datetime(2015,4,1))

But it is not.

但事实并非如此。

I figure pandas compares PointInTime(2015,3,3) to PointInTime(2015,2,2) and makes the assumption that since they are not equal, PointInTime(2015,3,3) cannot be equal to Timespan(datetime(2015,2,1), datetime(2015,4,1)), since this timespan was equal to PointInTime(2015,2,2)

我认为pandas将PointInTime(2015,3,3)与PointInTime(2015,2,2)进行比较,并假设由于它们不相等,PointInTime(2015,3,3)不能等于Timespan(datetime(2015, 2,1),datetime(2015,4,1)),因为这个时间跨度等于PointInTime(2015,2,2)

Sort of like this:

有点像:

Rose == Flower
Lilly != Rose

Therefore:

因此:

Lilly != Flower

Edit:

编辑:

I tried to make all PointInTime equal to each other, this changed the behaviour of the join to include the 2015-3-3, but the 2015-2-2 was only included for the Timespan 2015-2-1 -> 2015-2-5, so this strengthens my above hypothesis.

我试图使所有PointInTime相互之间相等,这改变了加入的行为以包括2015-3-3,但2015-2-2仅包括在Timespan 2015-2-1 - > 2015-2 -5,所以这加强了我的上述假设。

If anyone has any other ideas, please comment and I can try it.

如果有人有任何其他想法,请评论,我可以尝试。

#4


3  

A pandas solution would be great if implemented similar to foverlaps() from data.table package in R. So far I've found numpy's piecewise() to be efficient. I've provided the code based on an earlier discussion Merging dataframes based on date range

如果实现类似于来自R中的data.table包的foverlaps(),大熊猫解决方案会很棒。到目前为止,我发现numpy的piecewise()是高效的。我已根据之前的讨论提供了基于日期范围合并数据帧的代码

A['permno'] = np.piecewise(np.zeros(A.count()[0]),
                                 [ (A['cusip'].values == id) & (A['fdate'].values >= start) & (A['fdate'].values <= end) for id, start, end in zip(B['ncusip'].values, B['namedf'].values, B['nameenddt'].values)],
                                 B['permno'].values).astype(int)