熊猫：根据时间分块数据帧字典

I have a dictionary of dataframes where each dataframe has price and timestamp columns. something like this {'A':df1, 'B':df2}

我有一个数据框字典,其中每个数据框都有价格和时间戳列。这样的东西{'A':df1,'B':df2}

I need to build a function which can slice up the dataframes inside the dict in chunks of H hours of the timestamp and then pass this dict of dataframes to another function(which does some computation) for every chunk.

我需要构建一个函数,它可以在时间戳的H小时块中切割dict中的数据帧,然后将这个数据帧的dict传递给每个chunk的另一个函数(它进行一些计算)。

How do I go forward with this?

我该如何继续前进?

For example

def foo(dict_of_dataframes):
    for id, df in dict_of_dataframes.items():
        do_something()

def boo(dict_of_dataframes, chunksize):
    """
    Needs to chunk up the @dict_of_dataframes in @chunksize hours
    and needs to call foo function on these chunks of
    @dicts_of_dataframes
    """

Sample data:

df1:
Time                       Price
2017-03-07 09:47:31+00:00  100
2017-03-07 11:27:31+00:00  120
2017-03-07 14:47:31+00:00  150
2017-03-07 17:17:31+00:00  135
2017-03-07 20:57:31+00:00  200
2017-03-08 03:27:31+00:00  120
2017-03-08 09:57:31+00:00  100
2017-03-08 11:27:31+00:00  150

df2:
Time                       Price
2017-03-07 09:07:31+00:00  200
2017-03-07 10:27:31+00:00  300
2017-03-07 12:47:31+00:00  100
2017-03-07 17:47:31+00:00  250
2017-03-07 22:27:31+00:00  300
2017-03-08 01:57:31+00:00  500
2017-03-08 02:57:31+00:00  500
2017-03-08 10:27:31+00:00  100

I need help with the boo function. How does one go forward with this?

我需要有关boo功能的帮助。一个人如何前进呢?

Also is there any specific term for these kinds of boo functions which simulate other function calling. I've seen these a few times, If you could point to a resource which explains how to design these 'function caller' functions, I'd really appreciate that.

对于这些模拟其他函数调用的boo函数,还有任何特定的术语。我已经看过几次,如果你能指出一个解释如何设计这些“函数调用者”功能的资源,我真的很感激。

1 个解决方案

#1

I think what you actually want can be achieved using resample - basically a groupby for datetimes. Assuming you need transaction sum within 6 hours, you can use this:

我认为你真正想要的是使用resample实现的 - 基本上是一个用于日期时间的groupby。假设您需要在6小时内完成交易金额,您可以使用:

def boo(dict_dfs, hours):
    return {k: v.resample(f'{hours}H').sum() for k,v in dict_dfs.items()}

now, if you 100% sure you need dicts instead, use groupby:

现在,如果您100%确定需要dicts,请使用groupby:

def boo(dict_dfs, hours):
    return {k:{hr:v for hr, v in df.groupby(Grouper(key='Time', freq=f'{hours}H'))} for k, df in dict_dfs.items()}

Btw, if you want to loop through {key, value} on dicts, use dict.items(), not dict itself.

顺便说一句,如果你想在dicts上循环{key,value},请使用dict.items(),而不是dict本身。

And one more note: I saw many times people overcomplicating their data structures. Most of the time you don't need dict of dataframes - you can use one dataframe, just having a category column or even a multi-index (like, [category, Time] multi-index in your case. With that, you'll get more reusable, fast and clean code!

还有一点需要注意:我看到很多时候人们对数据结构过于复杂。大多数情况下,您不需要数据帧的字典 - 您可以使用一个数据帧,只需要一个类别列甚至是多索引(例如,[category,Time]多索引。在这种情况下,你可以'将获得更多可重用,快速和干净的代码!

#1