如何使用Pandas的时间戳按小时对数据帧进行分组

I have the following dataframe structure that is indexed with a timestamp:

我有以下数据帧结构,使用时间戳索引:

    neg neu norm    pol pos date
time                        
1520353341  0.000   1.000   0.0000  0.000000    0.000   
1520353342  0.121   0.879   -0.2960 0.347851    0.000   
1520353342  0.217   0.783   -0.6124 0.465833    0.000

I create a date from the timestamp:

我从时间戳创建一个日期:

data_frame['date'] = [datetime.datetime.fromtimestamp(d) for d in data_frame.time]

Result:

    neg neu norm    pol pos date
time                        
1520353341  0.000   1.000   0.0000  0.000000    0.000   2018-03-06 10:22:21
1520353342  0.121   0.879   -0.2960 0.347851    0.000   2018-03-06 10:22:22
1520353342  0.217   0.783   -0.6124 0.465833    0.000   2018-03-06 10:22:22

I want to group by hour, while getting the mean for all the values, except the timestamp, that should be the hour from where the group started. So this is the result I want to archive:

我希望按小时分组,同时获取除时间戳之外的所有值的均值,该值应该是组开始的小时。所以这是我要归档的结果:

    neg neu norm    pol pos
time                    
1520352000  0.027989    0.893233    0.122535    0.221079    0.078779
1520355600  0.028861    0.899321    0.103698    0.209353    0.071811

The closest I have gotten so far has been with this answer:

到目前为止,我得到的最接近的答案是:

data = data.groupby(data.date.dt.hour).mean()

Results:

    neg neu norm    pol pos
date                    
0   0.027989    0.893233    0.122535    0.221079    0.078779
1   0.028861    0.899321    0.103698    0.209353    0.071811

But I cant figure out how to keep the timestamp that takes in account he hour where the grouby started.

但我无法弄清楚如何保持时间戳考虑到煤矸石开始的时间。

3 个解决方案

#1

I came across this gem, pd.DataFrame.resample, after I posted my round-to-hour solution.

在我发布了我的圆形解决方案之后,我遇到了这个gem,pd.DataFrame.resample。

# Construct example dataframe
times = pd.date_range('1/1/2018', periods=5, freq='25min')
values = [4,8,3,4,1]
df = pd.DataFrame({'val':values}, index=times)

# Resample by hour and calculate medians
df.resample('H').median()

Or you can use groupby with Grouper if you don't want times as index:

或者,如果您不希望将时间作为索引,则可以将groupby与Grouper一起使用:

df = pd.DataFrame({'val':values, 'times':times})
df.groupby(pd.Grouper(level='times', freq='H')).median()

#2

You can round the timestamp column down to the nearest hour:

您可以将时间戳列向下舍入到最近的小时:

import math
df.time = [math.floor(t/3600) * 3600 for t in df.time]

Or even simpler, using integer division:

甚至更简单,使用整数除法:

df.time = [(t//3600) * 3600 for t in df.time]

You can group by this column and thus preserve the timestamp.

您可以按此列进行分组,从而保留时间戳。

#3

-1

Did you try creating an hour column by:

您是否尝试通过以下方式创建小时列:

data_frame['hour'] = data_frame.date.dt.hour

Then grouping by hour like:

然后按小时分组,如:

data = data.groupby(data.hour).mean()

#1