I have the following dataframe structure that is indexed with a timestamp:
我有以下数据帧结构,使用时间戳索引:
neg neu norm pol pos date
time
1520353341 0.000 1.000 0.0000 0.000000 0.000
1520353342 0.121 0.879 -0.2960 0.347851 0.000
1520353342 0.217 0.783 -0.6124 0.465833 0.000
I create a date from the timestamp:
我从时间戳创建一个日期:
data_frame['date'] = [datetime.datetime.fromtimestamp(d) for d in data_frame.time]
Result:
neg neu norm pol pos date
time
1520353341 0.000 1.000 0.0000 0.000000 0.000 2018-03-06 10:22:21
1520353342 0.121 0.879 -0.2960 0.347851 0.000 2018-03-06 10:22:22
1520353342 0.217 0.783 -0.6124 0.465833 0.000 2018-03-06 10:22:22
I want to group by hour, while getting the mean for all the values, except the timestamp, that should be the hour from where the group started. So this is the result I want to archive:
我希望按小时分组,同时获取除时间戳之外的所有值的均值,该值应该是组开始的小时。所以这是我要归档的结果:
neg neu norm pol pos
time
1520352000 0.027989 0.893233 0.122535 0.221079 0.078779
1520355600 0.028861 0.899321 0.103698 0.209353 0.071811
The closest I have gotten so far has been with this answer:
到目前为止,我得到的最接近的答案是:
data = data.groupby(data.date.dt.hour).mean()
Results:
neg neu norm pol pos
date
0 0.027989 0.893233 0.122535 0.221079 0.078779
1 0.028861 0.899321 0.103698 0.209353 0.071811
But I cant figure out how to keep the timestamp that takes in account he hour where the grouby started.
但我无法弄清楚如何保持时间戳考虑到煤矸石开始的时间。
3 个解决方案
#1
3
I came across this gem, pd.DataFrame.resample
, after I posted my round-to-hour solution.
在我发布了我的圆形解决方案之后,我遇到了这个gem,pd.DataFrame.resample。
# Construct example dataframe
times = pd.date_range('1/1/2018', periods=5, freq='25min')
values = [4,8,3,4,1]
df = pd.DataFrame({'val':values}, index=times)
# Resample by hour and calculate medians
df.resample('H').median()
Or you can use groupby
with Grouper
if you don't want times as index:
或者,如果您不希望将时间作为索引,则可以将groupby与Grouper一起使用:
df = pd.DataFrame({'val':values, 'times':times})
df.groupby(pd.Grouper(level='times', freq='H')).median()
#2
1
You can round the timestamp column down to the nearest hour:
您可以将时间戳列向下舍入到最近的小时:
import math
df.time = [math.floor(t/3600) * 3600 for t in df.time]
Or even simpler, using integer division:
甚至更简单,使用整数除法:
df.time = [(t//3600) * 3600 for t in df.time]
You can group by this column and thus preserve the timestamp.
您可以按此列进行分组,从而保留时间戳。
#3
-1
Did you try creating an hour column by:
您是否尝试通过以下方式创建小时列:
data_frame['hour'] = data_frame.date.dt.hour
Then grouping by hour like:
然后按小时分组,如:
data = data.groupby(data.hour).mean()
#1
3
I came across this gem, pd.DataFrame.resample
, after I posted my round-to-hour solution.
在我发布了我的圆形解决方案之后,我遇到了这个gem,pd.DataFrame.resample。
# Construct example dataframe
times = pd.date_range('1/1/2018', periods=5, freq='25min')
values = [4,8,3,4,1]
df = pd.DataFrame({'val':values}, index=times)
# Resample by hour and calculate medians
df.resample('H').median()
Or you can use groupby
with Grouper
if you don't want times as index:
或者,如果您不希望将时间作为索引,则可以将groupby与Grouper一起使用:
df = pd.DataFrame({'val':values, 'times':times})
df.groupby(pd.Grouper(level='times', freq='H')).median()
#2
1
You can round the timestamp column down to the nearest hour:
您可以将时间戳列向下舍入到最近的小时:
import math
df.time = [math.floor(t/3600) * 3600 for t in df.time]
Or even simpler, using integer division:
甚至更简单,使用整数除法:
df.time = [(t//3600) * 3600 for t in df.time]
You can group by this column and thus preserve the timestamp.
您可以按此列进行分组,从而保留时间戳。
#3
-1
Did you try creating an hour column by:
您是否尝试通过以下方式创建小时列:
data_frame['hour'] = data_frame.date.dt.hour
Then grouping by hour like:
然后按小时分组,如:
data = data.groupby(data.hour).mean()