从netcdf文件中获得每个月的平均小时工资

时间:2021-01-19 16:59:07

I have a netCDF file with the time dimension containing data by the hour for 2 years. I want to average it to get an hourly average for each hour of the day for each month. I tried this:

我有一个netCDF文件,它的时间维度包含2年以小时为单位的数据。我想让它平均每个月平均每小时得到一个小时。我试着这样的:

import xarray as xr
ds = xr.open_mfdataset('ecmwf_usa_2015.nc')    
ds.groupby(['time.month', 'time.hour']).mean('time')

but I get this error:

但是我得到了这个错误:

*** TypeError: `group` must be an xarray.DataArray or the name of an xarray variable or dimension

How can I fix this? If I do this:

我该怎么解决这个问题呢?如果我这样做:

ds.groupby('time.month', 'time.hour').mean('time')

I do not get an error but the result has a time dimension of 12 (one value for each month), whereas I want an hourly average for each month i.e. 24 values for each of 12 months. Data is available here: https://www.dropbox.com/s/yqgg80wn8bjdksy/ecmwf_usa_2015.nc?dl=0

我没有得到错误,但是结果的时间维度是12(每个月一个值),而我想要每个月的每小时平均值,即12个月的24个值。数据如下:https://www.dropbox.com/s/yqgg80wn8bjdksy/ecmwf_usa_2015.nc?

3 个解决方案

#1


3  

You are getting TypeError: group must be an xarray.DataArray or the name of an xarray variable or dimension because ds.groupby() is supposed to take xarray dataset variable or array , you passed a list of variables.

您将得到TypeError: group必须是一个xarray。DataArray或xarray变量或维度的名称,因为dn .groupby()应该使用xarray数据集变量或数组,所以您传递了一个变量列表。

You have two options:

1. xarray bins --> group by hour

Refer group by documentation group by documentation and convert dataset into splits or bins and then apply groupby('time.hour')

按文档分组,按文档分组,并将数据集转换为分割或bin,然后应用groupby(“time.hour”)

This is because applying groupby on month and then hour one by one or by together is aggregating all the data. If you split them you into month data you would apply group by - mean on each month.

这是因为按月应用groupby,然后按小时或按小时逐个地聚合所有数据。如果你把它们分成月数据,你会按月平均申请。

You can try this approach as mentioned in documentation:

您可以尝试这个方法,如文件中提到的:

GroupBy: split-apply-combine

GroupBy:split-apply-combine

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

xarray支持与熊猫具有相同API的“group by”操作,实现分割-应用-合并策略:

  • Split your data into multiple independent groups. => Split them by months using groupby_bins
  • 将数据分割成多个独立的组。=>使用groupby_bin按月划分
  • Apply some function to each group. => apply group by
  • 对每个组应用一些函数。= >应用组
  • Combine your groups back into a single data object. **apply aggregate function mean('time')
  • 将组合并到单个数据对象中。* *应用聚合函数的意思(“时间”)

2. convert it into pandas dataframe and use group by

Warning : Not all netcdfs are convertable to panda dataframe , there may be meta data loss while conversion.

警告:并不是所有netcdfs都可以转换为熊猫数据存储器,在转换时可能会有元数据丢失。

Convert ds into pandas dataframe by df = ds.to_dataframe()and use group by as you require by using pandas.Grouperlike

通过df = ds.to_dataframe()将ds转换为熊猫dataframe(),并按照使用pandas. grouperlike的方式使用group by。

df.set_index('time').groupby([pd.Grouper(freq='1M'), 't2m']).mean()

Note : I saw couple of answers with pandas.TimeGrouper but its deprecated and one has to use pandas.Grouper now.

注意:我看到了一些熊猫的答案。时间组,但不赞成,一个人必须使用熊猫。石斑鱼。

Since your data set is too big and question does not have minimized data and working on it consuming heavy resources I would suggest to look at these examples on pandas

由于您的数据集太大,并且问题没有最小化数据,并且正在处理数据消耗大量资源,我建议您看看这些熊猫的例子

  1. group by weekdays
  2. 集团的工作日
  3. group by time
  4. 组的时间
  5. groupby-date-range-depending-on-each-row
  6. groupby-date-range-depending-on-each-row
  7. group-and-count-rows-by-month-and-year
  8. group-and-count-rows-by-month-and-year

#2


0  

Not a python solution, but I think this is how you could do it using CDO in a bash script loop:

不是python解决方案,但我认为这就是在bash脚本循环中使用CDO的方法:

# loop over months:
for i in {1..12}; do
   # This gives the hourly mean for each month separately 
   cdo yhourmean -selmon,${i} datafile.nc mon${i}.nc
done
# merge the files
cdo mergetime mon*.nc hourlyfile.nc
rm -f mon*.nc # clean up the files

Note that if you data doesn't start in January then you will get a "jump" in the final file time... I think that can be sorted by setting the year after the yhourmean command if that is an issue for you.

请注意,如果您的数据在1月份没有启动,那么您将在最终的文件时间中得到一个“跳转”。我认为可以通过设置yhourmean命令之后的年份来排序,如果这对您来说是个问题的话。

#3


0  

Whith this

因为这

import xarray as xr
ds = xr.open_mfdataset('ecmwf_usa_2015.nc')
print ds.groupby('time.hour' ).mean('time')

I get somthing like this:

我得到这样的东西:

Dimensions: (hour: 24, latitude: 93, longitude: 281) Coordinates:

尺寸:(小时:24,纬度:93,经度:281)坐标:

  • longitude (longitude) float32 230.0 230.25 230.5 230.75 231.0 231.25 ... * latitude (latitude) float32 48.0 47.75 47.5 47.25 47.0 46.75 46.5 ... * hour (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  • 经度(经度)*纬度(纬度)浮动32 48.0 47.75 47.5 47.25 47.0 46.75 46.5…* 1小时(小时)到…

I think that is what you want.

我想这就是你想要的。

#1


3  

You are getting TypeError: group must be an xarray.DataArray or the name of an xarray variable or dimension because ds.groupby() is supposed to take xarray dataset variable or array , you passed a list of variables.

您将得到TypeError: group必须是一个xarray。DataArray或xarray变量或维度的名称,因为dn .groupby()应该使用xarray数据集变量或数组,所以您传递了一个变量列表。

You have two options:

1. xarray bins --> group by hour

Refer group by documentation group by documentation and convert dataset into splits or bins and then apply groupby('time.hour')

按文档分组,按文档分组,并将数据集转换为分割或bin,然后应用groupby(“time.hour”)

This is because applying groupby on month and then hour one by one or by together is aggregating all the data. If you split them you into month data you would apply group by - mean on each month.

这是因为按月应用groupby,然后按小时或按小时逐个地聚合所有数据。如果你把它们分成月数据,你会按月平均申请。

You can try this approach as mentioned in documentation:

您可以尝试这个方法,如文件中提到的:

GroupBy: split-apply-combine

GroupBy:split-apply-combine

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

xarray支持与熊猫具有相同API的“group by”操作,实现分割-应用-合并策略:

  • Split your data into multiple independent groups. => Split them by months using groupby_bins
  • 将数据分割成多个独立的组。=>使用groupby_bin按月划分
  • Apply some function to each group. => apply group by
  • 对每个组应用一些函数。= >应用组
  • Combine your groups back into a single data object. **apply aggregate function mean('time')
  • 将组合并到单个数据对象中。* *应用聚合函数的意思(“时间”)

2. convert it into pandas dataframe and use group by

Warning : Not all netcdfs are convertable to panda dataframe , there may be meta data loss while conversion.

警告:并不是所有netcdfs都可以转换为熊猫数据存储器,在转换时可能会有元数据丢失。

Convert ds into pandas dataframe by df = ds.to_dataframe()and use group by as you require by using pandas.Grouperlike

通过df = ds.to_dataframe()将ds转换为熊猫dataframe(),并按照使用pandas. grouperlike的方式使用group by。

df.set_index('time').groupby([pd.Grouper(freq='1M'), 't2m']).mean()

Note : I saw couple of answers with pandas.TimeGrouper but its deprecated and one has to use pandas.Grouper now.

注意:我看到了一些熊猫的答案。时间组,但不赞成,一个人必须使用熊猫。石斑鱼。

Since your data set is too big and question does not have minimized data and working on it consuming heavy resources I would suggest to look at these examples on pandas

由于您的数据集太大,并且问题没有最小化数据,并且正在处理数据消耗大量资源,我建议您看看这些熊猫的例子

  1. group by weekdays
  2. 集团的工作日
  3. group by time
  4. 组的时间
  5. groupby-date-range-depending-on-each-row
  6. groupby-date-range-depending-on-each-row
  7. group-and-count-rows-by-month-and-year
  8. group-and-count-rows-by-month-and-year

#2


0  

Not a python solution, but I think this is how you could do it using CDO in a bash script loop:

不是python解决方案,但我认为这就是在bash脚本循环中使用CDO的方法:

# loop over months:
for i in {1..12}; do
   # This gives the hourly mean for each month separately 
   cdo yhourmean -selmon,${i} datafile.nc mon${i}.nc
done
# merge the files
cdo mergetime mon*.nc hourlyfile.nc
rm -f mon*.nc # clean up the files

Note that if you data doesn't start in January then you will get a "jump" in the final file time... I think that can be sorted by setting the year after the yhourmean command if that is an issue for you.

请注意,如果您的数据在1月份没有启动,那么您将在最终的文件时间中得到一个“跳转”。我认为可以通过设置yhourmean命令之后的年份来排序,如果这对您来说是个问题的话。

#3


0  

Whith this

因为这

import xarray as xr
ds = xr.open_mfdataset('ecmwf_usa_2015.nc')
print ds.groupby('time.hour' ).mean('time')

I get somthing like this:

我得到这样的东西:

Dimensions: (hour: 24, latitude: 93, longitude: 281) Coordinates:

尺寸:(小时:24,纬度:93,经度:281)坐标:

  • longitude (longitude) float32 230.0 230.25 230.5 230.75 231.0 231.25 ... * latitude (latitude) float32 48.0 47.75 47.5 47.25 47.0 46.75 46.5 ... * hour (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  • 经度(经度)*纬度(纬度)浮动32 48.0 47.75 47.5 47.25 47.0 46.75 46.5…* 1小时(小时)到…

I think that is what you want.

我想这就是你想要的。