在连接不同时区的熊猫数据存储器时可能出现错误

I am trying to join two pandas dataframes with the pd.DataFrame.join function, but there is a problem when I try to change the datetime of an existing index. I used:

我正试着加入两只熊猫dataframe和pd.DataFrame。join函数，但是当我尝试更改现有索引的datetime时，会遇到一个问题。我使用:

import pandas as pd
import pytz
import numpy as np

# Creating date ranges
sim_date = pd.date_range('1980-01-01', '1980-12-31', freq='1H', tz='UTC')
obs_date = pd.date_range('1980-01-01', '1980-12-31', freq='1H')

# Creating DataFrames with the time ranges as indices
sim = pd.DataFrame(np.random.rand(len(interim_date)), index=interim_date, columns=['Sim'])
obs = pd.DataFrame(np.random.rand(len(obs_date)), index=obs_date, columns=['Obs'])

# Changing the timezone of the observed data index
obs.index = obs.index.tz_localize('UTC').tz_convert('America/Phoenix')

# Printing the result of the join
print(pd.DataFrame.join(sim, obs).dropna())

Which yields:

收益率:

                                Sim       Obs
1980-01-01 00:00:00+00:00  0.844345  0.117649
1980-01-01 01:00:00+00:00  0.505349  0.755907
1980-01-01 02:00:00+00:00  0.799555  0.169102
1980-01-01 03:00:00+00:00  0.194750  0.704400
1980-01-01 04:00:00+00:00  0.459079  0.241803
1980-01-01 05:00:00+00:00  0.496936  0.726264
1980-01-01 06:00:00+00:00  0.515039  0.989569
1980-01-01 07:00:00+00:00  0.271105  0.488859
1980-01-01 08:00:00+00:00  0.545269  0.434904
1980-01-01 09:00:00+00:00  0.817365  0.067979
1980-01-01 10:00:00+00:00  0.051024  0.068993
1980-01-01 11:00:00+00:00  0.170346  0.510406
1980-01-01 12:00:00+00:00  0.518609  0.583602
1980-01-01 13:00:00+00:00  0.725753  0.402805
1980-01-01 14:00:00+00:00  0.134059  0.879183
1980-01-01 15:00:00+00:00  0.304070  0.773884
1980-01-01 16:00:00+00:00  0.742448  0.158367
1980-01-01 17:00:00+00:00  0.539499  0.067725
1980-01-01 18:00:00+00:00  0.349432  0.027337
1980-01-01 19:00:00+00:00  0.549015  0.078190
1980-01-01 20:00:00+00:00  0.089871  0.878931
1980-01-01 21:00:00+00:00  0.100849  0.359007
1980-01-01 22:00:00+00:00  0.290280  0.168759
1980-01-01 23:00:00+00:00  0.074420  0.881724
1980-01-02 00:00:00+00:00  0.091413  0.820616

but when I use this:

但是当我用这个的时候:

import pandas as pd
import pytz
import numpy as np

interim_date = pd.date_range('1980-01-01', '1980-01-02', freq='1H', tz='UTC')
obs_date = pd.date_range('1980-01-01', '1980-01-02', freq='1H', tz='America/Phoenix')

sim = pd.DataFrame(np.random.rand(len(interim_date)), index=interim_date, columns=['Sim'])
obs = pd.DataFrame(np.random.rand(len(obs_date)), index=obs_date, columns=['Obs'])

print(pd.DataFrame.join(sim, obs).dropna())

it yields:

它的收益率:

                                Sim       Obs
1980-01-01 07:00:00+00:00  0.894766  0.509333
1980-01-01 08:00:00+00:00  0.805764  0.564251
1980-01-01 09:00:00+00:00  0.996807  0.856853
1980-01-01 10:00:00+00:00  0.494817  0.088286
1980-01-01 11:00:00+00:00  0.716468  0.947045
1980-01-01 12:00:00+00:00  0.808407  0.332764
1980-01-01 13:00:00+00:00  0.554688  0.959215
1980-01-01 14:00:00+00:00  0.389542  0.462384
1980-01-01 15:00:00+00:00  0.039566  0.850724
1980-01-01 16:00:00+00:00  0.634998  0.097579
1980-01-01 17:00:00+00:00  0.169957  0.390812
1980-01-01 18:00:00+00:00  0.113913  0.519487
1980-01-01 19:00:00+00:00  0.521354  0.260055
1980-01-01 20:00:00+00:00  0.910717  0.693063
1980-01-01 21:00:00+00:00  0.907878  0.190714
1980-01-01 22:00:00+00:00  0.625534  0.048584
1980-01-01 23:00:00+00:00  0.926966  0.815481
1980-01-02 00:00:00+00:00  0.841386  0.573255

It seems as though both methods should yield the same result, but it seems like in the first case the observed DataFrame index does not change, even though when I print it out it says that it changed... Any suggestions?

似乎这两种方法都应该产生相同的结果，但在第一种情况下，观察到的DataFrame索引似乎没有变化，尽管当我打印出来时，它说它发生了变化……有什么建议吗?

1 个解决方案

#1

Your two methods of construction are actually different. For example, look at the first element of each index:

你的两种构造方法实际上是不同的。例如，查看每个索引的第一个元素:

In [2]: obs_date1 = pd.date_range('1980-01-01', '1980-12-31', freq='1H') \
                      .tz_localize('UTC') \
                      .tz_convert('America/Phoenix')

In [3]: obs_date2 = pd.date_range('1980-01-01', '1980-12-31', freq='1H', 
                                  tz='America/Phoenix')

In [4]: obs_date1[0]
Out[4]: Timestamp('1979-12-31 17:00:00-0700', tz='America/Phoenix', freq='H')

In [5]: obs_date2[0]
Out[5]: Timestamp('1980-01-01 00:00:00-0700', tz='America/Phoenix', freq='H')

By doing .tz_localize('UTC'), you're basically appending the UTC timezone to each timestamp. Then by using tz_convert('America/Phoenix') you're shifting the UTC timestamps back seven hours to the America/Phoenix timezone. When you use this method and join against the UTC timestamps, this index get shifted back to UTC for alignment, and both indexes exactly match, hence nothing getting dropped in your join.

通过执行.tz_localize('UTC')，基本上就是将UTC时区附加到每个时间戳。然后通过使用tz_convert(“America/Phoenix”)，您将UTC时间戳向后移动了7个小时，转到America/Phoenix时区。当您使用此方法并对UTC时间戳进行连接时，该索引将被返回到UTC进行对齐，并且这两个索引都完全匹配，因此在您的连接中没有删除任何内容。

I think what you want for the first option is to just directly use tz_localize('America/Phoenix'):

我认为您想要的第一个选项是直接使用tz_localize(“America/Phoenix”):

In [6]: obs_date3 = pd.date_range('1980-01-01', '1980-12-31', freq='1H') \
   ...:               .tz_localize('America/Phoenix')

In [7]: obs_date3[0]
Out[7]: Timestamp('1980-01-01 00:00:00-0700', tz='America/Phoenix', freq='H')

#1