根据Pandas with Groupby中列中的值,从DataFrame中选择CONSECUTIVE行

时间:2023-02-11 21:40:35

I have 5 years of S & P 500 data that I’m trying to group into specific time chunks to run some analysis on. My data is in 5 minute increments. After reading it into a DataFrame called dated, I thought I could groupby it into chunks, encompassing consecutive rows, by the time increments between the Globex open and close for each trading day. The Globex open is at 3:00PM (15:00) PST and the close is 1:15PM PST (13:15), 22.25 hours later. So I would like to chunk the data from the 3:00PM open THROUGH to the 1:15PM close--that's approximately 275 rows per chunk.

我有5年的标准普尔500指数数据,我试图将其分组到特定的时间块来进行一些分析。我的数据以5分钟为增量。在将其读入名为日期的DataFrame之后,我想我可以将它组合成块,包含连续的行,按照Globex开放和关闭的时间增量每个交易日。 Globex开放时间为太平洋标准时间下午3:00(15:00),收盘时间为太平洋标准时间下午1:15(13:15),22.25小时后。因此,我希望将数据从下午3点开放到下午1:15关闭 - 每块大约275行。

The difficulty is that a ‘trading day’ spans 2 different dates, for eaxmple, a session opens on Sunday 3-6-2016 at 15:00 and closes on Monday, 3-7-2016 at 13:15. I’m not able to groupby the ‘Date’ column because of course all of 3-6 would be in one chunk and 3-7 in another, when I need the data chunk to span both dates, in order to get the whole Globex day in one chunk.

困难在于“交易日”跨越2个不同的日期,例如,会话将于2016年6月3日星期日15:00开始,并于2016年3月7日星期一13:15结束。我无法通过'Date'列进行分组,因为当我需要数据块跨越两个日期时,所有3-6将在一个块中而3-7在另一个块中,以便获得整个Globex一天中的一天。

Being relatively new to both Pandas and Python I don’t know what method to use to group the data into the chunks that I want. Once the data is segregated, I’d like to extract the high and low of each session/chunk into separate columns and also a column for just the 15:05 open price.

对于Pandas和Python都比较新,我不知道用什么方法将数据分组到我想要的块中。一旦数据被隔离,我想将每个会话/块的高点和低点提取到单独的列中,并且还将列表仅用于15:05的开放价格。

Here is a sample of what my data looks like:

以下是我的数据的示例:

    Date    Time    Open    High    Low Close   Up  Down
0   2015-08-03  15:05   2073.50 2074.00 2073.25 2073.25 210 563
1   2015-08-03  15:10   2073.25 2073.25 2072.25 2072.75 118 632
2   2015-08-03  15:15   2072.75 2072.75 2072.25 2072.50 132 85
3   2015-08-03  15:20   2072.50 2072.75 2072.25 2072.50 95  312
4   2015-08-03  15:25   2072.50 2074.00 2072.50 2073.50 372 264

Originally, the ‘Date’ column values looked like this: 8/3/2015. I thought that it may not be read as an actual date object, so I changed the values using to_datetime() to get the ‘Date’ column values to read like an actual date object as seen in my sample DataFrame above.

最初,“日期”列值看起来像这样:2015年8月3日。我认为它可能不会被读作实际的日期对象,所以我使用to_datetime()更改了值,以使“Date”列值读取为实际日期对象,如上面的示例DataFrame中所示。

dated['Date'] =pd.to_datetime(dated['Date'])

When I tried to change the values in the ‘Time’ column using to_datetime(), it successfully changes the time from 15:05 to 15:05:00, but it also added the day, so it looked like this: ‘2016-03-05 15:05:00’, the problem being that it used today’s date for the date. This obviously won’t work for me as my data is historical and the dates and times are references to historical prices.
The reason I was trying to change the ‘Time’ column to a datetime object is I thought I would be able to slice it into the chunks that I need during a groupby operation:

当我尝试使用to_datetime()更改“时间”列中的值时,它成功地将时间从15:05更改为15:05:00,但它也添加了当天,所以它看起来像这样:'2016- 03-05 15:05:00',问题在于它使用了今天的约会日期。这显然对我不起作用,因为我的数据是历史数据,日期和时间是历史价格的参考。我试图将'Time'列更改为datetime对象的原因是我认为我可以将它切成我在groupby操作中需要的块:

dated = dated['Date'].groupby(dated['15:05' : '13:20'])  

This produced and error:

这产生了错误:

IndexError: invalid slice

So I’d appreciate any help solving this problem--pointing me in the correct areas of research. I've been essentially reading the pandas documentation piece by piece, trying out different methods, but as I'm not sure what steps to start with, I've been picking topics at random to read and not finding the answer.

所以我很感激任何帮助解决这个问题 - 指出我正确的研究领域。我基本上一直在阅读熊猫文档,尝试不同的方法,但由于我不知道从哪个步骤开始,我一直在随机选择主题阅读而没有找到答案。

Thanks, Ana

1 个解决方案

#1


2  

This is actually very complicated.

这实际上非常复杂。

First, you can convert your time as follows:

首先,您可以按如下方式转换时间:

df['Datetime'] = pd.to_datetime(df.Date + ' ' + df.Time)

Here, I will create a larger sample datframe:

在这里,我将创建一个更大的样本数据帧:

np.random.seed(0)
idx = pd.date_range('2015-1-1', '2016-1-1', freq='5min')
df = pd.DataFrame(np.random.randn(len(idx), 6),    
                  columns=['Open', 'High', 'Low', 'Close', 'Up', 'Down'])
df['Datetime'] = idx

Let's add a boolean flag to indicate when the market is open.

让我们添加一个布尔标志来指示市场何时开放。

# Create a market open flag.
df['market_open'] = False
mask = (df.Datetime.dt.time > dt.time(15)) | (df.Datetime.dt.time < dt.time(13, 15))
df.loc[mask, 'market_open'] = True

Here we create a function than returns the open, high, low, close, etc. on the grouped bars:

在这里,我们创建一个函数,而不是返回分组条上的open,high,low,close等:

def ohlc(df):
    return (
        df.Datetime.iat[-1], # last timestamp in group.
        df.Open.iat[0], # First Open.
        df.High.max(), 
        df.Low.min(), 
        df.Close.iat[-1], # Last Close.
        df.Up.sum(), 
        df.Down.sum(),
        df.Close.count(), # Count number of closing bars.
        df.market_open.iat[0])   # Take first True/False indicator.

Now we do our groupby based on the change in market_open (i.e. on changes in the True/False flag) and then we apply our function to these grouped results.

现在我们根据market_open的变化(即真/假标志的变化)进行groupby,然后我们将函数应用于这些分组结果。

bars = pd.DataFrame(
    zip(*df.groupby(
            (df.market_open != df.market_open.shift())
            .cumsum()
             ).apply(ohlc))).T

bars.columns = ['bar_close_time', 'Open', 'High', 'Low', 'Close', 'Up', 'Down', 'bar_count', 'market_open']

We have bars for both open and closed sessions. We can remove the ones for when the market is closed.

我们有开放和非公开会议的酒吧。我们可以在市场关闭时删除那些。

# Remove bars when market is closed
bars = bars[bars.market_open].iloc[:, :-1]

>>> bars.tail()
          bar_close_time      Open     High      Low      Close        Up     Down bar_count
722  2015-12-28 13:10:00   1.23175  2.88569  -2.7143  -0.785648  -13.3166  14.6094       266
724  2015-12-29 13:10:00 -0.900675   2.6483 -2.61698    -0.8265  0.825872  4.98565       266
726  2015-12-30 13:10:00   1.65299  2.57881 -2.85199  -0.376141  -4.32867  3.62123       266
728  2015-12-31 13:10:00  0.435619  2.93638 -2.74758  -0.461525  -20.0928 -15.8205       266
730  2016-01-01 00:00:00  0.293165  2.39097  -2.1234  0.0684124  -7.83721  1.69182       108

#1


2  

This is actually very complicated.

这实际上非常复杂。

First, you can convert your time as follows:

首先,您可以按如下方式转换时间:

df['Datetime'] = pd.to_datetime(df.Date + ' ' + df.Time)

Here, I will create a larger sample datframe:

在这里,我将创建一个更大的样本数据帧:

np.random.seed(0)
idx = pd.date_range('2015-1-1', '2016-1-1', freq='5min')
df = pd.DataFrame(np.random.randn(len(idx), 6),    
                  columns=['Open', 'High', 'Low', 'Close', 'Up', 'Down'])
df['Datetime'] = idx

Let's add a boolean flag to indicate when the market is open.

让我们添加一个布尔标志来指示市场何时开放。

# Create a market open flag.
df['market_open'] = False
mask = (df.Datetime.dt.time > dt.time(15)) | (df.Datetime.dt.time < dt.time(13, 15))
df.loc[mask, 'market_open'] = True

Here we create a function than returns the open, high, low, close, etc. on the grouped bars:

在这里,我们创建一个函数,而不是返回分组条上的open,high,low,close等:

def ohlc(df):
    return (
        df.Datetime.iat[-1], # last timestamp in group.
        df.Open.iat[0], # First Open.
        df.High.max(), 
        df.Low.min(), 
        df.Close.iat[-1], # Last Close.
        df.Up.sum(), 
        df.Down.sum(),
        df.Close.count(), # Count number of closing bars.
        df.market_open.iat[0])   # Take first True/False indicator.

Now we do our groupby based on the change in market_open (i.e. on changes in the True/False flag) and then we apply our function to these grouped results.

现在我们根据market_open的变化(即真/假标志的变化)进行groupby,然后我们将函数应用于这些分组结果。

bars = pd.DataFrame(
    zip(*df.groupby(
            (df.market_open != df.market_open.shift())
            .cumsum()
             ).apply(ohlc))).T

bars.columns = ['bar_close_time', 'Open', 'High', 'Low', 'Close', 'Up', 'Down', 'bar_count', 'market_open']

We have bars for both open and closed sessions. We can remove the ones for when the market is closed.

我们有开放和非公开会议的酒吧。我们可以在市场关闭时删除那些。

# Remove bars when market is closed
bars = bars[bars.market_open].iloc[:, :-1]

>>> bars.tail()
          bar_close_time      Open     High      Low      Close        Up     Down bar_count
722  2015-12-28 13:10:00   1.23175  2.88569  -2.7143  -0.785648  -13.3166  14.6094       266
724  2015-12-29 13:10:00 -0.900675   2.6483 -2.61698    -0.8265  0.825872  4.98565       266
726  2015-12-30 13:10:00   1.65299  2.57881 -2.85199  -0.376141  -4.32867  3.62123       266
728  2015-12-31 13:10:00  0.435619  2.93638 -2.74758  -0.461525  -20.0928 -15.8205       266
730  2016-01-01 00:00:00  0.293165  2.39097  -2.1234  0.0684124  -7.83721  1.69182       108