在熊猫中开始和停止时间段

时间:2022-09-15 15:49:17

I have a following dataframe in pandas:

我在熊猫中有以下数据框:

+----------+-------------------+---------------------+----------+------------+
| UserName | MainOperationName |  Submission_Ended   |  delta   | new period |
+----------+-------------------+---------------------+----------+------------+

| User1    | Record submission | 2017-07-31 00:08:25 | 00:00:00 | False      |
| User1    | Record submission | 2017-07-31 00:12:02 | 00:03:37 | False      |
| User1    | Record submission | 2017-07-31 00:14:51 | 00:02:49 | False      |
| User1    | Record submission | 2017-07-31 00:17:27 | 00:02:36 | False      |
| User1    | Record submission | 2017-07-31 00:23:42 | 00:06:15 | False      |
| User1    | Record submission | 2017-07-31 00:25:35 | 00:01:53 | False      |
| User1    | Record submission | 2017-07-31 00:26:01 | 00:00:26 | False      |
| User1    | Record submission | 2017-07-31 01:59:11 | 01:33:10 | True       |
| User1    | Record submission | 2017-07-31 02:00:37 | 00:01:26 | False      |
| User1    | Record submission | 2017-07-31 02:03:12 | 00:02:35 | False      |
| User1    | Record submission | 2017-07-31 02:21:22 | 00:18:10 | False      |
| User1    | Record submission | 2017-07-31 02:30:28 | 00:09:06 | False      |
| User1    | Record submission | 2017-07-31 02:36:03 | 00:05:35 | False      |
| User1    | Record submission | 2017-07-31 03:25:43 | 00:49:40 | True       |
+----------+-------------------+---------------------+----------+------------+

Delta column is just the difference between rows of Submission_Ended. Then new period is True when the difference is greater than 20 minutes. I think I would also force first row value to be True because it's when a new period starts. I assume that when the delta is smaller than that the user is using the application otherwise he/she is having a break. I would like to visualize that with timeline/gantt chart (like in the last section here). But for that I need to get start and stop of each period which in that case would be:

Delta列只是Submission_Ended行之间的差异。当差值大于20分钟时,新周期为True。我想我也会强制第一行值为True,因为它是在一个新的时期开始时。我假设当delta小于用户正在使用该应用程序时,否则他/她正在休息。我想用时间轴/甘特图来形象化(就像在上一节这里)。但为此我需要开始和停止每个时期,在这种情况下:

  • start: 2017-07-31 00:08:25 ; stop: 2017-07-31 00:26:01
  • 开始时间:2017-07-31 00:08:25;停止:2017-07-31 00:26:01

  • start: 2017-07-31 01:59:11 ; stop: 2017-07-31 02:36:03
  • 开始时间:2017-07-31 01:59:11;停止:2017-07-31 02:36:03

  • start: 2017-07-31 03:25:43 ; stop: ...
  • 开始时间:2017-07-31 03:25:43;停: ...

Any idea how I could get that from data structured like that? Just to mention that in my real dataframe there are hundreds of users.

知道我怎么能从那样结构化的数据中得到它吗?仅举几例,在我的真实数据框架中,有数百个用户。

1 个解决方案

#1


0  

Taylor's comment was very useful. I did the following:

泰勒的评论非常有用。我做了以下事情:

session_id = 0

def get_session_id(row):
    global session_id

    if row == True:
        session_id += 1
        return session_id
    else:
        return session_id

df_ben["session id"] = df_ben["new period"].apply(lambda row: get_session_id(row))
start_time = df_ben.groupby("session id").nth(0)["Submission_Ended"]
stop_time = df_ben.groupby("session id").nth(-1)["Submission_Ended"]
df_final = pd.DataFrame({"start":start_time, "stop":stop_time})

And my final result is this:

我的最终结果如下:

+------------+---------------------+---------------------+
|            |        start        |        stop         |
+------------+---------------------+---------------------+
| session id |                     |                     |
| 1          | 2017-07-31 00:08:25 | 2017-07-31 00:26:01 |
| 2          | 2017-07-31 01:59:11 | 2017-07-31 02:36:03 |
| 3          | 2017-07-31 03:25:43 | 2017-07-31 03:48:40 |
| 4          | 2017-07-31 04:12:03 | 2017-07-31 04:12:03 |
| 5          | 2017-07-31 04:36:09 | 2017-07-31 05:23:26 |
| 6          | 2017-07-31 05:59:04 | 2017-07-31 06:24:34 |
+------------+---------------------+---------------------+

So I can work with that!

所以我可以用它!

Now I only don't like the way I assigned the session_id using global statement. Any idea how to do it in a neater way?

现在我只是不喜欢使用全局语句分配session_id的方式。知道如何以更整洁的方式做到这一点吗?

#1


0  

Taylor's comment was very useful. I did the following:

泰勒的评论非常有用。我做了以下事情:

session_id = 0

def get_session_id(row):
    global session_id

    if row == True:
        session_id += 1
        return session_id
    else:
        return session_id

df_ben["session id"] = df_ben["new period"].apply(lambda row: get_session_id(row))
start_time = df_ben.groupby("session id").nth(0)["Submission_Ended"]
stop_time = df_ben.groupby("session id").nth(-1)["Submission_Ended"]
df_final = pd.DataFrame({"start":start_time, "stop":stop_time})

And my final result is this:

我的最终结果如下:

+------------+---------------------+---------------------+
|            |        start        |        stop         |
+------------+---------------------+---------------------+
| session id |                     |                     |
| 1          | 2017-07-31 00:08:25 | 2017-07-31 00:26:01 |
| 2          | 2017-07-31 01:59:11 | 2017-07-31 02:36:03 |
| 3          | 2017-07-31 03:25:43 | 2017-07-31 03:48:40 |
| 4          | 2017-07-31 04:12:03 | 2017-07-31 04:12:03 |
| 5          | 2017-07-31 04:36:09 | 2017-07-31 05:23:26 |
| 6          | 2017-07-31 05:59:04 | 2017-07-31 06:24:34 |
+------------+---------------------+---------------------+

So I can work with that!

所以我可以用它!

Now I only don't like the way I assigned the session_id using global statement. Any idea how to do it in a neater way?

现在我只是不喜欢使用全局语句分配session_id的方式。知道如何以更整洁的方式做到这一点吗?