I have a following dataframe in pandas:
我在熊猫中有以下数据框:
+----------+-------------------+---------------------+----------+------------+
| UserName | MainOperationName | Submission_Ended | delta | new period |
+----------+-------------------+---------------------+----------+------------+
| User1 | Record submission | 2017-07-31 00:08:25 | 00:00:00 | False |
| User1 | Record submission | 2017-07-31 00:12:02 | 00:03:37 | False |
| User1 | Record submission | 2017-07-31 00:14:51 | 00:02:49 | False |
| User1 | Record submission | 2017-07-31 00:17:27 | 00:02:36 | False |
| User1 | Record submission | 2017-07-31 00:23:42 | 00:06:15 | False |
| User1 | Record submission | 2017-07-31 00:25:35 | 00:01:53 | False |
| User1 | Record submission | 2017-07-31 00:26:01 | 00:00:26 | False |
| User1 | Record submission | 2017-07-31 01:59:11 | 01:33:10 | True |
| User1 | Record submission | 2017-07-31 02:00:37 | 00:01:26 | False |
| User1 | Record submission | 2017-07-31 02:03:12 | 00:02:35 | False |
| User1 | Record submission | 2017-07-31 02:21:22 | 00:18:10 | False |
| User1 | Record submission | 2017-07-31 02:30:28 | 00:09:06 | False |
| User1 | Record submission | 2017-07-31 02:36:03 | 00:05:35 | False |
| User1 | Record submission | 2017-07-31 03:25:43 | 00:49:40 | True |
+----------+-------------------+---------------------+----------+------------+
Delta
column is just the difference between rows of Submission_Ended
. Then new period
is True when the difference is greater than 20 minutes. I think I would also force first row value to be True because it's when a new period starts. I assume that when the delta is smaller than that the user is using the application otherwise he/she is having a break. I would like to visualize that with timeline/gantt chart (like in the last section here). But for that I need to get start and stop of each period which in that case would be:
Delta列只是Submission_Ended行之间的差异。当差值大于20分钟时,新周期为True。我想我也会强制第一行值为True,因为它是在一个新的时期开始时。我假设当delta小于用户正在使用该应用程序时,否则他/她正在休息。我想用时间轴/甘特图来形象化(就像在上一节这里)。但为此我需要开始和停止每个时期,在这种情况下:
- start: 2017-07-31 00:08:25 ; stop: 2017-07-31 00:26:01
- start: 2017-07-31 01:59:11 ; stop: 2017-07-31 02:36:03
- start: 2017-07-31 03:25:43 ; stop: ...
开始时间:2017-07-31 00:08:25;停止:2017-07-31 00:26:01
开始时间:2017-07-31 01:59:11;停止:2017-07-31 02:36:03
开始时间:2017-07-31 03:25:43;停: ...
Any idea how I could get that from data structured like that? Just to mention that in my real dataframe there are hundreds of users.
知道我怎么能从那样结构化的数据中得到它吗?仅举几例,在我的真实数据框架中,有数百个用户。
1 个解决方案
#1
0
Taylor's comment was very useful. I did the following:
泰勒的评论非常有用。我做了以下事情:
session_id = 0
def get_session_id(row):
global session_id
if row == True:
session_id += 1
return session_id
else:
return session_id
df_ben["session id"] = df_ben["new period"].apply(lambda row: get_session_id(row))
start_time = df_ben.groupby("session id").nth(0)["Submission_Ended"]
stop_time = df_ben.groupby("session id").nth(-1)["Submission_Ended"]
df_final = pd.DataFrame({"start":start_time, "stop":stop_time})
And my final result is this:
我的最终结果如下:
+------------+---------------------+---------------------+
| | start | stop |
+------------+---------------------+---------------------+
| session id | | |
| 1 | 2017-07-31 00:08:25 | 2017-07-31 00:26:01 |
| 2 | 2017-07-31 01:59:11 | 2017-07-31 02:36:03 |
| 3 | 2017-07-31 03:25:43 | 2017-07-31 03:48:40 |
| 4 | 2017-07-31 04:12:03 | 2017-07-31 04:12:03 |
| 5 | 2017-07-31 04:36:09 | 2017-07-31 05:23:26 |
| 6 | 2017-07-31 05:59:04 | 2017-07-31 06:24:34 |
+------------+---------------------+---------------------+
So I can work with that!
所以我可以用它!
Now I only don't like the way I assigned the session_id
using global
statement. Any idea how to do it in a neater way?
现在我只是不喜欢使用全局语句分配session_id的方式。知道如何以更整洁的方式做到这一点吗?
#1
0
Taylor's comment was very useful. I did the following:
泰勒的评论非常有用。我做了以下事情:
session_id = 0
def get_session_id(row):
global session_id
if row == True:
session_id += 1
return session_id
else:
return session_id
df_ben["session id"] = df_ben["new period"].apply(lambda row: get_session_id(row))
start_time = df_ben.groupby("session id").nth(0)["Submission_Ended"]
stop_time = df_ben.groupby("session id").nth(-1)["Submission_Ended"]
df_final = pd.DataFrame({"start":start_time, "stop":stop_time})
And my final result is this:
我的最终结果如下:
+------------+---------------------+---------------------+
| | start | stop |
+------------+---------------------+---------------------+
| session id | | |
| 1 | 2017-07-31 00:08:25 | 2017-07-31 00:26:01 |
| 2 | 2017-07-31 01:59:11 | 2017-07-31 02:36:03 |
| 3 | 2017-07-31 03:25:43 | 2017-07-31 03:48:40 |
| 4 | 2017-07-31 04:12:03 | 2017-07-31 04:12:03 |
| 5 | 2017-07-31 04:36:09 | 2017-07-31 05:23:26 |
| 6 | 2017-07-31 05:59:04 | 2017-07-31 06:24:34 |
+------------+---------------------+---------------------+
So I can work with that!
所以我可以用它!
Now I only don't like the way I assigned the session_id
using global
statement. Any idea how to do it in a neater way?
现在我只是不喜欢使用全局语句分配session_id的方式。知道如何以更整洁的方式做到这一点吗?