I have a gigantic dataframe with a datetime type column called time, and another float type column called dist, the data frame is sorted based on time, and dist already. I want to split the dataframe into several dataframes base on monotonic increase of dist.
我有一个巨大的数据帧,其中包含一个名为time的日期时间类型列,另一个名为dist的浮点类型列,数据框基于时间排序,而dist已经排序。我想基于dist的单调增加将数据帧分成几个数据帧。
Split
分裂
dt dist
0 20160811 11:10 1.0
1 20160811 11:15 1.4
2 20160811 12:15 1.8
3 20160811 12:32 0.6
4 20160811 12:34 0.8
5 20160811 14:38 0.2
into
成
dt dist
0 20160811 11:10 1.0
1 20160811 11:15 1.4
2 20160811 12:15 1.8
dt dist
0 20160811 12:32 0.6
1 20160811 12:34 0.8
dt dist
0 20160811 14:38 0.2
2 个解决方案
#1
6
You can calculate a difference vector of dist
column and then do a cumsum()
on the condition diff < 0
(this creates a new id whenever the dist
decreases from previous value)
您可以计算dist列的差异向量,然后在条件diff <0时执行cumsum()(每当dist从之前的值减小时,这会创建一个新的id)
df['id'] = (df.dist.diff() < 0).cumsum()
print(df)
# dt dist id
#0 20160811 11:10 1.0 0
#1 20160811 11:15 1.4 0
#2 20160811 12:15 1.8 0
#3 20160811 12:32 0.6 1
#4 20160811 12:34 0.8 1
#5 20160811 14:38 0.2 2
for _, g in df.groupby((df.dist.diff() < 0).cumsum()):
print(g)
# dt dist
#0 20160811 11:10 1.0
#1 20160811 11:15 1.4
#2 20160811 12:15 1.8
# dt dist
#3 20160811 12:32 0.6
#4 20160811 12:34 0.8
# dt dist
#5 20160811 14:38 0.2
#2
1
you can do it using np.split() method:
你可以用np.split()方法做到这一点:
In [92]: df
Out[92]:
dt dist
0 2016-08-11 11:10:00 1.0
1 2016-08-11 11:15:00 1.4
2 2016-08-11 12:15:00 1.8
3 2016-08-11 12:32:00 0.6
4 2016-08-11 12:34:00 0.8
5 2016-08-11 14:38:00 0.2
In [93]: dfs = np.split(df, df[df.dist.diff().fillna(0) < 0].index)
In [94]: [print(x) for x in dfs]
dt dist
0 2016-08-11 11:10:00 1.0
1 2016-08-11 11:15:00 1.4
2 2016-08-11 12:15:00 1.8
dt dist
3 2016-08-11 12:32:00 0.6
4 2016-08-11 12:34:00 0.8
dt dist
5 2016-08-11 14:38:00 0.2
Out[94]: [None, None, None]
Explanation:
说明:
In [97]: df.dist.diff().fillna(0) < 0
Out[97]:
0 False
1 False
2 False
3 True
4 False
5 True
Name: dist, dtype: bool
In [98]: df[df.dist.diff().fillna(0) < 0]
Out[98]:
dt dist
3 2016-08-11 12:32:00 0.6
5 2016-08-11 14:38:00 0.2
In [99]: df[df.dist.diff().fillna(0) < 0].index
Out[99]: Int64Index([3, 5], dtype='int64')
#1
6
You can calculate a difference vector of dist
column and then do a cumsum()
on the condition diff < 0
(this creates a new id whenever the dist
decreases from previous value)
您可以计算dist列的差异向量,然后在条件diff <0时执行cumsum()(每当dist从之前的值减小时,这会创建一个新的id)
df['id'] = (df.dist.diff() < 0).cumsum()
print(df)
# dt dist id
#0 20160811 11:10 1.0 0
#1 20160811 11:15 1.4 0
#2 20160811 12:15 1.8 0
#3 20160811 12:32 0.6 1
#4 20160811 12:34 0.8 1
#5 20160811 14:38 0.2 2
for _, g in df.groupby((df.dist.diff() < 0).cumsum()):
print(g)
# dt dist
#0 20160811 11:10 1.0
#1 20160811 11:15 1.4
#2 20160811 12:15 1.8
# dt dist
#3 20160811 12:32 0.6
#4 20160811 12:34 0.8
# dt dist
#5 20160811 14:38 0.2
#2
1
you can do it using np.split() method:
你可以用np.split()方法做到这一点:
In [92]: df
Out[92]:
dt dist
0 2016-08-11 11:10:00 1.0
1 2016-08-11 11:15:00 1.4
2 2016-08-11 12:15:00 1.8
3 2016-08-11 12:32:00 0.6
4 2016-08-11 12:34:00 0.8
5 2016-08-11 14:38:00 0.2
In [93]: dfs = np.split(df, df[df.dist.diff().fillna(0) < 0].index)
In [94]: [print(x) for x in dfs]
dt dist
0 2016-08-11 11:10:00 1.0
1 2016-08-11 11:15:00 1.4
2 2016-08-11 12:15:00 1.8
dt dist
3 2016-08-11 12:32:00 0.6
4 2016-08-11 12:34:00 0.8
dt dist
5 2016-08-11 14:38:00 0.2
Out[94]: [None, None, None]
Explanation:
说明:
In [97]: df.dist.diff().fillna(0) < 0
Out[97]:
0 False
1 False
2 False
3 True
4 False
5 True
Name: dist, dtype: bool
In [98]: df[df.dist.diff().fillna(0) < 0]
Out[98]:
dt dist
3 2016-08-11 12:32:00 0.6
5 2016-08-11 14:38:00 0.2
In [99]: df[df.dist.diff().fillna(0) < 0].index
Out[99]: Int64Index([3, 5], dtype='int64')