Pandas,通过单值增加列值来分割数据帧

时间:2021-06-28 21:44:14

I have a gigantic dataframe with a datetime type column called time, and another float type column called dist, the data frame is sorted based on time, and dist already. I want to split the dataframe into several dataframes base on monotonic increase of dist.

我有一个巨大的数据帧,其中包含一个名为time的日期时间类型列,另一个名为dist的浮点类型列,数据框基于时间排序,而dist已经排序。我想基于dist的单调增加将数据帧分成几个数据帧。

Split

分裂

   dt                    dist
0  20160811 11:10        1.0
1  20160811 11:15        1.4
2  20160811 12:15        1.8
3  20160811 12:32        0.6
4  20160811 12:34        0.8
5  20160811 14:38        0.2

into

   dt                    dist
0  20160811 11:10        1.0
1  20160811 11:15        1.4
2  20160811 12:15        1.8

   dt                    dist
0  20160811 12:32        0.6
1  20160811 12:34        0.8

   dt                    dist
0  20160811 14:38        0.2

2 个解决方案

#1


6  

You can calculate a difference vector of dist column and then do a cumsum() on the condition diff < 0 (this creates a new id whenever the dist decreases from previous value)

您可以计算dist列的差异向量,然后在条件diff <0时执行cumsum()(每当dist从之前的值减小时,这会创建一个新的id)

df['id'] = (df.dist.diff() < 0).cumsum()

print(df)

#               dt  dist  id
#0  20160811 11:10   1.0   0
#1  20160811 11:15   1.4   0
#2  20160811 12:15   1.8   0
#3  20160811 12:32   0.6   1
#4  20160811 12:34   0.8   1
#5  20160811 14:38   0.2   2

for _, g in df.groupby((df.dist.diff() < 0).cumsum()):
    print(g)

#               dt  dist
#0  20160811 11:10   1.0
#1  20160811 11:15   1.4
#2  20160811 12:15   1.8
#               dt  dist
#3  20160811 12:32   0.6
#4  20160811 12:34   0.8
#               dt  dist
#5  20160811 14:38   0.2

#2


1  

you can do it using np.split() method:

你可以用np.split()方法做到这一点:

In [92]: df
Out[92]:
                   dt  dist
0 2016-08-11 11:10:00   1.0
1 2016-08-11 11:15:00   1.4
2 2016-08-11 12:15:00   1.8
3 2016-08-11 12:32:00   0.6
4 2016-08-11 12:34:00   0.8
5 2016-08-11 14:38:00   0.2

In [93]: dfs = np.split(df, df[df.dist.diff().fillna(0) < 0].index)

In [94]: [print(x) for x in dfs]
                   dt  dist
0 2016-08-11 11:10:00   1.0
1 2016-08-11 11:15:00   1.4
2 2016-08-11 12:15:00   1.8
                   dt  dist
3 2016-08-11 12:32:00   0.6
4 2016-08-11 12:34:00   0.8
                   dt  dist
5 2016-08-11 14:38:00   0.2
Out[94]: [None, None, None]

Explanation:

说明:

In [97]: df.dist.diff().fillna(0) < 0
Out[97]:
0    False
1    False
2    False
3     True
4    False
5     True
Name: dist, dtype: bool

In [98]: df[df.dist.diff().fillna(0) < 0]
Out[98]:
                   dt  dist
3 2016-08-11 12:32:00   0.6
5 2016-08-11 14:38:00   0.2

In [99]: df[df.dist.diff().fillna(0) < 0].index
Out[99]: Int64Index([3, 5], dtype='int64')

#1


6  

You can calculate a difference vector of dist column and then do a cumsum() on the condition diff < 0 (this creates a new id whenever the dist decreases from previous value)

您可以计算dist列的差异向量,然后在条件diff <0时执行cumsum()(每当dist从之前的值减小时,这会创建一个新的id)

df['id'] = (df.dist.diff() < 0).cumsum()

print(df)

#               dt  dist  id
#0  20160811 11:10   1.0   0
#1  20160811 11:15   1.4   0
#2  20160811 12:15   1.8   0
#3  20160811 12:32   0.6   1
#4  20160811 12:34   0.8   1
#5  20160811 14:38   0.2   2

for _, g in df.groupby((df.dist.diff() < 0).cumsum()):
    print(g)

#               dt  dist
#0  20160811 11:10   1.0
#1  20160811 11:15   1.4
#2  20160811 12:15   1.8
#               dt  dist
#3  20160811 12:32   0.6
#4  20160811 12:34   0.8
#               dt  dist
#5  20160811 14:38   0.2

#2


1  

you can do it using np.split() method:

你可以用np.split()方法做到这一点:

In [92]: df
Out[92]:
                   dt  dist
0 2016-08-11 11:10:00   1.0
1 2016-08-11 11:15:00   1.4
2 2016-08-11 12:15:00   1.8
3 2016-08-11 12:32:00   0.6
4 2016-08-11 12:34:00   0.8
5 2016-08-11 14:38:00   0.2

In [93]: dfs = np.split(df, df[df.dist.diff().fillna(0) < 0].index)

In [94]: [print(x) for x in dfs]
                   dt  dist
0 2016-08-11 11:10:00   1.0
1 2016-08-11 11:15:00   1.4
2 2016-08-11 12:15:00   1.8
                   dt  dist
3 2016-08-11 12:32:00   0.6
4 2016-08-11 12:34:00   0.8
                   dt  dist
5 2016-08-11 14:38:00   0.2
Out[94]: [None, None, None]

Explanation:

说明:

In [97]: df.dist.diff().fillna(0) < 0
Out[97]:
0    False
1    False
2    False
3     True
4    False
5     True
Name: dist, dtype: bool

In [98]: df[df.dist.diff().fillna(0) < 0]
Out[98]:
                   dt  dist
3 2016-08-11 12:32:00   0.6
5 2016-08-11 14:38:00   0.2

In [99]: df[df.dist.diff().fillna(0) < 0].index
Out[99]: Int64Index([3, 5], dtype='int64')