
时间:2021-01-01 21:40:11

I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.

我正在处理一个非常大的数据帧(350万X 150,并且在打开时需要25 gig的内存)我需要在id号和日期上找到最多一列,并且只保留具有最大值的行。每行是特定日期的一个id的记录观察,我还需要最新的日期。

This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:

这是动物测试数据,其中对于每个id和日期存在20个附加列seg1-seg20,其连续地填充测试日信息,例如,第一测试数据填充seg1,第二测试数据填充seg2等。 “值”字段表示已填充了多少段,换句话说已完成了多少次测试,因此具有最大“值”的行具有最多的测试数据。理想情况下,我只想要这些行而不是前一行。例如:

df= DataFrame({'id':[1000,1000,1001,2000,2000,2000], 
       date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
1  20010201  1000    76                               1
2  20010115  1001    23   34   32   32                4
3  20010203  2000    45   52                          2
4  20010223  2000    12   24   34   43   43   41      6
5  20010220  2000    12   24   34   43   44   35      6

And eventually it should be:


       date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
2  20010115  1001    23   34   32   32                4
4  20010223  2000    12   24   34   43   43   41      6

I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:

我首先尝试使用.groupby('id')。max但无法找到一种方法来使用它来删除行。结果数据框必须包含ORIGINAL ROWS,而不仅仅是每个id的每列的最大值。我目前的解决方案是:

for i in df.id.unique():
    df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])

But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.


Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.


1 个解决方案



I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.


If you do:


df.groupby('id', sort = False).max()

You will get:


          date  value
1000  20010201      3
1001  20010115      4
2000  20010223      6

And if you don't want id as the index, you do:


df.groupby('id', sort = False, as_index = False).max()

And you will get:


     id      date  value
0  1000  20010201      3
1  1001  20010115      4
2  2000  20010223      6

I don't know if that's going to be much faster, though.



This way the index will not be reseted:


df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]

And you will get:


           date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
2  20010115  1001    23   34   32   32                4
4  20010223  2000    12   24   34   43   43   43      6



