I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
我正在处理一个非常大的数据帧(350万X 150,并且在打开时需要25 gig的内存)我需要在id号和日期上找到最多一列,并且只保留具有最大值的行。每行是特定日期的一个id的记录观察,我还需要最新的日期。
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
这是动物测试数据,其中对于每个id和日期存在20个附加列seg1-seg20,其连续地填充测试日信息,例如,第一测试数据填充seg1,第二测试数据填充seg2等。 “值”字段表示已填充了多少段,换句话说已完成了多少次测试,因此具有最大“值”的行具有最多的测试数据。理想情况下,我只想要这些行而不是前一行。例如:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
最终它应该是:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
我首先尝试使用.groupby('id')。max但无法找到一种方法来使用它来删除行。结果数据框必须包含ORIGINAL ROWS,而不仅仅是每个id的每列的最大值。我目前的解决方案是:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
但是每次运行大约需要10秒钟,我假设因为它每次都试图调用整个数据帧。有760,000个唯一ID,每个ID长17位,因此在此速率下可行的时间太长。
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.
还有另一种方法更有效吗?目前它将每个列作为“对象”读取,但将相关列转换为整数的最低位似乎也没有帮助。
1 个解决方案
#1
1
I tried with groupby('id').max()
and it works, and it also drop the rows. Did you remeber to reassign the df
variable? Because this operation (and almost all Pandas' operations) are not in-place.
我尝试使用groupby('id')。max()并且它可以工作,它也会删除行。你还记得重新分配df变量吗?因为这个操作(以及几乎所有Pandas的操作)都不在原位。
If you do:
如果你这样做:
df.groupby('id', sort = False).max()
You will get:
你会得到:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id
as the index, you do:
如果你不想将id作为索引,你可以:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
你会得到:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
不过,我不知道这是否会更快。
Update
This way the index will not be reseted:
这样,索引不会被重置:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
你会得到:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6
#1
1
I tried with groupby('id').max()
and it works, and it also drop the rows. Did you remeber to reassign the df
variable? Because this operation (and almost all Pandas' operations) are not in-place.
我尝试使用groupby('id')。max()并且它可以工作,它也会删除行。你还记得重新分配df变量吗?因为这个操作(以及几乎所有Pandas的操作)都不在原位。
If you do:
如果你这样做:
df.groupby('id', sort = False).max()
You will get:
你会得到:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id
as the index, you do:
如果你不想将id作为索引,你可以:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
你会得到:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
不过,我不知道这是否会更快。
Update
This way the index will not be reseted:
这样,索引不会被重置:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
你会得到:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6