Python Pandas DataFrames排序，求和和获取最大数据

I have just started learning Python, Pandas and NumPy and I want to find out what is the cleanest and most efficient way to solve the following problem.

我刚刚开始学习Python,Pandas和NumPy,我想找出解决以下问题的最简洁,最有效的方法。

I have data which holds CarManufacturer, Car, TotalCarSales, bearing in mind that the data is not small:

我有数据包含CarManufacturer,Car,TotalCarSales,请记住数据不小:

CarManufacturer Car TotalCarSales

Volkswagen Polo 100

大众Polo 100

Volkswagen Golf 50

大众高尔夫50

Honda Jazz 40

本田爵士40

Honda Civic 100

本田思域100

Question: Which manufacturer sold the most cars according to it's top 3 best sellers?

问题:根据最畅销的三大卖家,哪家制造商销售的汽车数量最多?

I'm struggling to solve this efficiently. I want to avoid iterating over the data.

我正努力有效地解决这个问题。我想避免迭代数据。

My thoughts: - Load Data into DataFrame - Index data according to CarManufacturer, Car, TotalCarSales - Do I want to do a sort here? That would be slow? - Create a new DataFrame which has CarManufacturer, TotalSales. For each CarManufacturer I would need to get the top 3 TotalCarSales and take their sum - Is there a way of doing this without iterating over all records in DataFrame? What is best way to fetch the top 3? - Then if I sort the TotalSales and take the top 3, wouldn't the sort be slow? Is there a more efficient way?

我的想法: - 将数据加载到DataFrame中 - 根据CarManufacturer,Car,TotalCarSales索引数据 - 我想在这里进行排序吗?那会慢吗? - 创建一个具有CarManufacturer,TotalSales的新DataFrame。对于每个CarManufacturer,我需要获得前3个TotalCarSales并获取它们的总和 - 有没有一种方法可以在不迭代DataFrame中的所有记录的情况下执行此操作?什么是获得前三名的最佳方式? - 然后,如果我对TotalSales进行排序并取得前三名,那么排序是否会变慢?有更有效的方法吗?

3 个解决方案

#1

The best way to do when you are learning is to try it.

学习时最好的方法就是尝试。

It's very unlikely your data will be too large (there aren't millions of car models), but in any case, you can use df.head(N) to take the top N rows to try your method and see if it's slow.

您的数据不太可能太大(没有数百万的车型),但无论如何,您可以使用df.head(N)来获取前N行以尝试您的方法并查看它是否很慢。

Other useful functions include df.groupby, df.nlargest, df.sort_values

其他有用的功能包括df.groupby,df.nlargest,df.sort_values

#2

Do I want to do a sort here? That would be slow?

我想在这里做点什么吗?那会慢吗?

Yes, sorting within each group is a good way to get what you want. Moreover sorting is a O(nlogn) operation so it shouldn't be too slow.

是的,在每个组内进行排序是获得所需内容的好方法。此外,排序是一个O(nlogn)操作,所以它不应该太慢。

Is there a way of doing this without iterating over all records in DataFrame? What is best way to fetch the top 3?

有没有办法在不迭代DataFrame中的所有记录的情况下执行此操作?什么是获得前三名的最佳方式?

Yes, you can use GroupBy.head. An alternative that may save you some time is SeriesGroupBy.nlargest which gives you the n largest elements of a series, so that you won't need to sort first.

是的,您可以使用GroupBy.head。可以节省一些时间的替代方案是SeriesGroupBy.nlargest,它为您提供系列中n个最大的元素,因此您不需要先排序。

#3

I think need:

我认为需要:

print (df)
   CarManufacturer      Car  TotalCarSales
0       Volkswagen     Polo            100
1       Volkswagen   Sharan            100
2       Volkswagen     Golf             50
3           Toyota    Auris            200
4           Toyota     Aygo             10
5           Toyota  Avensis             50
6            Honda    Civic             40
7            Honda     Jazz             40
8            Honda    Civic            100
9             Seat   Toledo            200
10            Seat     Leon            400

a = (df.sort_values('TotalCarSales', ascending=False)
      .groupby('CarManufacturer')['TotalCarSales']
      .apply(lambda x: x.head(2).sum())  #for top3 change 2 to 3
      .nlargest(3).index.tolist())
print (a)
['Seat', 'Toyota', 'Volkswagen']

Explanation:

First sort DataFrame by column TotalCarSales with sort_values

首先按TotalCarSales列对DataFrame进行排序,并使用sort_values

Then groupby and sum first 3 top values selected by head

然后groupby并汇总头部选择的前3个最高值

For 3 top CarManufacturer add nlargest and exctract values of index

对于3*CarManufacturer,添加索引的nlargest和exctract值

Details:

print (df.sort_values('TotalCarSales', ascending=False))
   CarManufacturer      Car  TotalCarSales
10            Seat     Leon            400
3           Toyota    Auris            200
9             Seat   Toledo            200
0       Volkswagen     Polo            100
1       Volkswagen   Sharan            100
8            Honda    Civic            100
2       Volkswagen     Golf             50
5           Toyota  Avensis             50
6            Honda    Civic             40
7            Honda     Jazz             40
4           Toyota     Aygo             10

print (df.sort_values('TotalCarSales', ascending=False)
      .groupby('CarManufacturer')['TotalCarSales']
      .apply(lambda x: x.head(2).sum()))
CarManufacturer
Honda         140
Seat          600
Toyota        250
Volkswagen    200
Name: TotalCarSales, dtype: int64

#1