根据另一列(在本例中为用户)向数据框添加列索引

时间:2021-05-02 23:12:27

I have a dataframe as given below where the last column represents the number of times the user has searched for the location and stay

我有一个如下所示的数据框,其中最后一列代表用户搜索位置和停留的次数

|  Hanks|         Rotterdam|      airbnb7|                     1|
|Sanders|         Rotterdam|      airbnb2|                     1|
|  Hanks|         Rotterdam|      airbnb2|                     3|
|  Hanks|             Tokyo|      airbnb8|                     2|
|  Larry|             Hanoi|             |                     2|
|  Mango|             Seoul|      airbnb5|                     1|
|  Larry|             Hanoi|      airbnb1|                     2|

which i want to transform as follows

我想改变如下

|  Hanks|         Rotterdam|      airbnb7|                     1|    1|
|Sanders|         Rotterdam|      airbnb2|                     1|    1|
|  Hanks|         Rotterdam|      airbnb2|                     3|    2|
|  Hanks|             Tokyo|      airbnb8|                     2|    3|
|  Larry|             Hanoi|             |                     2|    0|
|  Mango|             Seoul|      airbnb5|                     1|    1|
|  Larry|             Hanoi|      airbnb1|                     2|    1|

Notice that column 5 represents the index of the unique combination of options(location+stay) that user selected. eg

请注意,第5列表示用户选择的唯一选项组合(位置+停留)的索引。例如

|  Hanks|         Rotterdam|      airbnb7|                     1|    1|
|  Hanks|         Rotterdam|      airbnb2|                     3|    2|
|  Hanks|             Tokyo|      airbnb8|                     2|    3|

I tried using groupBy/Agg to do this by implementing a udf function as the following in the agg function.

我尝试使用groupBy / Agg来执行此操作,方法是在agg函数中实现以下udf函数。

val df2 = df1.groupBy("User", "clickedDestination", "clickedAirbnb")
                      .agg(indexUserDetailsUDF(col("clickedAirbnb")) as ("clickedAirbnbIndex"))

And the udf as follows

而udf如下

var cnt = 0
val airbnbClickIndex:(String) => String = (airbnb) => {
  if(airbnb== "") "null" //return 0 for airbnbClickIndex when airbnb is empty
  else{cnt+=1; cnt.toString()} //otherwise return incremented value
}
val indexUserDetailsUDF = udf(airbnbClickIndex)

But this is not working. Any input is much appreciated. Thanks.

但这不起作用。任何输入都非常感谢。谢谢。

Update1: Daniel's suggestion of dense_rank does the following to a user

Update1:​​Daniel对dense_rank的建议对用户进行了以下操作

|Meera|         Amsterdam|     airbnb12|         1|     1|
|Meera|         Amsterdam|      airbnb2|         1|     2|
|Meera|         Amsterdam|      airbnb7|         1|     3|
|Meera|         Amsterdam|      airbnb8|         1|     4|
|Meera|         Bangalore|             |         1|     5|
|Meera|         Bangalore|     airbnb11|         1|     6|
|Meera|         Bangalore|      airbnb8|         1|     7|
|Meera|             Hanoi|      airbnb1|         2|     8|
|Meera|             Hanoi|      airbnb2|         1|     9|
|Meera|             Hanoi|      airbnb7|         1|    10|
|Meera|            Mumbai|             |         1|    11|
|Meera|              Oslo|             |         2|    12|
|Meera|              Oslo|      airbnb8|         1|    13|
|Meera|             Paris|             |         1|    14|
|Meera|             Paris|     airbnb11|         1|    15|
|Meera|             Paris|      airbnb6|         1|    16|
|Meera|             Paris|      airbnb7|         1|    17|
|Meera|             Paris|      airbnb8|         2|    18|
|Meera|         Rotterdam|      airbnb2|         1|    19|

I assumed dense_rank will push those records with empty field values (in this case 3rd empty field) to the last. Is this correct?

我假设dense_rank会将那些带有空字段值的记录(在本例中为第3个空字段)推送到最后一个。它是否正确?

1 个解决方案

#1


0  

If I got it right, you probably want a windowed rank. You could try the following:

如果我做对了,你可能想要一个窗口等级。您可以尝试以下方法:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val window = Window.partitionBy("User").orderBy("User", "clickedDestination", "clickedAirbnb")

val result = df.withColumn("clickedAirbnbIndex", dense_rank().over(window))

If needed, you can find some good reading about window functions in Spark here.

如果需要,您可以在Spark中找到关于窗口功能的一些好的阅读。

Also, the functions package api documentation is very useful.

此外,函数包api文档非常有用。

#1


0  

If I got it right, you probably want a windowed rank. You could try the following:

如果我做对了,你可能想要一个窗口等级。您可以尝试以下方法:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val window = Window.partitionBy("User").orderBy("User", "clickedDestination", "clickedAirbnb")

val result = df.withColumn("clickedAirbnbIndex", dense_rank().over(window))

If needed, you can find some good reading about window functions in Spark here.

如果需要,您可以在Spark中找到关于窗口功能的一些好的阅读。

Also, the functions package api documentation is very useful.

此外,函数包api文档非常有用。