spark add_month无法按预期工作[重复]

This question already has an answer here:

这个问题在这里已有答案:

Spark Dataframe Random UUID changes after every transformation/action 1 answer

每次转换/操作1回答后,Spark Dataframe随机UUID都会更改

in a dataframe, I'm generating a column based on column A in DateType format "yyyy-MM-dd". Column A is generated from a UDF (udf generates a random date from the last 24 months).

在数据框中,我正在生成基于DateType格式“yyyy-MM-dd”中的列A的列。 A列是从UDF生成的(udf生成过去24个月的随机日期)。

from that generated date I try to calculate column B. Column B is column A minus 6 months. ex. 2017-06-01 in A is 2017-01-01 in B. To achieve this I use function add_months(columname, -6)

从生成日期开始,我尝试计算B列.B列是A栏减去6个月。恩。 2017-06-01在A中是2017-01-01在B.要实现这一点我使用函数add_months(columname,-6)

when I do this using another column (not generated by udf) I get the right result. But when I do it on that generated column I get random values, totally wrong.

当我使用另一列(不是由udf生成)执行此操作时,我得到了正确的结果。但是,当我在生成的列上执行此操作时,我会得到随机值,完全错误。

I checked the schema, column is from DateType

我检查了架构,列是从DateType

this is my code :

这是我的代码:

val test = df.withColumn("A", to_date(callUDF("randomUDF")))
val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))

code of my UDF :

我的UDF的代码:

sqlContext.udf.register("randomUDF", () => {

//prepare dateformat
val formatter = new SimpleDateFormat("yyyy-MM-dd")

//get today's date as reference 
val today = Calendar.getInstance()
val now = today.getTime()

//set "from" 2 years from now
val from = Calendar.getInstance()
from.setTime(now)
from.add(Calendar.MONTH, -24)

// set dates into Long
val valuefrom = from.getTimeInMillis()
val valueto = today.getTimeInMillis()

//generate random Long between from and to
val value3 = (valuefrom + Math.random()*(valueto - valuefrom))

// set generated value to Calendar and format date
val calendar3 = Calendar.getInstance()
calendar3.setTimeInMillis(value3.toLong)
formatter.format(calendar3.getTime()
}

UDF works as expected, but I think there is something going wrong here. I tried the add_months function on another column (not generated) and it worked fine.

UDF按预期工作,但我认为这里出了问题。我在另一列(未生成)上尝试了add_months函数,它运行正常。

example of results I get with this code :

我得到这个代码的结果示例:

A            |      B
2017-10-20   |   2016-02-27
2016-05-06   |   2015-05-25
2016-01-09   |   2016-03-14
2016-01-04   |   2017-04-26

using spark version 1.5.1 using scala 2.10.4

使用scala 2.10.4使用spark版本1.5.1

1 个解决方案

#1

The creation of test2 dataframe in your code

在代码中创建test2数据帧

val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))

is treated by spark as

被火花视为

val test2 = df.withColumn("A", to_date(callUDF("randomUDF"))).select(col("*"), add_months(to_date(callUDF("randomUDF")), -6).as("B"))

So you can see that udf function is called twice. df.withColumn("A", to_date(callUDF("randomUDF"))) is generating the date that comes in column A. And add_months(to_date(callUDF("randomUDF")), -6).as("B") is calling udf function again and generating a new date and subtracting 6 months from it and showing that date in column B.

所以你可以看到udf函数被调用两次。 df.withColumn(“A”,to_date(callUDF(“randomUDF”)))生成A列中的日期。并且add_months(to_date(callUDF(“randomUDF”)), - 6)。as(“B” )再次调用udf函数并生成一个新日期并从中减去6个月并在B列中显示该日期。

Thats the reason you are getting random dates.

这就是你得到随机日期的原因。

The solution to this would be to use persist or cache in test dataframe as

对此的解决方案是在测试数据帧中使用持久化或缓存

val test = df.withColumn("A", callUDF("randomUDF")).cache()
val test2 = test.as("table").withColumn("B", add_months($"table.A", -6))

#1