I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:
我在Spark中有一个dataframe,其中一个列包含一个数组。现在,我已经编写了一个单独的UDF,它将数组转换为另一个只有不同值的数组。请参见下面的例子:
Ex: [24,23,27,23] should get converted to [24, 23, 27] Code:
例:[24,23,27,23]应该转换为[24,23,27]代码:
def uniq_array(col_array):
x = np.unique(col_array)
return x
uniq_array_udf = udf(uniq_array,ArrayType(IntegerType()))
Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array))
In the above code, Df2.age_array
is the array on which I am applying the UDF to get a different column "age_array_unique"
which should contain only unique values in the array.
在上面的代码中,Df2。age_array是我应用UDF获取另一个列“age_array_unique”的数组,该列应该只包含数组中的惟一值。
However, as soon as I run the command Df3.show()
, I get the error:
但是,一旦我运行命令Df3.show(),就会得到错误:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
net.razorvine.pickle。PickleException:用于构造类指令的零参数(对于numpy.core.multiarray._重构)
Can anyone please let me know why this is happening?
谁能告诉我为什么会这样吗?
Thanks!
谢谢!
1 个解决方案
#1
18
The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique
not only returns numpy.ndarray
but also converts numerics to the corresponding NumPy
types which are not compatible with DataFrame
API. You can try something like this:
问题的根源在于从UDF返回的对象不符合声明的类型。np。unique不仅返回numpy。ndarray还将数字转换为与DataFrame API不兼容的相应的NumPy类型。你可以试试这样的方法:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
或者这个(保持秩序)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
代替。
If you really want np.unique
you have to convert the output:
如果你真的想要np。唯一你必须转换输出:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
#1
18
The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique
not only returns numpy.ndarray
but also converts numerics to the corresponding NumPy
types which are not compatible with DataFrame
API. You can try something like this:
问题的根源在于从UDF返回的对象不符合声明的类型。np。unique不仅返回numpy。ndarray还将数字转换为与DataFrame API不兼容的相应的NumPy类型。你可以试试这样的方法:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
或者这个(保持秩序)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
代替。
If you really want np.unique
you have to convert the output:
如果你真的想要np。唯一你必须转换输出:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))