使用pySpark的RDD元素组合的点积

时间:2022-11-24 21:25:12

I have an RDD where each element is a tuple of the form

我有一个RDD,其中每个元素都是表单的元组

[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]

I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?

我想通过使用mllib.linalg.SparseVector类提供的SparseVector1.dot(SparseVector2)方法获取此RDD中每个值的点积。我知道python有一个itertools.combinations模块,可用于实现要计算的点积的组合。有人可以提供代码片段来实现相同的目标吗?我只能做一个RDD.collect(),所以我收到RDD中所有元素的列表,然后在这个列表上运行itertools.combinations,但根据我的理解,这将执行根上的所有计算并且不会'本身就是分发的。有人可以建议一个更分散的方式来实现这一目标吗?

1 个解决方案

#1


def computeDot(sparseVectorA, sparseVectorB):
    """
    Function to compute dot product of two SparseVectors
    """
    return sparseVectorA.dot(sparseVectorB)

# Use Cartesian function on the RDD to create tuples containing 
# 2-combinations of all the rows in the original RDD

combinationRDD = (originalRDD.cartesian(originalRDD))

# The records in combinationRDD will be of the form 
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function

dottedRDD = (combinationRDD
             .filter(lambda x: x[0][0] != x[1][0])
             .map(lambda x: computeDot(x[0][1], x[1][1])
             .cache())

The solution to this question should be along this line.

这个问题的解决方案应该沿着这条路线。

#1


def computeDot(sparseVectorA, sparseVectorB):
    """
    Function to compute dot product of two SparseVectors
    """
    return sparseVectorA.dot(sparseVectorB)

# Use Cartesian function on the RDD to create tuples containing 
# 2-combinations of all the rows in the original RDD

combinationRDD = (originalRDD.cartesian(originalRDD))

# The records in combinationRDD will be of the form 
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function

dottedRDD = (combinationRDD
             .filter(lambda x: x[0][0] != x[1][0])
             .map(lambda x: computeDot(x[0][1], x[1][1])
             .cache())

The solution to this question should be along this line.

这个问题的解决方案应该沿着这条路线。