
时间:2022-11-24 21:25:12

I have an RDD where each element is a tuple of the form


[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]

I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?


1 个解决方案


def computeDot(sparseVectorA, sparseVectorB):
    Function to compute dot product of two SparseVectors
    return sparseVectorA.dot(sparseVectorB)

# Use Cartesian function on the RDD to create tuples containing 
# 2-combinations of all the rows in the original RDD

combinationRDD = (originalRDD.cartesian(originalRDD))

# The records in combinationRDD will be of the form 
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function

dottedRDD = (combinationRDD
             .filter(lambda x: x[0][0] != x[1][0])
             .map(lambda x: computeDot(x[0][1], x[1][1])

The solution to this question should be along this line.



def computeDot(sparseVectorA, sparseVectorB):
    Function to compute dot product of two SparseVectors
    return sparseVectorA.dot(sparseVectorB)

# Use Cartesian function on the RDD to create tuples containing 
# 2-combinations of all the rows in the original RDD

combinationRDD = (originalRDD.cartesian(originalRDD))

# The records in combinationRDD will be of the form 
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function

dottedRDD = (combinationRDD
             .filter(lambda x: x[0][0] != x[1][0])
             .map(lambda x: computeDot(x[0][1], x[1][1])

The solution to this question should be along this line.
