KMeans||对Spark的情绪分析。

时间:2022-06-27 23:09:34

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.

我正在尝试写基于Spark的情绪分析程序。为此,我使用word2vec和KMeans集群。从word2Vec,我有20k个单词/向量集合在100维空间中,现在我想要把这个向量空间聚集起来。当我用默认的并行实现运行KMeans时,算法工作了3个小时!但是随机初始化策略是8分钟。我做错了什么?我有mac book pro机器,有4个内核处理器和16gb内存。

K ~= 4000 maxInteration was 20

K ~= 4000 maxInteration 20。

var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
      model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
    val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
    log.info("Clustering data size {}",data.count())
    log.info("==================Train process started==================");
    val clusterSize = modelSize/5

    val kmeans = new KMeans()
    kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
    kmeans.setK(clusterSize)
    kmeans.setRuns(1)
    kmeans.setMaxIterations(50)
    kmeans.setEpsilon(1e-4)

    time = System.currentTimeMillis()
    val clusterModel: KMeansModel = kmeans.run(data)

And spark context initialization is here:

spark上下文初始化在这里:

val conf = new SparkConf()
      .setAppName("SparkPreProcessor")
      .setMaster("local[4]")
      .set("spark.default.parallelism", "8")
      .set("spark.executor.memory", "1g")
    val sc = SparkContext.getOrCreate(conf)

Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster

也很少有关于运行这个程序的更新。我在Intelij的想法中运行它。我没有真正的星星之火。但我认为你的个人机器可以是星星之火。

I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:

我看到程序从Spark代码LocalKMeans.scala挂在这个循环中。

// Initialize centers by sampling using the k-means++ procedure.
    centers(0) = pickWeighted(rand, points, weights).toDense
    for (i <- 1 until k) {
      // Pick the next center with a probability proportional to cost under current centers
      val curCenters = centers.view.take(i)
      val sum = points.view.zip(weights).map { case (p, w) =>
        w * KMeans.pointCost(curCenters, p)
      }.sum
      val r = rand.nextDouble() * sum
      var cumulativeScore = 0.0
      var j = 0
      while (j < points.length && cumulativeScore < r) {
        cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
        j += 1
      }
      if (j == 0) {
        logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
          s" Using duplicate point for center k = $i.")
        centers(i) = points(0).toDense
      } else {
        centers(i) = points(j - 1).toDense
      }
    }

2 个解决方案

#1


1  

Initialisation using KMeans.K_MEANS_PARALLEL is more complicated then random. However, it shouldn't make such a big difference. I would recommend to investigate, whether it is the parallel algorithm which takes to much time (it should actually be more efficient then KMeans itself).

使用KMeans初始化。K_MEANS_PARALLEL更复杂,然后是随机的。然而,它不应该造成如此大的差异。我建议你去调查一下,它是否是一种花费大量时间的并行算法(它实际上应该更有效率,然后KMeans本身)。

For information on profiling see: http://spark.apache.org/docs/latest/monitoring.html

有关分析的信息,请参见:http://spark.apache.org/docs/latest/monitor.html。

If it is not the initialisation which takes up the time there is something seriously wrong. However, using random initialisation shouldn't be any worse for the final result (just less efficient!).

如果不是初始化占用了时间,就会有严重的错误。然而,使用随机初始化不应该对最终结果有任何坏处(只是效率降低了!)

Actually when you use KMeans.K_MEANS_PARALLEL to initialise you should get reasonable results with 0 iterations. If this is not the case there might be some regularities in the distribution of the data which send KMeans offtrack. Hence, if you haven't distributed your data randomly you could also change this. However, such an impact would surprise me give a fixed number of iterations.

实际上,当你用KMeans的时候。与初始化的K_MEANS_PARALLEL应该在0次迭代中得到合理的结果。如果情况不是这样,那么发送KMeans偏离轨道的数据的分布可能会有一些规律。因此,如果你没有随机分布你的数据,你也可以改变这个。但是,这样的影响会让我感到惊讶,我给出了固定的迭代次数。

#2


1  

I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization. Data size approximately: 4k clusters for 21k 100-dim vectors.

我在AWS上运行了3个服务器的spark (c3.xlarge),结果是相同的——并行KMeans在N个并行运行中初始化algo,但是对于少量的数据,它仍然非常慢,我的解决方案是使用随机初始化。数据大小约:21k 100-dim矢量的4k集群。

#1


1  

Initialisation using KMeans.K_MEANS_PARALLEL is more complicated then random. However, it shouldn't make such a big difference. I would recommend to investigate, whether it is the parallel algorithm which takes to much time (it should actually be more efficient then KMeans itself).

使用KMeans初始化。K_MEANS_PARALLEL更复杂,然后是随机的。然而,它不应该造成如此大的差异。我建议你去调查一下,它是否是一种花费大量时间的并行算法(它实际上应该更有效率,然后KMeans本身)。

For information on profiling see: http://spark.apache.org/docs/latest/monitoring.html

有关分析的信息,请参见:http://spark.apache.org/docs/latest/monitor.html。

If it is not the initialisation which takes up the time there is something seriously wrong. However, using random initialisation shouldn't be any worse for the final result (just less efficient!).

如果不是初始化占用了时间,就会有严重的错误。然而,使用随机初始化不应该对最终结果有任何坏处(只是效率降低了!)

Actually when you use KMeans.K_MEANS_PARALLEL to initialise you should get reasonable results with 0 iterations. If this is not the case there might be some regularities in the distribution of the data which send KMeans offtrack. Hence, if you haven't distributed your data randomly you could also change this. However, such an impact would surprise me give a fixed number of iterations.

实际上,当你用KMeans的时候。与初始化的K_MEANS_PARALLEL应该在0次迭代中得到合理的结果。如果情况不是这样,那么发送KMeans偏离轨道的数据的分布可能会有一些规律。因此,如果你没有随机分布你的数据,你也可以改变这个。但是,这样的影响会让我感到惊讶,我给出了固定的迭代次数。

#2


1  

I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization. Data size approximately: 4k clusters for 21k 100-dim vectors.

我在AWS上运行了3个服务器的spark (c3.xlarge),结果是相同的——并行KMeans在N个并行运行中初始化algo,但是对于少量的数据,它仍然非常慢,我的解决方案是使用随机初始化。数据大小约:21k 100-dim矢量的4k集群。