
时间:2022-09-13 09:49:06

I want to extract special bigrams (not , word2) in every document and replace these two words with one number (1) if word2 exist in (words.txt) file else it shouldn't be replaced.


here is my data (data.txt):


fit perfectly clie . purchased not instructions install helpful . improvement battery life not hoped .

product returned not fit nor solve problem ordered . company honest credited account .

cable good not work . cable extremely hot not recognize devices .

and (words.txt) file:



i've tried :

我试过了 :

   import org.apache.spark.{SparkConf, SparkContext} 

   object test {

   def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("test").setMaster("local")
    val sc = new SparkContext(conf1)
    val searchList = sc.textFile("data/words.txt")
    val searchBigram = searchList.map(word => ("not", word)).collect.toSet
    val sample1 = sc.textFile("data/data.txt")
    val sample2 = sample1.map(s => s.split( """\.""") // split on .
      .map(_.split(" ") // split on space
      .sliding(2) // take continuous pairs
      .map { case Array(a, b) => (a, b)}
      ).map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
      .map { case (e1, e2) => e1}.mkString(" "))

expected output is :


fit perfectly clie . purchased 1 install helpful . improvement battery life 1 .

product returned 1 nor solve problem ordered . company honest credited account .

cable good 1 . cable extremely hot 1 devices . 

my above code is not complete and it doesn't work, Can anybody help me?


1 个解决方案


If you want to stick to the bigram approach, I think it would work better if we would make bigrams out of the search items as well.


val searchList  = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet

Now, departing from the result of '.sliding(2)', we can transform the arrays to tuples:


val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}

//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)

// We undo the tuples to obtain the modified string
val  result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1

Integrate that idea in the larger program and that should result in a working process.



If you want to stick to the bigram approach, I think it would work better if we would make bigrams out of the search items as well.


val searchList  = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet

Now, departing from the result of '.sliding(2)', we can transform the arrays to tuples:


val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}

//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)

// We undo the tuples to obtain the modified string
val  result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1

Integrate that idea in the larger program and that should result in a working process.
