I want to extract special bigrams (not , word2)
in every document and replace these two words with one number (1)
if word2 exist in (words.txt) file else it shouldn't be replaced.
我想在每个文档中提取特殊的双字母(不是,word2),如果word2存在于(words.txt)文件中,则用一个数字(1)替换这两个单词,否则不应该替换它们。
here is my data (data.txt):
这是我的数据(data.txt):
fit perfectly clie . purchased not instructions install helpful . improvement battery life not hoped .
product returned not fit nor solve problem ordered . company honest credited account .
cable good not work . cable extremely hot not recognize devices .
...
and (words.txt) file:
和(words.txt)文件:
hoped
instructions
work
fit
...
i've tried :
我试过了 :
import org.apache.spark.{SparkConf, SparkContext}
object test {
def main(args: Array[String]): Unit = {
val conf1 = new SparkConf().setAppName("test").setMaster("local")
val sc = new SparkContext(conf1)
val searchList = sc.textFile("data/words.txt")
val searchBigram = searchList.map(word => ("not", word)).collect.toSet
val sample1 = sc.textFile("data/data.txt")
val sample2 = sample1.map(s => s.split( """\.""") // split on .
.map(_.split(" ") // split on space
.sliding(2) // take continuous pairs
.map { case Array(a, b) => (a, b)}
).map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
.map { case (e1, e2) => e1}.mkString(" "))
sample2.foreach(println)
}
}
expected output is :
预期产量是:
fit perfectly clie . purchased 1 install helpful . improvement battery life 1 .
product returned 1 nor solve problem ordered . company honest credited account .
cable good 1 . cable extremely hot 1 devices .
...
my above code is not complete and it doesn't work, Can anybody help me?
我的上面的代码不完整,它不起作用,任何人都可以帮助我吗?
1 个解决方案
#1
If you want to stick to the bigram approach, I think it would work better if we would make bigrams out of the search items as well.
如果你想坚持使用二元组方法,我认为如果我们也可以从搜索项目中创建双字母组合,那么它会更好。
val searchList = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet
Now, departing from the result of '.sliding(2)', we can transform the arrays to tuples:
现在,从'.sliding(2)'的结果出发,我们可以将数组转换为元组:
val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}
//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
// We undo the tuples to obtain the modified string
val result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1
Integrate that idea in the larger program and that should result in a working process.
将该想法整合到更大的计划中,这应该会产生一个工作流程。
#1
If you want to stick to the bigram approach, I think it would work better if we would make bigrams out of the search items as well.
如果你想坚持使用二元组方法,我认为如果我们也可以从搜索项目中创建双字母组合,那么它会更好。
val searchList = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet
Now, departing from the result of '.sliding(2)', we can transform the arrays to tuples:
现在,从'.sliding(2)'的结果出发,我们可以将数组转换为元组:
val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}
//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
// We undo the tuples to obtain the modified string
val result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1
Integrate that idea in the larger program and that should result in a working process.
将该想法整合到更大的计划中,这应该会产生一个工作流程。