一、引言
前边一节介绍了Word2Vec模型训练同义词,那么在大数据量的情况下,我们自然想到了用spark来进行训练。下面就介绍我们是如何实现spark上的模型训练。
二、分词
模型训练的输入是分好词的语料,那么就得实现spark上的分词。
def split(jieba_list, iterator):
sentences = []
for i in iterator:
try:
seg_list = []
#out_str = ""
s = ""
for c in i:
if not c is None:
s += c.encode('utf-8')
id = s.split("__")[0]
s = s.split("__")[1]
wordList = jieba.cut(s, cut_all=False)
for word in wordList:
out_str += word
out_str += " "
sentences.append(out_str)
except:
continue
return sentences
三、模型训练
这里,直接用分词后的rdd对象作为输入
word2vec = Word2Vec().setNumPartitions(50)
spark.sql("use jkgj_log")
df = spark.sql("select label1_name,label2_name from mid_dim_tag ")
df_list = df.collect()
spark.sparkContext.broadcast(df_list)
diagnosis_text_in = spark.sql("select main_suit,msg_content from diagnosis_text_in where pt>='20170101'")
inp = diagnosis_text_in.rdd.repartition(1200).mapPartitions(lambda it: split(df_list,it)).map(lambda row: row.split(" "))
model = word2vec.fit(inp)