1.下载*中文语料
http://www.52nlp.cn/中英文*语料上的Word2Vec实验中下载中文*数据,也可从中文数据的下载地址是:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2。
2.xml的wiki数据转换为text格式
参考http://www.mamicode.com/info-detail-1699780.html这篇博客
import logging
import os.path
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print (globals()['__doc__'] % locals())
sys.exit(1)
inp, outp = sys.argv[1:3]
space = b''
i = 0
output = open(outp, 'w',encoding='utf-8')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
s=space.join(text)
s=s.decode('utf8') + "\n"
output.write(s)
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
然后打开cmd,输入python G:\*语料\process_wiki.py G:\*语料\zhwiki-latest-pages-articles.xml.bz2 G:\*语料\wiki.zh.text(因为我是直接打开cmd,没有在*语料中打开,因此需要标清楚每个文件的路径,不然就会提示错误:No such file or directory)
3.进行繁化简
首先先安装opencc 安装exe的版本 到https://bintray.com/package/files/byvoid/opencc/OpenCC 中下载
opencc-1.0.1-win64.7z |
参考一下博客上面写在wiki.zh.text所在文件下打开命令窗口运行该命令C:\Users\xiaolin\opencc-1.0.1-win64\opencc.exe -i wiki.zh.text -o Wiki.zh.text.jian -c zht2hs.iniz
出现发现opencc-1.0.1-win64中没有这个文件,
然后阅读了http://blog.sina.com.cn/s/blog_703521020102zb5v.html这篇博客知道 t2s.json Traditional Chinese to Simplified Chinese 繁體到簡體
所以更改命令为C:\Users\xiaolin\opencc-1.0.1-win64\opencc.exe -i wiki.zh.text -o wiki.zh.text.jian -c C:\Users\xiaolin\opencc-1.0.1-win64\t2s.json
-i表示输入文件,-o表示输出文件,t2s.json表示繁体转换为简体
这样就实现了化繁为简
4.进行分词
用python自带的结巴分词进行分词
import jieba
import jieba.analyse
import jieba.posseg as pseg
import codecs,sys
def cut_words(sentence):
#print sentence
return " ".join(jieba.cut(sentence)).encode('utf-8')
f=codecs.open('wiki.zh.text.jian','r',encoding="utf-8")
target = codecs.open("wiki.zh.jian.fenci.txt", 'w',encoding="utf-8")
print ('open files')
line_num=1
line = f.readline()
while line:
print('---- processing ', line_num, ' article----------------')
line_seg = " ".join(jieba.cut(line))
target.writelines(line_seg)
line_num = line_num + 1
line = f.readline()
f.close()
target.close()
exit()
while line:
curr = []
for oneline in line:
#print(oneline)
curr.append(oneline)
after_cut = map(cut_words, curr)
target.writelines(after_cut)
print ('saved',line_num,'articles')
exit()
line = f.readline1()
f.close()
target.close()
5.训练Word2vec模型
参考https://magicly.me/word2vec-first-try-md/这篇博客
用52nlp的训练模型可能是内存不足,还是其他问题,总是出现slowly version gensim.doc2vec is being used,但是一直没有结果运行出来,参考这个博客,试了一下这个代码,在spyder3里跑了一下,结果跑通了。
import multiprocessing
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
inp = 'G:\\*语料\\wiki.zh.jian.fenci.txt'
outp1 = 'G:\\*语料\\wiki-zh-model'
outp2 = 'G:\\*语料\\wiki-zh-vector'
model = Word2Vec(LineSentence(inp), size = 400, window = 5, min_count = 5, workers = multiprocessing.cpu_count())
model.save(outp1) ## 以二进制格式存储
model.save_word2vec_format(outp2, binary = False) ## 以文本格式存储, 一行是一个词的vector
运行结果如下:
运行完之后会得到三个文件:wiki-zh-model、wiki-zh-model.syn1neg.npy、wiki-zh-model.wv.syn0.npy
5.进行word2vec模型测试
from gensim.models import Word2Vec
model = Word2Vec.load('G:\\*语料\\wiki-zh-model')
# model = Word2Vec.load_word2vec_format('./wiki-zh-vector', binary = False) # 如果之前用文本保存话, 用这个方法加载
res = model.most_similar('游泳')
print(res)
测试结果:
任务完成!!!
不过还需要改进的是可以在分词前把文档进行一些处理:去掉英文、去掉停用词、去掉标点符号等等
可以参考:http://www.cnblogs.com/little-horse/p/6701911.html