word2vec训练中文模型

时间:2024-07-07 17:33:20

--  这篇文章是一个学习、分析的博客 ---

1.准备数据与预处理

首先需要一份比较大的中文语料数据,可以考虑中文的*(也可以试试搜狗的新闻语料库)。中文*的打包文件地址为 
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

中文*的数据不是太大,xml的压缩文件大约1G左右。首先用 process_wiki_data.py处理这个XML压缩文件,执行:python process_wiki_data.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # process_wiki_data.py 用于解析XML,将XML的wiki数据转换为text格式胡2*!
  4. import logging
  5. import os.path
  6. import sys
  7. from gensim.corpora import WikiCorpus
  8. if __name__ == '__main__':
  9. program = os.path.basename(sys.argv[0])
  10. logger = logging.getLogger(program)
  11. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  12. logging.root.setLevel(level=logging.INFO)
  13. logger.info("running %s" % ' '.join(sys.argv))
  14. # check and process input arguments
  15. if len(sys.argv) < 3:
  16. print globals()['__doc__'] % locals()
  17. sys.exit(1)
  18. inp, outp = sys.argv[1:3]
  19. space = " "
  20. i = 0
  21. output = open(outp, 'w')
  22. wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
  23. for text in wiki.get_texts():
  24. output.write(space.join(text) + "\n")
  25. i = i + 1
  26. if (i % 10000 == 0):
  27. logger.info("Saved " + str(i) + " articles")
  28. output.close()
  29. logger.info("Finished Saved " + str(i) + " articles")

得到信息:

  1. 2016-08-11 20:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
  2. 2016-08-11 20:40:08,329: INFO: Saved 10000 articles
  3. 2016-08-11 20:40:45,501: INFO: Saved 20000 articles
  4. 2016-08-11 20:41:23,659: INFO: Saved 30000 articles
  5. 2016-08-11 20:42:01,748: INFO: Saved 40000 articles
  6. 2016-08-11 20:42:33,779: INFO: Saved 50000 articles
  7. ......
  8. 2016-08-11 20:55:23,094: INFO: Saved 200000 articles
  9. 2016-08-11 20:56:14,692: INFO: Saved 210000 articles
  10. 2016-08-11 20:57:04,614: INFO: Saved 220000 articles
  11. 2016-08-11 20:57:57,979: INFO: Saved 230000 articles
  12. 2016-08-11 20:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)
  13. 2016-08-11 20:58:16,622: INFO: Finished Saved 232894 articles

Python的话可用jieba完成分词,生成分词文件wiki.zh.text.seg 
接着用word2vec工具训练: 
python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # train_word2vec_model.py用于训练模型
  4. import logging
  5. import os.path
  6. import sys
  7. import multiprocessing
  8. from gensim.corpora import WikiCorpus
  9. from gensim.models import Word2Vec
  10. from gensim.models.word2vec import LineSentence
  11. if __name__ == '__main__':
  12. program = os.path.basename(sys.argv[0])
  13. logger = logging.getLogger(program)
  14. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  15. logging.root.setLevel(level=logging.INFO)
  16. logger.info("running %s" % ' '.join(sys.argv))
  17. # check and process input arguments
  18. if len(sys.argv) < 4:
  19. print globals()['__doc__'] % locals()
  20. sys.exit(1)
  21. inp, outp1, outp2 = sys.argv[1:4]
  22. model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
  23. workers=multiprocessing.cpu_count())
  24. # trim unneeded model memory = use(much) less RAM
  25. #model.init_sims(replace=True)
  26. model.save(outp1)
  27. model.save_word2vec_format(outp2, binary=False)

运行信息

  1. 2016-08-12 09:50:02,586: INFO: running python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector
  2. 2016-08-12 09:50:02,592: INFO: collecting all words and their counts
  3. 2016-08-12 09:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
  4. 2016-08-12 09:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types
  5. 2016-08-12 09:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types
  6. 2016-08-12 09:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types
  7. ...
  8. 2016-08-12 09:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types
  9. 2016-08-12 09:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types
  10. 2016-08-12 09:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types
  11. 2016-08-12 09:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences
  12. 2016-08-12 09:52:13,672: INFO: total 278291 word types after removing those with count<5
  13. 2016-08-12 09:52:13,673: INFO: constructing a huffman tree from 278291 words
  14. 2016-08-12 09:52:29,323: INFO: built huffman tree with maximum node depth 25
  15. 2016-08-12 09:52:29,683: INFO: resetting layer weights
  16. 2016-08-12 09:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
  17. 2016-08-12 09:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s
  18. 2016-08-12 09:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s
  19. 2016-08-12 09:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
  20. 2016-08-12 09:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s
  21. 2016-08-12 09:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s
  22. 2016-08-12 09:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s
  23. 2016-08-12 09:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s
  24. ......
  25. 2016-08-12 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s
  26. 2016-08-12 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s
  27. 2016-08-12 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s
  28. 2016-08-12 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s
  29. 2016-08-12 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
  30. 2016-08-12 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None
  31. 2016-08-12 19:22:13,884: INFO: not storing attribute syn0norm
  32. 2016-08-12 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy
  33. 2016-08-12 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy
  34. 2016-08-12 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

测试模型效果:

    1. In [1]: import gensim
    2. In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
    3. In [3]: model.most_similar(u"足球")
    4. Out[3]:
    5. [(u'\u8054\u8d5b', 0.6553816199302673),
    6. (u'\u7532\u7ea7', 0.6530429720878601),
    7. (u'\u7bee\u7403', 0.5967546701431274),
    8. (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
    9. (u'\u4e59\u7ea7', 0.5840631723403931),
    10. (u'\u8db3\u7403\u961f', 0.5560152530670166),
    11. (u'\u4e9a\u8db3\u8054', 0.5308005809783936),
    12. (u'allsvenskan', 0.5249762535095215),
    13. (u'\u4ee3\u8868\u961f', 0.5214947462081909),
    14. (u'\u7532\u7ec4', 0.5177896022796631)]
    15. In [4]: result = model.most_similar(u"足球")
    16. In [5]: for e in result:
    17. print e[0], e[1]
    18. ....:
    19. 联赛 0.65538161993
    20. 甲级 0.653042972088
    21. 篮球 0.596754670143
    22. 俱乐部 0.587228953838
    23. 乙级 0.58406317234
    24. 足球队 0.556015253067
    25. 亚足联 0.530800580978
    26. allsvenskan 0.52497625351
    27. 代表队 0.521494746208
    28. 甲组 0.51778960228
    29. In [6]: result = model.most_similar(u"男人")
    30. In [7]: for e in result:
    31. print e[0], e[1]
    32. ....:
    33. 女人 0.77537125349
    34. 家伙 0.617369174957
    35. 妈妈 0.567102909088
    36. 漂亮 0.560832381248
    37. 잘했어 0.540875017643
    38. 谎言 0.538448691368
    39. 爸爸 0.53660941124
    40. 傻瓜 0.535608053207
    41. 예쁘다 0.535151124001
    42. mc刘 0.529670000076
    43. In [8]: result = model.most_similar(u"女人")
    44. In [9]: for e in result:
    45. print e[0], e[1]
    46. ....:
    47. 男人 0.77537125349
    48. 我的某 0.589010596275
    49. 妈妈 0.576344847679
    50. 잘했어 0.562340974808
    51. 美丽 0.555426716805
    52. 爸爸 0.543958246708
    53. 新娘 0.543640494347
    54. 谎言 0.540272831917
    55. 妞儿 0.531066179276
    56. 老婆 0.528521537781
    57. In [10]: result = model.most_similar(u"青蛙")
    58. In [11]: for e in result:
    59. print e[0], e[1]
    60. ....:
    61. 老鼠 0.559612870216
    62. 乌龟 0.489831030369
    63. 蜥蜴 0.478990525007
    64. 猫 0.46728849411
    65. 鳄鱼 0.461885392666
    66. 蟾蜍 0.448014199734
    67. 猴子 0.436584025621
    68. 白雪公主 0.434905380011
    69. 蚯蚓 0.433413207531
    70. 螃蟹 0.4314712286
    71. In [12]: result = model.most_similar(u"姨夫")
    72. In [13]: for e in result:
    73. print e[0], e[1]
    74. ....:
    75. 堂伯 0.583935439587
    76. 祖父 0.574735701084
    77. 妃所生 0.569327116013
    78. 内弟 0.562012672424
    79. 早卒 0.558042645454
    80. 曕 0.553856015205
    81. 胤祯 0.553288519382
    82. 陈潜 0.550716996193
    83. 愔之 0.550510883331
    84. 叔父 0.550032019615
    85. In [14]: result = model.most_similar(u"衣服")
    86. In [15]: for e in result:
    87. print e[0], e[1]
    88. ....:
    89. 鞋子 0.686688780785
    90. 穿着 0.672499775887
    91. 衣物 0.67173999548
    92. 大衣 0.667605519295
    93. 裤子 0.662670075893
    94. 内裤 0.662210345268
    95. 裙子 0.659705817699
    96. 西装 0.648508131504
    97. 洋装 0.647238850594
    98. 围裙 0.642895817757
    99. In [16]: result = model.most_similar(u"*局")
    100. In [17]: for e in result:
    101. print e[0], e[1]
    102. ....:
    103. 司法局 0.730189085007
    104. *厅 0.634275555611
    105. * 0.612798035145
    106. 房管局 0.597343325615
    107. 商业局 0.597183346748
    108. 军管会 0.59476184845
    109. 体育局 0.59283208847
    110. 财政局 0.588721752167
    111. 戒毒所 0.575558543205
    112. 新闻办 0.573395550251
    113. In [18]: result = model.most_similar(u"铁道部")
    114. In [19]: for e in result:
    115. print e[0], e[1]
    116. ....:
    117. 盛光祖 0.565509021282
    118. 交通部 0.548688530922
    119. 批复 0.546967327595
    120. 刘志军 0.541010737419
    121. 立项 0.517836689949
    122. 报送 0.510296344757
    123. 计委 0.508456230164
    124. 水利部 0.503531932831
    125. 国务院 0.503227233887
    126. 经贸委 0.50156635046
    127. In [20]: result = model.most_similar(u"清华大学")
    128. In [21]: for e in result:
    129. print e[0], e[1]
    130. ....:
    131. 北京大学 0.763922810555
    132. 化学系 0.724210739136
    133. 物理系 0.694550514221
    134. 数学系 0.684280991554
    135. 中山大学 0.677202701569
    136. 复旦 0.657914161682
    137. 师范大学 0.656435549259
    138. 哲学系 0.654701948166
    139. 生物系 0.654403865337
    140. 中文系 0.653147578239
    141. In [22]: result = model.most_similar(u"卫视")
    142. In [23]: for e in result:
    143. print e[0], e[1]
    144. ....:
    145. 湖南 0.676812887192
    146. 中文台 0.626506924629
    147. 収蔵 0.621356606483
    148. 黄金档 0.582251906395
    149. cctv 0.536769032478
    150. 安徽 0.536752820015
    151. 非同凡响 0.534517168999
    152. 唱响 0.533438682556
    153. 最强音 0.532605051994
    154. 金鹰 0.531676828861
    155. In [24]: result = model.most_similar(u"习1*") //这里博客作了判断,不让包含 有国家*的信息
    156. In [25]: for e in result:
    157. print e[0], e[1]
    158. ....:
    159. 胡2* 0.809472680092
    160. 江3泽民 0.754633367062
    161. 李4克强 0.739740967751
    162. 贾5庆林 0.737033963203
    163. 曾6庆红 0.732847094536
    164. 吴7邦国 0.726941585541
    165. 总书记 0.719057679176
    166. 李8瑞环 0.716384887695
    167. 温9家宝 0.711952567101
    168. 王10岐山 0.703570842743
    169. In [26]: result = model.most_similar(u"林丹")
    170. In [27]: for e in result:
    171. print e[0], e[1]
    172. ....:
    173. 黄综翰 0.538035452366
    174. 蒋燕皎 0.52646958828
    175. 刘鑫 0.522252976894
    176. 韩晶娜 0.516120731831
    177. 王晓理 0.512289524078
    178. 王适 0.508560419083
    179. 杨影 0.508159279823
    180. 陈跃 0.507353425026
    181. 龚智超 0.503159761429
    182. 李敬元 0.50262516737
    183. In [28]: result = model.most_similar(u"语言学")
    184. In [29]: for e in result:
    185. print e[0], e[1]
    186. ....:
    187. 社会学 0.632598280907
    188. 人类学 0.623406708241
    189. 历史学 0.618442356586
    190. 比较文学 0.604823827744
    191. 心理学 0.600066184998
    192. 人文科学 0.577783346176
    193. 社会心理学 0.575571238995
    194. 政治学 0.574541330338
    195. 地理学 0.573896467686
    196. 哲学 0.573873817921
    197. In [30]: result = model.most_similar(u"计算机")
    198. In [31]: for e in result:
    199. print e[0], e[1]
    200. ....:
    201. 自动化 0.674171924591
    202. 应用 0.614087462425
    203. 自动化系 0.611132860184
    204. 材料科学 0.607891201973
    205. 集成电路 0.600370049477
    206. 技术 0.597518980503
    207. 电子学 0.591316461563
    208. 建模 0.577238917351
    209. 工程学 0.572855889797
    210. 微电子 0.570086717606
    211. In [32]: model.similarity(u"计算机", u"自动化")
    212. Out[32]: 0.67417196002404789
    213. In [33]: model.similarity(u"女人", u"男人")
    214. Out[33]: 0.77537125129824813
    215. In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
    216. Out[34]: u'\u4e2d\u5fc3'
    217. In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
    218. 中心

来源:https://www.zybuluo.com/hanxiaoyang/note/472184