1. 分析
1) 中文分词:结巴分词
2) 中英文翻译:wordnet汉语开放词网,可从以下网址下载:
3) 情感分析:wordnet的sentiwordnet组件
4) 停用词:参考以下网页,另外加入常用标点符号
2. 代码
# encoding=utf-8
import jieba
import sys
import codecs
import nltk
from import wordnet as wn
from import sentiwordnet as swn
def doSeg(filename) :
f = open(filename, 'r+')
file_list = ()
seg_list = (file_list)
stopwords = []
for word in open("./stop_words.txt", "r"):
ll = []
for seg in seg_list :
if (("utf-8") not in stopwords and seg != ' ' and seg != '' and seg != "\n" and seg != "\n\n"):
return ll
def loadWordNet():
f = ("./", "rb", "utf-8")
known = set()
for l in f:
if ('#') or not ():
row = ().split("\t")
if len(row) == 3:
(synset, lemma, status) = row
elif len(row) == 2:
(synset, lemma) = row
status = 'Y'
print "illformed line: ", ()
if status in ['Y', 'O' ]:
if not ((), ()) in known:
(((), ()))
return known
def findWordNet(known, key):
ll = [];
for kk in known:
if (kk[1] == key):
return ll
def id2ss(ID):
return wn._synset_from_pos_and_offset(str(ID[-1:]), int(ID[:8]))
def getSenti(word):
return swn.senti_synset(())
if __name__ == '__main__' :
known = loadWordNet()
words = doSeg([1])
n = 0
p = 0
for word in words:
ll = findWordNet(known, word)
if (len(ll) != 0):
n1 = 0.0
p1 = 0.0
for wid in ll:
desc = id2ss(wid)
swninfo = getSenti(desc)
p1 = p1 + swninfo.pos_score()
n1 = n1 + swninfo.neg_score()
if (p1 != 0.0 or n1 != 0.0):
print word, '-> n ', (n1 / len(ll)), ", p ", (p1 / len(ll))
p = p + p1 / len(ll)
n = n + n1 / len(ll)
print "n", n, ", p", p
3. 待解决的问题
1) 结巴分词与wordnet chinese中的词不能一一对应
2) 一词多义/一义多词
3) 语义问题
4. 参考
1) Learning lexical scales:WordNet and SentiWordNet
2) SentiWordNet Interface