词向量源码解析:(5.2)ngram2vec源码解析之uni_uni

时间:2021-05-21 17:19:24

ngram2vec工具包的流程和hyperwords很像,都是一系列的##2##文件,最终从语料一步一步得到词向量。所以我们先通过脚本文件看一下整个的执行流程。一共有三个脚本文件,uni_uni,uni_bi,bi_bi。这三个脚本都会完整的执行一遍流程,只不过用的特征不一样。_之前表示中心词用的特征,_之后表示上下文用的特征。uni_bi就表示中心词用的是简单的单词特征,上下文用的是单词特征加上bigram特征。首先看看uni_uni.sh的内容。

#!/bin/sh

#首先是一些参数设置,win是上下文窗口大小,size是词向量维度,thr是低频词过滤的阈值,sub是subsampling的参数,iters是word2vec执行几轮,threads是线程数量,下面的很多程序都是多线程的。negative是word2vec参数,memsize是内存大小,这个工具包会让我们在有限的内存内得到词向量。
win=2
size=300
thr=100
sub=1e-3
iters=3
threads=8
negative=5
memsize=32.0
corpus=wiki2010.clean
output_path=outputs/uni_uni/win${win}

#不同模型的词向量放在不同的路径下面
mkdir -p ${output_path}/sgns
mkdir -p ${output_path}/ppmi
mkdir -p ${output_path}/svd
mkdir -p ${output_path}/glove

#然后执行corpus2vocab,corpus2pairs,后者用到前者的输入。同时后者是多线程,会生成多个文件,我们最后要对多个文件进行拼接得到完整的pair
python ngram2vec/corpus2vocab.py --ngram 1 --memory_size ${memsize} --min_count ${thr} ${corpus} ${output_path}/vocab
python ngram2vec/corpus2pairs.py --win ${win} --sub ${sub} --ngram_word 1 --ngram_context 1 --threads_num ${threads} ${corpus} ${output_path}/vocab ${output_path}/pairs
#concatenate pair files 
if [ -f "${output_path}/pairs" ]
then
rm ${output_path}/pairs
fi
for i in $(seq 0 $((${threads}-1)) )
do
cat ${output_path}/pairs_${i} >> ${output_path}/pairs
rm ${output_path}/pairs_${i}
done

#有了pairs以后,通过pairs得到中心词和上下文词典
#generate (center) word vocabulary and context vocabulary, which are used as vocabulary files for all models
python ngram2vec/pairs2vocab.py ${output_path}/pairs ${output_path}/words.vocab ${output_path}/contexts.vocab

#word2vec在pairs上面进行训练
#SGNS, learn representation upon pairs
./word2vecf/word2vecf -train ${output_path}/pairs -pow 0.75 -cvocab ${output_path}/contexts.vocab -wvocab ${output_path}/words.vocab -dumpcv ${output_path}/sgns/sgns.contexts -output ${output_path}/sgns/sgns.words -threads ${threads} -negative ${negative} -size ${size} -iters ${iters}

#后面的代码word2vec生成的向量进行评估,首先把词向量从文本格式转到numpy数组格式,然后用analogy和similarity评估。
#SGNS evaluation
cp ${output_path}/words.vocab ${output_path}/sgns/sgns.words.vocab
cp ${output_path}/contexts.vocab ${output_path}/sgns/sgns.contexts.vocab
python ngram2vec/text2numpy.py ${output_path}/sgns/sgns.words
python ngram2vec/text2numpy.py ${output_path}/sgns/sgns.contexts
for dataset in testsets/analogy/google.txt testsets/analogy/semantic.txt testsets/analogy/syntactic.txt testsets/analogy/msr.txt
do
python ngram2vec/analogy_eval.py SGNS ${output_path}/sgns/sgns ${dataset}
done
for dataset in testsets/ws/ws353_similarity.txt testsets/ws/ws353_relatedness.txt testsets/ws/bruni_men.txt testsets/ws/radinsky_mturk.txt testsets/ws/luong_rare.txt testsets/ws/sim999.txt
do
python ngram2vec/ws_eval.py SGNS ${output_path}/sgns/sgns ${dataset}
done

#下面是从pairs得到counts,也就是共现矩阵。GloVe,PPMI和SVD都在共现矩阵上面生成词向量
#generate co-occurrence matrix from pairs
python ngram2vec/pairs2counts.py --memory_size ${memsize} ${output_path}/pairs ${output_path}/words.vocab ${output_path}/contexts.vocab ${output_path}/counts

#在共现矩阵的基础上得到PPMI矩阵
#PPMI, learn representation upon counts (co-occurrence matrix)
python ngram2vec/counts2ppmi.py ${output_path}/words.vocab ${output_path}/contexts.vocab ${output_path}/counts ${output_path}/ppmi/ppmi

#对PPMI单词表示进行评估
#PPMI evaluation
cp ${output_path}/words.vocab ${output_path}/ppmi/ppmi.words.vocab
cp ${output_path}/contexts.vocab ${output_path}/ppmi/ppmi.contexts.vocab
for dataset in testsets/analogy/google.txt testsets/analogy/semantic.txt testsets/analogy/syntactic.txt testsets/analogy/msr.txt
do
python ngram2vec/analogy_eval.py PPMI ${output_path}/ppmi/ppmi ${dataset}
done
for dataset in testsets/ws/ws353_similarity.txt testsets/ws/ws353_relatedness.txt testsets/ws/bruni_men.txt testsets/ws/radinsky_mturk.txt testsets/ws/luong_rare.txt testsets/ws/sim999.txt
do
python ngram2vec/ws_eval.py PPMI ${output_path}/ppmi/ppmi ${dataset}
done

#对PPMI矩阵进行分解,得到SVD单词表示
#SVD, factorize PPMI matrix
python ngram2vec/ppmi2svd.py ${output_path}/ppmi/ppmi ${output_path}/svd/svd 

#对SVD单词表示进行评估
#SVD evaluation
cp ${output_path}/words.vocab ${output_path}/svd/svd.words.vocab
cp ${output_path}/contexts.vocab ${output_path}/svd/svd.contexts.vocab
for dataset in testsets/analogy/google.txt testsets/analogy/semantic.txt testsets/analogy/syntactic.txt testsets/analogy/msr.txt
do
python ngram2vec/analogy_eval.py SVD ${output_path}/svd/svd ${dataset}
done
for dataset in testsets/ws/ws353_similarity.txt testsets/ws/ws353_relatedness.txt testsets/ws/bruni_men.txt testsets/ws/radinsky_mturk.txt testsets/ws/luong_rare.txt testsets/ws/sim999.txt
do
python ngram2vec/ws_eval.py SVD ${output_path}/svd/svd ${dataset}
done

#glovef需要打乱的共现矩阵以及二进制的输入
#GloVe, learn representation upon counts (co-occurrence matrix)
python ngram2vec/counts2shuf.py ${output_path}/counts ${output_path}/counts.shuf
python ngram2vec/counts2bin.py ${output_path}/counts.shuf ${output_path}/counts.shuf.bin

#glovef进行训练
./GloVe/build/glove -save-file ${output_path}/glove/glove.words -threads ${threads} -input-file ${output_path}/counts.shuf.bin -vector-size ${size} -words-file ${output_path}/words.vocab -contexts-file ${output_path}/contexts.vocab 

#把glovef生成向量由文本格式转成numpy数组格式
cp ${output_path}/words.vocab ${output_path}/glove/glove.words.vocab
python ngram2vec/text2numpy.py ${output_path}/glove/glove.words

#对glovef进行评估
for dataset in testsets/analogy/google.txt testsets/analogy/semantic.txt testsets/analogy/syntactic.txt testsets/analogy/msr.txt
do
python ngram2vec/analogy_eval.py GLOVE ${output_path}/glove/glove ${dataset}
done
for dataset in testsets/ws/ws353_similarity.txt testsets/ws/ws353_relatedness.txt testsets/ws/bruni_men.txt testsets/ws/radinsky_mturk.txt testsets/ws/luong_rare.txt testsets/ws/sim999.txt
do
python ngram2vec/ws_eval.py GLOVE ${output_path}/glove/glove ${dataset}
done

剩下的两个脚本文件和uni_uni.sh几乎完全一样,只是corpus2pairs的部分有变化。后面都一模一样。所以我们也能看到corpus2pairs部分决定了所欲模型的信息源头。我们的窗口多大,高频词低频词怎么处理,特征选什么,全部都在corpus2pairs部分决定了。由于这个工具包有清晰的流程,能帮助我们非常方便的进行扩展。比如我们需要引入新的特征,只需要在corpus2pairs部分有改变就可以了。