I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python
我使用了预先训练的谷歌新闻数据集,使用python中的Gensim库获取单词向量。
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
After loading the model I am converting training reviews sentence words into vectors
在加载模型后,我将训练复习语句转换成向量。
#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])
During word2Vec process i get a lot of errors for the words in my corpus, that are not in the model. Problem is how can i retrain already pre-trained model (e.g GoogleNews-vectors-negative300.bin'), in order to get word vectors for those missing words.
在word2Vec过程中,在我的语料库中,我得到了很多错误,这不在模型中。问题是我如何才能重新训练已经训练过的模型(e)。为了得到那些缺失单词的单词向量。
Following is what I have tried: Trained a new model from training sentences that I had
以下是我的尝试:从我的训练句子中训练出一个新的模型。
# Set values for various parameters
num_features = 300 # Word vector dimensionality
min_word_count = 10 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
# Initialize and train the model (this will take some time)
print "Training model..."
model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count,
window = context, sample = downsampling)
model.build_vocab(sentences)
model.train(sentences)
model.n_similarity(["food"], ["rice"])
It worked! but the problem is that I have a really small dataset and less resources to train a large model.
它工作!但问题是,我有一个非常小的数据集和更少的资源来训练一个大型模型。
Second way that I am looking at is to extend the already trained model such as GoogleNews-vectors-negative300.bin.
第二种方法是扩展已经训练过的模型,比如GoogleNews-vectors-negative300.bin。
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
model.train(sentences)
Is it possible and is that a good way to use, please help me out
有可能吗?这是一种很好的使用方法,请帮帮我?
3 个解决方案
#1
1
Some folks have been working on extending gensim to allow online training.
一些人一直致力于延长gensim的在线培训。
A couple GitHub pull requests you might want to watch for progress on that effort:
一些GitHub请求您可能想要关注在这方面的进展:
- https://github.com/piskvorky/gensim/pull/435
- https://github.com/piskvorky/gensim/pull/435
- https://github.com/piskvorky/gensim/pull/615
- https://github.com/piskvorky/gensim/pull/615
It looks like this improvement could allow updating the GoogleNews-vectors-negative300.bin model.
看起来这种改进可以允许更新GoogleNews-vectors-negative300。本模型。
#2
1
This is how I technically solved the issue:
这就是我在技术上解决问题的方法:
Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/
准备从Radim Rehurek中可循环语句的数据输入:https://technologies.com/word2vec-tutorial/。
sentences = MySentences('newcorpus')
Setting up the model
建立该模型
model = gensim.models.Word2Vec(sentences)
Intersecting the vocabulary with the google word vectors
与谷歌词向量相交的词汇表。
model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
lockf=1.0,
binary=True)
Finally executing the model and updating
最后执行模型并更新。
model.train(sentences)
A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...
警告:从实质性的观点来看,很有争议的是,一个可能非常小的语料库是否真的能“改进”在大规模语料库上训练的谷歌wordv。
#3
0
it is possible if model builder didn't finalize the model training . in python it is:
如果模型构建者没有完成模型训练,这是可能的。在python中:
model.sims(replace=True) #finalize the model
if the model didn't finalize it is a perfect way to have model with large dataset.
如果模型没有最终确定,那么这是一个拥有大型数据集模型的完美方法。
#1
1
Some folks have been working on extending gensim to allow online training.
一些人一直致力于延长gensim的在线培训。
A couple GitHub pull requests you might want to watch for progress on that effort:
一些GitHub请求您可能想要关注在这方面的进展:
- https://github.com/piskvorky/gensim/pull/435
- https://github.com/piskvorky/gensim/pull/435
- https://github.com/piskvorky/gensim/pull/615
- https://github.com/piskvorky/gensim/pull/615
It looks like this improvement could allow updating the GoogleNews-vectors-negative300.bin model.
看起来这种改进可以允许更新GoogleNews-vectors-negative300。本模型。
#2
1
This is how I technically solved the issue:
这就是我在技术上解决问题的方法:
Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/
准备从Radim Rehurek中可循环语句的数据输入:https://technologies.com/word2vec-tutorial/。
sentences = MySentences('newcorpus')
Setting up the model
建立该模型
model = gensim.models.Word2Vec(sentences)
Intersecting the vocabulary with the google word vectors
与谷歌词向量相交的词汇表。
model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
lockf=1.0,
binary=True)
Finally executing the model and updating
最后执行模型并更新。
model.train(sentences)
A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...
警告:从实质性的观点来看,很有争议的是,一个可能非常小的语料库是否真的能“改进”在大规模语料库上训练的谷歌wordv。
#3
0
it is possible if model builder didn't finalize the model training . in python it is:
如果模型构建者没有完成模型训练,这是可能的。在python中:
model.sims(replace=True) #finalize the model
if the model didn't finalize it is a perfect way to have model with large dataset.
如果模型没有最终确定,那么这是一个拥有大型数据集模型的完美方法。