
时间:2022-07-10 20:23:07

I am trying to load a pretrained Word2Vec (or Glove) embedding in my Tensorflow code, however I have some problems understanding it as I cannot find many examples. The question is not about getting and loading the embedding matrix, which I understand, but about looking up the word ids. Currently I am using the code from https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/. There, first the embedding matrix is loaded (understood). Then, a vocabulary processor is used to convert a sentence x to a list of word IDs:


vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit the vocab from glove
pretrain = vocab_processor.fit(vocab)
#transform inputs
x = np.array(list(vocab_processor.transform(your_raw_input)))

This works and gives me a list of word ids, but I do not know if this is correct. What bothers me most is the question how the vocabulary processor gets the correct word ids from the embedding I just read (since otherwise the result of the embedding would be wrong). Does the fit step do this?


Or is there another way, how do you do this lookup?


Thanks! Oliver


1 个解决方案



Yes, the fit step tells the vocab_processor the index of each word (starting from 1) in the vocab array. transform just reversed this lookup and produces the index from the words and uses 0 to pad the output to the max_document_size.

是的,fit步骤告诉vocab_processor vocab数组中每个单词的索引(从1开始)。 transform只是颠倒了这个查找并从单词生成索引并使用0将输出填充到max_document_size。

You can see that in a short example here:


vocab_processor = learn.preprocessing.VocabularyProcessor(5)
vocab = ['a', 'b', 'c', 'd', 'e']
pretrain = vocab_processor.fit(vocab)

pretrain == vocab_processor
# True

np.array(list(pretrain.transform(['a b c', 'b c d', 'a e', 'a b c d e'])))

# array([[1, 2, 3, 0, 0],
#        [2, 3, 4, 0, 0],
#        [1, 5, 0, 0, 0],
#        [1, 2, 3, 4, 5]])



Yes, the fit step tells the vocab_processor the index of each word (starting from 1) in the vocab array. transform just reversed this lookup and produces the index from the words and uses 0 to pad the output to the max_document_size.

是的,fit步骤告诉vocab_processor vocab数组中每个单词的索引(从1开始)。 transform只是颠倒了这个查找并从单词生成索引并使用0将输出填充到max_document_size。

You can see that in a short example here:


vocab_processor = learn.preprocessing.VocabularyProcessor(5)
vocab = ['a', 'b', 'c', 'd', 'e']
pretrain = vocab_processor.fit(vocab)

pretrain == vocab_processor
# True

np.array(list(pretrain.transform(['a b c', 'b c d', 'a e', 'a b c d e'])))

# array([[1, 2, 3, 0, 0],
#        [2, 3, 4, 0, 0],
#        [1, 5, 0, 0, 0],
#        [1, 2, 3, 4, 5]])