在tensorflow理解’tf.nn.nce_loss()”

I am trying to understand the NCE loss function in Tensorflow. NCE loss is employed for a word2vec task, for instance:

我试着去理解肌腱流中的NCE损失函数。NCE丢失用于word2vec任务，例如:

# Look up embeddings for inputs.
embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                        stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
loss = tf.reduce_mean(
    tf.nn.nce_loss(weights=nce_weights,
                   biases=nce_biases,
                   labels=train_labels,
                   inputs=embed,
                   num_sampled=num_sampled,
                   num_classes=vocabulary_size))

more details, please reference Tensorflow word2vec_basic.py

更多细节，请参考Tensorflow word2vec_basic.py

What are the input and output matrices in the NCE function?
NCE函数的输入和输出矩阵是什么?

In a word2vec model, we are interested in building representations for words. In the training process, given a slid window, every word will have two embeddings: 1) when the word is a centre word; 2) when the word is a context word. These two embeddings are called input and output vectors, respectively. (more explanations of input and output matrices)

在word2vec模型中，我们感兴趣的是为单词构建表示。在训练过程中，给定一个滑动窗口，每个单词都会有两个嵌入:1)当单词是中心词时;当这个词是一个上下文词时。这两个嵌入分别称为输入向量和输出向量。(对输入和输出矩阵的更多解释)

In my opinion, the input matrix is embeddings and the output matrix is nce_weights. Is it right?

在我看来，输入矩阵是嵌入的，输出矩阵是nce_weights。这是对的吗?

What is the final embedding?
最终的嵌入是什么?

According to a post by s0urcer also relating to nce, it says the final embedding matrix is just the input matrix. While, some others saying, the final_embedding=input_matrix+output_matrix. Which is right/more common?

根据s0urcer的一篇与nce相关的文章，它说最终的嵌入矩阵就是输入矩阵。还有人说，final_embed =input_matrix+output_matrix。这是正确的/更常见吗?

5 个解决方案

#1

Let's look at the relative code in word2vec example(examples/tutorials/word2vec).

让我们看看word2vec示例中的相关代码(示例/教程/word2vec)。

embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

This two lines create embedding representations. embeddings is a matrix where each row represents a word vector. embedding_lookup is a quick way to get vectors corresponding to train_inputs. In word2vec example, train_inputs consists of some int32 number, representing the id of target words. Basically, it can be placed by hidden layer feature.

这两行创建了嵌入表示。嵌入是一个矩阵，其中每一行表示一个单词向量。嵌入式_lookup是获取与train_input对应的向量的一种快速方法。在word2vec示例中，train_input包含一些int32数字，表示目标单词的id。基本上，它可以通过隐藏层特性来放置。

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                        stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

These two lines create parameters. They will be updated by optimizer during training. We can use tf.matmul(embed, tf.transpose(nce_weights)) + nce_biases to get final output score. In other words, last inner-product layer in classification can be replaced by it.

这两行创建了参数。在培训期间，优化器将对它们进行更新。我们可以用tf。matmul(嵌入，tf转置(nce_weights)) + nce_，以获得最终输出值。换句话说，分类中的最后一个内产品层可以被它所取代。

loss = tf.reduce_mean(
      tf.nn.nce_loss(weights=nce_weights,     # [vocab_size, embed_size]
                   biases=nce_biases,         # [embed_size]
                   labels=train_labels,       # [bs, 1]
                   inputs=embed,              # [bs, embed_size]
                   num_sampled=num_sampled, 
                   num_classes=vocabulary_size))

These lines create nce loss, @garej has given a very good explanation. num_sampled refers to the number of negative sampling in nce algorithm.

这些行造成nce损失，@garej给出了一个很好的解释。num_sampling表示在nce算法中负采样的数量。

To illustrate the usage of nce, we can apply it in mnist example(examples/tutorials/mnist/mnist_deep.py) with following 2 steps:

为了说明nce的用法，我们可以将它应用于mnist示例(示例/教程/mnist/mnist_deep.py)，步骤如下:

Replace embed with hidden layer output. The dimension of hidden layer is 1024 and num_output is 10. Minimum value of num_sampled is 1. Remember to remove the last inner-product layer in deepnn().

用隐藏层输出替换嵌入。隐藏层的尺寸为1024,num_output为10。num_sampling的最小值是1。记住要删除deepnn()中的最后一个内产品层。

y_conv, keep_prob = deepnn(x)                                            

num_sampled = 1                                                          
vocabulary_size = 10                                                     
embedding_size = 1024                                                    
with tf.device('/cpu:0'):                                                
  embed = y_conv                                                         
  # Construct the variables for the NCE loss                             
  nce_weights = tf.Variable(                                             
      tf.truncated_normal([vocabulary_size, embedding_size],             
                          stddev=1.0 / math.sqrt(embedding_size)))       
  nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

Create loss and compute output. After computing the output, we can use it to calculate accuracy. Note that the label here is not one-hot vector as used in softmax. Labels are the original label of training samples.

创建损失并计算输出。在计算输出后，我们可以用它来计算精度。请注意，这里的标签不是在softmax中使用的一个热矢量。标签是训练样本的原始标签。

loss = tf.reduce_mean(                                   
    tf.nn.nce_loss(weights=nce_weights,                           
                   biases=nce_biases,                             
                   labels=y_idx,                                  
                   inputs=embed,                                  
                   num_sampled=num_sampled,                       
                   num_classes=vocabulary_size))                  

output = tf.matmul(y_conv, tf.transpose(nce_weights)) + nce_biases
correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(y_, 1))

When we set num_sampled=1, the val accuracy will end at around 98.8%. And if we set num_sampled=9, we can get almost the same val accuracy as trained by softmax. But note that nce is different from softmax.

当我们设置num_samples =1时，val精度将在98.8%左右结束。如果我们设置num_sample =9，我们可以得到与softmax相同的val精度。但是请注意，nce与softmax是不同的。

Full code of training mnist by nce can be found here. Hope it is helpful.

nce的完整培训代码可以在这里找到。希望它是有帮助的。

#2

The embeddings Tensor is your final output matrix. It maps words to vectors. Use this in your word prediction graph.

嵌入张量是最终的输出矩阵。它将单词映射到向量。在你的词汇预测图表中使用这个。

The input matrix is a batch of centre-word : context-word pairs (train_input and train_label respectively) generated from the training text.

输入矩阵是由训练文本生成的一批中心字:上下文-字对(分别是train_input和train_label)。

While the exact workings of the nce_loss op are not yet know to me, the basic idea is that it uses a single layer network (parameters nce_weights and nce_biases) to map an input vector (selected from embeddings using the embed op) to an output word, and then compares the output to the training label (an adjacent word in the training text) and also to a random sub-sample (num_sampled) of all other words in the vocab, and then modifies the input vector (stored in embeddings) and the network parameters to minimise the error.

虽然nce_loss op的确切机制还不知道对我来说,最基本的想法是,它使用一个单层网络(参数nce_weights和nce_biases)来映射一个输入向量(从嵌入使用嵌入op)选择一个输出,然后比较输出培训标签(一个相邻的词在训练文本)和一个随机的业者(num_sampled)的所有其他单词词汇,然后修改输入向量(存储在嵌入中)和网络参数，使错误最小化。

#3

What are the input and output matrices in the NCE function?

NCE函数的输入和输出矩阵是什么?

Take for example the skip gram model, for this sentence:

以跳跃格模型为例，对于这个句子:

the quick brown fox jumped over the lazy dog

敏捷的棕色狐狸跳过了懒狗

the input and output pairs are:

输入和输出对为:

(quick, the), (quick, brown), (brown, quick), (brown, fox), ...

(快速),(快速、棕色),(棕色、快速),(棕色,福克斯)……

For more information please refer to the tutorial.

有关更多信息，请参考本教程。

What is the final embedding?

最终的嵌入是什么?

The final embedding you should extract is usually the {w} between the input and hidden layer.

您应该提取的最终嵌入通常是输入和隐藏层之间的{w}。

To illustrate more intuitively take a look at the following picture:

为了更直观地说明，请看下面的图片:

The one hot vector [0, 0, 0, 1, 0] is the input layer in the above graph, the output is the word embedding [10, 12, 19], and W(in the graph above) is the matrix in between.

一个热矢量[0,0,0,0,0,1,0]是上图的输入层，输出是嵌入字[10,12,19]，W(上图)是中间的矩阵。

For detailed explanation please read this tutorial.

有关详细说明，请阅读本教程。

#4

1) In short, it is right in general, but just partly right for the function in question. See tutorial:

简而言之，它在一般情况下是正确的，但在一定程度上，它对所讨论的函数是正确的。看到教程:

The noise-contrastive estimation loss is defined in terms of a logistic regression model. For this, we need to define the weights and biases for each word in the vocabulary (also called the output weights as opposed to the input embeddings).

用逻辑回归模型定义噪声对比估计损失。为此，我们需要定义词汇表中每个单词的权重和偏差(也称为输出权重，而不是输入嵌入)。

So inputs to the function nce_loss are output weights and a small part of input embeddings, among the other stuff.

函数nce_loss的输入是输出权值和一小部分的输入嵌入，以及其他东西。

2) 'Final' embedding (aka Word vectors, aka Vector representations of Words) is what you call input matrix. Embeddings are strings (vectors) of that matrix, corresponding to each word.

2)“最终”嵌入(即单词向量，即单词的向量表示)就是你所称的输入矩阵。嵌入是矩阵的字符串(向量)，对应于每个词。

Warn In fact, this terminology is confusing because of input and output concepts usage in NN environment. Embeddings matrix is not an input to NN, as input to NN is technically an input layer. You obtain the final state of this matrix during training process. Nonetheless, this matrix should be initialised in programming, because an algorithm has to start from some random state of this matrix to gradually update it during training.

警告事实上，由于在NN环境中输入和输出概念的使用，这个术语令人困惑。嵌入矩阵不是NN的输入，因为NN的输入技术上是一个输入层。你在训练过程中得到这个矩阵的最终状态。然而，这个矩阵在编程中应该被初始化，因为算法必须从这个矩阵的某个随机状态开始，在训练过程中逐渐更新它。

The same is true for weights - they are to be initialised also. It happens in the following line:

同样的道理也适用于权重——它们也将被初始化。它发生在下面一行:

nce_weights = tf.Variable(
        tf.truncated_normal([50000, 128], stddev=1.0 / math.sqrt(128)))

Each embedding vector can be multiplied by a vector from weights matrix (in a string to column manner). So we will get the scalar in the NN output layer. The norm of this scalar is interpreted as probability that the target word (from input layer) will be accompanied by label [or context] word corresponding to the place of scalar in output layer.

每个嵌入向量可以乘以一个向量，从权值矩阵(以字符串到列的方式)。我们会得到NN输出层中的标量。这个标量的范数被解释为目标字(来自输入层)将伴随与输出层中标量位置对应的标签[或上下文]字的概率。

Warn: ironically there is a confusion (mistake) in TensorFlow tutorial in Building the Graph section - they confused targets and contexts. What is worse, in python file you refer to they use labels for context words. I believe it puzzled a lot of people struggling with this poor tutorial.

警告:讽刺的是，在构建图形部分的TensorFlow教程中有一个混淆(错误)——它们混淆了目标和上下文。更糟糕的是，在python文件中，您提到它们使用上下文单词的标签。我相信这让很多人困惑，因为这个可怜的教程。

So, if we are saying about inputs (arguments) to the function, then both matrixes are such: weights and a batch sized extraction from embeddings:

因此，如果我们说的是函数的输入(参数)，那么两个矩阵都是这样的:权值和从嵌入项中提取的批量大小:

tf.nn.nce_loss(weights=nce_weights,            # Tensor of shape(50000, 128)
               biases=nce_biases,              # vector of zeros; len(128)
               labels=train_labels,            # labels == context words enums
               inputs=embed,                   # Tensor of shape(128, 128)
               num_sampled=num_sampled,        # 64: randomly chosen negative (rare) words
               num_classes=vocabulary_size))   # 50000: by construction

This nce_loss function outputs a vector of batch_size - in the TensorFlow example a shape(128,) tensor. Then reduce_mean() reduces this result to a scalar taken the mean of those 128 values, which is in fact an objective for further minimization.

这个nce_loss函数输出一个batch_size的向量——在TensorFlow示例中是一个形状(128，)张量。然后reduce_mean()将这个结果简化为一个标量，取这128个值的平均值，这实际上是进一步最小化的目标。

Hope this helps.

希望这个有帮助。

#5

From the paper Learning word embeddings efficiently with noise-contrastive estimation:

从论文中利用噪声对比估计有效地学习词的嵌入:

NCE is based on the reduction of density estimationto probabilistic binary classification. The basic idea is to train a logistic regression classifier to discriminate between samples from the data distribution and samples from some “noise” distribution

NCE是基于对概率二分类的密度估计的减少。基本思想是训练逻辑回归分类器，从数据分布中区分样本，从某些“噪声”分布中区分样本

We could find out that in word embedding, the NCE is actually negative sampling. (For the difference between this two, see paper: Notes on Noise Contrastive Estimation and Negative Sampling)

我们可以发现，在词嵌入中，NCE实际上是负采样。(关于两者的不同，请参阅论文:噪音对比估计及负采样的注释)

Therefore, you do not need to input the noise distribution. And also from the quote, you will find out that it is actually a logistic regression: weight and bias would be the one need for logistic regression. If you are familiar with word2vec, it is just adding a bias.

因此，不需要输入噪声分布。从这句话，你会发现它实际上是一个逻辑回归:重量和偏差是逻辑回归的需要。如果您熟悉word2vec，那么它只是添加了一个偏见。

#1