Blog Logo
Research Scientist @ Foundation AI
Image Source:
· · ·

Distributed Vector Representation Series


Improvements on Word2Vec

· · ·


  • Computing the continuous vector representations of words from very large data sets.
  • Current state-of-the-art performance on semantic and syntactic word similarities.
  • Classical techniques treat words as atomic units without any notion of similarities between them because they are represented using indices in a vocabulary (bag-of-words).
  • Advantages of classical techniques lie in simplicity, robustness and accuracy of simple model when trained on large data sets over complex models trained on less data.
  • Disadvantage of these methods is observed when the amount of data available to train is limited in certain fields like say, automatic speech recognition and machine translations.

Previous Works

  • Neural Network Language Model (NNLM):
    • Consists of input, projection, hidden and output layers.
    • Input layer has N previous words encoded using 1-in-V coding, where V is the size of Vocabulary.
    • Projection layer, P has a projection of input layer has a dimensionality of \(N * D\) and uses a projection matrix.
    • High complexity between projection and hidden layer due to dimensions of the dense projection layer.
    • Computational complexity of NNLM per training example is given by

    \[Q = N * D + N * D * H + H * V\]

    • Where
      • Q is the computational cost
      • N is the number of previous words used for learning
      • D is the dimensionality of the projection layer
      • H is the size of hidden layer
      • V is the size of the vocabulary and output layer.
    • \(H * V\) is the dominating term above which was proposed to be reduced to as less as \(H * log_2(V)\) using
      • Hierarchical softmax
      • Avoiding normalized models for training
      • Binary tree representations of the vocabulary using Huffman Trees
    • So, the major complexity is dominated by \(N * D * H\)
  • Recurrent Neural Network Language Model (RNNLM):
    • Overcome the limitations of NNLM such as need to specify the context length, N (order of the model N)
    • Theoretically RNNs can efficiently represent more complex patterns than shallow neural networks.
    • No projection layer
    • Consists of Input, hidden and output layers.
    • Develops a short term memory of seen data in the self-fed time delayed hidden layer.
    • Computational complexity of NNLM per training example is given by

    \[Q = H * H + H * V\]

    • Where
      • Q is the computational cost
      • H is the size of hidden layer
      • V is the size of the vocabulary and output layer.
    • Word representations D have the same dimensionality as the hidden layer H.
    • Again, \(H * V\) will be reduced to \(H * log_2(V)\) using Hierarchical softmax.
    • So, the major complexity is dominated by \(H * H\)
  • It’s observed that most complexity is contributed by the non-linearity of the hidden layer in the networks.

Continuous Bag-of-Words Model (CBOW)

  • Similar to feedforward NNLM, but the non-linear hidden layer is removed.

  • Projection layer is shared for all the words. So all words are projected into the same position and their vectors are averaged.

  • Model is called bag-of-words model because the order of words in the history or future does not influence the projections.

  • Unlike NNLM, words from future are used to with the best result found with 4 history and 4 future words in context.

  • Training criterion is the correct classification of the current(middle) word.

  • Training complexity is given by

\[Q = N * D + D * log_2(V)\]

  • Model is continuous bag-of-words because unlike standard bag-of-words it uses continuous distributed representations of the context.

  • Weights between the input and the projection layer is shared for all words positions in the same way as in NNLM.

Continuous Skip-Gram Model

  • Similar to CBOW but slight changes in training criterion.

  • Instead of predicting current word from the surrounding words in the window, current word is used to predict the words surrounding the current word.

  • Accuracy and quality of vector is found to increase as the number of context words predicted is increased, but that increased the computational complexity as well.

  • Training complexity is given by

\[Q = C * (D + D * log_2(V))\]

  • Where
    • C is the maximum distance of the words. Say, C=5 is chosen then a number \(R \in [1, C]\) is selected randomly and then R words from history and R from future are correct labels of the current word.

Model Architectures

CBOW and Skip-Gram Model Architectures


  • Algebraic operations on the vector representations actually give meaningful results like cosine similary of \(vector(X)\) is closest to \(vector(‘smallest’)\) where

\[vector(X) = vector(‘biggest’) - vector(‘big’) + vector(‘small’)\]

  • Subtle relationships are learnt when accurate data is used. For example, France is to Paris as Germany is to Berlin.

  • After a certain point adding more dimensionality to the word vectors or adding more training data provides diminishing improvements.

  • NNLM vectors work better than RNNLM because word vectors in RNNLM are directly connected to non-linear hidden layer.

  • CBOW is better than NNLM on syntactic tasks and about the same on semantic tasks.

  • Skip-Gram works slightly worse than CBOW but better than NNLM on syntactic tasks and much better on semantic tasks.

  • Training time for Skip-Gram model is greater than CBOW model.


Efficient Estimation of Word Representations in Vector Space

· · ·