Blog Logo
Research Assistant @ Purdue University
Image Source:
· · ·

Distributed Vector Representation Series


Improvements on Word2Vec

· · ·

Skip-Gram Model

  • Training objective of skip-gram model is to deduce word representations that help in predicting the surrounding words in a sentence or a document, i.e. give a sequence of training words \(w_1, w_2, w_3, … , w_T\), the objective is to maximize the average log probability,
  • Where c is the size of the training context
  • Larger c results in more training examples and thus higher accuracy, at the expense of the training time.

  • Basic skip-gram formulation defines \(p(w_{t+j}|w_t)\) using softmax function
  • Where
    • \(v_w\) and \(v_w^{'}\) are the input and output vector representations of \(w\)
    • \(W\) is the number of words in the vocabulary
    • cost of computing \(\nabla log\,p(w_O| w_I)\) is proportional to \(W\) which is quite large in order of \(10^5 - 10^7\) terms

Drawbacks of Initial Word2Vec Proposed

  • Training time
  • Indifference to word order and inability to represent idiomatic phrases.


  • Hierarchical Softmax
    • Computationally efficient approximation of full softmax.
    • Advantageous because instead of evaluating W output nodes in the neural network, only need to evaluate \(log_2(W)\) nodes.
    • Uses binary tree representation of the output layer with W nodes as its leaves. Each node represents the relative probability of its child nodes. So, probability is assigned to a leaf node through random walk from root node
    • Each word can be reached by following an appropriate path from the root. Let \(n(w, j)\) be the j-th node on the path from the root to w, and let \(L(w)\) be the length of this path. So,

    \[n(w, 1) = root\]

    \[n(w, L(w)) = w\]

    • For any inner child n, let \(ch(n)\) be arbitrary fixed child of n and let \(\unicode{x27E6} x \unicode{x27E7}\) be 1 if x is true and -1 otherwise, then hierarchical softmax defines \(p(w_O|w_I)\) as
    • Where \(\sigma (x) = {1 \over (1 + exp(-x))}\)
    • \(\sum_{w=1}^W p(w|w_I) = 1\) is verifiable.
  • Negative Sampling
    • Alternative to Hierarchical softmax.
    • Based on Noise Contrastive Estimation(NCE) which posits that a good model should be able to differentiate data from noise by means of logistic regression.
    • NCE approximately maximizes the log probability of softmax, but skip-gram only aims at learning high quality word representations and hence NCE can be simplified.
    • Negative Sampling (NEG) is defined as
    • It replaces the \(log\, p(w_O | w_I)\) term in skip-gram objective

    • Aiming at distinguishing between \(w_O\) from draws from noise distribution \(P_n(w)\) using logistic regression, where k is the number of negative samples for each data sample, because this use case does not require maximization of the softmax log probability.

    • k value ranges 5-20 for small datasets and 2-5 for large datasets.

    • NCE differs from NEG in that NCE needs the sample and numerical probabilities of the noise distribution, but NEG uses only samples.

    • Both NCE and NEG have \(P_n(w)\) as a free parameter but unigram distribution U(w) raised to 3/4 power i.e. \(U(w)^{3/4}/Z\) is found to outperform other options like unigram and uniform distribution.

  • Subsampling of frequent words
    • In a corpus, most frequent words can occur hundreds of millions of time such as the stopwords.
    • These words give little information by co-occuring with other words. Consequently, vector representations of frequent words do not change significantly after training on several million examples.
    • Imbalance between rare and frequent words are thus countered using sub-sampling approach.
    • Each word \(w_i\) in the training set is discarded with a probability given by
    • Where
      • \(f(w_i)\) is frequency of word \(w_i\)
      • t is a chosen threshold, around \(10^{-5}\)
    • Subsampling formula is chosen heuristically
  • Learning Phrases
    • Aim is to learn phrases where in individual words meaning is entirely different when compared to the group of words.
    • Start off by finding words that frequently occur together and infrequently in other contexts.
    • Theoretically, skip-gram model can be trained with all n-grams, but would be a very memory intensive operation.
    • A data driven approach is put forward based on unigram and bigram counts, given by
    • Where \(\delta\) is a discounting coefficient and prevents phrases formed by very infrequent words. The bigrams with score above a given threshold only are used as phrases.

    • Often it is needed to run the process 2-4 times changing the threshold values and seeing the quality of phrases formed.


  • Subsampling results in faster training and significantly better representations of uncommon words.

  • Negative sampling helps accurately train for frequent words using a simple method.


Distributed Representations of Words and Phrases and their Compositionality

· · ·