Distributed Vector Representation Series
SkipGram Model
 Training objective of skipgram model is to deduce word representations that help in predicting the surrounding words in a sentence or a document, i.e. give a sequence of training words \(w_1, w_2, w_3, … , w_T\), the objective is to maximize the average log probability,
 Where c is the size of the training context

Larger c results in more training examples and thus higher accuracy, at the expense of the training time.
 Basic skipgram formulation defines \(p(w_{t+j}w_t)\) using softmax function
 Where
 \(v_w\) and \(v_w^{'}\) are the input and output vector representations of \(w\)
 \(W\) is the number of words in the vocabulary
 cost of computing \(\nabla log\,p(w_O w_I)\) is proportional to \(W\) which is quite large in order of \(10^5  10^7\) terms
Drawbacks of Initial Word2Vec Proposed
 Training time
 Indifference to word order and inability to represent idiomatic phrases.
Improvements
 Hierarchical Softmax
 Computationally efficient approximation of full softmax.
 Advantageous because instead of evaluating W output nodes in the neural network, only need to evaluate \(log_2(W)\) nodes.
 Uses binary tree representation of the output layer with W nodes as its leaves. Each node represents the relative probability of its child nodes. So, probability is assigned to a leaf node through random walk from root node
 Each word can be reached by following an appropriate path from the root. Let \(n(w, j)\) be the jth node on the path from the root to w, and let \(L(w)\) be the length of this path. So,
\[n(w, 1) = root\]
\[n(w, L(w)) = w\]
 For any inner child n, let \(ch(n)\) be arbitrary fixed child of n and let \(\unicode{x27E6} x \unicode{x27E7}\) be 1 if x is true and 1 otherwise, then hierarchical softmax defines \(p(w_Ow_I)\) as
 Where \(\sigma (x) = {1 \over (1 + exp(x))}\)
 \(\sum_{w=1}^W p(ww_I) = 1\) is verifiable.
 Negative Sampling
 Alternative to Hierarchical softmax.
 Based on Noise Contrastive Estimation(NCE) which posits that a good model should be able to differentiate data from noise by means of logistic regression.
 NCE approximately maximizes the log probability of softmax, but skipgram only aims at learning high quality word representations and hence NCE can be simplified.
 Negative Sampling (NEG) is defined as

It replaces the \(log\, p(w_O  w_I)\) term in skipgram objective

Aiming at distinguishing between \(w_O\) from draws from noise distribution \(P_n(w)\) using logistic regression, where k is the number of negative samples for each data sample, because this use case does not require maximization of the softmax log probability.

k value ranges 520 for small datasets and 25 for large datasets.

NCE differs from NEG in that NCE needs the sample and numerical probabilities of the noise distribution, but NEG uses only samples.

Both NCE and NEG have \(P_n(w)\) as a free parameter but unigram distribution U(w) raised to 3/4 power i.e. \(U(w)^{3/4}/Z\) is found to outperform other options like unigram and uniform distribution.
 Subsampling of frequent words
 In a corpus, most frequent words can occur hundreds of millions of time such as the stopwords.
 These words give little information by cooccuring with other words. Consequently, vector representations of frequent words do not change significantly after training on several million examples.
 Imbalance between rare and frequent words are thus countered using subsampling approach.
 Each word \(w_i\) in the training set is discarded with a probability given by
 Where
 \(f(w_i)\) is frequency of word \(w_i\)
 t is a chosen threshold, around \(10^{5}\)
 Subsampling formula is chosen heuristically
 Learning Phrases
 Aim is to learn phrases where in individual words meaning is entirely different when compared to the group of words.
 Start off by finding words that frequently occur together and infrequently in other contexts.
 Theoretically, skipgram model can be trained with all ngrams, but would be a very memory intensive operation.
 A data driven approach is put forward based on unigram and bigram counts, given by

Where \(\delta\) is a discounting coefficient and prevents phrases formed by very infrequent words. The bigrams with score above a given threshold only are used as phrases.

Often it is needed to run the process 24 times changing the threshold values and seeing the quality of phrases formed.
Conclusions

Subsampling results in faster training and significantly better representations of uncommon words.

Negative sampling helps accurately train for frequent words using a simple method.
REFERENCES
Distributed Representations of Words and Phrases and their Compositionality