Blog Logo
·
· · ·

### Distributed Vector Representation Series

Word2Vec

Improvements on Word2Vec

· · ·

### Skip-Gram Model

• Training objective of skip-gram model is to deduce word representations that help in predicting the surrounding words in a sentence or a document, i.e. give a sequence of training words $w_1, w_2, w_3, … , w_T$, the objective is to maximize the average log probability,
• Where c is the size of the training context
• Larger c results in more training examples and thus higher accuracy, at the expense of the training time.

• Basic skip-gram formulation defines $p(w_{t+j}|w_t)$ using softmax function
• Where
• $v_w$ and $v_w^{'}$ are the input and output vector representations of $w$
• $W$ is the number of words in the vocabulary
• cost of computing $\nabla log\,p(w_O| w_I)$ is proportional to $W$ which is quite large in order of $10^5 - 10^7$ terms

### Drawbacks of Initial Word2Vec Proposed

• Training time
• Indifference to word order and inability to represent idiomatic phrases.

### Improvements

• Hierarchical Softmax
• Computationally efficient approximation of full softmax.
• Advantageous because instead of evaluating W output nodes in the neural network, only need to evaluate $log_2(W)$ nodes.
• Uses binary tree representation of the output layer with W nodes as its leaves. Each node represents the relative probability of its child nodes. So, probability is assigned to a leaf node through random walk from root node
• Each word can be reached by following an appropriate path from the root. Let $n(w, j)$ be the j-th node on the path from the root to w, and let $L(w)$ be the length of this path. So,

$n(w, 1) = root$

$n(w, L(w)) = w$

• For any inner child n, let $ch(n)$ be arbitrary fixed child of n and let $\unicode{x27E6} x \unicode{x27E7}$ be 1 if x is true and -1 otherwise, then hierarchical softmax defines $p(w_O|w_I)$ as
• Where $\sigma (x) = {1 \over (1 + exp(-x))}$
• $\sum_{w=1}^W p(w|w_I) = 1$ is verifiable.
• Negative Sampling
• Alternative to Hierarchical softmax.
• Based on Noise Contrastive Estimation(NCE) which posits that a good model should be able to differentiate data from noise by means of logistic regression.
• NCE approximately maximizes the log probability of softmax, but skip-gram only aims at learning high quality word representations and hence NCE can be simplified.
• Negative Sampling (NEG) is defined as
• It replaces the $log\, p(w_O | w_I)$ term in skip-gram objective

• Aiming at distinguishing between $w_O$ from draws from noise distribution $P_n(w)$ using logistic regression, where k is the number of negative samples for each data sample, because this use case does not require maximization of the softmax log probability.

• k value ranges 5-20 for small datasets and 2-5 for large datasets.

• NCE differs from NEG in that NCE needs the sample and numerical probabilities of the noise distribution, but NEG uses only samples.

• Both NCE and NEG have $P_n(w)$ as a free parameter but unigram distribution U(w) raised to 3/4 power i.e. $U(w)^{3/4}/Z$ is found to outperform other options like unigram and uniform distribution.

• Subsampling of frequent words
• In a corpus, most frequent words can occur hundreds of millions of time such as the stopwords.
• These words give little information by co-occuring with other words. Consequently, vector representations of frequent words do not change significantly after training on several million examples.
• Imbalance between rare and frequent words are thus countered using sub-sampling approach.
• Each word $w_i$ in the training set is discarded with a probability given by
• Where
• $f(w_i)$ is frequency of word $w_i$
• t is a chosen threshold, around $10^{-5}$
• Subsampling formula is chosen heuristically
• Learning Phrases
• Aim is to learn phrases where in individual words meaning is entirely different when compared to the group of words.
• Start off by finding words that frequently occur together and infrequently in other contexts.
• Theoretically, skip-gram model can be trained with all n-grams, but would be a very memory intensive operation.
• A data driven approach is put forward based on unigram and bigram counts, given by
• Where $\delta$ is a discounting coefficient and prevents phrases formed by very infrequent words. The bigrams with score above a given threshold only are used as phrases.

• Often it is needed to run the process 2-4 times changing the threshold values and seeing the quality of phrases formed.

### Conclusions

• Subsampling results in faster training and significantly better representations of uncommon words.

• Negative sampling helps accurately train for frequent words using a simple method.

## REFERENCES

Distributed Representations of Words and Phrases and their Compositionality

· · ·