Distributed Vector Representation Series
Introduction
 Computing the continuous vector representations of words from very large data sets.
 Current stateoftheart performance on semantic and syntactic word similarities.
 Classical techniques treat words as atomic units without any notion of similarities between them because they are represented using indices in a vocabulary (bagofwords).
 Advantages of classical techniques lie in simplicity, robustness and accuracy of simple model when trained on large data sets over complex models trained on less data.
 Disadvantage of these methods is observed when the amount of data available to train is limited in certain fields like say, automatic speech recognition and machine translations.
Previous Works
 Neural Network Language Model (NNLM):
 Consists of input, projection, hidden and output layers.
 Input layer has N previous words encoded using 1inV coding, where V is the size of Vocabulary.
 Projection layer, P has a projection of input layer has a dimensionality of \(N * D\) and uses a projection matrix.
 High complexity between projection and hidden layer due to dimensions of the dense projection layer.
 Computational complexity of NNLM per training example is given by
\[Q = N * D + N * D * H + H * V\]
 Where
 Q is the computational cost
 N is the number of previous words used for learning
 D is the dimensionality of the projection layer
 H is the size of hidden layer
 V is the size of the vocabulary and output layer.
 \(H * V\) is the dominating term above which was proposed to be reduced to as less as \(H * log_2(V)\) using
 Hierarchical softmax
 Avoiding normalized models for training
 Binary tree representations of the vocabulary using Huffman Trees
 So, the major complexity is dominated by \(N * D * H\)
 Recurrent Neural Network Language Model (RNNLM):
 Overcome the limitations of NNLM such as need to specify the context length, N (order of the model N)
 Theoretically RNNs can efficiently represent more complex patterns than shallow neural networks.
 No projection layer
 Consists of Input, hidden and output layers.
 Develops a short term memory of seen data in the selffed time delayed hidden layer.
 Computational complexity of NNLM per training example is given by
\[Q = H * H + H * V\]
 Where
 Q is the computational cost
 H is the size of hidden layer
 V is the size of the vocabulary and output layer.
 Word representations D have the same dimensionality as the hidden layer H.
 Again, \(H * V\) will be reduced to \(H * log_2(V)\) using Hierarchical softmax.
 So, the major complexity is dominated by \(H * H\)
 It’s observed that most complexity is contributed by the nonlinearity of the hidden layer in the networks.
Continuous BagofWords Model (CBOW)

Similar to feedforward NNLM, but the nonlinear hidden layer is removed.

Projection layer is shared for all the words. So all words are projected into the same position and their vectors are averaged.

Model is called bagofwords model because the order of words in the history or future does not influence the projections.

Unlike NNLM, words from future are used to with the best result found with 4 history and 4 future words in context.

Training criterion is the correct classification of the current(middle) word.

Training complexity is given by
\[Q = N * D + D * log_2(V)\]

Model is continuous bagofwords because unlike standard bagofwords it uses continuous distributed representations of the context.

Weights between the input and the projection layer is shared for all words positions in the same way as in NNLM.
Continuous SkipGram Model

Similar to CBOW but slight changes in training criterion.

Instead of predicting current word from the surrounding words in the window, current word is used to predict the words surrounding the current word.

Accuracy and quality of vector is found to increase as the number of context words predicted is increased, but that increased the computational complexity as well.

Training complexity is given by
\[Q = C * (D + D * log_2(V))\]
 Where
 C is the maximum distance of the words. Say, C=5 is chosen then a number \(R \in [1, C]\) is selected randomly and then R words from history and R from future are correct labels of the current word.
Model Architectures
Results
 Algebraic operations on the vector representations actually give meaningful results like cosine similary of \(vector(X)\) is closest to \(vector(‘smallest’)\) where
\[vector(X) = vector(‘biggest’)  vector(‘big’) + vector(‘small’)\]

Subtle relationships are learnt when accurate data is used. For example, France is to Paris as Germany is to Berlin.

After a certain point adding more dimensionality to the word vectors or adding more training data provides diminishing improvements.

NNLM vectors work better than RNNLM because word vectors in RNNLM are directly connected to nonlinear hidden layer.

CBOW is better than NNLM on syntactic tasks and about the same on semantic tasks.

SkipGram works slightly worse than CBOW but better than NNLM on syntactic tasks and much better on semantic tasks.

Training time for SkipGram model is greater than CBOW model.
REFERENCES:
Efficient Estimation of Word Representations in Vector Space