【笔记】CS224N Natural Language Processing with Deep Learning
CS224N(二)Word Vectors
CS224n:深度学习的自然语言处理(2017年冬季)1080p https://www.bilibili.com/video/av28030942/
How to represent meaning of words
meaning=denotation:
signifier (symbol) signified (idea or thing)
Usable meaning in computer:
WordNet (synonym sets & hypernyms)
Shortcomings:
- missing nuance
- missing words’ new meaning
- subjective
- human labour requirement
- hard to computer accurate word similarity
discrete symbols of representing words:
one-hot vectors
Vector dimension = number of words in vocabulary
shortcomings:
All vectors are orthogonal, no natural notion of similarity
Solution:
learn to encode similarity in the vectors themselves
Representing words by their context:
Since the meaning of word can be represented by those words around, the nearby context can be used to build up a representation of the centre word.
Method of building word vectors: (word embedding/representation)
Build a dense vector for each word, which is similar to vectors of words that appear in similar contexts. (words appear in similar contexts share similar vectors)
Word2vec
Overview:
steps:
- a large corpus of text
- represent every word in a fixed vocabulary with a vector
- Go through each position t (with a centre word c and context/outside words o) in the text
- Use word vectors’ similarity of c and o to calculate the probability (probability of o given c)
- adjusting word vectors to maximize the probability
example:
Here, = “into”, is the previously mentioned centre word c.
The center word keeps changing as the position changes (centre word “into” --> “banking” --> “crises”…)
Objective function:
The likelihood of context words given centre word:
Here, represents variables to be optimised, is the centre word, is the context word, m stands for the size of window.
The objective function is average negative log likelihood:
Tips: minimising the objective function $\Leftrightarrow $ Maximizing predictive accuracy
How to calculate :
- Use two vectors for each word w:
when w is centre word
when w is context word - For centre word c and context/outside word o:
example:
Here “into” is center word, thus is used, other words are context, thus their vectors are used, window size is 2.
The probability of o given c here is an example of softmax function.
(max: amplifies probability of largest value; soft: still assigns probability to small values)
Parameters optimization: (gradients are used)
The dimension of :
Here, d is the dimension of vectors, V is the number of words, each word has 2 vectors (as center and as context)
Derivations of gradient
Why two vectors?
Easier optimization. Average both at the end.
Two models:
- Skip-grams (SG): Predict context/outside words (position independent) given center word (used in the above example)
- Continuous Bag of Words (CBOW): Predict center word from (bag of) context words
Additional efficiency in training:
Negative sampling (the above focus on softmax)
Gradient Descent:
This part will not be specifically covered here, just pay attention to the learning rate and use stochastic gradient descent (SGD) for saving time.