【笔记】CS224N Natural Language Processing with Deep Learning

CS224n:深度学习的自然语言处理(2017年冬季)1080p https://www.bilibili.com/video/av28030942/

How to represent meaning of words

meaning=denotation:
signifier (symbol) \Leftrightarrow signified (idea or thing)

Usable meaning in computer:
WordNet (synonym sets & hypernyms)

Shortcomings:

  1. missing nuance
  2. missing words’ new meaning
  3. subjective
  4. human labour requirement
  5. hard to computer accurate word similarity

discrete symbols of representing words:
one-hot vectors
Vector dimension = number of words in vocabulary

shortcomings:
All vectors are orthogonal, no natural notion of similarity

Solution:
learn to encode similarity in the vectors themselves

Representing words by their context:
Since the meaning of word can be represented by those words around, the nearby context can be used to build up a representation of the centre word.

Method of building word vectors: (word embedding/representation)
Build a dense vector for each word, which is similar to vectors of words that appear in similar contexts. (words appear in similar contexts share similar vectors)

Word2vec

Overview:
steps:

  1. a large corpus of text
  2. represent every word in a fixed vocabulary with a vector
  3. Go through each position t (with a centre word c and context/outside words o) in the text
  4. Use word vectors’ similarity of c and o to calculate the probability p(oc)p(o|c) (probability of o given c)
  5. adjusting word vectors to maximize the probability

example:
【笔记】CS224N Natural Language Processing with Deep Learning
Here, wtw_t = “into”, is the previously mentioned centre word c.
The center word keeps changing as the position changes (centre word “into” --> “banking” --> “crises”…)

Objective function:
The likelihood of context words given centre word:
L(θ)=t=1Tmjm,j0P(wt+jwt;θ)L(\theta )=\prod_{t=1}^{T}\prod_{-m\leq j\leq m,j\neq 0}P(w_{t+j}|w_t;\theta)
Here, θ\theta represents variables to be optimised, wtw_t is the centre word, wt+jw_{t+j} is the context word, m stands for the size of window.
The objective function J(θ)J(\theta) is average negative log likelihood:
【笔记】CS224N Natural Language Processing with Deep Learning
Tips: minimising the objective function $\Leftrightarrow $ Maximizing predictive accuracy

How to calculate P(wt+jwt;θ)P(w_{t+j}|w_t;\theta):

  1. Use two vectors for each word w:
    vwv_w when w is centre word
    uwu_w when w is context word
  2. For centre word c and context/outside word o:
    【笔记】CS224N Natural Language Processing with Deep Learning

example:
【笔记】CS224N Natural Language Processing with Deep Learning
Here “into” is center word, thus vintov_into is used, other words are context, thus their uu vectors are used, window size is 2.

【笔记】CS224N Natural Language Processing with Deep Learning
The probability of o given c here is an example of softmax function.
(max: amplifies probability of largest value; soft: still assigns probability to small values)

Parameters optimization: (gradients are used)

The dimension of θ\theta: R2dV\mathbb{R}^{2dV}
Here, d is the dimension of vectors, V is the number of words, each word has 2 vectors (as center and as context)

Derivations of gradient

【笔记】CS224N Natural Language Processing with Deep Learning

Why two vectors?
Easier optimization. Average both at the end.

Two models:

  1. Skip-grams (SG): Predict context/outside words (position independent) given center word (used in the above example)
  2. Continuous Bag of Words (CBOW): Predict center word from (bag of) context words

Additional efficiency in training:
Negative sampling (the above focus on softmax)

Gradient Descent:
【笔记】CS224N Natural Language Processing with Deep Learning

This part will not be specifically covered here, just pay attention to the learning rate alphaalpha and use stochastic gradient descent (SGD) for saving time.