How to represent meaning of words

meaning=denotation:
signifier (symbol) $\Leftrightarrow$ signified (idea or thing)

Usable meaning in computer:
WordNet (synonym sets & hypernyms)

Shortcomings:

missing nuance
missing words’ new meaning
subjective
human labour requirement
hard to computer accurate word similarity

discrete symbols of representing words:
one-hot vectors
Vector dimension = number of words in vocabulary

shortcomings:
All vectors are orthogonal, no natural notion of similarity

Solution:
learn to encode similarity in the vectors themselves

Representing words by their context:
Since the meaning of word can be represented by those words around, the nearby context can be used to build up a representation of the centre word.

Method of building word vectors: (word embedding/representation)
Build a dense vector for each word, which is similar to vectors of words that appear in similar contexts. (words appear in similar contexts share similar vectors)

Word2vec

Overview:
steps:

a large corpus of text
represent every word in a fixed vocabulary with a vector
Go through each position t (with a centre word c and context/outside words o) in the text
Use word vectors’ similarity of c and o to calculate the probability $p(o|c)$ (probability of o given c)
adjusting word vectors to maximize the probability

example:
【笔记】CS224N Natural Language Processing with Deep Learning
Here, $w_t$ = “into”, is the previously mentioned centre word c.
The center word keeps changing as the position changes (centre word “into” --> “banking” --> “crises”…)

Objective function:
The likelihood of context words given centre word:
$L(\theta )=\prod_{t=1}^{T}\prod_{-m\leq j\leq m,j\neq 0}P(w_{t+j}|w_t;\theta)$
Here, $\theta$ represents variables to be optimised, $w_t$ is the centre word, $w_{t+j}$ is the context word, m stands for the size of window.
The objective function $J(\theta)$ is average negative log likelihood:
【笔记】CS224N Natural Language Processing with Deep Learning
Tips: minimising the objective function $\Leftrightarrow $ Maximizing predictive accuracy

How to calculate $P(w_{t+j}|w_t;\theta)$ :

Use two vectors for each word w:
$v_w$ when w is centre word
$u_w$ when w is context word
For centre word c and context/outside word o:

example:
【笔记】CS224N Natural Language Processing with Deep Learning
Here “into” is center word, thus $v_into$ is used, other words are context, thus their $u$ vectors are used, window size is 2.

【笔记】CS224N Natural Language Processing with Deep Learning
The probability of o given c here is an example of softmax function.
(max: amplifies probability of largest value; soft: still assigns probability to small values)

Parameters optimization: (gradients are used)

The dimension of $\theta$ : $\mathbb{R}^{2dV}$
Here, d is the dimension of vectors, V is the number of words, each word has 2 vectors (as center and as context)

Derivations of gradient

【笔记】CS224N Natural Language Processing with Deep Learning

Why two vectors?
Easier optimization. Average both at the end.

Two models:

Skip-grams (SG): Predict context/outside words (position independent) given center word (used in the above example)
Continuous Bag of Words (CBOW): Predict center word from (bag of) context words

Additional efficiency in training:
Negative sampling (the above focus on softmax)

Gradient Descent:
【笔记】CS224N Natural Language Processing with Deep Learning

This part will not be specifically covered here, just pay attention to the learning rate $alpha$ and use stochastic gradient descent (SGD) for saving time.

【笔记】CS224N Natural Language Processing with Deep Learning

CS224N（二）Word Vectors

How to represent meaning of words

Word2vec

Derivations of gradient

相关推荐