Word Vectors详解(2)

3.3 Skip-Gram Model

Another approach is to create a model such that use the center word to generate the context.

Let’s discuss the Skip-Gram model above. The setup is largely the same but we essentially swap our x and y i.e. x in the CBOW are now y and viceversa. The input one hot vector (center word) we will represent with an x (since there is only one). And the output vectors as y(j). We define , the same as in CBOW.

How does it work:
1. We get our embedded word vectors for the center word:vc=x∈ℝ|V|
2. Generate a score vector z=vc. As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
4. Turn the scores into probabilities yˆ=softmax(z)∈ℝ|V|.
5. We desire our probabilities generated yˆ to match the true probabilities, the one hot vector of the actual output.

m i n i m i z e J = - log P (w c - m, . . ., w c + m | w c) = - log \prod j = 0, j \neq m 2 m P (u c - m + j | v c) = - log \prod j = 0, j \neq m 2 m exp (u T c v c) \sum | V | j = 1 exp (u T j v c) = - \sum j = 0, j \neq m 2 m u T c - m + j v c + 2 m log \sum j = 1 | V | exp (u T j v c)

Note that

J = - \sum j = 0, j \neq m 2 m log P (u c - m + j | v c) = - \sum j = 0, j \neq m 2 m H (y ˆ, y c - m + j)

Where H(yˆ,yc−m+j) is the cross-entropy between the probability vector yˆ and the one-hot vector yc−m+j

Skip-gram treats each context word equally : the models computes the probability for each word of appearing in the context independently of its distance to the center word
Word Vectors详解(2)

shortage

Loss functions J for CBOW and Skip-Gram are expensive to compute because of the softmax normalization, where we sum over all |V| scores!

To solve this problem, a simple idea is we could instead just approximate it. We have a method called Negative Sampling

Negative Sampling

For every training step, instead of looping over the entire vocabulary, we can just sample several negative examples! We “sample” from a noise distribution (Pn(w)) whose probabilities match the ordering of the frequency of the vocabulary.

While negative sampling is based on Skip-Gram model or CBOW, it is in fact optimizing a different objective.

Consider a pair (w,c) of word and context. Did this pair came from the training data? Let’s denote by P(D=1|w,c) the probability that (w,c) came from the corpus data. Correspondingly, P(D=0|w,c) will be the probability that (w,c) did not come from the corpus data. Model P(D=1|w,c) with the sigmoid function:

P (D = 1 | w, c, θ) = σ (u T c v w)

Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maximum likelihood approach of these two probabilities. (Here we take θ to be the parameters of the model, and in our case it is  and )

Word Vectors详解(2)

Note that maximizing the likelihood is the same as minimizing the negative log likelihood

J = - \sum (w, c) \in D log 1 1 + exp (- u T w v c) - \sum (w, c) \in D ˜ log 1 1 + exp (u T w v c)

Note that D˜ is a “false” or “negative” corpus. Where we would have sentences like “stock boil fish is toy”. Unnatural sentences that should get a low probability of ever occurring. We can generate D˜ on the fly by randomly sampling this negative from the word bank.

For skip-gram, our new objective function for observing the context word c-m+j given the center word c would be:

- log σ (u T c - m + j v c) - \sum k = 1 K log σ (- u ˜ T k v c)

For CBOW, our new objective function for observing the center word uc given the context vector vˆ=vc−m+vc−m+1+...+vc+m2m would be

- log σ (u T c v ˆ) - \sum k = 1 K log σ (- u ˜ T k v ˆ)

In the above formulation, {u˜Tkvˆ|k=1...K} are sampled from Pn(w). There is much discussion of what makes the best approximation, what seems to work best is the Unigram Model raised to the power of 3/4. Why 3/4? Here’s an example that might help gain some intuition:

i s : 0.9 3 / 4 = 0.92 c o n s t i t u t i o n : 0.09 3 / 4 = 0.16 b o m b a s t i c : 0.01 3 / 4 = 0.032

“bombastic” is now 3x more likely to be sampled while “is” only went up marginally.

3.3 Skip-Gram Model

shortage

Negative Sampling

相关推荐