Hidden Markov Model (HMM) 详细推导及思路分析

往期文章链接目录

Before reading this post, make sure you are familiar with the EM Algorithm and decent among of knowledge of convex optimization. If not, please check out my previous post

Let’s get started!

Conditional independence

AA and BB are conditionally independent given CC if and only if, given knowledge that CC occurs, knowledge of whether AA occurs provides no information on the likelihood of BB occurring, and knowledge of whether BB occurs provides no information on the likelihood of AA occurring.

Formally, if we denote conditional independence of AA and BB given CC by (A ⁣ ⁣ ⁣B)C(A\perp \!\!\!\perp B)\mid C, then by definition, we have

(A ⁣ ⁣ ⁣B)C    P(A,BC)=P(AC)P(BC)(A\perp \!\!\!\perp B)\mid C\quad \iff \quad P(A, B\mid C)= P(A\mid C) \cdot P(B\mid C)

Given the knowledge that CC occurs, to show the knowledge of whether BB occurs provides no information on the likelihood of AA occurring, we have

P(AB,C)=P(A,B,C)P(B,C)=P(A,BC)P(C)P(B,C)=P(AC)P(BC)P(C)P(BC)P(C)=P(AC) \begin{aligned} P(A | B ,C) &=\frac{P(A , B , C)}{P(B , C)} \\ &=\frac{P(A , B | C) \cdot P(C)}{P(B , C)} \\ &=\frac{P(A | C) \cdot P(B | C) \cdot P(C)}{P(B | C) \cdot P(C)} \\ &=P(A | C) \end{aligned}

Two classical cases where XX and ZZ are conditionally independent

Case 1 :

Hidden Markov Model (HMM) 详细推导及思路分析

From the above directed graph, we have P(X,Y,Z)=P(X)P(YX)P(ZY)P(X,Y,Z) = P(X)\cdot P(Y|X)\cdot P(Z|Y). Hence we have

P(ZX,Y)=P(X,Y,Z)P(X,Y)=P(X)P(YX)P(ZY)P(X)P(YX)=P(ZY) \begin{aligned} P(Z|X,Y) &= \frac{P(X,Y,Z)}{P(X,Y)}\\ &= \frac{P(X)\cdot P(Y|X)\cdot P(Z|Y)}{P(X)\cdot P(Y|X)}\\ &= P(Z|Y) \end{aligned}

Therefore, XX and ZZ are conditionally independent.

Case 2 :

Hidden Markov Model (HMM) 详细推导及思路分析

From the above directed graph, we have P(X,Y,Z)=P(Y)P(XY)P(ZY)P(X,Y,Z) = P(Y)\cdot P(X|Y) \cdot P(Z|Y). Hence we have

P(ZX,Y)=P(X,Y,Z)P(X,Y)=P(Y)P(XY)P(ZY)P(Y)P(XY)=P(ZY) \begin{aligned} P(Z|X,Y) &= \frac{P(X,Y,Z)}{P(X,Y)}\\ &= \frac{P(Y)\cdot P(X|Y) \cdot P(Z|Y)}{P(Y)\cdot P(X|Y)}\\ &= P(Z|Y) \end{aligned}

Therefore, XX and ZZ are conditionally independent.

Settings of the Hidden Markov Model (HMM)

The HMM is based on augmenting the Markov chain. A Markov chain is a model that tells us something about the probabilities of sequences of random variables, states, each of which can take on values from some set. A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state.

To put it formally, suppose we have a sequence of state variables z1,z2,...,znz_1, z_2, ..., z_n. Then the Markov assumption is

p(znz1z2...zn1)=p(znzn1) p(z_n | z_1z_2...z_{n-1}) = p(z_n | z_{n-1})

A Markov chain is useful when we need to compute a probability for a sequence of observable events. However, in many cases the events we are interested in are hidden. For example we don’t normally observe part-of-speech (POS) tags in a text. Rather, we see words, and must infer the tags from the word sequence. We call the tags hidden because they are not observed.

Hidden Markov Model (HMM) 详细推导及思路分析

A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model. An HMM is specified by the following components:

  • A sequence of hidden states zz, where zkz_k takes values from all possible hidden states Z={1,2,..,m}Z = \{1,2,..,m\}.

  • A sequence of observations xx, where x=(x1,x2,...,xn)x = (x_1, x_2, ..., x_n). Each one is drawn from a vocabulary VV.

  • A transition probability matrix AA, where AA is an m×mm \times m matrix. AijA_{ij} represents the probability of moving from state ii to state jj: Aij=p(zt+1=jzt=i)A_{ij} = p(z_{t+1}=j| z_t=i), and j=1mAij=1\sum_{j=1}^{m} A_{ij} = 1 for all ii.

  • An emission probability matrix BB, where BB is an m×Vm \times |V| matrix. BijB_{ij} represents the probability of an observation xjx_j being generated from a state ii: Bij=P(xt=Vjzt=i)B_{ij} = P(x_t = V_j|z_t = i)

  • An initial probability distribution π\pi over states, where π=(π1,π2,...,πm)\pi = (\pi_1, \pi_2, ..., \pi_m). πi\pi_i is the probability that the Markov chain will start in state ii. i=1mπi=1\sum_{i=1}^{m} \pi_i = 1.

Given a sequence xx and the corresponding hidden states zz (like one in the picture above), we have

P(x,zθ)=p(z1)[p(z2z1)p(z3z2)...(znzn1)][p(x1z1)p(x2z2)...p(xnzn)](0) P(x, z|\theta) = p(z_1) \cdot [p(z_2|z_1)\cdot p(z_3|z_2)\cdot ... \cdotp(z_n|z_{n-1})] \cdot [p(x_1|z_1)\cdot p(x_2|z_2)\cdot ... \cdot p(x_n|z_n)] \tag 0

We get p(z1)p(z_1) from π\pi, p(zk+1zk)p(z_{k+1}|z_k) from AA, and p(xkzk)p(x_k|z_k) from BB.

Useful probabilities p(zkx)p(z_k | x) and p(zk+1,zkx)p(z_{k+1}, z_k | x)

p(zkx)p(z_k | x) and p(zk+1,zkx)p(z_{k+1}, z_k | x) are useful probabilities and we are going to use them later.

Intuition: Once we have a sequence xx, we might be interested in find the probability of any hidden state zkz_k, i.e., find probabilities p(zk=1x),p(zk=2x),...,p(zk=mx)p(z_k =1| x), p(z_k =2| x), ..., p(z_k =m| x). we have the following

p(zkx)=p(zk,x)p(x)(1)p(zk,x)(2) \begin{aligned} p(z_k | x) &= \frac{p(z_k, x)}{p(x)} & & (1)\\ &\propto p(z_k, x) & & (2)\\ \end{aligned}

Note that from (1)(1) to (2), since p(x)p(x) doesn’t change for all values of zkz_k, p(zkx)p(z_k | x) is proportional to p(zk,x)p(z_k, x).

p(zk=i,x)=p(zk=i,x1:k,xk+1:n)=p(zk=i,x1:k)p(xk+1:nzk=i,x1:k)(3)=p(zk=i,x1:k)p(xk+1:nzk=i)(4.1)=αk(zk=i)βk(zk=i)(4.11) \begin{aligned} p(z_k=i, x) &= p(z_k=i, x_{1:k}, x_{k+1:n}) \\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+1:n}|z_k=i, x_{1:k}) & & (3)\\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+1:n}|z_k=i) & & (4.1) \\ &= \alpha_k(z_k=i) \cdot \beta_k(z_k=i) &&(4.11)\\ \end{aligned}

Hidden Markov Model (HMM) 详细推导及思路分析

From the above graph, we see that the second term (3)(3) is the 2nd classical cases. So xk+1:nx_{k+1:n} and x1:kx_{1:k} are conditionally independent. This is why we can go from (3)(3) to (4.1)(4.1). We are going to use the Forward Algorithm to compute p(zk,x1:k)p(z_k, x_{1:k}), and Backward Algorithm to compute p(xk+1:nzk)p(x_{k+1:n}|z_k) later.

We denote p(zk,x1:k)p(z_k, x_{1:k}) by αk(zk)\alpha_k(z_k) and p(xk+1:nzk)p(x_{k+1:n}|z_k) by βk(zk)\beta_k(z_k).

After we know how to calculate these two terms separately, we can calculate p(zkx)p(z_k | x) easily by introducing a normalization term. That is,

p(zk=ix)=p(zk=i,x)j=1mp(zk=j,x)=αk(zk=i)βk(zk=i)j=1mαk(zk=j)βk(zk=j)(4.2) \begin{aligned} p(z_k = i | x) &= \frac{p(z_k = i, x)}{\sum_{j=1}^m p(z_k = j, x)} \\ &= \frac{\alpha_k(z_k=i)\beta_k(z_k=i)}{\sum_{j=1}^{m} \alpha_k(z_k=j)\beta_k(z_k=j)} && (4.2) \end{aligned}

where jmp(zk=j,x)\sum_j^m p(z_k = j, x) is the normalization term which makes p(zk=1,x)p(z_k = 1, x) take values between 00 and 11 for all zkz_k.

Similarly, we are also interested in finding p(zk+1,zkx)p(z_{k+1}, z_k | x), where

p(zk+1,zkx)p(zk+1,zk,x) p(z_{k+1}, z_k | x) \propto p(z_{k+1}, z_k, x)

By using the property of conditional independence, we have

p(zk+1=j,zk=i,x)=p(zk=i,zk+1=j,x1:k,xk+1,xk+2:n)=p(zk=i,x1:k)p(xk+2:n)zk+1=j)p(zk+1=jzk=i)p(xk+1zk+1=j)(4.3)=αk(zk=i)βk+1(zk+1=j)p(zk+1=jzk=i)p(xk+1zk+1=j)(4.4) \begin{aligned} p(z_{k+1}=j, z_k=i, x) &= p(z_k=i, z_{k+1}=j, x_{1:k}, x_{k+1}, x_{k+2:n}) \\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+2:n)|z_{k+1}=j}) \cdot p(z_{k+1}=j|z_{k}=i) \cdot p(x_{k+1}| z_{k+1}=j) && (4.3)\\ &= \alpha_k(z_k=i) \cdot \beta_{k+1}(z_{k+1}=j) \cdot p(z_{k+1}=j|z_{k}=i) \cdot p(x_{k+1}| z_{k+1}=j) && (4.4) \\ \end{aligned}

Note that we can find the third and the forth term from the transition probability matrix and the emission probability matrix. Again, we can calculate p(zk+1,zkx)p(z_{k+1}, z_k | x) simply by introducing a normalization term. That is,

p(zk+1=s,zk=rx)=p(zk+1=s,zk=r,x)i=1mj=1mp(zk+1=j,zk=i,x)=αk(zk=r)βk+1(zk+1=s)p(zk+1=szk=r)p(xk+1zk+1=s)i=1mαk(zk=i)βk+1(zk+1=j)p(zk+1=jzk=i)p(xk+1zk+1=j)(4.42) \begin{aligned} p(z_{k+1}=s, z_k=r | x) &= \frac{p(z_{k+1}=s, z_k=r, x)}{\sum_{i=1}^{m} \sum_{j=1}^{m} p(z_{k+1}=j, z_k=i, x)} \\ &= \frac{\alpha_k(z_k=r) \cdot \beta_{k+1}(z_{k+1}=s) \cdot p(z_{k+1}=s|z_{k}=r) \cdot p(x_{k+1}| z_{k+1}=s)}{\sum_{i=1}^{m} \alpha_k(z_k=i) \cdot \beta_{k+1}(z_{k+1}=j) \cdot p(z_{k+1}=j|z_{k}=i) \cdot p(x_{k+1}| z_{k+1}=j)} && (4.42) \end{aligned}

Remark

  • We denote

    γk(i)=p(zk=ix)=p(zk=i,x)p(x)(4.43) \gamma_k(i) = p(z_k = i | x) = \frac{p(z_k=i, x)}{p(x)} \tag{4.43}

  • We denote

    ξk(i,j)=p(zk+1=j,zk=ix)=p(zk+1=j,zk=i,x)p(x)(4.44) \xi_k(i,j) = p(z_{k+1}=j, z_k=i | x) = \frac{p(z_{k+1}=j, z_k=i, x)}{p(x)} \tag{4.44}

Three fundamental problems of HMM

Problem 1 (Likelihood): Given an observation sequence xx and parameters θ=(A,B,π)\theta = (A, B, \pi), determine the likelihood p(xθ)p(x|\theta).

Problem 2 (Learning): Given an observation sequence xx, learn the parameters θ=(A,B,π)\theta = (A, B, \pi).

Problem 3 (Inference): Given an observation sequence xx and parameters θ=(A,B,π)\theta = (A, B, \pi), discover the best hidden state sequence zz.

Problem 1 (Likelihood)

Goal: Given an observation sequence xx and parameters θ=(A,B,π)\theta = (A, B, \pi), determine the likelihood p(xθ)p(x|\theta).

Naive Way:

From (0)(0), we have already know how to compute P(x,zθ)P(x, z|\theta), so we can compute p(xθ)p(x|\theta) by summing all possible sequence zz:

p(xθ)=zP(x,zθ)p(zθ) p(x|\theta) = \sum_z P(x, z|\theta) \cdot p(z|\theta)

This method is not applicable since there are mnm^n ways of combinations of sequence zz. So we introduce the following two algorithm: Forward Algorithm and Backward Algorithm.

Forward Algorithm

Hidden Markov Model (HMM) 详细推导及思路分析
  • Goal: Compute p(zk,x1:k)p(z_k, x_{1:k}), given θ=(A,B,π)\theta = (A, B, \pi).

From the picture above, it’s natural to compute p(zk,x1:k)p(z_k, x_{1:k}) by dynamic programming (DP). That is, to calculate it in terms of p(zk1,x1:k1)p(z_{k-1}, x_{1:k-1}):

p(zk,x1:k)=zk1p(zk,zk1,x1:k1,xk)(5)=zk1p(zk1,x1:k1)p(zk,xkzk1,x1:k1)(6)=zk1p(zk1,x1:k1)p(zkzk1,x1:k1)p(xkzk,zk1,x1:k1)(7)=zk1p(zk1,x1:k1)p(zkzk1)p(xkzk)(8) \begin{aligned} p(z_k , x_{1:k}) &= \sum_{z_{k-1}} p(z_k, z_{k-1}, x_{1:k-1}, x_k) & & (5)\\ &= \sum_{z_{k-1}} p(z_{k-1}, x_{1:k-1}) \cdot p(z_k, x_k|z_{k-1}, x_{1:k-1}) & & (6)\\ &= \sum_{z_{k-1}} p(z_{k-1}, x_{1:k-1}) \cdot p(z_k|z_{k-1}, x_{1:k-1}) \cdot p(x_k|z_k, z_{k-1}, x_{1:k-1}) & & (7)\\ &= \sum_{z_{k-1}} p(z_{k-1}, x_{1:k-1}) \cdot p(z_k|z_{k-1}) \cdot p(x_k|z_k) & & (8)\\ \end{aligned}

Ramark:

  • From (6)(6) to (7)(7), we use the fact that p(b,ca)=p(ba)p(ca,b)p(b,c|a) = p(b|a) \cdot p(c|a,b).

  • From (7)(7) to (8)(8), we use the conditional independence, which is visualized in the picture above.

  • We denote p(zk,x1:k)p(z_k , x_{1:k}) by αk(zk)\alpha_k(z_k), so

    αk(zk)=p(zk,x1:k)=zk1αk1(zk1)p(zkzk1)p(xkzk)(9) \alpha_k(z_k) = p(z_k , x_{1:k}) = \sum_{z_{k-1}} \alpha_{k-1}(z_{k-1}) \cdot p(z_k|z_{k-1}) \cdot p(x_k|z_k) \tag 9

  • In equation (9)(9), the term p(zkzk1)p(z_k|z_{k-1}) is the transition probability from state zk1z_{k-1} to state zkz_{k}; the term p(xkzk)p(x_k|z_k) is the emission probability of observing xkx_k given state zkz_k.

  • α1(z1=q)=p(z1=q,x1)=πqp(x1z1=q)\alpha_1(z_1=q) = p(z_1=q, x_1) = \pi_q \cdot p(x_1 | z_1 = q), where p(x1z1=q)p(x_1 | z_1 = q) is an emmission probability.

Knowing how to compute p(zk,x1:k)p(z_k , x_{1:k}) recurssively, we have

p(xθ)=p(x1:nθ)=znp(zn,x1:n)=znαn(zn)=q=1mαn(zn=q)p(x|\theta) = p(x_{1:n}|\theta) = \sum_{z_n} p(z_n, x_{1:n}) = \sum_{z_n} \alpha_n(z_n) = \sum_{q=1}^{m} \alpha_n(z_n=q)

Backward Algorithm

  • Goal: Compute p(xk+1:nzk)p(x_{k+1:n} | z_k), given θ=(A,B,π)\theta = (A, B, \pi).

Again, we are going to use DP to compute p(xk+1:nzk)p(x_{k+1:n} | z_k) in terms of p(xk+2:nzk+1)p(x_{k+2:n} | z_{k+1}):

p(xk+1:nzk)=zk+1p(xk+1,xk+2:n,zk+1zk)=zk+1p(xk+2:n,zk+1zk)p(xk+1zk,xk+2:n,zk+1)=zk+1p(zk+1zk)p(xk+2:nzk+1,zk)p(xk+1zk,xk+2:n,zk+1)(10)=zk+1p(xk+2:nzk+1)p(zk+1zk)p(xk+1zk+1)(11) \begin{aligned} p(x_{k+1:n} | z_k) &= \sum_{z_{k+1}} p(x_{k+1}, x_{k+2:n}, z_{k+1} | z_k) \\ &= \sum_{z_{k+1}} p(x_{k+2:n}, z_{k+1}| z_k) \cdot p(x_{k+1}| z_k, x_{k+2:n}, z_{k+1}) \\ &= \sum_{z_{k+1}} p(z_{k+1}|z_k) \cdot p(x_{k+2:n}|z_{k+1}, z_k) \cdot p(x_{k+1} | z_k, x_{k+2:n}, z_{k+1}) && (10)\\ &= \sum_{z_{k+1}} p(x_{k+2:n}|z_{k+1}) \cdot p(z_{k+1}|z_k) \cdot p(x_{k+1}|z_{k+1}) && (11)\\ \end{aligned}

Ramark:

  • From (10)(10) to (11)(11), we use the conditional independece similar to the one in forward algorithm.

  • We denote p(xk+1:nzk)p(x_{k+1:n} | z_k) by βk(zk)\beta_k(z_k), so

    βk(zk)=p(xk+1:nzk)=zk+1p(xk+2:nzk+1)p(zk+1zk)p(xk+1zk+1)(12) \beta_k(z_k) = p(x_{k+1:n} | z_k) = \sum_{z_{k+1}} p(x_{k+2:n}|z_{k+1}) \cdot p(z_{k+1}|z_k) \cdot p(x_{k+1}|z_{k+1}) \tag {12}

  • In equation (12)(12), the term p(zk+1zk)p(z_{k+1}|z_k) is the transition probability from state zkz_{k} to state zk+1z_{k+1}; the term p(xk+1zk+1)p(x_{k+1}|z_{k+1}) is the emission probability of observing xk+1x_{k+1} given state zk+1z_{k+1}.

  • βn(zn)=1\beta_n(z_n) = 1.

Knowing how to compute p(xk+1:nzk)p(x_{k+1:n} | z_k) recursively, we have

p(xθ)=z1p(x,z1)=z1p(xz1)p(z1)=z1p(x1,x1+1:nz1)p(z1)=z1p(x1z1)p(x1+1:nz1,x1)p(z1)(13)=z1p(x1z1)p(x1+1:nz1)p(z1)(14)=z1p(x1z1)β1(z1)p(z1)=q=1mβ1(z1=q)p(x1z1=q)πq \begin{aligned} p(x|\theta) &= \sum_{z_1} p(x, z_1) = \sum_{z_1} p(x | z_1) \cdot p(z_1) \\ &= \sum_{z_1} p(x_1, x_{1+1:n} | z_1) \cdot p(z_1) \\ &= \sum_{z_1} p(x_1 | z_1) \cdot p(x_{1+1:n}|z_1, x_1) \cdot p(z_1) &&(13)\\ &= \sum_{z_1} p(x_1 | z_1) \cdot p(x_{1+1:n}|z_1) \cdot p(z_1) &&(14)\\ &= \sum_{z_1} p(x_1 | z_1) \cdot \beta_1(z_1) \cdot p(z_1) \\ &= \sum_{q=1}^m \beta_1(z_1=q) \cdot p(x_1 | z_1=q) \cdot \pi_q \end{aligned}

From (13)(13) to (14)(14), we use the conditional independence. To make it clean, I didn’t include θ\theta in the above derivation, but keep in mind xx is conditioned on θ\theta.

Problem 2 (Learning)

Goal: Given an observation sequence xx, learn the parameters θ=(A,B,π)\theta = (A, B, \pi).

Given that the hidden states are unknown, it’s natural to use the EM Algorithm to solve parameters. Remind that the EM Algorithm consists of two steps:

  • An expectation (E) step, which creates a function Q(θ,θi)Q(\theta, \theta_i) for the expectation of the log-likelihood logp(x,zθ)\log p(x,z|\theta) evaluated using the current conditional distribution of zz given xx and the current estimate of the parameters θi\theta_i, where

Q(θ,θi)=EzP(zx,θi)[logp(x,zθ)]=zP(zx,θi)logp(x,zθ) \begin{aligned} Q(\theta, \theta_i) &= E_{z \sim P(z|x,\theta_i)}[\log p(x,z|\theta)] \\ &= \sum_z P(z|x,\theta_i) \cdot \log p(x,z|\theta) \\ \end{aligned}

  • A maximization (M) step, which computes parameters maximizing the expected log-likelihood Q(θ,θi)Q(\theta, \theta_i) found on the EE step and then update parameters to θi+1\theta_{i+1}.

We fist initialize parameters θ0=(A0,B0,π0)\theta_0 = (A_0, B_0, \pi_0)

E Step:

We are going to construct Q(θ,θi)Q(\theta, \theta_i).

Q(θ,θi)=EzP(zx,θi)[logp(x,zθ)]=zP(zx,θi)logp(x,zθ)=logp(x,zθ)p(x,zθi)p(xθi)(15)logp(x,zθ)p(x,zθi)(16) \begin{aligned} Q(\theta, \theta_i) &= E_{z \sim P(z|x,\theta_i)}[\log p(x,z|\theta)] \\ &= \sum_z P(z|x,\theta_i) \cdot \log p(x,z|\theta) \\ &= \log p(x,z|\theta) \cdot \frac{p(x,z|\theta_i)}{p(x|\theta_i)} &&(15)\\ &\propto \log p(x,z|\theta) \cdot p(x,z|\theta_i) &&(16)\\ \end{aligned}

Since we know xx and θi\theta_i, p(xθi)p(x|\theta_i) is a constant and therefore we can write from (15)(15) to (16)(16). In the earlier section “Settings of the Hidden Markov Model” of the post, we deduce that

P(x,zθ)=p(z1)[p(z2z1)p(z3z2)...(znzn1)][p(x1z1)p(x2z2)...p(xnzn)]P(x, z|\theta) = p(z_1) \cdot [p(z_2|z_1)\cdot p(z_3|z_2)\cdot ... \cdotp(z_n|z_{n-1})] \cdot [p(x_1|z_1)\cdot p(x_2|z_2)\cdot ... \cdot p(x_n|z_n)]

So we can formulate Q(θ,θi)Q(\theta, \theta_i) as

Q(θ,θi)=z(logπzi+t=1n1logp(zt+1zt)+t=1nlogp(xnzn))p(x,zθi)=zlogπzip(x,zθi)+zt=1n1logp(zt+1zt)p(x,zθi)+zt=1nlogp(xtzt)p(x,zθi) \begin{aligned} Q(\theta, \theta_i) &= \sum_z \left( \log \pi_{z_i} + \sum_{t=1}^{n-1} \log p(z_{t+1}|z_t) + \sum_{t=1}^{n} \log p(x_n|z_n)\right) \cdot p(x,z|\theta_i) \\ &= \sum_z \log \pi_{z_i} \cdot p(x,z|\theta_i) + \sum_z \sum_{t=1}^{n-1} \log p(z_{t+1}|z_t) \cdot p(x,z|\theta_i) + \sum_z \sum_{t=1}^{n} \log p(x_t|z_t) \cdot p(x,z|\theta_i) \\ \end{aligned}

M Step:

We are going to maximize Q(θ,θi)Q(\theta, \theta_i) and update θi+1\theta_{i+1}.

Note that we write Q(θ,θi)Q(\theta, \theta_i) as the sum of three terms. The first therm is related to π\pi, the second term is related to AA, and the third term is related to BB. Therefore we can maximize each term separately.

We can write the first term as

zlogπzip(x,zθi)=j=1mlogπjp(x,z1=jθi)\sum_z \log \pi_{z_i} \cdot p(x,z|\theta_i) = \sum_{j=1}^m \log \pi_j \cdot p(x, z_1 = j|\theta_i)

under the constraint j=1mπj=1\sum_{j=1}^m \pi_j = 1. Clearly, this is a convex optimization problem:

Hidden Markov Model (HMM) 详细推导及思路分析

The Lagrangian LL associated with the problem is

L(π,v)=j=1mlogπjp(x,z1=jθi)+v(j=1mπj1) L(\pi, v) = \sum_{j=1}^m \log \pi_j \cdot p(x, z_1 = j|\theta_i) + v \cdot (\sum_{j=1}^m \pi_j - 1)

Note that any pair of primal and dual optimal points must satisfy the KKT conditions. So we use one KKT property that the gradient must vanish at the optimal point to find π\pi. This might not be the optimal π\pi since “any pair of primal and dual optimal points must satisfy the KKT conditions” doesn’t imply that a point satisfying the KKT conditions is the optimal.

Lπj=p(x,z1=jθi)1πj+v=0p(x,z1=jθi)+vπj=0πj=p(x,z1=jθi)v(17) \begin{aligned} \frac{\partial L}{\partial \pi_j} = p(x, z_1=j|\theta_i) \cdot \frac{1}{\pi_j} + v & = 0 \\ p(x, z_1=j|\theta_i) + v \cdot \pi_j & = 0 \\ \pi_j & = \frac{-p(x, z_1=j|\theta_i)}{v} & & (17)\\ \end{aligned}

By setting Lπj=0\frac{\partial L}{\partial \pi_j} = 0 for all jj, we have

j=1mp(x,z1=jθi)+vj=1mπj=0p(xθi)+v=0v=p(xθi)(18) \begin{aligned} \sum_{j=1}^m p(x, z_1=j|\theta_i) + v \cdot \sum_{j=1}^m \pi_j &= 0 \\ p(x|\theta_i) + v &= 0\\ v &= - p(x|\theta_i) && (18)\\ \end{aligned}

By plugging (18)(18) into (17)(17), we have

πj=p(x,z1=jθi)v=p(x,z1=jθi)p(xθi)=γ1(j) \pi_j = \frac{-p(x, z_1=j|\theta_i)}{v} = \frac{p(x, z_1=j|\theta_i)}{p(x|\theta_i)} = \gamma_1(j).

In the similar way, we can write the second term as

zt=1n1logp(zt+1zt)p(x,zθi)=j=1mk=1mt=1n1logp(zt+1=kzt=j)p(x,zt=j,zt+1=kθi)zt=1nlogp(xtzt)p(x,zθi)=j=1mt=1nlogp(xtzt=j)p(x,zt=jθi) \sum_z \sum_{t=1}^{n-1} \log p(z_{t+1}|z_t) \cdot p(x,z|\theta_i) = \sum_{j=1}^m \sum_{k=1}^m \sum_{t=1}^{n-1} \log p(z_{t+1}=k|z_t=j) \cdot p(x,z_t=j, z_{t+1}=k|\theta_i) \\ \sum_z \sum_{t=1}^{n} \log p(x_t|z_t) \cdot p(x,z|\theta_i) = \sum_{j=1}^m \sum_{t=1}^n \log p(x_t|z_t=j) \cdot p(x,z_t=j|\theta_i)

with seperate constraints

k=1mp(zt+1=kzt=j)=1,j=1mp(xtzt=j)=1\sum_{k=1}^m p(z_{t+1}=k|z_t=j) = 1, \\ \sum_{j=1}^m p(x_t|z_t=j) = 1

We can solve for optimal parameters similar to solving for π\pi. After we set the gradient of corresponding Lagrangian to 00, we have

Ajk=p(zt+1=kzt=j)=t=1n1p(x,zt=j,zt+1=kθi)t=1n1p(x,zt=jθi)=t=1n1p(x,zt=j,zt+1=kθi)/p(xθi)t=1n1p(x,zt=jθi)/p(xθi)=t=1n1ξt(jk)t=1n1γt(j)(19)Bjk=p(xt=Vkzt=j)=t=1n1p(x,zt=jθi)I(xt=Vk)t=1n1p(x,zt=jθi)=t=1n1p(x,zt=jθi)/p(xθi)I(xt=Vk)t=1n1p(x,zt=jθi)/p(xθi)=t=1n1γt(j)I(xt=Vk)t=1n1γt(j)(20) \begin{aligned} A_{jk} &= p(z_{t+1}=k|z_t=j) \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j, z_{t+1}=k|\theta_i)}{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j, z_{t+1}=k|\theta_i)/ p(x|\theta_i)}{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)/p(x|\theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} \xi_t(jk)}{\sum_{t=1}^{n-1} \gamma_t(j)} &&(19)\\\\ B_{jk} &= p(x_t= V_k|z_t=j) \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)\cdot I(x_t= V_k)}{\sum_{t=1}^{n-1} p(x, z_t=j | \theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)/ p(x|\theta_i) \cdot I(x_t= V_k)}{\sum_{t=1}^{n-1} p(x, z_t=j | \theta_i)/ p(x|\theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} \gamma_t(j) \cdot I(x_t=V_k)}{\sum_{t=1}^{n-1} \gamma_t(j)} &&(20)\\ \end{aligned}

Remark: I(xt=Vk)I(x_t= V_k) is an indicator function. If xt=Vkx_t= V_k, then I(xt=Vk)=1I(x_t= V_k) = 1, and 00 otherwise. We get the results (19)(19) and (20)(20) from (4.43)(4.43) and (4.44)(4.44).

We also call this algorithm Baum-Welch algorithm.

Problem 3 (Inference)

Goal: Given an observation sequence xx and parameters θ=(A,B,π)\theta = (A, B, \pi), discover the best hidden state sequence zz.

Method 1:Brute force.

This is not applicable. Every hidden state has mm choices and the sequence has length nn. So there are mnm^n possible combinations.

Method 2: Use Forward/ Backward Algorithm.
Given a sequence xx, we know how to compute p(zkx)p(z_k | x) from the top of the post. Therefore, at each time kk, we can compute p(zk=ix)p(z_k=i| x) for all i{1,2,...,m}i \in \{1,2,...,m\} and choose the one with the highest probability. In the way, at every time, we chose the most possible hidden state. However, there is still a problem. It only finds the most possible hidden state locally and doesn’t take the whole sequence into account. Even if we chose the most possible hidden state at each time kk. The combination of them might not be the best one and even doesn’t make sense. Foe example, if Aij=0A_{ij} = 0, then if zk=iz_k=i, zk+1z_{k+1} cannot take state jj. But the Forward/ Backward Algorithm doesn’t take it into account. We can think of it as a greedy approach to approximate the best result.

Method 3: Viterbi algorithm:

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states that results in a sequence of observed events in HMM.

We define δk(i)\delta_k(i) to be the maximum probability among all paths which are at state ii at time kk. That is,

δk(i)=maxi1:ik1p(ik=i,i1:ik1,x1:kθ),i=1,2,...,m\delta_k(i) = \mathop{\rm max}\limits_{i_1:i_{k-1}} p(i_k=i, i_1:i_{k-1}, x_{1:k}|\theta), \quad i= 1,2,...,m

We can construct a recurssion formula δk(i)\delta_k(i). That is,

δk+1(i)=maxi1:ikp(ik+1=i,i1:ik,x1:k+1θ)=max1jmδk(j)Ajip(xk+1zk+1=i)i=1,2,...,m;k=1,2,...,n1 \begin{aligned} \delta_{k+1}(i) &= \mathop{\rm max}\limits_{i_1:i_{k}} p(i_{k+1}=i, i_1:i_{k}, x_{1:k+1}|\theta) \\ &= \mathop{\rm max}\limits_{1\leq j \leq m} \delta_k(j) \cdot A_{ji} \cdot p(x_{k+1}|z_{k+1}=i) && i= 1,2,...,m; \, k=1,2,...,n-1\\ \end{aligned}

And the base case is δ1(i)=πip(x1z1=i)\delta_{1}(i) = \pi_i \cdot p(x_1| z_1=i).

Therefore, we compute the optimal probability PP^{*}

P=max1imδn(i)P^{*} = \mathop{\rm max}\limits_{1 \leq i \leq m} \delta_n(i)

by recursion. During the process of finding the highest probability of a path, we keep recording the hidden states associate with the path. So after we find the the highest probability, we also record the path associated with it and therefore the best sequence zz.


Reference:

  • https://towardsdatascience.com/conditional-independence-the-backbone-of-bayesian-networks-85710f1b35b
  • https://courses.cs.washington.edu/courses/cse473/16au/slides-16au/25-bn.pdf
  • https://en.wikipedia.org/wiki/Conditional_independence
  • https://web.stanford.edu/~jurafsky/slp3/A.pdf
  • https://www.cs.cmu.edu/~epxing/Class/10701-08s/recitation/em-hmm.pdf
  • https://people.eecs.berkeley.edu/~stephentu/writeups/hmm-baum-welch-derivation.pdf
  • 统计学习方法,李航著,清华大学出版社

往期文章链接目录