Introduction
Now we talk about Hidden Markov Model. Well, what is HMM used for? Consider the following problem:
Given an unknown observation: O, recognize it as one of N classes with minimum probability of error.
So how to define the error and error probability?
Conditional Error: Given O, the risk associated with deciding that it is a class i event:
R(Si|O)=∑j=1NeijP(Sj|O)
where
P(Sj|O) is the probability of that
O is a class
Sj event and
eij is the cost of classifying a class
j event as a class
i event.
eij>0,eii=0.
Expected Error:
E=∫R(S(O)|O)p(O)dO
where
S(O) is the
decision made on
O based on a policy. Then the question can be considered as:
How should S(O) be made to achieve minimum error probability? Or P(S(O)|O) is maximized?
Bayes Decision Theory
If we institute the policy: S(O)=Si=argmaxSjP(Sj|O) then R(S(O)|O)=minSjR(Sj|O). It is the so-called Maximum A Posteriori (MAP) decision. But how do we know P(Sj|O),i=1,2,…,M for any O ?
Markov Model
States : S={S0,S1,S2,…,SN}
Transition probabilities : P(qt=Si|qt−1=Sj)
Markov Assumption:
P(qt=Si|qt−1=Sj,qt−1=Sk,…)=P(qt=Si|qt−1=Sj)=aji,aji≥0,∑i=1Naji=1,∀j
Hidden Markov Model
States: S={S0,S1,S2,…,SN}
Transition probabilities : P(qt=Si|qt−1=Sj)=aji
Output probability distributions (at state j for symbol k): P(yt=Ok|qt=Sj)=bj(k,λj) parameterized by λj .

HMM Problems and Solutions
-
Evaluation: Given a model, compute probability of observation sequence.
-
Decoding: Find a state sequence which maximizes probability of the observation sequence.
-
Training: Adjust model parameters which maximizes probability of observed sequences.
Evaluation
Compute the probability of observation sequence O=o1o2…oT , given a HMM model parameter λ:
P(O|λ)=∑∀QP(O,Q|λ),Q=q0q1q2…qT=∑∀Qaq0q1bq1(o1)⋅aq1q2bq2(o2)⋯aqT−1qTbqT(oT)
This is not practical since the number of paths is
O(NT). How to deal with it?
Forward Algorithm
αt(j)=P(o1o2…ot,qt=Sj|λ)
Compute
α recursively:
α0(j)={1,if Sj is the start state0,otherwiseαt(j)=[∑i=0Nαt−1(i)aij]bj(ot),t>0(1)(2)
Then
P(O|λ)=αT(N)
Computation is
O(N2T).
Backward Algorithm
βt(i)=P(ot+1ot+2…oT|qt=Si,λ)
Compute
β recursively:
βT(i)={1,if Si is the end state0,otherwiseβt(i)=∑j=0Naijbj(ot+1)βt+1(j),t<T(3)(4)
Then
P(O|λ)=β0(0)
Computation is
O(N2T).
Decoding
Find the state sequence Q which maximizes P(O,Q|λ).
Viterbi Algorithm
VPt(i)=maxq0q1…qt−1P(o1o2…ot,qt=Si|λ)
Compute
VP recursively:
VPt(j)=maxi=0,1,…NVPt−1(i)aijbj(ot)t>0
Then
P(O,Q|λ)=VPT(N)
Save each maximum for backtrace at end.
Training
For the sake of tuning λ to maximize P(O|λ), there is NO efficient algorithm for global optimum, nonetheless, efficient iterative algorithm capable of finding a local optimum exists.
Baum-Welch Reestimation
Define the probability of transiting from Si to Sj at time t given O as1
ξt(i,j)=P(qt=Si,qt+1=Sj|O,λ)=αt(i)aijbj(Ot+1)βt+1(j)P(O|λ)
Let
a¯ij=Expected num. of trans. from Si to SjExpected num. of trans. from Si=∑T−1t=0ξt(i,j)∑T−1t=0∑Nj=0ξt(i,j)b¯j(k)=Expected num. of times in Sj with symbol kExpected num. of times in Sj=∑t:Ot+1=k∑Ni=0ξt(i,j)∑T−1t=0∑Ni=0ξt(i,j)(5)(6)
Training Algorithm:
- Initialize λ=(A,B).
- Compute α,β and ξ.
- Estimate λ¯=(A¯,B¯) from ξ.
- Replace λ with λ¯.
- If not converge, go to 2.
Reference
More details needed? Refer to :
- “An Introduction to Hidden Markov Models”, by Rabiner and Juang.
- “Hidden Markov Models: Continuous Speech Recognition”, by Kai-Fu Lee.
- Thanks B. H. Juang in Georgia Institute of Technology.
- Thanks Wayne Ward in Carnegie Mellon University.