Log-Linear model
Let x be an example, and let y be a possible label for it. A log-linear model assumes that
p(y∣x;w)=Z(x,w)exp[∑j=1JwjFj(x,y)]
where the partition function
Z(x,w)=y′∑exp[j=1∑JwjFj(x,y′)]
Note that in ∑y′, we make a summation over all possible y. Therefore, given x, the label predicted by the model is
y^=yargmaxp(y∣x;w)=yargmaxj=1∑JwjFj(x,y)
Each expression Fj(x,y) is called a feature-function. You can think of it as the j-th feature extracted from (x,y).
Remark of the log-linear model:
-
a linear combination ∑j=1JwjFj(x,y) can take any positive or negative real value; the exponential makes it positive.
-
The division makes the result p(y∣x;w) between 0 and 1, i.e. makes them be valid probabilities.
Conditional Random Fields (CRF)
Last time, we talked about Markov Random Fields. In this post, we are going to discuss Conditional Random Fields, which is an important special case of Markov Random Fields arises when they are applied to model a conditional probability distribution p(y∣x), where x and y are vactor-valued variables.

Formal definition of CRF
Formally, a CRF is a Markov network which specifies a conditional distribution
P(y∣x)=Z(x)1c∈C∏ϕc(xc,yc)
with partition function
Z=y∈Y∑c∈C∏ϕc(xc,yc)
we further assume that the factors ϕc(xc,yc) (maximal cliques) are of the form
ϕc(xc,yc)=exp[wcTfc(xc,yc)]
Since we require our potential function ϕ to be non-negative, it’s natural to use the exponential function. fc(xc,yc) can be an arbitrary set of features describing the compatibility between xc and yc. Note that these feature functions could be designed by manually doing feature engineering or using deep learning, LSTM, etc.
Log-linear model to linear-CRF
As a remainder, let x be an example, and let y be a possible label for it. Then a log-linear model assumes that
p(y∣x;w)=Z(x,w)exp[∑j=1JwjFj(x,y)]
From now on, we use the bar notation for sequences. Then to linear-CRF, we write the above equation as
p(yˉ∣xˉ;w)=Z(xˉ,w)exp[∑j=1JwjFj(xˉ,yˉ)]=Z(xˉ,w)exp[∑j=1Jwj∑i=2Tfj(yi−1,yi,xˉ)](1)
where y can take values from {1,2,...,m}. Here is an example:
Assume we have a sequence xˉ=(x1,x2,x3,x4) and the corresponding hidden sequence yˉ=(y1,y2,y3,y4).

We can divide each feature-function Fj(xˉ,yˉ) into fuctions for each maximal clique. That is,
Fj(xˉ,yˉ)=i=2∑Tfj(yi−1,yi,xˉ)(1.1)
Perticularly, from the above figure, since we have 3 maximal cliques, so
Fj(xˉ,yˉ)=fj(y1,y2,xˉ)+fj(y2,y3,xˉ)+fj(y3,y4,xˉ)
If we extract J feature functions from the (xˉ,yˉ) pair, then it becomes
j=1∑JwjFj(x,y)=j=1∑Jwji=2∑Tfj(yi−1,yi,xˉ)
Inference problem for CRF
- Goal: given a sequence xˉ, and parameter w, find the best hidden sequence yˉ. The condition probability of yˉ is
p(yˉ∣xˉ;w)=Z(xˉ,w)exp[∑j=1Jwj∑i=2Tfj(yi−1,yi,xˉ)]
Our objective is(check that the objective of CRF is the objective of Log-Linear model described above):
y^=yˉargmaxp(yˉ∣xˉ;w)=yˉargmaxj=1∑Jwji=2∑Tfj(yi−1,yi,xˉ)=yˉargmaxi=2∑Tgi(yi−1,yi)(2)(3)(4)
Note:
-
(2)→(3): we can ignore the denominator since it stays the same for all possible yˉ. Exponential function won’t affect our objective.
-
We set
gi(yi−1,yi)=j=1∑Jwj⋅fj(yi−1,yi,xˉ)(5)
Based on our objective in (5), we want to find the best path from y1 to yT such that the objective function is maximized. Clearly, we can use Dynamic Programming (DP) here.

Let u(k,v) denote the score of the best path from t=1 to t=k, where the tag of time k is v. Then the recursion formula can be easily visualized from the above figure and we can write it as
u(k,v)=smax[u(k−1,s)+gk(s,v)]
where s takes values from states {1,2,...,m}. The maximum of the objective is max{u(T,1),u(T,2),...,u(T,m)}.
Learning problem for CRF
Goal: Given the data set D={(x(1),y(1)),...,(x(n),y(n))}, we want to find parameter w to maximize p(D∣w). That is,
w^MLE=wmaxp(D∣w)=wmaxi=1∏np(y(i)∣x(i);w)
That is, we need to take derivatives and then use the gradient descent method.
Learning problem for general Log-Linear model
p(yˉ,xˉ;w)=Z(x,w)exp[∑j=1JwjFj(x,y)]
Take the derivative with respect to wj:
∂wj∂[logp(y∣x;w)]=∂wj∂[j=1∑JwjFj(x,y)−logZ(x,w)]=Fj(x,y)−Z(x,w)1⋅∂wj∂Z(x,w)(6)
where
∂wj∂Z(x,w)=∂wj∂y′∑exp[j=1∑JwjFj(x,y′)]=y′∑∂wj∂[expj=1∑JwjFj(x,y′)]=y′∑[expj=1∑JwjFj(x,y′)]⋅Fj(x,y′)(7)
Combining (6) and (7), we have
∂wj∂[logp(y∣x;w)]=Fj(x,y)−Z(x,w)1y′∑Fj(x,y′)[expj=1∑JwjFj(x,y′)]=Fj(x,y)−y′∑Fj(x,y′)Z(x,w)exp∑j=1JwjFj(x,y′)=Fj(x,y)−y′∑Fj(x,y′)⋅p(y′∣x;w)=Fj(x,y)−Ey′∼p(y′∣x;w)[Fj(x,y′)](8)
Learning problem for CRF
We can edit (8) to get the partial derivative for CRF:
∂wj∂[logp(yˉ∣xˉ;w)]=Fj(xˉ,yˉ)−Eyˉ′∼p(yˉ′∣x;w)[Fj(x,yˉ′)]=Fj(xˉ,yˉ)−Eyˉ′[i=2∑Tfj(yi−1,yi,xˉ)]=Fj(xˉ,yˉ)−i=2∑TEyˉ′[fj(yi−1,yi,xˉ)]=Fj(xˉ,yˉ)−i=2∑TEyi−1,yi[fj(yi−1,yi,xˉ)]=Fj(xˉ,yˉ)−i=2∑Tyi−1∑yi∑fj(yi−1,yi,xˉ)⋅p(yi−1,yi∣xˉ;w)(9)(10)(11)(12)(13)
Note:
-
(9)→(10): Use equation (1).
-
(11)→(12): each term fj(yi−1,yi,xˉ) is only related to yi−1 and yi.
-
In the equation (13), the only unknown term is p(yi−1,yi∣xˉ;w). Let’s now see how to compute it.
Compute Z(xˉ,w)
Z(xˉ,w)=yˉ∑exp[j=1∑JwjFj(xˉ,yˉ)]=yˉ∑exp[j=1∑Jwji=2∑Tfj(yi−1,yi,xˉ)]=yˉ∑[expi=2∑Tgi(yi−1,yi)](14)
We see the term gi(yi−1,yi) in the equation (14) again, and (14) is the sum of [exp∑i=2Tgi(yi−1,yi)] over all y. If we list all the possibilities, the time complexity is O(mT), which is not acceptable. So we should solve it in a similar way like what we did in the inference section (Dynamic Programming). There are two ways to solve it: forward algorithm and backward algorithm. Note that this is very similar to HMM we discussed before.
Forward Algorithm

Let α(k,v) denote the sum of all possible paths from t=1 to t=k, where the tag of time k is v. Then the recursion formula can be easily visualized from the above figure and we can write it as
α(k,v)=smax[α(k−1,s)⋅expgk(s,v)]
where s∈{1,2,...,m}. Then, we can write Z(xˉ,w) as
Z(xˉ,w)=s=1∑mα(T,s)
Backward Algorithm

Let β(k,v) denote the sum of all possible paths from t=k to t=T, where the tag of time t is v. Then the recursion formula can be easily visualized from the above figure and we can write it as
β(k,v)=smax[β(k+1,s)⋅expgk+1(v,s)]
where s∈{1,2,...,m}. Then, we can write Z(xˉ,w) as
Z(xˉ,w)=s=1∑mβ(1,s)
Compute p(yk=u∣xˉ;w)

From the figure above, we can divide it into the product of a forward term and a backward term:
p(yk=u∣xˉ;w)=Z(xˉ,w)α(k,u)⋅β(k,u)
where
Z(xˉ,w)=u∑α(k,u)⋅β(k,u)
- Note that we can also compute Z(xˉ,w) by write it as a product of an α term and a β term.
Compute p(yk=u,yk+1=v∣xˉ;w)

From the figure above, we can divide it into the product of a forward term, a backward term, and an term that represent the path going from yk=u to yk+1=v:
p(yk=u,yk+1=v∣xˉ;w)=Z(xˉ,w)α(k,u)⋅[expgk+1(u,v)]⋅β(k+1,v)
where
Z(xˉ,w)=u∑v∑α(k,u)⋅[expgk+1(u,v)]⋅β(k+1,v)
Now go back to where we stopped (equation (13)), and use what we just derived above, we have
∂wj∂[logp(yˉ∣xˉ;w)]=Fj(xˉ,yˉ)−i=2∑Tyi−1∑yi∑fj(yi−1,yi,xˉ)⋅p(yi−1,yi∣xˉ;w)=Fj(xˉ,yˉ)−i=2∑Tyi−1∑yi∑fj(yi−1,yi,xˉ)⋅Z(xˉ,w)α(i−1,yi−1)⋅[expgi(yi−1,yi)]⋅β(i,y1)(15)
Now every term in the equation (15) is known. So we can use SGD to update the parameter w.
Reference:
- https://ermongroup.github.io/cs228-notes/representation/undirected/
- http://cseweb.ucsd.edu/~elkan/250B/CRFs.pdf
- http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf
- http://cseweb.ucsd.edu/~elkan/250Bfall2007/loglinear.pdf