Log-Linear Model & CRF 条件随机场详解

往期文章链接目录

Log-Linear model

Let xx be an example, and let yy be a possible label for it. A log-linear model assumes that

p(yx;w)=exp[j=1JwjFj(x,y)]Z(x,w) p(y | x ; w)=\frac{\exp [\sum_{j=1}^J w_{j} F_{j}(x, y)]}{Z(x, w)}

where the partition function

Z(x,w)=yexp[j=1JwjFj(x,y)] Z(x, w)=\sum_{y^{\prime}} \exp [\sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)]

Note that in y\sum_{y^{\prime}}, we make a summation over all possible yy. Therefore, given xx, the label predicted by the model is

y^=argmaxyp(yx;w)=argmaxyj=1JwjFj(x,y) \hat{y}=\underset{y}{\operatorname{argmax}} p(y | x ; w)=\underset{y}{\operatorname{argmax}} \sum_{j=1}^J w_{j} F_{j}(x, y)

Each expression Fj(x,y)F_j(x, y) is called a feature-function. You can think of it as the jj-th feature extracted from (x,y)(x,y).

Remark of the log-linear model:

  • a linear combination j=1JwjFj(x,y)\sum_{j=1}^J w_{j} F_{j}(x, y) can take any positive or negative real value; the exponential makes it positive.

  • The division makes the result p(yx;w)p(y | x ; w) between 0 and 1, i.e. makes them be valid probabilities.

Conditional Random Fields (CRF)

Last time, we talked about Markov Random Fields. In this post, we are going to discuss Conditional Random Fields, which is an important special case of Markov Random Fields arises when they are applied to model a conditional probability distribution p(yx)p(y|x), where xx and yy are vactor-valued variables.

Log-Linear Model & CRF 条件随机场详解

Formal definition of CRF

Formally, a CRF is a Markov network which specifies a conditional distribution

P(yx)=1Z(x)cCϕc(xc,yc)P(y\mid x) = \frac{1}{Z(x)} \prod_{c \in C} \phi_c(x_c,y_c)

with partition function

Z=yYcCϕc(xc,yc)Z = \sum_{y \in \mathcal{Y}} \prod_{c \in C} \phi_c(x_c,y_c)

we further assume that the factors ϕc(xc,yc)\phi_c(x_c,y_c) (maximal cliques) are of the form

ϕc(xc,yc)=exp[wcTfc(xc,yc)]\phi_c(x_c,y_c) = \exp[w_c^T f_c(x_c, y_c)]

Since we require our potential function ϕ\phi to be non-negative, it’s natural to use the exponential function. fc(xc,yc)f_c(x_c, y_c) can be an arbitrary set of features describing the compatibility between xcx_c and ycy_c. Note that these feature functions could be designed by manually doing feature engineering or using deep learning, LSTM, etc.

Log-linear model to linear-CRF

As a remainder, let xx be an example, and let yy be a possible label for it. Then a log-linear model assumes that

p(yx;w)=exp[j=1JwjFj(x,y)]Z(x,w) p(y | x ; w)=\frac{\exp [\sum_{j=1}^J w_{j} F_{j}(x, y)]}{Z(x, w)}

From now on, we use the bar notation for sequences. Then to linear-CRF, we write the above equation as

p(yˉxˉ;w)=exp[j=1JwjFj(xˉ,yˉ)]Z(xˉ,w)=exp[j=1Jwji=2Tfj(yi1,yi,xˉ)]Z(xˉ,w)(1) \begin{aligned} p(\bar y | \bar x; w) &= \frac{\exp [\sum_{j=1}^J w_{j} F_{j}(\bar x, \bar y)]}{Z(\bar x, w)}\\ &= \frac{\exp [\sum_{j=1}^J w_{j} \sum_{i=2}^{T} f_j (y_{i-1}, y_i, \bar x)]}{Z(\bar x, w)} &&\quad(1) \end{aligned}

where yy can take values from {1,2,...,m}\{1,2,...,m\}. Here is an example:

Assume we have a sequence xˉ=(x1,x2,x3,x4)\bar x = (x_1, x_2, x_3, x_4) and the corresponding hidden sequence yˉ=(y1,y2,y3,y4)\bar y = (y_1, y_2, y_3, y_4).

Log-Linear Model & CRF 条件随机场详解

We can divide each feature-function Fj(xˉ,yˉ)F_j(\bar x, \bar y) into fuctions for each maximal clique. That is,

Fj(xˉ,yˉ)=i=2Tfj(yi1,yi,xˉ)(1.1) F_j(\bar x, \bar y) = \sum_{i=2}^{T} f_j (y_{i-1}, y_i, \bar x) \tag {1.1}

Perticularly, from the above figure, since we have 33 maximal cliques, so

Fj(xˉ,yˉ)=fj(y1,y2,xˉ)+fj(y2,y3,xˉ)+fj(y3,y4,xˉ) F_j(\bar x, \bar y) = f_j(y_1, y_2, \bar x) + f_j(y_2, y_3, \bar x) + f_j(y_3, y_4, \bar x)

If we extract JJ feature functions from the (xˉ,yˉ)(\bar x, \bar y) pair, then it becomes

j=1JwjFj(x,y)=j=1Jwji=2Tfj(yi1,yi,xˉ)\sum_{j=1}^J w_{j} F_{j}(x, y) = \sum_{j=1}^J w_{j} \sum_{i=2}^{T} f_j (y_{i-1}, y_i, \bar x)

Inference problem for CRF

  • Goal: given a sequence xˉ\bar x, and parameter ww, find the best hidden sequence yˉ\bar y. The condition probability of yˉ\bar y is

p(yˉxˉ;w)=exp[j=1Jwji=2Tfj(yi1,yi,xˉ)]Z(xˉ,w) p(\bar y | \bar x; w) = \frac{\exp [\sum_{j=1}^J w_{j} \sum_{i=2}^{T} f_j (y_{i-1}, y_i, \bar x)]}{Z(\bar x, w)}

Our objective is(check that the objective of CRF is the objective of Log-Linear model described above):

y^=argmaxyˉp(yˉxˉ;w)(2)=argmaxyˉj=1Jwji=2Tfj(yi1,yi,xˉ)(3)=argmaxyˉi=2Tgi(yi1,yi)(4) \begin{aligned} \hat{y} &= \underset{\bar y}{\operatorname{argmax}} p(\bar y | \bar x ; w) &&(2)\\ &= \underset{\bar y}{\operatorname{argmax}} \sum_{j=1}^J w_{j} \sum_{i=2}^{T} f_j (y_{i-1}, y_i, \bar x) &&(3) \\ &= \underset{\bar y}{\operatorname{argmax}} \sum_{i=2}^{T} g_i(y_{i-1}, y_i) && (4) \end{aligned}

Note:

  • (2)(3)(2) \to (3): we can ignore the denominator since it stays the same for all possible yˉ\bar y. Exponential function won’t affect our objective.

  • We set
    gi(yi1,yi)=j=1Jwjfj(yi1,yi,xˉ)(5)g_i(y_{i-1}, y_i) = \sum_{j=1}^J w_{j} \cdot f_j (y_{i-1}, y_i, \bar x) \tag 5

Based on our objective in (5)(5), we want to find the best path from y1y_1 to yTy_T such that the objective function is maximized. Clearly, we can use Dynamic Programming (DP) here.

Log-Linear Model & CRF 条件随机场详解

Let u(k,v)u(k,v) denote the score of the best path from t=1t=1 to t=kt=k, where the tag of time kk is vv. Then the recursion formula can be easily visualized from the above figure and we can write it as

u(k,v)=maxs[u(k1,s)+gk(s,v)] u(k,v) = \underset{s}{\operatorname{max}} [u(k-1, s) + g_k(s,v)]

where ss takes values from states {1,2,...,m}\{1,2,...,m\}. The maximum of the objective is max{u(T,1),u(T,2),...,u(T,m)}\operatorname{max} \{u(T,1), u(T,2), ..., u(T,m)\}.

  • Time complexity: O(mT)O(m)=O(m2T)O(mT) \cdot O(m) = O(m^2 T).

  • Space complexity: O(mT)O(mT) since we need to track the path of the best sequence yˉ\bar y.

Learning problem for CRF

Goal: Given the data set D={(x(1),y(1)),...,(x(n),y(n))}D = \{ ({x^{(1)}}, {y^{(1)}}), ..., ({x^{(n)}}, {y^{(n)}})\}, we want to find parameter ww to maximize p(Dw)p(D|w). That is,

w^MLE=maxwp(Dw)=maxwi=1np(y(i)x(i);w) \begin{aligned} \hat{w}_{MLE} &= \underset{w}{\operatorname{max}} p(D|w) \\ &= \underset{w}{\operatorname{max}} \prod_{i=1}^{n} p( {y^{(i)}} | {x^{(i)}}; w) \end{aligned}

That is, we need to take derivatives and then use the gradient descent method.

Learning problem for general Log-Linear model

p(yˉ,xˉ;w)=exp[j=1JwjFj(x,y)]Z(x,w)p(\bar y, \bar x; w) = \frac{\exp [\sum_{j=1}^J w_{j} F_{j}(x, y)]}{Z(x, w)}

Take the derivative with respect to wjw_j:

wj[logp(yx;w)]=wj[j=1JwjFj(x,y)logZ(x,w)]=Fj(x,y)1Z(x,w)wjZ(x,w)(6) \begin{aligned} \frac{\partial}{\partial w_j} [\log p(y| x; w)] &= \frac{\partial}{\partial w_j} [\sum_{j=1}^J w_{j} F_j(x, y) - \log Z(x,w)] \\ &= F_j(x, y) - \frac{1}{Z(x,w)} \cdot \frac{\partial}{\partial w_j} Z(x,w) &&(6) \end{aligned}

where

wjZ(x,w)=wjyexp[j=1JwjFj(x,y)]=ywj[expj=1JwjFj(x,y)]=y[expj=1JwjFj(x,y)]Fj(x,y)(7) \begin{aligned} \frac{\partial}{\partial w_j} Z(x,w) &= \frac{\partial}{\partial w_j} \sum_{y^{\prime}} \exp [\sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)] \\ &= \sum_{y^{\prime}} \frac{\partial}{\partial w_j} [\exp \sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)] \\ &= \sum_{y^{\prime}} [\exp \sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)] \cdot F_{j}\left(x, y^{\prime}\right) &&(7) \end{aligned}

Combining (6)(6) and (7)(7), we have

wj[logp(yx;w)]=Fj(x,y)1Z(x,w)yFj(x,y)[expj=1JwjFj(x,y)]=Fj(x,y)yFj(x,y)expj=1JwjFj(x,y)Z(x,w)=Fj(x,y)yFj(x,y)p(yx;w)=Fj(x,y)Eyp(yx;w)[Fj(x,y)](8) \begin{aligned} \frac{\partial}{\partial w_j} [\log p(y| x; w)] &= F_{j}\left(x, y\right) - \frac{1}{Z(x,w)} \sum_{y^{\prime}} F_{j}\left(x, y^{\prime}\right) [\exp \sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)] \\ &= F_{j}\left(x, y\right) - \sum_{y^{\prime}} F_{j}\left(x, y^{\prime}\right) \frac{\exp \sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)}{Z(x,w)} \\ &= F_{j}\left(x, y\right) - \sum_{y^{\prime}} F_{j}\left(x, y^{\prime}\right) \cdot p(y^{\prime}|x;w) \\ &= F_{j}\left(x, y\right) - E_{y^{\prime} \sim p(y^{\prime}|x;w)}[F_{j}\left(x, y^{\prime}\right)] &&(8) \end{aligned}

Learning problem for CRF

We can edit (8)(8) to get the partial derivative for CRF:

wj[logp(yˉxˉ;w)]=Fj(xˉ,yˉ)Eyˉp(yˉx;w)[Fj(x,yˉ)](9)=Fj(xˉ,yˉ)Eyˉ[i=2Tfj(yi1,yi,xˉ)](10)=Fj(xˉ,yˉ)i=2TEyˉ[fj(yi1,yi,xˉ)](11)=Fj(xˉ,yˉ)i=2TEyi1,yi[fj(yi1,yi,xˉ)](12)=Fj(xˉ,yˉ)i=2Tyi1yifj(yi1,yi,xˉ)p(yi1,yixˉ;w)(13) \begin{aligned} \frac{\partial}{\partial w_j} [\log p(\bar y| \bar x; w)] &= F_{j}\left(\bar x, \bar y\right) - E_{\bar y^{\prime} \sim p(\bar y^{\prime}|x;w)}[F_{j}\left(x,\bar y^{\prime}\right)] &&(9)\\ &= F_{j}\left(\bar x, \bar y\right) - E_{\bar y^{\prime}}[\sum_{i=2}^T f_j(y_{i-1}, y_i, \bar x)] &&(10)\\ &= F_{j}\left(\bar x, \bar y\right) - \sum_{i=2}^T E_{\bar y^{\prime} }[f_j(y_{i-1}, y_i, \bar x)] &&(11)\\ &= F_{j}\left(\bar x, \bar y\right) - \sum_{i=2}^T E_{y_{i-1}, y_i}[f_j(y_{i-1}, y_i, \bar x)] &&(12)\\ &= F_{j}\left(\bar x, \bar y\right) - \sum_{i=2}^T \sum_{y_{i-1}} \sum_{y_{i}} f_j(y_{i-1}, y_i, \bar x) \cdot p(y_{i-1}, y_i| \bar x; w) &&(13) \end{aligned}

Note:

  • (9)(10)(9) \to (10): Use equation (1)(1).

  • (11)(12)(11) \to (12): each term fj(yi1,yi,xˉ)f_j(y_{i-1}, y_i, \bar x) is only related to yi1y_{i-1} and yiy_i.

  • In the equation (13)(13), the only unknown term is p(yi1,yixˉ;w)p(y_{i-1}, y_i| \bar x; w). Let’s now see how to compute it.

Compute Z(xˉ,w)Z(\bar x, w)

Z(xˉ,w)=yˉexp[j=1JwjFj(xˉ,yˉ)]=yˉexp[j=1Jwji=2Tfj(yi1,yi,xˉ)]=yˉ[expi=2Tgi(yi1,yi)](14) \begin{aligned} Z(\bar x, w) &= \sum_{\bar y} \exp \left[\sum_{j=1}^J w_{j} F_{j}\left(\bar x, \bar y \right)\right] \\ &= \sum_{\bar y} \exp \left[\sum_{j=1}^J w_{j} \sum_{i=2}^T f_j(y_{i-1}, y_i, \bar x)\right] \\ &= \sum_{\bar y} \left[\exp \sum_{i=2}^T g_i(y_{i-1}, y_i)\right] && (14) \end{aligned}

We see the term gi(yi1,yi)g_i(y_{i-1}, y_i) in the equation (14)(14) again, and (14)(14) is the sum of [expi=2Tgi(yi1,yi)]\left[\exp \sum_{i=2}^T g_i(y_{i-1}, y_i)\right] over all yy. If we list all the possibilities, the time complexity is O(mT)O(m^T), which is not acceptable. So we should solve it in a similar way like what we did in the inference section (Dynamic Programming). There are two ways to solve it: forward algorithm and backward algorithm. Note that this is very similar to HMM we discussed before.

Forward Algorithm
Log-Linear Model & CRF 条件随机场详解

Let α(k,v)\alpha(k,v) denote the sum of all possible paths from t=1t=1 to t=kt=k, where the tag of time kk is vv. Then the recursion formula can be easily visualized from the above figure and we can write it as

α(k,v)=maxs[α(k1,s)expgk(s,v)] \alpha(k,v) = \underset{s}{\operatorname{max}} \left[\alpha(k-1, s) \cdot \text{exp}\, g_k(s,v)\right]

where s{1,2,...,m}s \in \{1,2,...,m\}. Then, we can write Z(xˉ,w)Z(\bar x, w) as

Z(xˉ,w)=s=1mα(T,s) Z(\bar x, w) = \sum_{s=1}^m \alpha(T, s)

Backward Algorithm
Log-Linear Model & CRF 条件随机场详解

Let β(k,v)\beta(k,v) denote the sum of all possible paths from t=kt=k to t=Tt=T, where the tag of time tt is vv. Then the recursion formula can be easily visualized from the above figure and we can write it as

β(k,v)=maxs[β(k+1,s)expgk+1(v,s)] \beta(k,v) = \underset{s}{\operatorname{max}} \left[\beta(k+1, s) \cdot \text{exp}\, g_{k+1}(v,s)\right]

where s{1,2,...,m}s \in \{1,2,...,m\}. Then, we can write Z(xˉ,w)Z(\bar x, w) as

Z(xˉ,w)=s=1mβ(1,s) Z(\bar x, w) = \sum_{s=1}^m \beta(1, s)

Compute p(yk=uxˉ;w)p(y_k=u|\bar x; w)

Log-Linear Model & CRF 条件随机场详解

From the figure above, we can divide it into the product of a forward term and a backward term:

p(yk=uxˉ;w)=α(k,u)β(k,u)Z(xˉ,w) p(y_k=u|\bar x; w) = \frac{\alpha(k,u)\cdot \beta(k,u)}{Z(\bar x, w)}

where

Z(xˉ,w)=uα(k,u)β(k,u) Z(\bar x, w) = \sum_u \alpha(k,u)\cdot \beta(k,u)

  • Note that we can also compute Z(xˉ,w)Z(\bar x, w) by write it as a product of an α\alpha term and a β\beta term.

Compute p(yk=u,yk+1=vxˉ;w)p(y_k=u, y_{k+1}=v|\bar x; w)

Log-Linear Model & CRF 条件随机场详解

From the figure above, we can divide it into the product of a forward term, a backward term, and an term that represent the path going from yk=uy_k=u to yk+1=vy_{k+1}= v:

p(yk=u,yk+1=vxˉ;w)=α(k,u)[expgk+1(u,v)]β(k+1,v)Z(xˉ,w) p(y_k=u, y_{k+1}=v|\bar x; w) = \frac{\alpha(k,u)\cdot [\text{exp} \, g_{k+1} (u,v)] \cdot\beta(k+1,v)}{Z(\bar x, w)}

where

Z(xˉ,w)=uvα(k,u)[expgk+1(u,v)]β(k+1,v) Z(\bar x, w) = \sum_u \sum_v \alpha(k,u)\cdot [\text{exp} \, g_{k+1} (u,v)] \cdot\beta(k+1,v)

Now go back to where we stopped (equation (13)(13)), and use what we just derived above, we have

wj[logp(yˉxˉ;w)]=Fj(xˉ,yˉ)i=2Tyi1yifj(yi1,yi,xˉ)p(yi1,yixˉ;w)=Fj(xˉ,yˉ)i=2Tyi1yifj(yi1,yi,xˉ)α(i1,yi1)[expgi(yi1,yi)]β(i,y1)Z(xˉ,w)(15) \begin{aligned} \frac{\partial}{\partial w_j} [\log p(\bar y| \bar x; w)] &= F_{j}\left(\bar x, \bar y\right) - \sum_{i=2}^T \sum_{y_{i-1}} \sum_{y_{i}} f_j(y_{i-1}, y_i, \bar x) \cdot p(y_{i-1}, y_i| \bar x; w) \\ &= F_{j}\left(\bar x, \bar y\right) - \sum_{i=2}^T \sum_{y_{i-1}} \sum_{y_{i}} f_j(y_{i-1}, y_i, \bar x) \cdot \frac{\alpha(i-1,y_{i-1})\cdot [\text{exp} \, g_{i} (y_{i-1},y_i)] \cdot\beta(i,y_1)}{Z(\bar x, w)} &&(15) \end{aligned}

Now every term in the equation (15)(15) is known. So we can use SGD to update the parameter ww.


Reference:

  • https://ermongroup.github.io/cs228-notes/representation/undirected/
  • http://cseweb.ucsd.edu/~elkan/250B/CRFs.pdf
  • http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf
  • http://cseweb.ucsd.edu/~elkan/250Bfall2007/loglinear.pdf

往期文章链接目录