【知识图谱学习笔记】(三)命名实体识别(1)
参考文献:
2013-AAAI-Effective bilingual constraints for semi-supervised learning of named entity recognizers(MengqiuWang,Wanxiang Che,Christopher D. Manning)
命名实体识别器半监督学习的有效双语约束
- 摘要
自然语言处理中的大多数半监督(semi-supervised)方法都利用了单一语言中的未加注释的资源;但是,可以通过使用一种以上语言的并行资源来获取信息,因为将相同的话语翻译成不同的语言有助于消除彼此之间的歧义。我们演示了一种有效利用大量双语文本(又称bitext)来改进单语系统的方法。我们提出了一个被分解的概率序列模型,该模型鼓励跨语言和文档内部的一致性。提出了一种简单的吉布斯采样算法来进行近似推理。使用OntoNotes数据集对英汉命名实体识别(NER)进行的实验表明,在双语测试环境中,我们的方法明显比最先进的单语CRF模型更准确。我们的模型也改进了Burkett等人(2010)的工作,中文和英文的相对误差分别降低了10.8%和4.5%。此外,通过在我们的双语模型中注释适量的未标记的双文本,并使用标记的数据进行升级训练,我们实现了与最先进的斯坦福单语NER系统相比,汉语的错误率降低9.2%。
Introduction
A number of semi-supervised techniques have been introduced to tackle this problem, such as bootstrapping(Yarowsky 1995; Collins and Singer 1999; Riloff and Jones 1999), multi-view learning (Blum and Mitchell 1998;Ganchev et al. 2008) and structural learning (Ando and Copyright c! 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.Zhang 2005). Most previous semi-supervised work is situated in a monolingual setting where all unannotated data are available only in a single language.However, in recent years, a vast amount of translated parallel texts have been generated in our increasingly connected multilingual world. While such bi-texts have primarily been leveraged to train statistical machine translation (SMT) systems, contemporary research has increasingly considered the possibilities of utilizing parallel corpora to improve systems outside of SMT. For example, Yarowsky and Ngai (2001) projects the part-of-speech labels assigned by a supervised model in one language (e.g. English) onto word-aligned parallel text in another language (e.g. Chinese) where less manually annotated data is available. Similar ideas were also employed by Das and Petrov (2011) and Fu, Qin, and Liu (2011).
A severe limitation of methods employing bilingual projection is that they can only be applied to test scenarios where parallel sentence pairs are available. It is more desirable to improve monolingual system performance, which is more broadly applicable. Previous work such as Li et al. (2012) and Kim, Toutanova, and Yu (2012) successfully demonstrated that manually-labeled bilingual corpora can be used to improve monolingual system performance. This approach, however, encounters the difficulty that manually annotated bilingual corpora are even harder to come by than monolingual ones.
In this work, we consider a semi-supervised learning scheme using unannotated bi-text. For a given language pair (e.g., English-Chinese), we expect one language (e.g. English) to have more annotated training resources than the other (e.g. Chinese), and thus there exists a strong monolingual model (for English) and a weaker model (for Chinese). Since bi-text contains translations across the two languages, an aligned sentence pair would exhibit some semantic and syntactic similarities. Thus we can constrain the two models to agree with each other by making joint predictions that are skewed towards the more informed model. In general, errors made in the lower-resource model will be corrected by the higher-resource model, but we also anticipate that these joint predictions will have higher quality for both languages than the output of a monolingual model alone. We can then apply this bilingual annotation method to a large amount of unannotated bi-text, and use the resulting annotated data as additional training data to train a new monolingual model with better coverage.[1]
在这项工作中,我们考虑一个使用无注解双文本的半监督学习方案。对于给定的语言对(例如,英汉),我们期望一种语言(例如,英语)比另一种语言(例如,汉语)有更多的带注释的培训资源,因此存在一个强的单语模型(对于英语)和一个弱的模型(对于汉语)。由于双文本包含跨两种语言的翻译,对齐的句子对将显示一些语义和语法上的相似性。因此,我们可以通过联合预测来约束这两个模型,从而使它们彼此一致,而这种联合预测是偏向于更明智的模型的。一般来说,在低资源模型中所犯的错误将由高资源模型来纠正,但我们也预期这些联合预测对于两种语言的质量将高于单语模型的输出。然后,我们可以将这种双语注释方法应用于大量未加注释的双文本,并将得到的带注释的数据作为额外的训练数据来训练一种具有更好覆盖率的新单语模型。
Burkett et al. (2010) proposed a similar framework with a “multi-view” learning scheme where k-best outputs of two monolingual taggers are reranked using a complex self-trained reranking model. In our work, we propose a simple decoding method based on Gibbs sampling that eliminates the need for training complex reranking models. In particular, we construct a new factored probabilistic model by chaining together two Conditional Random Field monolingual models with a bilingual constraint model, which encourages soft label agreements. We then apply Gibbs sampling to find the best labels under the new factored model. We can further improve the quality of bilingual prediction by incorporating an additional model, expanding upon Finkel, Grenager, and Manning (2005), that enforces global label consistency for each language.
Burkett等人(2010)提出了一个类似的框架,该框架采用了一种“多视图”的学习方案,其中两个单语标记器的k-best输出使用复杂的自训练重排序模型进行重排序。在我们的工作中,我们提出了一种基于Gibbs sampling的简单解码方法,这种方法消除了训练复杂重新链接模型的需要。特别地,我们利用双语约束模型将两个条件随机场单语模型连接起来,构造了一个新的因子概率模型,该模型鼓励软标签协议。然后,我们应用吉布斯抽样找到最佳标签下的新因式模型。我们可以通过加入一个附加的模型,在Finkel、Grenager和Manning(2005)的基础上进行扩展,进一步提高双语预测的质量,该模型加强了每种语言的全球标签一致性。
Experiments on Named Entity Recognition (NER) show that our bilingual method yields significant improvements over the state-of-the-art Stanford NER system. When evaluated over the standard OntoNotes English-Chinese dataset in a bilingual setting, our models achieve a F1 error reduction of 18.6% in Chinese and 9.9% in English. Our method also improves over Burkett et al. (2010) with a relative error reduction of 10.8% and 4.5% in Chinese and English, respectively. Furthermore, we automatically label a moderate-sized set of 80k sentence pairs using our bilingual model, and train new monolingual models using an uptraining scheme. The resulting monolingual models demonstrate an error reduction of 9.2% over the Stanford NER systems for Chinese.[2]
Monolingual NER with CRF
Named Entity Recognition is an important task in NLP. It serves as a first step in turning unstructured text into structured data, and has broad applications in news aggregation, question answering, and bioNLP. Given an input sentence, an NER tagger identifies words that are part of a named entity, and assigns the entity type and relative position information. For example, in the commonly used BIO tagging scheme, a tag such as B-PERSON indicates the word is the beginning of a person name entity; and a I-LOCATION tag marks the word to be inside a location entity. All words marked with tag O are not part of any entity. Figure 1 illustrates a tagged sentence pair in English and Chinese.
命名实体识别是自然语言处理中的一项重要任务。它是将非结构化文本转换为结构化数据的第一步,在新闻聚合、问题回答和生物处理方面具有广泛的应用。给定一个输入句子,NER tagger标识作为命名实体一部分的单词,并分配实体类型和相对位置信息。例如,在常用的BIO标记方案中,诸如B-PERSON之类的标记表示该单词是person name实体的开头;I-LOCATION标记将单词标记在位置实体中。所有标记为O的单词不属于任何实体。图1展示了一个带标记的中英文句子对。
Current state-of-the-art supervised NER systems employ an undirected graphical model called Conditional Random Field (CRF) (Lafferty, McCallum, and Pereira 2001). Given an input sentence x, a linear-chain structured CRF defines the following conditional probability for tag sequence y:
where fj is the jth feature function, λj is the feature weight, and Ζ(x) is the partition function.
目前最先进的监督NER系统采用了一种称为条件随机场(CRF)的无向图形模型(Lafferty, McCallum,和Pereira 2001)。给定一个输入句子x,一个线性链结构的CRF为标签序列y定义如下条件概率:
fj 是特征函数,λj是特征权重,Ζ(x)是配分函数。
Bilingual NER Constraints
A pair of aligned sentences in two languages contain complementary cues to aid the analysis of each other. For example, in Figure 1, it is not immediately obvious whether the phrase “Foreign Affairs” on the English side refers to an organization (Ministry of Foreign Affairs), or general foreign affairs. But the aligned word on the Chinese side is a lot less ambiguous, and can be easily identified as an organization entity.
Another example is that in the Chinese training data we have never seen the translation of the name “Kamyao”. As a result, the tagger cannot make use of lexical features, and so has to rely on less informative contextual features to predict if it is a geo-political entity (GPE) or a person. But we have seen the aligned word on the English side being tagged as person, and thus can infer that the Chinese aligned entity should also be a person.
It is straight-forward to see that accurate word alignment is essential in such an analysis. Fortunately, there are automatic word alignment systems used in MT research that produce robust and accurate alignment results, and our method will use the output of one (Liang, Taskar, and Klein 2006).
在两种语言中,一对对齐的句子包含互补的线索,以帮助分析对方。例如,在图1中,英语侧的“Foreign Affairs”一词是指某个组织(外交部),还是指一般的外交事务,这一点并不明显。但在汉语这边,“对齐”这个词的含意要少得多,而且很容易被识别为一个组织实体。
另一个例子是,在中文的训练数据中,我们从来没有见过Kamyao这个名字的翻译。因此,标记者不能利用词汇特征,因此必须依赖信息较少的上下文特征来预测它是一个地缘政治实体(GPE)还是一个人。但是,我们已经看到在英语侧对齐的单词被标记为person,因此可以推断,在汉语侧对齐的实体也应该是person。
很明显,在这样的分析中,准确的单词对齐是必不可少的。幸运的是,在机器翻译研究中使用的自动字对齐系统产生了稳健而准确的对齐结果,我们的方法将使用其中一个的输出(Liang, Taskar, and Klein 2006)。
Hard Agreement Constraints
Drawing on the above observations, we first propose a simple bilingual constraint model that enforces hard agreements. We define the following probability for an output sequence pair yc and ye for Chinese and English input sentences xc and xe, respectively:
where A is the set of all aligned word pairs, and is an indicator function that equals 1 if yac=yae , and 0 otherwise.
Soft Agreement Constraints
If we apply hard agreement constraints, any output sequence pairs that disagree on any tag pair will be assigned zero probability. Such a hard constraint is not always satisfied in practice, since annotation standards in different languages can differ. An example is given in Figure 2, where the phrase mention of “bonded area” is considered a location in the Chinese gold-standard, but not in the English gold-standard.
如果我们应用硬协议约束,任何不同意任何标签对的输出序列对将可能被赋零。在实践中,这样的硬约束并不总是得到满足,因为不同语言的注释标准可能不同。图2给出了一个例子,其中提到的“保税区”在中国的黄金标准中被认为是一个位置,而在英国的黄金标准中则不是。
We can soften these constraints by replacing the 1 and 0 values in indicator function with a probability measure. We first tag a set of unannotated bilingual sentence pairs using two baseline monolingual CRF taggers. Then we collect counts of aligned entity tag pairs from the autogenerated tagged data. The value
is chosen to be the pairwise mutual information score of the entity pair
. This version of constraints is denoted as auto.
我们可以通过将1和0值的指示器函数替换为概率测量来软化这些约束。我们首先使用两种基线单语CRF taggers来标记一组无注释的双语句子。然后,我们从自动生成的标记数据收集对齐的实体标记对。这个值被选为实体对的成对相互信息分数(?)。这个版本的约束是被视为自动的。
Alignment Uncertainty
When we consider the previous two sets of bilingual constraints, we assume the word alignments are given by some off-the-shelf alignment model which outputs a set of “hard” alignments. In practice, most statistical word alignment models assign a probability to each alignment pair, and “hard” alignments are produced by cutting off alignment pairs that fall below a threshold value.
当我们考虑前面两组的双语约束时,我们假设这个词的对齐是由一些现成的对齐模型给出的,它输出一组硬对齐。在实践中,大多数统计词对齐模型为每个对齐对分配了一个概率,并且通过切断低于阈值的对齐对来生成硬对齐。
To take into account alignment uncertainties, we modify function by exponentiating(求幂) its value to the power of the alignment probability to give a new function:
The intuition behind this modification is that pairs with a higher alignment probability will reflect more probability fluctuation when different label assignments are considered.
For example, consider an extreme case where a particular pair of aligned words has alignment probability 0. Then the value of the U function will always be 1 regardless of what tags are assigned to the two words, thus reducing the impact of different choices of tags for this pair in the overall tag sequence assignment.
例如,考虑一个极端情况,其中一个特定的对齐的单词有对齐概率0。然后,U函数的值将永远是1,而不考虑将哪些标记分配给这两个词,从而减少在总体标记顺序分配中对这对选项的不同选择的影响。
Gibbs Sampling with Factored Models
In a monolingual setting, exact inference in a standard linear-chain CRF can be done by applying the Viterbi algorithm to find the most likely output sequence. But when we consider the joint probability of an output sequence pair in a bilingual setting, especially when we apply the aforementioned bilingual constraints, cyclic cliques(?) are introduced into the Markov random field which make exact inference algorithms intractable.
在单语设置中,可以通过应用Viterbi算法来找到最可能的输出序列来完成标准线性链CRF的精确推理。但当我们考虑在双语言设置中输出序列对的联合概率时,特别是当我们应用上述的语言约束时,循环cliques被引入马尔可夫随机场,从而使精确的推理算法难以处理。
Markov Chain Monte Carlo (马尔可夫链蒙特卡尔理论MCMC) methods offer a simple and elegant solution for approximate inference by constructing a Markov chain whose stationary distribution is the target distribution.
In this work, we adopt a specific MCMC sampling method called Gibbs sampling (Geman and Geman 1984). We define a Markov chain over output sequences by observing a simple transition rule: from a current sequence assignment at time t - 1, we can transition into the next sequence at time t by changing the label at any position i. And the distribution over these transitions is defined as:
where is the set of all labels except yi at time t-1.
To apply the bilingual constraints during decoding, we formulate a new factored model by combining the two monolingual CRF models (one for each language) with the bilingual constraint model via a simple product. [3]The resulting model is of the following form:
为了在解码过程中应用双语约束,我们通过一个简单的产品将两种单语CRF模型(每种语言一个)与双语约束模型相结合,形成一个新的因子模型。
Obtaining the state transition model for the monolingual CRF models is straight-forward. In the case of a first order linear-chain CRF, the Markov blanket is the neighboring two cliques. Given the Markov blanket of state i, the label at position i is independent of all other states. Thus we can compute the transition model simply by normalizing the product of the neighboring clique potentials. Finkel, Grenager, and Manning (2005) gave a more detailed account of how to compute this quantity.
获取状态转换模型P的单语CRF模型是直接的。在一阶线性链CRF的情况下,马尔可夫毯是相邻的两个团。给定状态i的马尔可夫区域,位置i的标签独立于所有其他状态。因此,我们可以简单地通过对相邻团势的乘积进行归一化来计算转移模型。Finkel、Grenager和Manning(2005)对如何计算这个数量给出了更详细的说明。
The transition probability of label yci in the bilingual constraint model is defined as , where yek is a word aligned to yci.
At decoding time, we walk the Markov chain by taking samples at each step. We start from some random assignment of the label sequence, and at each step we randomly sample a new value for yi at a randomly chosen position i. After a fixed number of steps, we output a complete sequence as our final solution. In practice, MCMC sampling could be quite slow and inefficient, especially when the input sentence is long. To speed up the sampling process, we initialize the state sequence from the best sequences found by Viterbi decoding using only the monolingual models.
在解码时,我们在每一步上取样的方式沿着马尔可夫链走。我们从标签序列的一些随机分配开始,在每一步中,我们随机为yi随机选择一个新的值,在一个随机选择的位置i,经过一个固定的步骤,我们输出一个完整的序列作为我们的最终解决方案。在实践中,MCMC的采样可能会非常缓慢和低效,特别是当输入句长时。为了加快采样过程,我们通过Viterbi解码的最佳序列初始化状态序列,只使用单语模型。
A bigger problem with vanilla Gibbs sampling is that the random samples we draw do not necessarily give us the most likely state sequence, as given by Viterbi in the exact inference case. One way to tackle this problem is to borrow the simulated annealing technique from optimization research (Kirkpatrick, Gelatt, and Vecchi 1983). We redefine the transition probability in Eqn. 3 (enquiry的缩写?)as:
Where c={c0,…,cT} is the schedule of annealing “temperature,” with 0 ≤ ci ≤ 1. The distribution becomes sharper as the value of ci move towards 0. In our experiments we adopted a linear cooling schedule, where c0 = 1, and . This technique has been shown to be effective by Finkel, Grenager, and Manning (2005).
使用vanilla Gibbs抽样的一个更大的问题是,我们绘制的随机样本并不一定会给我们最可能的状态序列,正如Viterbi在精确推理案例中所给出的那样。解决这个问题的一种方法是从优化研究中借鉴模拟退火技术(Kirkpatrick,Gelatt,和Vecchi 1983)。我们重新定义Eqn . 3的过渡概率.
(simulated annealing technique: 模拟退火技术.
相关文章:
https://blog.****.net/Eric2016_Lv/article/details/79701646
http://www.gnu.org/software/gsl/doc/html/siman.html)
Global Consistency Constraints
A distinctive feature of the proposed factored model and Gibbs sampling inference is the ability to incorporate nonlocal constraints that are not easily captured in a traditional Markov network model. The bilingual constraint model described earlier is certainly a benefactor of this unique characteristic.
提出的因子模型和吉布斯抽样推理的一个独特特征是,在传统的马尔可夫网络模型中吸收不容易捕获的非局部约束的能力。前面描述的双语约束模型当然是这个独特特征的贡献者。
Still, there are further linguistic constraints that we can apply to improve the NER system. For example, many previous papers have made the observation that occurrences of the same word sequence within a given document are unlikely to take on different entity types (Bunescu and Mooney 2004; Sutton and McCallum 2004; Finkel, Grenager, and Manning 2005; inter alia). Similar to Finkel, Grenager, and Manning (2005), we devise a global consistency model as follows:
Γ is the set of all possible entity type violations, φγ is the penalty parameter for violation type γ, and #(γ, y, x) is the count of violations γ in sequence y. For example, if the word sequence “China Daily” has occurred both as GPE and organization exactly once, then the penalty φγ for GPE-to- organization violation will apply once. The parameter values of φγ are estimated empirically by counting the occurrences of entity pairs of the same word sequence in the training data.
We can now factor in one global consistency model for each language by taking the product of Eqn. 4 with Eqn. 6. The same Gibbs sampling procedure applies unchanged to this new factored model. At test time, instead of tagging one sentence at a time, we group together sentences that belong to the same document, and tag one document at a time.
我们现在可以通过Eqn . 4和Eqn . 4的产品来计算每种语言的全局一致性模型。同样的吉布斯抽样程序适用于这个新的因子模型。在测试时,我们不是一次标记一个句子,而是把属于同一个文档的句子组合在一起,然后一次标记一个文档。
Enhancing Recall
A flaw of the Finkel, Grenager, and Manning (2005) model described above is that consistency is enforced by applying penalties to entity type violations. But if a word is not tagged with an entity type, it will not receive any penalty since no entity type violations would occur. Therefore, this model has the tendency of favoring null annotations, which can result in losses in model recall.
上述Finkel、Grenager和Manning(2005)模型的一个缺陷是,一致性是通过对实体类型违规行为施加惩罚来强制执行的。但是如果一个单词没有使用实体类型标记,它将不会受到任何惩罚,因为不会发生实体类型违规。因此,该模型有向空标注倾斜的趋势,会导致模型召回的损失。
We fix this deficiency in Finkel, Grenager, and Manning (2005) by introducing a new “reward” parameter δ, which has value > 0. δ is activated each time we see a matching pair of entities for the same word occurrence. The new Pglo is modified as:
where #(δ, y, x) is the activation count of δ in sequence y.
This model is in fact a naive Bayes model, where the parameters δ and φ are empirically estimated (a value of 2 is used for δ in our experiments, based on tuning on a development set). A similar global consistency model was shown to be effective in Rush et al. (2012), where parameters were also tuned on a development set.
该模型实际上是一个朴素贝叶斯模型,参数δ和φ的经验估计(2的值用于δ在我们的实验中,基于优化开发集)。类似的全局一致性模型在Rush等人(2012)中被证明是有效的,其中的参数也在开发集上进行了调优。
Experimental Setup
To compare the proposed bilingual constraint decoding algorithm against traditional monolingual methods, we evaluate on a large, manually annotated parallel corpus that contains named entity annotation in both Chinese and English. The corpus we use is the latest version (v4.0) of the OntoNotes corpus (Hovy et al. 2006), which includes 401 pairs of Chinese and English documents (chtb_0001-0325, ectb_1001- 1078). We use odd-numbered documents as the development set and even-numbered documents as the blind test set.
为了比较所提出的双语约束译码算法与传统的单语译码算法,我们在一个大型的、包含中英文命名实体注释的并行语料库上进行了评估。我们使用的语料库是OntoNotes语料库的最新版本(v4.0) (Hovy et al. 2006),其中包括401对中英文文档(chtb_0001-0325, ectb_1001- 1078)。我们使用奇数文档作为开发集,使用偶数文档作为盲测试集。
These document pairs are aligned at document level, but not at sentence or word level. To obtain sentence alignment, we use the Champollion Tool Kit (CTK)[4]. After discarding sentences with no aligned counterpart, a total of 8,249 sentence pairs were retained. We induce word alignment using the BerkeleyAligner toolkit (Liang, Taskar, and Klein 2006). The aligner outputs the posterior probability for each aligned word pair. To increase efficiency, we prune away all alignments that have probability less than 0.1.
这些文档对在文档级对齐,但不在句子或单词级对齐。为了获得句子对齐,我们使用了Champollion工具包(CTK)。去掉没有对齐的句子后,总共保留了8249对句子。我们使用BerkeleyAligner工具包归纳单词对齐(Liang, Taskar, and Klein 2006)。对准器输出每个对齐的单词对的后验概率。为了提高效率,我们删除了所有概率小于0.1的对齐。
We adopt the state-of-the-art monolingual Stanford NER tagger as a strong baseline for both English and Chinese. For English, we use the default tagger setting from Finkel, Grenager, and Manning (2005). For Chinese, we use an improved set of features over the default tagger, which are listed in Table 1. Both models make use of distributional similarity features taken from word clusters trained on large amounts of non-overlapping data. We train the two CRF models on all portions of the OntoNotes corpus that are annotated with named entity tags, except the parallel-aligned portion which we reserve for development and test purposes. In total, there are about 660 documents (~16k sentences) and 1,400 documents (~39k sentences) for Chinese and English, respectively.
我们采用最先进的单语斯坦福大学纳塔格作为强大的基线,为英语和汉语。对于英语,我们使用Finkel、Grenager和Manning(2005)的默认tagger设置。对于中文,我们使用了一组比默认标记器更好的特性,如表1所示。这两个模型都利用了从大量非重叠数据上训练的词簇中提取的分布相似性特征。我们在OntoNotes语料库的所有部分(除了为开发和测试目的保留的平行对齐部分)上对两个CRF模型进行了培训,这些部分都用命名实体标记进行了注释。中英文文档共计约660篇(~16k句),1400篇(~39k句)。
Out of the 18 named entity types that are annotated in OntoNotes, which include person, location, date, money, and so on, we select the four most commonly seen named entity types for evaluation. They are person, location, organization and GPE. All entities of these four types are converted to the standard BIO format, and background tokens and all other entities types are marked with tag O.
In all of the Gibbs sampling experiments, a fixed number of 2000 sampling steps are taken, and a linear cooling schedule is used in the deterministic annealing procedure.
In order to compare our method with past work, we obtained code from Burkett et al. (2010) and reproduced their experiment setting for the OntoNotes data. An extra set of 5,000 unannotated parallel sentence pairs are used for training the reranker, and the reranker model selection was performed on the development dataset.
We report standard NER measures (entity precision (P), recall (R) and F1 score) on the test set. Statistical significance tests are done using the paired bootstrap resampling method (Efron and Tibshirani 1993), where we repeatedly draw random samples with replacement from the output of the two systems, and compare the test statistics (e.g. absolute difference in F1 score) of the new samples with the observed test statistics. We used 1000 sampling iterations in our experiments.
Bilingual NER Results
The main results on Chinese and English test sets are shown in Table 2. The first row (CRF) shows the baseline monolingual model performance. As we can see, the performance on Chinese is much lower than on English. This is partially attributed to the fact that the Chinese NER tagger was trained on less than half as much data, but it is also because NER in Chinese is a harder problem (e.g., there are no capitalization features in Chinese, which is a very strong indicator of named entities in English).
汉语和英语测试集的主要结果如表2所示。第一行(CRF)显示了单语模型的基线性能。正如我们所看到的,中文的表现远远低于英语。这部分归因于这样一个事实:汉语NER tagger被训练在不到一半的数据,但它也因为NER在汉语是一个困难的问题(例如, 在中国没有资本的特性,这是一个非常强大的指标用英语命名实体)。
By enforcing hard agreement constraints, we can see from row hard that there is an increase of about 1.4% in absolute F1 score on the Chinese side, but at the expense of a 0.9% drop on the English side. The tradeoff mainly occurs in recall.
When we loosen the bilingual constraint to allow softagreement by simply assigning a hand-picked value (0.02) to aligned entities of different types (row manual), we observe a significant increase in accuracy in both Chinese and English. This suggests that the soft alignment successfully accounted for the cases where annotation standards differ in the two languages. In particular, the Chinese results are 3.8% better than the monolingual baseline, a 12% relative error reduction.
When we replace the arbitrary hand-picked soft-agreement probabilities with empirical counts from the auto-tagged dataset (row auto), we see a small increase in recall on both sides, but a drop in precision for Chinese. However, accounting for alignment uncertainty (row auto+ aP) increases both precision and recall for Chinese, resulting in another 1.2% increase in absolute F1 score over the auto model.
Comparing against Burkett et al. (2010) (second row from the top), we can see that both our method and Burkett et al. (2010) significantly outperform the monolingual CRF baseline. This suggests that methods that explore bilingual language cues do have great utility in the NER task. Our best model (auto+aP) gives a significant gain over Burkett et al. (2010) on Chinese (by 2.2%), but trails behind on English by 0.7%. However, we will show in the next section some further improvements to our method by modeling global label consistency, which allows us to outperform Burkett et al. (2010) on both languages.
与Burkett et al.(2010)(上排第二行)相比,我们可以看到我们的方法和Burkett et al.(2010)都明显优于单语CRF基线。这表明,探索双语语言线索的方法在NER任务中确实有很大的效用。我们的最佳模型(auto+aP)在中文上比Burkett et al.(2010)有显著的提高(提高了2.2%),但在英文上则落后了0.7%。然而,在下一节中,我们将通过对全局标签一致性建模来进一步改进我们的方法,这使我们能够在两种语言上超越Burkett等人(2010)。
Results on Global Consistency
Table 3 shows results on the test set after factoring in a global consistency model. Adding global consistency to the monolingual baseline (mono) increases performance on English (consistent with results from previous work (Finkel, Grenager, and Manning 2005)), but hurts Chinese results, especially in recall.
表3显示了将全局一致性模型考虑在内后的测试集结果。在单语基线(mono)中加入全球一致性可以提高英语成绩(与以前的工作结果一致(Finkel、Grenager和Manning 2005)),但会损害汉语的成绩,尤其是在回忆方面。
A possible explanation is that CRF models for English are more certain about which words are entities (by having strong indicative features such as word capitalization), and thus a penalty does not persuade the model to label a word as a non-entity. However, in the Chinese case, the CRF model is weaker, and thus less certain about words being an entity or not. It is also much more likely that the same word (string) will be both an entity and a common word in Chinese than English. In some cases, the model will be better off marking a word as a non-entity, than risking taking a penalty for labeling it inconsistently. By applying the “reward” function, we see a drastic increase in recall on both Chinese and English, with a relatively small sacrifice in precision on Chinese. The overall F1 score increases by about 3.1% and 0.8% in Chinese and English, respectively.
一种可能的解释是,英语的CRF模型更确定哪些词是实体(通过具有很强的指示性特征,如单词大小写),因此惩罚并不能说服模型将单词标记为非实体。然而,在汉语的情况下,CRF模型是较弱的,因此不确定的词是一个实体或不。与英语相比,同样的词(字符串)在汉语中更有可能既是一个实体又是一个常用词。在某些情况下,模型最好将一个单词标记为一个非实体,而不是冒着因标记不一致而受到惩罚的风险。通过使用奖励功能,我们发现汉语和英语的回忆率都有了大幅提高,而汉语的回忆率则相对较低。F1总分中、英文分别提高了3.1%和0.8%左右。
Similar results can be found when we apply global consistency to the bilingual model (auto). Again we see a recall-precision tradeoff between models with or without a “reward” function. But overall, we observe a significant increase in performance when global consistency with a reward function is factored in.
Modeling alignment uncertainty continues to improve the Chinese results when the global consistency model is added, but shows a small performance decrease on the English side. But the gain on the Chinese side is more significant than the loss on English side.
The best overall F1 scores are achieved when bilingual constraints, global consistency with reward, and alignment uncertainty are conjoined. The combined model outperforms the CRF monolingual baseline, with an error reduction of 18.6% for Chinese and 9.9% for English. This model also significantly improves over the method of Burkett et al. (2010) with an error reduction of 10.8% for Chinese and 4.5% for English.
Beyond the difference in model performance, our method is much easier to understand and implement than Burkett et al. (2010). Their method involves simulating a multi-view learning environment using “weakened” monolingual models to train a reranking model, and transplanting the parameters of the “weakened” models to “strong” models at test time in a practical but ad-hoc manner.
Semi-supervised NER Results
In the previous section we demonstrated the utility of our proposed method in a bilingual setting, where parallel sentence pairs are tagged together and directly evaluated. In reality, this is not the common use case. Most down-stream NLP applications operate in a monolingual environment. Therefore, in order to benefit general monolingual NLP systems, we propose a semi-supervised learning setting where we use the bilingual tagger to annotate a large amount of unannotated bilingual text, then we take the tagged sentences on the Chinese side to retrain a monolingual Chinese tagger.
在前一节中,我们演示了我们提出的方法在双语环境中的实用性,在双语环境中,平行句子对被标记在一起并直接进行评估。实际上,这不是常见的用例。大多数下游NLP应用程序在单语言环境中运行。因此,为了使一般的单语NLP系统受益,我们提出了一种半监督学习设置,即使用双语标记者对大量未加注释的双语文本进行注释,然后将标记后的句子移至汉语一侧对单语标记者进行再训练。
To evaluate the effectiveness of this approach, we used the Chinese-English part of the Foreign Broadcast Information Service corpus (FBIS, LDC2003E14), and tagged it with the auto+aP model. Unlike the OntoNotes dataset, this corpus does not contain document boundaries. In order to apply the document-level label consistency model, we divide the test set into blocks of ten sentences, and use the blocks as pseudo-documents.
Results from self-training, as well as results from uptraining using model outputs from Burkett et al. (2010) are shown in Table 4. We can see that by using 80,000 additional sentences, our method gives a significant boost (⇠2.9%, an error reduction of ⇠9.2%) over the CRF baseline. Our method also improves over Burkett et al. (2010) by a significant margin.
The gains are more pronounced in recall than precision, which suggests that the semi-supervised approach using bilingual data is very effective in increasing the coverage of the monolingual tagger. On the other hand, monolingual self-training hurts performance in both precision and recall.
这种提高在回忆上比在准确性上更明显,这表明使用双语数据的半监督方法在增加单语标记者的覆盖率方面非常有效。另一方面,单语的自我训练在准确性和回忆能力上都损害了表现。
We also report results on the effect of using increasing amounts of unannotated bilingual data. When only 10k sentences are added to the Chinese side, we already see a 5.2% error reduction over the CRF baseline.
Conclusions
We introduced a factored model with a Gibbs sampling inference algorithm, that can be used to produce more accurate tagging results for a parallel corpus. Our model makes use of cross-language bilingual constraints and intra-document consistency constraints. We further demonstrated that unlabeled parallel corpora tagged with our bilingual model can then be used to improve monolingual tagging results, using an uptraining scheme. The model presented here is not restricted to the NER task only, but can be adopted to improve other natural language applications as well, such as syntactic parsing and semantic analysis.
我们引入了一个带有Gibbs sampling推理算法的因子模型,它可以用于为一个并行语料库生成更准确的标记结果。我们的模型利用了跨语言双语约束和文档内部一致性约束。我们进一步证明,使用我们的双语模型标记的未标记的平行语料库可以使用一个升级训练方案来改进单语标记结果。本文提出的模型不仅适用于NER任务,还可用于改进其他自然语言应用,如句法分析和语义分析。
[1] This training regimen is also referred to as “uptraining” (Petrov et al. 2010).
[2] All of our code is made available at nlp.stanford.edu/software/CRF-NER.shtml.
[3] This model double-counts the state sequence conditioned on a given observation, and therefore is likely deficient. However, we do not find this to be a problem in practice.
[4] champollion.sourceforge.net