1. 导读

本节博文，我们将使用CNN来做文本分类。这篇博文暂时不会有实际的代码，我们先从理论上对NLP-CNN做一个介绍，并梳理和总结相关文献。实践工作留待下一篇博文介绍。
本篇博文的主要参考文献为：
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

2.NLP-CNN基本概念

与图像中的CNN类似的是，我们也把输入的一句话当做一个二维矩阵，矩阵的每一行为一个单词的词向量。如果输入的是文档，那就是一个三维的张量。其中第三个维度代表文档中的句子。或者你也可以理解为一系列的二维样本（注意这里的一系列的二维样本，和二维样本的一系列通道是两个概念。）
词向量（又称之为词嵌入，word embeddings）有很多表示方式，例如可以采用word2vec或者GloVe.当然也可以采用one-hot vectors.这样，对于一句包含10个单词，每个单词为100维的词向量输入而言，我们就拥有了10×100 的“图像”。
一个典型的NLP-CNN结构如图2.1所示：

图2.1:文本句子分类的CNN示意图。
我们在这里设计了3中尺寸的kernel,高度分别是 2, 3和4,每种尺寸的都有2个kernel(也就是说会提取两种高度为2的特征，两种高度为3的特征和两种高度为4的特征).得到卷积结果后，采用1-max pooling做池化。最后softmax layer里面我们假定了一个二分类。图片来源：Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.

对于计算机视觉而言，CNN具有很好的物理意义。例如，卷积具有位置不变的性质（Location Invariance ，我们在前几篇博客中曾经多次解释过）并且具有特征整合（Compositionality）的特点。这些特点对应到图像上而言是很有意义的，但是在NLP中却不见得如此。你也许不关心一个词出现在句子中的哪个位置。但是，图片上相邻的像素在语义上是相关的，可在NLP中却未必如此。在很多语言中，短语中往往隔着几个单词（例如短语：buy … for …），这样一来，我们把局部的特征做整合，似乎并没有清晰的解释。NLP中，单词有自己的整合方式，例如形容词往往修饰名词，但是对于这种机制，高层的特征表达究竟以为这什么？这并不像在图片中那样容易让人理解。
综上所述，CNN对于NLP任务而言，似乎并不是最佳的方法。在今后的博文中，你将会看到博主对RNN(Recurrent Neural Networks)的介绍，相比于CNN，RNN在处理句子时，更有说服力。
不过不要灰心，这并不是说用CNN做不了NLP任务。实践表明在NLP中使用CNN也能取得很好的效果。这就像某句谚语说的那样：All models are wrong。例如，我们常用的词袋模型，往往有很多其实并不符合实际的假设，但是仍然在实际应用中取得了非常好的结果。

2.1 narrow/wide convolution

在之前的博文中，我们讲过zero padding的策略。事实上，采取补零策略的卷积操作就叫做wide convolution反之，叫做narrow convolution。

2.2 stride size

另一个参数是kernel的滑动步长stride.通常我们使用的步长是1，但是你要明白的是，步长越大我们做的卷积操作次数就越少，输出的规模也会越小。大一点的步长是的我们的CNN从某种程度上来看有点儿类似于RNN了（例如它们看起来似乎都像是一棵树）。

2.3 pooling layers

在之前的博文中，我们曾详细讲过池化层。它相当于对输出做降采样。最常用个池化操作是求max.在自然语言处理中，我们常常在整个输出上做一次pooling,这样对一个kernel的结果产生一个值。
为什么做池化呢？之前的博客也已经说过了，这里我们再整理一下。
一方面，池化操作可以为我们提供固定大小的输出，这可以用到后面的分类器上。例如，假如你有1000个kernels，对每个kernel做一次池化，这样你将得到1000维的输出。不管你输入多大，不管你用了几个kernel.这就允许你处理变长的句子，和变尺寸的kernel.
另一方面，池化也可以降低输出维度并且保留最显著的特征信息。你可以把kernel想象成一个特定特征的提取器，例如某个kernel可能会检测句子中是否有否定词如“not amazing”.如果这种短语在句子中的某个位置出现了，那么使用这个kernel在这个位置做卷积后，得到的值就会比较大，而在其他位置上的卷积就会比较小。这样，当我们采用max pooling后，就仅仅保留了这个最大值。也就是说，我们可以知道这个句子中是否出现了否定词，但是却这个否定词出现的具体位置，却因为池化操作被忽略掉了。但是我们似乎也没必要关心它出现的具体位置，事实上，很多类似的语言模型都是这样处理的，例如n-grams model.换言之，你失去了位置的全局信息，但是保留了局部的特征。即，你不知道某个特征实例发生在句子的哪个位置，但是你却可以很容易区分“not amazing”和“amazing not”。

在图像中，pooling也提供了这种位置不变性。你可以平移或者旋转，甚至缩放几个像素，仍可以保证输出的稳定。

2.4 channels

channel是什么呢？对于同一个输入数据，你可以认为，它是这个数据的多个分量。正如我们在之前的博客中，处理图像一样。一个彩色图像时二维的。但是每一个像素有r,g,b三个通道。因此我们可以得到把输入数据分成3个通道来处理。后面的卷积层一般会分别对这三个通道做卷积然后求和。
在tensorflow中，我们表示图像一般分成4个维度。[样本图片，行，列，通道]，从中可以看到，这就是为什么我们在第2节开头中说“注意这里的一系列的二维样本，和二维样本的一系列通道是两个概念”。因为它们是不同的维度概念。关于输入维度的问题，我们在之前的博文中，已经跟大家提到了，这里又提到了，今后将不再赘述。

通道这个概念除了用于输入外，也可以用在其他层中。例如某个卷积层，有4个kernel。那么对一个单通道的图像而言，将产生4个输出图像。如果在这之后再加一个卷积层，那么显然，此时的这4个输出图像，就相当于第二个卷积层的输出，并且是4个通道。

对于图像而言，多个通道的概念很容易理解，那么在NLP中，该怎么理解通道呢？
与图像类似，你可以把通道理解为对同一个事物的不同视角的表达。那么在NLP中，你也可以把通道想象成对于单词的不同表达形式（例如word embeddings可以用word2vec表示一个通道，用GloVe表示另一个通道，用one-hot vec表示第三个通道），对于句子，你可以用英语表示一个通道，汉语表示另一个通道等等。

最适合的任务也许是分类，例如情感分析（Sentiment Analysis）,垃圾邮件检测（Spam Detection）或者主题分类（Topic Categorization）. 由于卷积和池化会丢失掉单词的位置顺序信息，因此不太容易做词性标注（PoS Tagging）或实体抽取（Entity Extraction）中的序列标签问题（sequence tagging）。

文献[1]使用CNN来做情感分析和主题划分。网络结构也非常的简单。我们之前说过，在NLP中不同的通道可以表示词向量的不同表示方式，这里采用的就是两个通道，一个通道是固定的word2vect，而另一个通道需要在训练过程中，动态的调整。
卷积神经网络CNN理论到实践(7)

文献[2]的卷积网络比[1]稍微复杂一些：
卷积神经网络CNN理论到实践(7)

[6] Adds an additional layer that performs “semantic clustering” to this network architecture.

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification
Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification

[4] Trains a CNN from scratch, without the need for for pre-trained word vectors like word2vec or GloVe. It applies convolutions directly to one-hot vectors. The author also proposes a space-efficient bag-of-words-like representation for the input data, reducing the number of parameters the network needs to learn. In [5] the author extends the model with an additional unsupervised “region embedding” that is learned using a CNN predicting the context of text regions. The approach in these papers seems to work well for long-form texts (like movie reviews), but their performance on short texts (like tweets) isn’t clear. Intuitively, it makes sense that using pre-trained word embeddings for short texts would yield larger gains than using them for long texts.

Building a CNN architecture means that there are many hyperparameters to choose from, some of which I presented above: Input represenations (word2vec, GloVe, one-hot), number and sizes of convolution filters, pooling strategies (max, average), and activation functions (ReLU, tanh). [7] performs an empirical evaluation on the effect of varying hyperparameters in CNN architectures, investigating their impact on performance and variance over multiple runs. If you are looking to implement your own CNN for text classification, using the results of this paper as a starting point would be an excellent idea. A few results that stand out are that max-pooling always beat average pooling, that the ideal filter sizes are important but task-dependent, and that regularization doesn’t seem to make a big different in the NLP tasks that were considered. A caveat of this research is that all the datasets were quite similar in terms of their document length, so the same guidelines may not apply to data that looks considerably different.

[8] explores CNNs for Relation Extraction and Relation Classification tasks. In addition to the word vectors, the authors use the relative positions of words to the entities of interest as an input to the convolutional layer. This models assumes that the positions of the entities are given, and that each example input contains one relation. [9] and [10] have explored similar models.

Another interesting use case of CNNs in NLP can be found in [11] and [12], coming out of Microsoft Research. These papers describe how to learn semantically meaningful representations of sentences that can be used for Information Retrieval. The example given in the papers includes recommending potentially interesting documents to users based on what they are currently reading. The sentence representations are trained based on search engine log data.

Most CNN architectures learn embeddings (low-dimensional representations) for words and sentences in one way or another as part of their training procedure. Not all papers though focus on this aspect of training or investigate how meaningful the learned embeddings are. [13] presents a CNN architecture to predict hashtags for Facebook posts, while at the same time generating meaningful embeddings for words and sentences. These learned embeddings are then successfully applied to another task – recommending potentially interesting documents to users, trained based on clickstream data.

CHARACTER-LEVEL CNNS

So far, all of the models presented were based on words. But there has also been research in applying CNNs directly to characters. [14] learns character-level embeddings, joins them with pre-trained word embeddings, and uses a CNN for Part of Speech tagging. [15][16] explores the use of CNNs to learn directly from characters, without the need for any pre-trained embeddings. Notably, the authors use a relatively deep network with a total of 9 layers, and apply it to Sentiment Analysis and Text Categorization tasks. Results show that learning directly from character-level input works very well on large datasets (millions of examples), but underperforms simpler models on smaller datasets (hundreds of thousands of examples). [17] explores to application of character-level convolutions to Language Modeling, using the output of the character-level CNN as the input to an LSTM at each time step. The same model is applied to various languages.

What’s amazing is that essentially all of the papers above were published in the past 1-2 years. Obviously there has been excellent work with CNNs on NLP before, as in Natural Language Processing (almost) from Scratch, but the pace of new results and state of the art systems being published is clearly accelerating.

QUESTIONS OR FEEDBACK? LET ME KNOW IN THE COMMENTS. THANKS FOR READING!

welcome!

Xiangguo Sun
[email protected]
http://blog.****.net/github_36326955

卷积神经网络CNN理论到实践(7)

Welcome to my blog column: Dive into ML/DL!

I devote myself to dive into typical algorithms on machine learning and deep learning, especially the application in the area of computational personality.

My research interests include computational personality, user portrait, online social network, computational society, and ML/DL. In fact you can find the internal connection between these concepts:

In this blog column, I will introduce some typical algorithms about machine learning and deep learning used in OSNs(Online Social Networks), which means we will include NLP, networks community, information diffusion,and individual recommendation system. Apparently, our ultimate target is to dive into user portrait , especially the issues on your personality analysis.

All essays are created by myself, and copyright will be reserved by me. You can use them for non-commercical intention and if you are so kind to donate me, you can scan the QR code below. All donation will be used to the library of charity for children in Lhasa.

赏金将用于拉萨儿童图书公益募捐 社会公益，听IT人的声音
手机扫一扫，即可：
卷积神经网络CNN理论到实践(7)

附：《春天里，我们的拉萨儿童图书馆，需要大家的帮助》

卷积神经网络CNN理论到实践(7)