本文作者：合肥工业大学管理学院钱洋 email：[email protected] 。
以下内容是个人的论文阅读笔记，内容可能有不到之处，欢迎交流。
未经本人允许禁止转载

原始论文

在Word Embeddings的原始论文中，其中有一块可视化使用的是PCA展示数据。如下图所示：
原始论文为：

Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.

gensim训练word2vec并使用PCA实现二维可视化

程序实现

在python下使用Gensim很容易使用word2vec模型训练语料得到词向量。基于训练的词向量，可以使用sklearn包中的PCA以及matplotlib可视化结果。如下程序为python3实现这一功能：

# -*- coding: utf-8 -*-
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# 训练的语料
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]
# 利用语料训练模型
model = Word2Vec(sentences,window=5, min_count=1)

# 基于2d PCA拟合数据
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# 可视化展示
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

该程序的可视化结果为：

gensim训练word2vec并使用PCA实现二维可视化

参考内容：
https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.

gensim训练word2vec并使用PCA实现二维可视化

原始论文

程序实现

相关推荐