找到最适合的句子令牌
问题描述:
的阵列我对文本挖掘以下数据框:令牌找到最适合的句子令牌
df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover",
"Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series",
"all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications",
"Without EETS editions, study of medieval English texts would hardly be possible."]})
text
0 Anyone who reads Old and Middle English litera...
1 Most of the works attributed to King Alfred or...
2 all of the surviving medieval drama, most of t...
3 Without EETS editions, study of medieval Engli...
而且我有名单:
tokens = [['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]
我试图找到最上面的列表令牌的每个令牌阵列适当的句子。
更新:我被要求更详细地解释我的问题。
问题是,我在非英语文本上做这件事,所以说明我的问题更多一些是很有问题的。
我在寻找一些函数x其作为输入的每个元素我令牌名单并为令牌列表中的每个元素,它会搜索最合适的(也许在某些指标意义上的)句子df.text
。这是输出无关紧要的主要思想。我只是想要它的工作:)
答
正如我前面所说,这篇文章只是我的问题的例证。我正在解决聚类问题。我使用LDA和K-means算法来做到这一点。为了找到最适合我的代币清单的合适的句子,我使用了K均值距离参数。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import lda
from sklearn.feature_extraction.text import CountVectorizer
import logging
from sklearn.cluster import MiniBatchKMeans
from sklearn import preprocessing
df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover",
"Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series",
"all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications",
"Without EETS editions, study of medieval English texts would hardly be possible."],
'tokens':[['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]})
df['tokens'] = df.tokens.str.join(',')
vectorizer = TfidfVectorizer(min_df=1, max_features=10000, ngram_range=(1, 2))
vz = vectorizer.fit_transform(df['tokens'])
logging.getLogger("lda").setLevel(logging.WARNING)
cvectorizer = CountVectorizer(min_df=1, max_features=10000, ngram_range=(1,2))
cvz = cvectorizer.fit_transform(df['tokens'])
n_topics = 4
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)
num_clusters = 4
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1,
init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)
X_all = X_topics
kmeans1 = kmeans_model.fit(X_all)
kmeans_clusters1 = kmeans1.predict(X_all)
kmeans_distances1 = kmeans1.transform(X_all)
d = dict()
l = 1
for i, desc in enumerate(df.text):
if(i < 3):
num = 3
if kmeans_clusters1[i] == num:
if l > kmeans_distances1[i][kmeans_clusters1[i]]:
l = kmeans_distances1[i][kmeans_clusters1[i]]
d['Cluster' + str(kmeans_clusters1[i])] = "distance: " + str(l)+ " "+ df.iloc[i]['text']
print("Cluster " + str(kmeans_clusters1[i]) + ": " + desc +
"(distance: " + str(kmeans_distances1[i][kmeans_clusters1[i]]) + ")")
print('---')
print("Cluster " + str(num) + " " + str(d.get('Cluster' + str(num))))
因此,与特定的集群内的最低距离令牌,是最适合的。
另外,你能解释一些关于你的问题,并添加预期的输出? –
计算句子和标记列表之间的相似度,选择一个标记列表中最相似的句子作为其输出句子。或者,更简单的方法是,计算句子中每个记号列表的标记的出现次数,选择出现记号最多的句子作为记号列表的输出。 – mutux