垃圾邮件识别-朴素贝叶斯算法-补充
我们设定的max_features = 5000,从调优的角度,我们试图分析词袋最大特征数max_features对结果的影响,我们分别计算max_features从1000到20000对评估准确度的影响。构造如下函数:
def show_diffrent_max_features():
global max_features
a=[]
b=[]
for i in range(1000,20000,2000):
max_features=i
print("max_features=%d" % i)
x, y = get_features_by_wordbag()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)
gnb = GaussianNB()
gnb.fit(x_train, y_train)
y_pred = gnb.predict(x_test)
score=metrics.accuracy_score(y_test, y_pred)
a.append(max_features)
b.append(score)
plt.plot(a, b, 'r')
plt.xlabel("max_features")
plt.ylabel("metrics.accuracy_score")
plt.title("metrics.accuracy_score VS max_features")
plt.legend()
plt.show()
加到上一个 “垃圾邮件识别-朴素贝叶斯算法”代码中,main()函数中添加
show_diffrent_max_features()
输出结果:
有可视化结果可见:max_features值越大,模型评估准确度越高,同时整个系统运算时间也增长,当max_features超过13000以后,准确率反而下降,所以将max_features设置为15000左右,准确度接近96.7%。但实验表明,当max_features超过5000以后计算时间明显过长,max_features=5000是,准确率达95.5%。
当系统max_features设置为15000时,系统运行结果:
Hello spam-mail
get_features_by_wordbag
Load C:/Users/Administrator/PycharmProjects/tensortflow快速入门/tensorflow_study\MNIST_data_bak/enron1/ham/
Load C:/Users/Administrator/PycharmProjects/tensortflow快速入门/tensorflow_study\MNIST_data_bak/enron1/spam/
CountVectorizer(analyzer='word', binary=False, decode_error='ignore',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=15000, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words='english',
strip_accents='ascii', token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
NB and wordbag
0.9671338811019816
[[1419 44]
[ 24 582]]