Machine learning potorfolio in manufacturing intelligence practice 2
-
Attribute Interations
you can change the size of graphic picture in Weka.
We can see good separation between the classes on the scatter plots. For example, petalwidth versus sepallength and petalwidth versus sepalwidth are good examples.
This suggest that linear methods and maybe decision trees and instance-based methods may do well on this problem. It also suggest that we probably do not need to spend too much time tuning or using advanced modeling techniques and ensembles. It may be a straightforward modeling problem.
Spot Check Algorithms (Evaluate Algorithms)
Weka has Experiment interface to help check accuracy performance cross multiple algorithms.
In this practice, I will add bellow multiclass classification algorithms:
-rules.ZeroR
-bayes.NaiveBayes
-functions.Logistic
-functions.SMO
-lazy.IBK (Change KNN parameter from 1 to 3)
-rules.PART
-trees.REPTree
-trees.J48
After run, we can analysis result when we default choose ZeroR as test basic, result as bellow:
We can see from above that all models have skill. The score of other algorithms are better than ZeroR and difference is statistically significant.
The results suggest both Logistic Regression and SVM achieved the highest accuracy. If there is no other reason, we can choose sample Logistic Algorithm.
We can futher compare all of results to Logistic Regression result as the test base.
We now see a very different story. Although the results for Logistic look better, the analysis suggests that the difference between these results and the results from all of the other algorithms are not statistically significant. From here we could choose an algorithm based on other criteria, like understandability or complexity.
From this perspective Logistic Regression and Naive Bayes are good candidates. We could also seek to further improve the results of one or more of these algorithms and see if we can achieve a significant improvement.
"Significant parameter" - default value is 0.05(5%)
ower measurements. We can use the mean and standard deviation of the model accuracy collected in the last section to help quantify the expected variability in the estimated accuracy of the model on unseen data. For example, we know that 95% of model accuracies will fall within two standard deviations of the mean model accuracy. Or, restated in a way we can explain to other people, we can generally expect that the performance of the model on unseen data will be 96.33% plus or minus 2 3:38 or 6.76, or between 87.57% and 100% accurate.